**Overview**
- Create an AI chatbot that can answer questions about 2024 tax filing using embedded tax docs as knowledge base

**Steps**
1. Document embedding
2. Pinecone for vector storage
3. GPT for answer generation
4. Streamlit for chatbot UI

# Prepare tax filing documents 

## Chunk & preprocess documents
* convert pdfs to text
* break text into manageable chunks (e.g. 500-1000 tokens) for embedding

In [None]:
import fitz  # PyMuPDF

def extract_text_pymupdf(pdf_path):
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

pdf_text = extract_text_pymupdf("./i1040gi.pdf")
print(pdf_text[:500])  # Print first 500 characters


Instructions for Form 1040 (2024)  Catalog Number 24811V
Dec 16, 2024
Department of the Treasury  Internal Revenue Service  www.irs.gov
Future Developments
2024 Changes
R
INSTRUCTIONS
See What’s New in these instructions.
See IRS.gov and IRS.gov/Forms, and for the latest information about developments related to Forms 1040 and 
1040-SR and their instructions, such as legislation enacted after they were published, go to IRS.gov/Form1040.
Free File is the fast, safe, and free way to prepare and e-


In [1]:
import pdfplumber

def extract_text_pdfplumber(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() or ""  # Some pages might be empty
    return text

# Example
pdf_text = extract_text_pdfplumber("./2025_seeded_PDI.pdf")
print(pdf_text[:500])


A novel scheme for measuring the growth of Alfv´en wave parametric decay instability
using counter-propagating waves
Feiyu Li1,∗ Seth Dorfman2,3, and Xiangrong Fu1,4
1 New Mexico Consortium, Los Alamos, New Mexico 87544, USA
2 Space Science Institute, Boulder, Colorado 80301, USA
3 University of California Los Angeles, Los Angeles, California 90095, USA
4 Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
(Dated: May 13, 2025)
The parametric decay instability (PDI) of Alfv´en wave


In [2]:
len(pdf_text)

29869

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.create_documents(pdf_text)


In [1]:
# chunks

In [10]:
chunks[0].page_content

'A'

## Generate embeddings
* use OpenAI, HuggingFace, or Cohere to covert text chunks into embeddings

In [None]:
from langchain.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(openai_api_key='')
vectors = embedding_model.embed_documents([chunk.page_content for chunk in chunks])


RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [None]:
# # using huggingface
# from langchain.embeddings import HuggingFaceEmbeddings  
# embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# vectors = embedding_model.embed_documents([chunk.page_content for chunk in chunks]) 

# using sentence_transformers
from langchain.embeddings import SentenceTransformerEmbeddings
embedding_model = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectors = embedding_model.embed_documents([chunk.page_content for chunk in chunks]) 

# using OpenAI
from langchain.embeddings import OpenAIEmbeddings
embedding_model = OpenAIEmbeddings(openai_api_key='')
vectors = embedding_model.embed_documents([chunk.page_content for chunk in chunks])     

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Store embeddings in Pinecone
* use Pinecone to store the embeddings for efficient retrieval
* ensure you have a Pinecone index created and configured

In [None]:
import pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="us-west1-gcp")
index = pinecone.Index("tax-rag")

# Upload
for i, vector in enumerate(vectors):
    index.upsert([(f"id-{i}", vector, {"text": chunks[i].page_content})])


## Store the vectors in Pinecone

In [None]:
import pinecone
# pinecone.init(api_key="", environment="us-west1-gcp")
pc = pinecone.Pinecone(api_key='')
index = pc.Index("tax-rag")

# Upload
for i, vector in enumerate(vectors):
    index.upsert([(f"id-{i}", vector, {"text": chunks[i].page_content})])




NotFoundException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'x-pinecone-api-version': '2025-04', 'x-cloud-trace-context': 'e55bc7cdcacf5dbd205b1d7fcddf94fd', 'date': 'Tue, 01 Jul 2025 02:44:04 GMT', 'server': 'Google Frontend', 'Content-Length': '82', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"NOT_FOUND","message":"Resource tax-rag not found"},"status":404}


## Build the Retrieval-Augmented Generation (RAG) pipeline

In [32]:
def retrieve_context(query):
    embedded_query = embedding_model.embed_query(query)
    results = index.query(vector=embedded_query, top_k=5, include_metadata=True)
    return [match['metadata']['text'] for match in results['matches']]

def generate_answer(query):
    context = "\n\n".join(retrieve_context(query))
    prompt = f"""You are a tax assistant. Answer based only on the following documents:\n\n{context}\n\nQ: {query}\nA:"""
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response['choices'][0]['message']['content']


## Create a Chat UI with Streamlit

In [33]:
import streamlit as st

st.title("Chatbot")

query = st.text_input("Ask me a tax question:")
if query:
    with st.spinner("Searching..."):
        answer = generate_answer(query)
        st.write("**Answer:**", answer)


2025-06-30 20:36:27.212 
  command:

    streamlit run /opt/anaconda3/lib/python3.12/site-packages/ipykernel_launcher.py [ARGUMENTS]
2025-06-30 20:36:27.213 Session state does not function when running a script without `streamlit run`


In [34]:
! streamlit run /opt/anaconda3/lib/python3.12/site-packages/ipykernel_launcher.py

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



      👋 [1mWelcome to Streamlit![0m

      If you’d like to receive helpful onboarding emails, news, offers, promotions,
      and the occasional swag, please enter your email address below. Otherwise,
      leave this field blank.

      [34mEmail: [0m ^C
