**Overview**
- Create an AI chatbot that can answer questions about 2025 tax filing using embedded tax docs as knowledge base

**Steps**
1. Document embedding
2. Pinecone for vector storage
3. GPT for answer generation
4. Streamlit for chatbot UI

In [1]:
import fitz  # PyMuPDF
import re
import os
from dotenv import load_dotenv

# load api keys for openai and pinecone
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")

# Chunk & preprocess documents
* convert pdfs to text
* break text into manageable chunks (e.g. 500-1000 tokens) for embedding

In [5]:
def extract_text_pymupdf(pdf_path):
    text = ""
    with fitz.open(pdf_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

def clean_pdf_text(text):
    # Replace multiple newlines with single spaces
    text = re.sub(r'\n+', ' ', text)
    # Replace multiple spaces with single space
    text = re.sub(r'\s+', ' ', text)
    # Strip leading/trailing whitespace
    text = text.strip()
    return text

pdf_text = extract_text_pymupdf("./i1040gi.pdf")

# Clean the extracted text
pdf_text = clean_pdf_text(pdf_text)

# print(pdf_text[:500])  # Print first 500 characters
print(pdf_text)


Line Instructions for Forms 1040 and 1040-SR Also see the instructions for Schedule 1 through Schedule 3 that follow the Form 1040 and 1040-SR instructions. What form to file. Everyone can file Form 1040. Form 1040-SR is available to you if you were born before January 2, 1960. Fiscal year filers. If you are a fiscal year filer using a tax year other than January 1 through December 31, 2024, enter the beginning and ending months of your fiscal year in the entry space provided at the top of page 1 of Form 1040 or 1040-SR. Write-in information. If you need to write a word, code, and/or dollar amount on Form 1040 or 1040-SR to explain an item of income or deduction, but don't have enough space to enter the word, code, and/or dollar amount, you can put an asterisk next to the applicable line number and put a footnote at the bottom of page 2 of your tax return indicating the line number and the word, code, and/or dollar amount you need to enter. Section references are to the Internal Revenu

In [18]:
len(pdf_text)

5045

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
"""
Notice the difference - pdf_text vs [pdf_text]. The create_documents() method expects a list of texts, not a single string. When you pass a single string, it treats each character as a separate document.
"""
chunks = text_splitter.create_documents([pdf_text])


In [8]:
chunks

[Document(metadata={}, page_content="Line Instructions for Forms 1040 and 1040-SR Also see the instructions for Schedule 1 through Schedule 3 that follow the Form 1040 and 1040-SR instructions. What form to file. Everyone can file Form 1040. Form 1040-SR is available to you if you were born before January 2, 1960. Fiscal year filers. If you are a fiscal year filer using a tax year other than January 1 through December 31, 2024, enter the beginning and ending months of your fiscal year in the entry space provided at the top of page 1 of Form 1040 or 1040-SR. Write-in information. If you need to write a word, code, and/or dollar amount on Form 1040 or 1040-SR to explain an item of income or deduction, but don't have enough space to enter the word, code, and/or dollar amount, you can put an asterisk next to the applicable line number and put a footnote at the bottom of page 2 of your tax return indicating the line number and the word, code, and/or dollar amount you need to enter. Section 

In [9]:
len(chunks)

32

In [10]:
chunks[0].page_content

"Line Instructions for Forms 1040 and 1040-SR Also see the instructions for Schedule 1 through Schedule 3 that follow the Form 1040 and 1040-SR instructions. What form to file. Everyone can file Form 1040. Form 1040-SR is available to you if you were born before January 2, 1960. Fiscal year filers. If you are a fiscal year filer using a tax year other than January 1 through December 31, 2024, enter the beginning and ending months of your fiscal year in the entry space provided at the top of page 1 of Form 1040 or 1040-SR. Write-in information. If you need to write a word, code, and/or dollar amount on Form 1040 or 1040-SR to explain an item of income or deduction, but don't have enough space to enter the word, code, and/or dollar amount, you can put an asterisk next to the applicable line number and put a footnote at the bottom of page 2 of your tax return indicating the line number and the word, code, and/or dollar amount you need to enter. Section references are to the Internal"

# Generate embeddings
* use OpenAI, HuggingFace, or Cohere to covert text chunks into embeddings

In [11]:
# from langchain.embeddings import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(openai_api_key = OPENAI_API_KEY)
vectors = embedding_model.embed_documents([chunk.page_content for chunk in chunks])


In [12]:
len(vectors)

32

In [13]:
# # using huggingface
# from langchain.embeddings import HuggingFaceEmbeddings  
# embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# vectors = embedding_model.embed_documents([chunk.page_content for chunk in chunks]) 

# using sentence_transformers
# from langchain.embeddings import SentenceTransformerEmbeddings
# embedding_model = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# vectors = embedding_model.embed_documents([chunk.page_content for chunk in chunks]) 

# # using OpenAI
# from langchain.embeddings import OpenAIEmbeddings
# embedding_model = OpenAIEmbeddings(openai_api_key='')
# vectors = embedding_model.embed_documents([chunk.page_content for chunk in chunks])     

# Store embeddings in Pinecone
* use Pinecone to store the embeddings for efficient retrieval
* ensure you have a Pinecone index created and configured

In [None]:
from pinecone import Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)
pc.list_indexes()

# # Delete multiple indexes
# pc.delete_index('tax-rag')
# pc.delete_index('tax-rag2')

[
    {
        "name": "tax-rag3",
        "metric": "cosine",
        "host": "tax-rag3-n6gatrn.svc.aped-4627-b74a.pinecone.io",
        "spec": {
            "serverless": {
                "cloud": "aws",
                "region": "us-east-1"
            }
        },
        "status": {
            "ready": true,
            "state": "Ready"
        },
        "vector_type": "dense",
        "dimension": 1536,
        "deletion_protection": "disabled",
        "tags": null
    }
]

In [None]:
from pinecone import Pinecone, ServerlessSpec
import time
import tqdm

pc = Pinecone(api_key=PINECONE_API_KEY)

# Step 1: Create index if not exists
index_name = "tax-rag3"
if index_name not in [idx.name for idx in pc.list_indexes()]:
    pc.create_index(
        name=index_name,
        dimension=1536,  # text-embedding-3-small
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"  # free tier region
        )
    )
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

# Step 2: Get the index host
index_info = pc.describe_index(index_name)
index_host = index_info.host

# Step 3: Connect using host
index = pc.Index(host=index_host)

# Step 4: Upsert embeddings
for i in tqdm.tqdm(range(len(vectors))):
    index.upsert([
        (f"id-{i}", vectors[i], {"text": chunks[i].page_content})
    ])


100%|██████████| 32/32 [00:04<00:00,  6.97it/s]


In [16]:
# Verify upload
stats = index.describe_index_stats()
print(f"Index now contains {stats.total_vector_count} vectors")

Index now contains 32 vectors


# Finally, RUN app.py 

Run the following in the terminal:

streamlit run app.py