# PDF RAG Chatbot with LangChain and Gemini 1.5 Flash

This notebook demonstrates how to build a RAG (Retrieval-Augmented Generation) system that can:
1. Load and process PDF documents
2. Create embeddings and store them in a vector database
3. Retrieve relevant context based on user questions
4. Generate accurate answers using Google's Gemini 1.5 Flash model

## 1. Install Required Packages

In [2]:
# pip3 install langchain langchain-google-genai langchain-community faiss-cpu pypdf

## 2. Import Libraries

In [2]:
import os
import tempfile
from IPython.display import display, Markdown
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

  from .autonotebook import tqdm as notebook_tqdm


## 3. Configure API Key

Set your Google API key for accessing Gemini 1.5 Flash and embedding models.

In [3]:
# Set your Google API key here
os.environ["GOOGLE_API_KEY"] = "AIzaSyCQm0KhDrntN2VaiTi7aHCoVpZYFMQo_jg"

## 4. Define Functions for PDF Processing and RAG Setup

In [6]:
def process_pdf(pdf_path):
    """
    Process a PDF file and create a vector store from its content.

    Args:
        pdf_path (str): Path to the PDF file

    Returns:
        FAISS vector store containing document chunks
    """
    print(f"Loading PDF from {pdf_path}...")

    # Load the PDF
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()

    print(f"Loaded {len(documents)} pages from the PDF.")

    # Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1500,
        chunk_overlap=150
    )
    chunks = text_splitter.split_documents(documents)

    print(f"Split into {len(chunks)} chunks.")

    # Create embeddings and vector store
    print("Creating embeddings and vector store...")
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    vector_store = FAISS.from_documents(chunks, embeddings)

    print("Vector store created successfully!")
    return vector_store

def get_conversation_chain(vector_store):
    """
    Create a conversational retrieval chain using the vector store.

    Args:
        vector_store: FAISS vector store containing document chunks

    Returns:
        ConversationalRetrievalChain for answering questions
    """
    # Initialize the Gemini model
    llm = ChatGoogleGenerativeAI(
        model="gemini-1.5-flash",
        temperature=0.2,
        convert_system_message_to_human=True
    )

    # Initialize memory for conversation history
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        return_messages=True,
        # Explicitly set the output key to 'answer'
        output_key='answer'
    )

    # Create a conversational retrieval chain
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
        memory=memory,
        return_source_documents=True
    )

    return conversation_chain

## 5. Load and Process Your PDF

Specify the path to your PDF file and process it.

In [7]:
# Set the path to your PDF file
pdf_path = "/Users/aaronjoju/Downloads/event_demo .pdf"  # Replace with your actual PDF path

# Process the PDF
vector_store = process_pdf(pdf_path)

# Create the conversation chain
conversation_chain = get_conversation_chain(vector_store)

Loading PDF from /Users/aaronjoju/Downloads/event_demo .pdf...
Loaded 3 pages from the PDF.
Split into 3 chunks.
Creating embeddings and vector store...


I0000 00:00:1743601445.108830 8291381 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


Vector store created successfully!


  memory = ConversationBufferMemory(
I0000 00:00:1743601452.177432 8291381 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


## 6. Chat Interface

Ask questions about the PDF document and get responses from the RAG system.

In [8]:
def ask_question(question):
    """
    Ask a question about the PDF and display the response.

    Args:
        question (str): The question to ask
    """
    print(f"\nQuestion: {question}")
    print("\nThinking...")

    # Get response from conversation chain
    response = conversation_chain({"question": question})

    # Extract the response text
    answer = response["answer"]

    # Build formatted response with sources
    formatted_response = f"### Answer:\n{answer}\n"

    # Add citations if available
    source_docs = response.get("source_documents", [])
    if source_docs:
        formatted_response += "\n### Sources:\n"
        for i, doc in enumerate(source_docs[:3]):  # Limit to top 3 sources
            page_info = f"Page {doc.metadata.get('page', 'unknown')}"
            formatted_response += f"{i+1}. {page_info}\n"

    # Display the formatted response
    display(Markdown(formatted_response))

## 7. Ask Questions About Your PDF

Use the cell below to ask questions about the content of your PDF. You can run this cell multiple times with different questions.

In [9]:
question = "What is the document about?"  # Replace with your question
ask_question(question)


Question: What is the document about?

Thinking...


  response = conversation_chain({"question": question})


### Answer:
The document is about the Computer Science Fest 2025,  an event hosted by the Department of Computer Science at TechVille University.  It details the event's schedule, activities (including a hackathon, coding competitions, paper presentations, workshops, keynote speeches, and networking sessions), rules, prizes, registration process, and contact information.

### Sources:
1. Page 1
2. Page 2
3. Page 0


## 8. Follow-up Questions

The system maintains conversation history, so you can ask follow-up questions.

In [10]:
follow_up_question = "Can you provide more details about the first topic?"  # Replace with your follow-up
ask_question(follow_up_question)


Question: Can you provide more details about the first topic?

Thinking...




### Answer:
The Computer Science Fest 2025 will be held at TechVille University Auditorium, Block A, from April 15-17, 2025, 9:00 AM to 6:00 PM.  It's organized by the Department of Computer Science at TechVille University.  The event includes keynote speeches, technical workshops (requiring pre-registration and with limited seats), a 24-hour hackathon (teams of up to 4 members), coding competitions (individual, various skill levels, using Python, Java, C++, or JavaScript), paper presentations (with a submission deadline of April 10th), tech exhibitions, panel discussions, and networking sessions.  Registration closes April 5th, 2025, and can be done online at www.techvillecsfest.com.  Prizes are awarded for the hackathon, coding competition, and best paper presentation.  Participation certificates will be given to all registered participants.  Contact information is: csfest@techville.edu or +1 (555) 123-4567.

### Sources:
1. Page 0
2. Page 2
3. Page 1


## 9. Save Vector Store (Optional)

You can save the vector store to disk for later use.

In [11]:
# Save the vector store to disk
vector_store.save_local("faiss_index")
print("Vector store saved to 'faiss_index' directory.")

Vector store saved to 'faiss_index' directory.


## 10. Load Vector Store (Optional)

You can load a previously saved vector store.

In [12]:
# Load the vector store from disk
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
# Add allow_dangerous_deserialization=True to the load_local call
loaded_vector_store = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)
loaded_conversation_chain = get_conversation_chain(loaded_vector_store)
print("Vector store loaded successfully!")

Vector store loaded successfully!


I0000 00:00:1743601463.247500 8291381 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported
I0000 00:00:1743601463.249200 8291381 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


In [13]:
# Ask a question using the loaded vector store
question = "What is the main topic of the document?"  # Replace with your question
response = loaded_conversation_chain({"question": question})

# Extract and display the answer
answer = response["answer"]
print(f"Answer: {answer}")

# Display sources if available
source_docs = response.get("source_documents", [])
if source_docs:
    print("\nSources:")
    for i, doc in enumerate(source_docs[:3]):  # Limit to top 3 sources
        page_info = f"Page {doc.metadata.get('page', 'unknown')}"
        print(f"{i+1}. {page_info}")



Answer: The main topic of the document is the Computer Science Fest 2025, including details about its events, registration, prizes, and contact information.

Sources:
1. Page 1
2. Page 2
3. Page 0


In [None]:
# Path to the new PDF file
new_pdf_path = "/path/to/your/new_pdf.pdf"  # Replace with the actual path to the new PDF

# Process the new PDF to create its vector store
new_vector_store = process_pdf(new_pdf_path)

# Merge the new vector store with the existing one
vector_store.merge_from(new_vector_store)

print("New PDF added successfully. The vector store now contains data from both PDFs.")

# Save the updated vector store to disk (optional)
vector_store.save_local("faiss_index_combined")
print("Updated vector store saved to 'faiss_index_combined' directory.")

In [None]:
# Reload the vector store for the second PDF (or other valid PDFs)
new_vector_store = FAISS.load_local("faiss_index_combined", embeddings, allow_dangerous_deserialization=True)

# Filter out the embeddings related to the first PDF
# Assuming you have metadata to identify the source of each document
valid_documents = [
    doc for doc in new_vector_store.docstore.values()
    if doc.metadata.get("source") != pdf_path  # Exclude documents from the first PDF
]

# Rebuild the vector store with only valid documents
filtered_vector_store = FAISS.from_documents(valid_documents, embeddings)

# Save the updated vector store
filtered_vector_store.save_local("faiss_index_filtered")
print("Filtered vector store saved to 'faiss_index_filtered' directory.")

In [None]:
# prompt: what to do after saving and storing vector store also explain why we do need vector store

# ## 11.  Further actions after saving the vector store

# After saving the vector store, you have several options depending on your needs:

# 1. Deployment for Production Use:
#    - Package the saved vector store ("faiss_index" directory in this case) along with your application code.
#    - Use a production-ready vector database (like Weaviate, Pinecone, or Chroma) instead of FAISS for better performance and scalability in a real-world setting.  FAISS is great for experimentation and development but may not be suitable for high-traffic applications.
#    - Deploy your application to a cloud platform (e.g., Google Cloud, AWS, Azure) or a server.  This will allow users to interact with the chatbot.
#    - Set up API endpoints to handle incoming questions and return answers from your chatbot.

# 2. Updating the Vector Store:
#    - If new PDF documents are added or existing documents are updated, you need to re-process the documents and update the vector store.  You would load the existing vector store (as demonstrated in step 10), then add the new document embeddings using `vector_store.add_documents()` or similar functions.  After adding the new documents, save the updated vector store again.

# 3.  Offline Usage:
#    - You can use the saved vector store without an internet connection.  This is useful for environments without constant network access.

# Why do we need a vector store?

# A vector store is crucial for efficient similarity search. Here's why:

# - Similarity Search:  The core idea is to convert text into vector embeddings (numerical representations).  When a user asks a question, the question is also converted to a vector. The vector store then quickly finds the most similar document chunks (those with vector representations closest to the question's vector) from the PDF. This is *much* faster than searching through all the text of the PDF directly.

# - Contextual Relevance: By retrieving similar document chunks, you provide relevant context to the language model (like Gemini).  This context helps the model generate accurate and informative answers related to the user's question. Without the vector store, the LLM would only have access to its pre-trained knowledge, which might not be specific enough for a question about a particular PDF.

# - Scalability: Vector stores enable efficient searches across large datasets.  Imagine a PDF that is hundreds or thousands of pages long; finding relevant text without a vector store would be extremely slow.

# - Speed: Vector similarity search is very fast, especially for large datasets, because it involves comparing vectors mathematically (finding distances) rather than doing full-text string matching.


In [14]:
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings

def process_json(json_data):
    """
    Process JSON data and create a vector store from its content.

    Args:
        json_data (dict): JSON data to process

    Returns:
        FAISS vector store containing document chunks
    """
    # Convert JSON data to a string representation
    json_text = json.dumps(json_data, indent=2)

    # Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1500,
        chunk_overlap=150
    )
    chunks = text_splitter.split_text(json_text)

    # Create embeddings and vector store
    print("Creating embeddings and vector store...")
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    vector_store = FAISS.from_texts(chunks, embeddings)

    print("Vector store created successfully!")
    return vector_store

# Example JSON data
json_data = {
    "name": "Tech Fest 2025",
    "description": "Annual technology festival showcasing student innovations.",
    "conductedDates": {
        "start": "2025-04-15T09:00:00Z",
        "end": "2025-04-17T17:00:00Z"
    },
    "targetedAudience": {
        "departments": ["Computer Science", "Electrical Engineering"],
        "courses": ["B.Tech", "M.Tech"]
    },
    "organizingInstitution": "XYZ University",
    "maximumStudents": 200,
    "maxEventsPerStudent": 3,
    "organizingCollege": "ABC College of Engineering",
    "generalRules": ["No outside food", "ID required for entry"],
    "contactInfo": {
        "email": "events@xyz.edu",
        "phone": "+1234567890"
    },
    "subEvents": [
        {
            "name": "Hackathon",
            "overview": "24-hour coding challenge",
            "venue": "Main Hall",
            "prizePools": [{ "rank": 1, "amount": 1000 }]
        },
        {
            "name": "Robotics Workshop",
            "overview": "Hands-on robotics session",
            "venue": "Lab 3",
            "prizePools": []
        }
    ]
}

# Process the JSON data
vector_store = process_json(json_data)

# You can now use the vector store to create a conversational chain or perform similarity searches
conversation_chain = get_conversation_chain(vector_store)

# Example question
question = "What is the event description?"
ask_question(question)

Creating embeddings and vector store...


I0000 00:00:1743605424.558713 8291381 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


Vector store created successfully!

Question: What is the event description?

Thinking...


I0000 00:00:1743605431.336996 8291381 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


### Answer:
Annual technology festival showcasing student innovations.

### Sources:
1. Page unknown
