# Creation of Chatbot with RAG and OpenAI

**Core Goal:**

Build a domain-specific Q&A chatbot using OpenAI’s API for both embeddings and chat completions, plus Gradio for the front end.


# Let's understand the process


**Step 1:** Install and import packages.

**Step 2:** Set your OpenAI API key.

**Step 3:** Load or create text documents.

**Step 4:** (Optional) Split documents into chunks.

**Step 5:** Create embeddings with OpenAIEmbeddings and store them in FAISS.

**Step 6:** Instantiate an OpenAI LLM with OpenAI().

**Step 7:** Define a custom prompt template to integrate conversation + retrieved chunks.

**Step 8:** Build a ConversationalRetrievalChain with memory.

**Step 9:** Ask multi-turn queries and observe how the chain fetches context from your documents.

**We have options for embeddings:**

HuggingFaceEmbeddings → OpenAIEmbeddings

HuggingFaceHub (LLM) → OpenAI (LLM)

Everything else in LangChain (splitting, vector store, retrieval chain, memory) remains the same.

# Adding functionality for uploading and chunking pdf documents

# Create 2 small pdf: RAG and LangChain


Document1: RAG.pdf

Title: Retrieval-Augmented Generation (RAG) Summary

Retrieval-Augmented Generation is a technique that combines
Language Models (like GPT) with external knowledge sources.

Instead of relying solely on the model's internal parameters,
RAG fetches relevant information from a vector database
(or other retrieval mechanisms) and feeds it into
the LLM’s prompt.

This approach helps reduce hallucinations and
improves factual accuracy by grounding outputs
in real documents.


Document2: LangChain.pdf

Title: LangChain Overview

LangChain is a development framework designed
to streamline the creation of powerful LLM applications.

It provides abstractions for prompts, chains, memory,
agents, and retrieval strategies. By combining these
components, developers can quickly assemble AI-driven
tools such as chatbots, question-answering systems,
and more.

LangChain also integrates with different LLM providers
(e.g., OpenAI, Hugging Face) and vector databases
(e.g., FAISS, Pinecone).

2. Upload the PDFs into Colab
Once you have your PDFs locally, do the following in Google Colab:

STEPS:
1. OpenAI key
2. Select relevant PDFs and find a way of loading it into colab NB. Upload from GIT.
3. Sentence transformer embedding
4. Understand functions in between if any change required
5. Test with custom queries
6. Incorporate gradio


In [1]:
!pip install langchain openai faiss-cpu pypdf


Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting pypdf
  Downloading pypdf-5.3.1-py3-none-any.whl.metadata (7.3 kB)
Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (30.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pypdf-5.3.1-py3-none-any.whl (302 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf, faiss-cpu
Successfully installed faiss-cpu-1.10.0 pypdf-5.3.1


In [2]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.19-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB

In [3]:
!pip install pymupdf

Collecting pymupdf
  Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.3-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m45.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.3


In [32]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.20.1-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.11-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.7.2 (from gradio)
  Downloading gradio_client-1.7.2-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3

In [1]:
import os
import requests
import glob

from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.document_loaders import PyMuPDFLoader  # Import PyMuPDFLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.text_splitter import CharacterTextSplitter

from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain


In [2]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain import HuggingFaceHub
import gradio as gr


In [3]:
from google.colab import files

uploaded = files.upload()  # This will prompt you to upload multiple files

Saving Securitisation.pdf to Securitisation (1).pdf


Select both RAG_Summary.pdf and LangChain_Overview.pdf from your local machine.

After upload, Colab will list the files it received. Make sure they appear in the Colab file explorer on the left panel.

# 3. Parse and Chunk the PDF Text

LangChain offers multiple loaders. PyPDFLoader (part of langchain.document_loaders) or a simple approach via pypdf can extract text from each PDF.

 Then we’ll split the text into smaller chunks.

3.1 Install and Import Dependencies

# 3.2 Parse the PDFs

For each PDF, we create a PyPDFLoader to load the text as Document objects.

 We’ll combine both documents into a single list.

In [6]:
url = "https://github.com/adimehta97/NLP_Final_Project/blob/main/Securitisation.pdf"
loader = OnlinePDFLoader(url) #replace with your url.
pages = loader.load_and_split()  # Splits by page, returning a list of Documents
all_texts = []

for page in pages:
  all_texts.append(page.page_content)

ImportError: unstructured package not found, please install it with `pip install unstructured`

In [4]:
# Let's assume both files are uploaded in Colab's current working directory
pdf_files = ["Securitisation (1).pdf"]

all_texts = []

for pdf_file in pdf_files:
    loader = PyMuPDFLoader(pdf_file) # This line is the change
    pages = loader.load_and_split()  # Splits by page, returning a list of Documents
    for page in pages:
        # each 'page' is a Document with .page_content
        all_texts.append(page.page_content)

print(f"Loaded {len(all_texts)} document segments from PDFs.")


Loaded 3 document segments from PDFs.


# Explanation:

loader.load_and_split(): By default, PyPDFLoader returns a list of Document objects, each representing a page.

We’re grabbing the .page_content from each page and appending to all_texts.

(If your PDFs are short, you might get 1 page per PDF. That’s fine.)

# 3.3 (Optional) Additional Text Splitting

If the text on each page is still too long, we can do a more fine-grained chunking. For eample here Now chunked_texts is a list of strings, each ~500 characters.

In [5]:
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=500,
    chunk_overlap=100
)

chunked_texts = []
for text in all_texts:
    chunks = text_splitter.split_text(text)
    chunked_texts.extend(chunks)

print(f"Total chunks after splitting: {len(chunked_texts)}")


Total chunks after splitting: 28


# 4. Create Embeddings and the Vector Store
Using OpenAI for embeddings:

In [6]:
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
vectorstore = FAISS.from_texts(chunked_texts, embeddings)

# 5. Set Up the OpenAI LLM

We’ll use the GPT-3.5 or GPT-4 model (whichever your key has access to). For a basic example:

In [30]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"

llm = HuggingFaceHub(
    repo_id="facebook/bart-large",  # Example model
    task="text-generation",  # Add this line to specify the task
    model_kwargs={"temperature": 0.7, "max_length": 512}
)

# 6. Build a Custom Prompt (Optional)


Let’s define a prompt template to incorporate retrieved context and conversation history:

In [31]:
prompt_template = """
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:
{chat_history}

Relevant context from PDFs:
{context}

User's question: {question}

Assistant:
"""

custom_prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["chat_history", "context", "question"]
)


# 7. Create the Conversational Retrieval Chain

Now let’s set up memory for multi-turn chat and a chain that uses the vector store:

In [32]:
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

conversational_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 2}),
    memory=memory,
    # Remove the combine_docs_chain argument and use combine_docs_chain_kwargs instead
    combine_docs_chain_kwargs={"prompt": custom_prompt},
    verbose=True
)

  memory = ConversationBufferMemory(


In [35]:
# Custom Prompt
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer:"""
QA_PROMPT = PromptTemplate(template=template, input_variables=["context", "question"])



In [37]:
chain = ConversationalRetrievalChain.from_llm(
    llm,
    vectorstore.as_retriever(),
    # Remove the qa_prompt argument
    # qa_prompt=QA_PROMPT, # This line is causing the issue
    combine_docs_chain_kwargs={"prompt": QA_PROMPT}, # Use custom prompt here
    return_source_documents=False,
)

chat_history = []  # Initialize chat history


In [33]:
def predict(message, history):
    """Function to generate a response using the LLM."""
    # Combine the history and the current message
    # Generate the response
    response = llm(message)
    return response


In [34]:
# Create the Gradio interface
iface = gr.ChatInterface(
    fn=predict,
    title="BART Chatbot",
    description="Chat with a BART-large model.",
    examples=[["Hello, how are you?", "I'm doing well, thank you!"]],
    chatbot=gr.Chatbot(height=500), # Adjust height as needed
)

# Launch the interface
iface.launch()



  chatbot=gr.Chatbot(height=500), # Adjust height as needed


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://00e6de4b37a74774d6.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




Explanation

We retrieve the top 2 chunks for each query.

memory=memory: Tracks the conversation so the user can ask follow-ups referencing prior Q&A.

combine_docs_chain=LLMChain(llm=llm, prompt=custom_prompt): We feed the conversation + retrieved chunks into our custom prompt, then call the LLM.

# 8. Test the Chatbot

We can now do a multi-turn conversation referencing the PDF content:

In [30]:
# 1st question
query1 = "What is securitisation?"
response1 = conversational_chain({"question": query1})
print("User:", query1)
print("Assistant:", response1["answer"])

# 2nd question
query2 = "And how does this impact the global financial system?"
response2 = conversational_chain({"question": query2})
print("\nUser:", query2)
print("Assistant:", response2["answer"])

# 3rd question
query3 = "Who underwrites them?"
response3 = conversational_chain({"question": query3})
print("\nUser:", query3)
print("Assistant:", response3["answer"])




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:


Relevant context from PDFs:
Securitization started as a way for ﬁnancial institutions and corporations to ﬁnd new 
sources of funding—either by moving assets oK their balance sheets or by borrowing 
against them to reﬁnance their origination at a fair market rate. It reduced their 
borrowing costs and, in the case of banks, lowered regulatory minimum capital 
requirements. For example, suppose a leasing company needed to raise cash. Under

The subprime mortgage crisis that began in 2007 has given the decades-old concept of 
securitization a bad name. Securitization is the process in which certain types of assets 
are pooled so that they can be repackaged into interest-bearing securities. The 




[1m> Finished chain.[0m

[1m> Finished chain.[0m
User: What is securitisation?
Assistant: 
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:


Relevant context from PDFs:
Securitization started as a way for ﬁnancial institutions and corporations to ﬁnd new 
sources of funding—either by moving assets oK their balance sheets or by borrowing 
against them to reﬁnance their origination at a fair market rate. It reduced their 
borrowing costs and, in the case of banks, lowered regulatory minimum capital 
requirements. For example, suppose a leasing company needed to raise cash. Under

The subprime mortgage crisis that began in 2007 has given the decades-old concept of 
securitization a bad name. Securitization is the process in which certain types of assets 
are pooled so that they can be repackaged into interest-bearing securities. The interest 
and principal payments from th




[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:

Human: What is securitisation?
Assistant: 
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:


Relevant context from PDFs:
Securitization started as a way for ﬁnancial institutions and corporations to ﬁnd new 
sources of funding—either by moving assets oK their balance sheets or by borrowing 
against them to reﬁnance their origination at a fair market rate. It reduced their 
borrowing costs and, in the case of banks, lowered regulatory minimum capital 
requirements. For example, suppose a leasing company needed to raise cash. Under

The subprime mortgage cri




[1m> Finished chain.[0m

[1m> Finished chain.[0m

User: And how does this impact the global financial system?
Assistant: 
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:

Human: What is securitisation?
Assistant: 
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:


Relevant context from PDFs:
Securitization started as a way for ﬁnancial institutions and corporations to ﬁnd new 
sources of funding—either by moving assets oK their balance sheets or by borrowing 
against them to reﬁnance their origination at a fair market rate. It reduced their 
borrowing costs and, in the case of banks, lowered regulatory minimum capital 
requirements. For example, suppose a leasing company needed to raise cash. Under

The subprime mortgage crisis that began in 2007 has given the




[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:

Human: What is securitisation?
Assistant: 
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:


Relevant context from PDFs:
Securitization started as a way for ﬁnancial institutions and corporations to ﬁnd new 
sources of funding—either by moving assets oK their balance sheets or by borrowing 
against them to reﬁnance their origination at a fair market rate. It reduced their 
borrowing costs and, in the case of banks, lowered regulatory minimum capital 
requirements. For example, suppose a leasing company needed to raise cash. Under

The subprime mortgage cri




[1m> Finished chain.[0m

[1m> Finished chain.[0m

User: Who underwrites them?
Assistant: 
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:

Human: What is securitisation?
Assistant: 
You are a helpful AI assistant.
Use the following context to answer the user's question.
If you don't know the answer, say you don't know.

Conversation so far:


Relevant context from PDFs:
Securitization started as a way for ﬁnancial institutions and corporations to ﬁnd new 
sources of funding—either by moving assets oK their balance sheets or by borrowing 
against them to reﬁnance their origination at a fair market rate. It reduced their 
borrowing costs and, in the case of banks, lowered regulatory minimum capital 
requirements. For example, suppose a leasing company needed to raise cash. Under

The subprime mortgage crisis that began in 2007 has given the decades-old concept of 
securit

You should see the chain retrieving relevant chunks from your two PDFs (RAG_Summary.pdf and LangChain_Overview.pdf), merging them with the conversation history, and generating answers.

What’s Happening?
Loading PDFs: We read the text from each PDF page.

Chunking: (Optional) we further split pages into ~500-char segments.

Embeddings: We embed each chunk using OpenAIEmbeddings().

Vector Store: We store these embeddings in FAISS for semantic retrieval.

Conversational Retrieval: For each user question, the chain uses the conversation context + the top matching chunks from the PDFs to produce a grounded answer.


what we have done

By following these steps, you’ve created a RAG chatbot that references PDF documents rather than plain text. Here’s the overall flow:

Upload PDFs into Colab.

Parse them (via PyPDFLoader).

Chunk the text if needed.

Embed with OpenAIEmbeddings.

Store in FAISS.

Initialize an OpenAI LLM + memory.

Build a ConversationalRetrievalChain.

Ask questions and watch it pull info from the PDFs.

Feel free to modify the text chunking strategy, the prompt template, and the number of retrieved chunks (k) to suit your data and desired level of detail. This approach scales up seamlessly for more PDFs or larger corpora (within Colab’s memory limits).



# How to extend

Example Workflow Outline

Step 1: Upload small text/PDF files (e.g., a short HR manual, a 3-page finance policy, a mini case study).

Step 2: Parse/extract text in Colab (using pypdf for PDFs, or direct text for .txt).

Step 3: Chunk the text if needed (500–1,000 tokens) to improve retrieval.

Step 4: Embed each chunk with a Hugging Face model (e.g., "sentence-transformers/all-MiniLM-L6-v2").

Step 5: Store embeddings in a FAISS or Chroma vector DB.

Step 6: Create a LangChain chain (e.g., ConversationalRetrievalChain) with a simple custom prompt instructing the LLM to ground answers in retrieved chunks.

Step 7: Ask queries. The chain retrieves relevant doc pieces and the LLM references them in its final answer.

Step 8: (Optional) Add a small Gradio UI for user-friendly Q&A.

# What You Can Add or Modify

Swap Models: Use Huggingface models such as  a more chat-oriented model (e.g., facebook/blenderbot-400M-distill or google/flan-t5-large), or a more advanced model if you have access.

Real Documents: Parse PDFs, Word docs, or other formats, then embed them. You might use PyPDFLoader or UnstructuredLoader from LangChain for robust parsing.

Improve Prompt: Add system instructions, style preferences, or mention a knowledge cutoff to refine the assistant’s answers.

Chunking Strategy: Tune chunk_size and chunk_overlap to suit your doc length and structure.

Advanced Retrieval: Explore more complex chain types (e.g., “map_reduce” or “refine”) to better combine multiple doc chunks.
