# RAG

Retrieval-Augmented Generation (RAG) is a technique where you give the LLM extra context from external data sources (like PDFs, websites, or text files) so it can answer questions better — especially when the info is not present in the model's training data.


Informally, imagine the LLM is like a student answering questions.

Traditional LLM: Answers from memory

RAG LLM: First opens a textbook, goes to the right chapter, then answers.

We will use OpenAI and langchain to demonstrate RAG.

In [None]:
openai_api_key = '<your_api_key>'


In [None]:
!pip install langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-0.3.30-py3-none-any.whl.metadata (2.4 kB)
Downloading langchain_openai-0.3.30-py3-none-any.whl (74 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.4/74.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.30


In [None]:
from langchain_openai import ChatOpenAI
import os

os.environ["OPENAI_API_KEY"] = openai_api_key

llm = ChatOpenAI(model="gpt-3.5-turbo", #or gpt-4
                 temperature=0.7, #optional to pass
                # openai_api_key=openai_api_key #could also be passed here if you do not want to set the environemnt variable
    )

# Now you can use it in a chain, or call it directly as below
response = llm.invoke("When was Acme Inc. founded?")
print(response.content)

There have been several companies with the name Acme Inc. founded throughout history, so it depends on which specific company you are referring to. Can you please provide more context or details?


Sample outputs obtained from the above:
- "It is unclear which specific company named Acme Inc. you are referring to, as there are many companies with similar names. Can you please provide more information or context so I can accurately answer your question?"

- "It is not possible to determine the specific founding date of Acme Inc. as it is a fictional company commonly used in cartoons, comic strips, and other forms of media. The name "Acme" is often used as a generic placeholder for a company in popular culture."


Other questions that may be asked:

- What new product line was launched in 2024 in Acme Inc?
- How frequently does the HR team of Acme Inc meet and what do they assess?

If it is a very specific or new company, the LLM may or may not be able to answer correctly. Hence, we will use RAG.

# RAG steps
1. Read the data for retrieval
2. Split the text
3. Produce Embeddings for splits
4. Store the embeddings in a vectorDB
5. Create Retriever to retrieve the embeddings from the VectorDB
6. Combine the LLM and the retriever, and produce results.

#1.Read the data for retrieval

Before doing RAG, we need data. There are different ways to read the data:

1.1. Using simple text\
1.2. Using a text file\
1.3. Using a pdf file

## 1.1 Using a simple text

###`Document` class

LangChain wraps all content in Document objects, which hold both text and optional metadata.

In [None]:
from langchain.schema import Document

doc1 = Document(page_content="This is the first document", metadata={"source": "file1.txt"})


In [None]:
doc1

Document(metadata={'source': 'file1.txt'}, page_content='This is the first document')

In [None]:
doc1.page_content

'This is the first document'

In [None]:
doc1.metadata

{'source': 'file1.txt'}

Each doc is a Document object with attributes: page_content and metadata


## 1.2 Using a txt file

### `document_loaders`

Loaders help you load documents from .txt, .pdf, .csv, URLs, etc.



In [None]:
!pip install langchain langchain-community


Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 k

In [None]:
from langchain.document_loaders import TextLoader

# If the above doesnt work, use
# from langchain_community.document_loaders import TextLoader

loader = TextLoader("RAG_file.txt")
docs1 = loader.load()

In [None]:
# docs1 is a list of Document objects
for doc in docs1:
    print(doc.page_content)   # The text
    print(doc.metadata)       # File name, etc.

Acme Inc. was founded in 1987 in Helsinki, Finland. It specializes in anti-gravity footwear and rocket-powered pogo sticks.

In 2024, Acme released a new product line: "Jet Sneakers", designed for low-orbit recreational use.

As per internal policy, Acme's HR team meets every 2 weeks to assess wellness metrics of staff based on holographic surveys.

{'source': 'RAG_file.txt'}




Even when loading a single .txt file, TextLoader.load() returns a list of one Document object — for consistency across all loaders in LangChain.

LangChain is designed to treat everything as a list of documents, whether you load:

one .txt file or multiple files at once.



## 1.3 Using a pdf file


In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.0.0-py3-none-any.whl (310 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.5/310.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-6.0.0


In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("RAG_file.pdf")
docs2 = loader.load()


In [None]:
# again docs2 is a list of Document objects
for doc in docs2:
    print(doc.page_content)   # The text
    print(doc.metadata)       # File name, etc.

Acme Inc. was founded in 1987 in Helsinki, Finland. It specializes in anti-gravity footwear and
rocket-powered pogo sticks. In 2024, Acme released a new product line: "Jet Sneakers",
designed for low-orbit recreational use. As per internal policy, Acme's HR team meets every 2
weeks to assess wellness metrics of staff based on holographic surveys.
{'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}


Note:
- We can also read csv files using `CSVLoader` from `langchain.document_loaders` or `langchain_community.document_loaders`.
- We can also read from html files using `UnstructuredHTMLLoader` from `langchain.document_loaders` or `langchain_community.document_loaders`.
- We can also read from online PDF files using `OnlinePDFLoader` from `langchain.document_loaders` or `langchain_community.document_loaders`.


## 1.4 Mixing Different File Types

You can load different formats separately and then combine them

In [None]:
# Assuming file1, file2, file3, file4 are available
# from langchain.document_loaders import TextLoader, PyPDFLoader, CSVLoader, UnstructuredHTMLLoader

# # Loaders for different file types
# txt_docs = TextLoader("file1.txt").load()
# pdf_docs = PyPDFLoader("file2.pdf").load()
# csv_docs = CSVLoader("file3.csv").load()
# html_docs = UnstructuredHTMLLoader("file4.html").load()

# # Merge all into one list
# all_docs = txt_docs + pdf_docs + csv_docs + html_docs


all_docs is just a list of Document objects → ready for splitting, embedding, and vector storage.

# 2.Split the text

LLMs have token limits, so long files need to be split.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10) #or 300 and 50 resp

In [None]:
chunks = splitter.split_documents(docs2)


In [None]:
chunks

[Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='Acme Inc. was founded in 1987 in Helsinki, Finland. It specializes in anti-gravity footwear and'),
 Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='rocket-powered pogo sticks. In 2024, Acme released a new product line: "Jet Sneakers",'),
 Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content="designed for low-orbit recreational use. As per internal policy, Acme's HR team meets every 2"),
 Document(metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page':

Like `split_documents()`, ther is also a function `split_text()` which can directly split the text. Let me use that below and show the effect of the parameters `chunk_size` and `chunk_overlap`

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks1 = splitter.split_text("Very long document text here...")
chunks1

['Very long document text here...']

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=3, chunk_overlap=1)
chunks2 = splitter.split_text("Very long document text here...")
chunks2

['Ver',
 'ry',
 'lo',
 'ong',
 'do',
 'ocu',
 'ume',
 'ent',
 'te',
 'ext',
 'he',
 'ere',
 'e..',
 '..']

# 3.Produce Embeddings for splits

For each chunk, we generate embeddings, which are numerical vectors that capture semantic meaning, such that,

- For similar texts, embeddings are closeby.

- And for dissimilar texts, embeddings are far apart.



In [None]:
from langchain_openai import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings()

# The following extracts the text from each Document chunk, and then converts each chunk into a vector (list of floats).

vectors = embedding_model.embed_documents([doc.page_content for doc in chunks])

In [None]:
len(vectors)

4

In [None]:
len(vectors[0])

1536

In [None]:
# vectors[0]

# 4.Store the embeddings in a vectorDB
The embeddings (and chunks) are stored in a vector database (FAISS, Pinecone, Weaviate, Chroma, etc.).

This lets us search semantically, not just by keywords.

(Later, when a user asks a query, the query itself is embedded. DB finds the nearest chunk embeddings and retrieves relevant chunks.)

In this demo, we will be using FAISS VectorDB.

FAISS stands for **Facebook AI Similarity Search**

Open-source library from Meta (Facebook AI).

It is optimized for fast similarity search in high-dimensional vectors (like embeddings).

It is used for:

- Nearest neighbor search
- Clustering
- Efficient retrieval in RAG pipelines

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m48.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0


In [None]:
from langchain.vectorstores import FAISS

vectorstore = FAISS.from_documents(chunks, embedding_model)

# This stores both embeddings and data together.

In [None]:
vectorstore

<langchain_community.vectorstores.faiss.FAISS at 0x7d69e899fd10>

### Access stored documents (Optional)

In [None]:
# Get all documents back (but embeddings are inside the FAISS index)
all_docs = vectorstore.docstore._dict

for doc_id, doc in all_docs.items():
    print("ID:", doc_id)
    print("Content:", doc.page_content)
    print("Metadata:", doc.metadata)
    print("-" * 40)


ID: 3f6cd2c9-42c5-4e95-a56e-5e5c5fa1da1e
Content: Acme Inc. was founded in 1987 in Helsinki, Finland. It specializes in anti-gravity footwear and
Metadata: {'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}
----------------------------------------
ID: 3e61c826-0d6c-4c82-a4f3-e42efde3bf43
Content: rocket-powered pogo sticks. In 2024, Acme released a new product line: "Jet Sneakers",
Metadata: {'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}
----------------------------------------
ID: 499a4a3b-9497-4a44-acf0-78b2a09eea04
Content: designed for low-orbit recreational use. As per internal policy, Acme's HR team meets every 2
Metadata: {'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}

### Save & reload FAISS (Optional)

In [None]:
# Save locally
vectorstore.save_local("faiss_index")

# Reload later
new_store = FAISS.load_local("faiss_index", OpenAIEmbeddings(), allow_dangerous_deserialization=True )

After creating a vectorstore, we can do similarity search using `similarity_search()` which uses the following process:

1. The query is converted into an embedding vector using the same embedding model you used for your documents.

2. FAISS computes similarity between the query embedding and all stored embeddings.

3. It returns the top-k most similar chunks (Document objects).

Output = a list of Documents (List[Document]).

In [None]:
# extra
retrieved_docs = vectorstore.similarity_search('What was the launch date?', k = 2) # k is the number of documents to retrieve
retrieved_docs

[Document(id='3e61c826-0d6c-4c82-a4f3-e42efde3bf43', metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='rocket-powered pogo sticks. In 2024, Acme released a new product line: "Jet Sneakers",'),
 Document(id='0770a2d5-f717-424a-a917-b5c63bcd0dfc', metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='weeks to assess wellness metrics of staff based on holographic surveys.')]

How similarity is measured?

FAISS uses vector distance metrics (like cosine similarity, L2 distance).

Embeddings of the query and chunks are compared.

Smaller distance = higher similarity.

Another variant:

`similarity_search_with_score`: Also gives similarity score along with the document

In [None]:
results = vectorstore.similarity_search_with_score("What was the launch date?", k=2)
for doc, score in results:
    print(doc.page_content, score)


rocket-powered pogo sticks. In 2024, Acme released a new product line: "Jet Sneakers", 0.44364873
weeks to assess wellness metrics of staff based on holographic surveys. 0.52889407


The previous four steps were to prepare the data for RAG. Now we can do the retrieval.

#5.Create Retriever to retrieve the embeddings from the VectorDB

A retriever is a wrapper around the vector store that defines how to fetch documents given a query.

In [None]:
retriever = vectorstore.as_retriever()
# or
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})



(Optional)
Retriever is a standardized interface that implements `get_relevant_documents(query)`

In [None]:
docs = retriever.get_relevant_documents("What was the launch date?")
for d in docs:
    print(d.page_content, d.metadata)
    print()


rocket-powered pogo sticks. In 2024, Acme released a new product line: "Jet Sneakers", {'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}

weeks to assess wellness metrics of staff based on holographic surveys. {'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}

designed for low-orbit recreational use. As per internal policy, Acme's HR team meets every 2 {'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}



#6.Combine the LLM and the retriever, and produce results.

Finally, LLM and the retriever are combined into a `RetrievalQA` chain.

### `RetrievalQA`

It is a LangChain chain designed specifically for retrieval-augmented generation (RAG).

It combines the following:

1. Retriever: pulls back the most relevant documents from your vector database.
2. LLM: takes the retrieved documents + user's query, and generates an answer.

So instead of the LLM hallucinating, it grounds its answers on your documents.

In [None]:
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=True)


### `.invoke({"query": query})`

Runs the chain.

- Input: a dictionary where the key is "query".

- Output: dictionary with keys like:

- - "result" : final LLM-generated answer.
- - "source_documents" (if enabled): list of docs retrieved.

In [None]:
query = "When and where was Acme Inc. founded?"
response = qa_chain.invoke({"query": query})
print(response["result"])

Acme Inc. was founded in 1987 in Helsinki, Finland.


In [None]:
print(response["source_documents"])

[Document(id='3f6cd2c9-42c5-4e95-a56e-5e5c5fa1da1e', metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='Acme Inc. was founded in 1987 in Helsinki, Finland. It specializes in anti-gravity footwear and'), Document(id='3e61c826-0d6c-4c82-a4f3-e42efde3bf43', metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content='rocket-powered pogo sticks. In 2024, Acme released a new product line: "Jet Sneakers",'), Document(id='499a4a3b-9497-4a44-acf0-78b2a09eea04', metadata={'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250730135006', 'source': 'RAG_file.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content="designed for low-orbit recreational use. As per internal policy, Acme's HR team meets every 2")]
