# 📝 05 – Retrieval & RAG Basics

In this chapter, we’ll learn about **RAG (Retrieval Augmented Generation)** and implement a simple RAG pipeline using LangChain with Gemini.

---

## 📌 5.1 What is RAG?

**Retrieval-Augmented Generation (RAG)** is a technique that improves Large Language Models by combining:
1. **Retrieval** → fetching relevant information from an external knowledge base (vector DB).
2. **Generation** → feeding that information into the LLM to produce grounded, accurate responses.

### Why RAG?
- LLMs have limited **context windows** (they can only “see” a certain number of tokens at once).
- LLMs can **hallucinate** (make up facts).
- With RAG, the LLM always has **up-to-date, external information**.


## 📌 5.2 Vector Databases

To perform retrieval, we need a **vector database**:
- Convert text into embeddings (numeric vectors).
- Store embeddings in a database (Chroma, FAISS, Pinecone).
- Perform similarity search to fetch most relevant chunks.

We’ll use **Chroma** here (easy, lightweight, local).


In [1]:
import os
import sys
from pathlib import Path

sys.path.append(os.path.abspath(".."))

In [2]:
# initializing the llm
from llm.load_llm import initialize_llm

llm = initialize_llm()

LLM ready: ChatGoogleGenerativeAI


Pipeline:
1. Convert docs → embeddings (vectors)
2. Store in a vector database
3. Retrieve relevant chunks based on query
4. Pass them + question into LLM

We’ll now build a simple demo with a few documents.

In [5]:
from langchain.schema import Document

# Example documents
docs = [
    Document(page_content="LangChain helps build apps with LLMs."),
    Document(page_content="RAG means Retrieval Augmented Generation."),
    Document(page_content="Chroma is a vector database for embeddings."),
    Document(page_content="Embeddings convert text into numerical vectors."),
]


## ✂️ 5.3. Split Documents into Chunks

Large documents need to be **chunked** into smaller pieces for embeddings.
We’ll use `CharacterTextSplitter`.

Why use `CharacterTextSplitter`?

* LLMs and embedding models have a token limit → can’t process huge text at once.
* Embedding small chunks helps better retrieval (more precise context).
* CharacterTextSplitter breaks text into chunks of fixed size, usually by characters.

In [6]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap = 10)
split_docs = text_splitter.split_documents(docs)

print("Number of chunks:" , len(split_docs))
for i, d in enumerate(split_docs[:5]):
    print(f"Chunk {i+1}: {d.page_content}")

Number of chunks: 4
Chunk 1: LangChain helps build apps with LLMs.
Chunk 2: RAG means Retrieval Augmented Generation.
Chunk 3: Chroma is a vector database for embeddings.
Chunk 4: Embeddings convert text into numerical vectors.


In [33]:
#Example

data =  """
LangChain is a powerful framework for building applications using large language models.
It provides tools for chaining prompts, memory, agents, and retrieval systems.
RAG (Retrieval-Augmented Generation) is one of the most useful techniques in LangChain.
"""
text_splitter = CharacterTextSplitter(chunk_size=50, chunk_overlap = 10, separator="\n")
chunks = text_splitter.split_text(data)

for i , c in enumerate(chunks):
    print(f"{i}\tchunk:{c}\n")

Created a chunk of size 88, which is longer than the specified 50
Created a chunk of size 78, which is longer than the specified 50


0	chunk:LangChain is a powerful framework for building applications using large language models.

1	chunk:It provides tools for chaining prompts, memory, agents, and retrieval systems.

2	chunk:RAG (Retrieval-Augmented Generation) is one of the most useful techniques in LangChain.



## 🔢 5.4. Create Embeddings and Store in Chroma

Embeddings = numerical representation of text.  
We’ll use **Google Generative AI embeddings** (`models/embedding-001`).


What is an Embedding?

* An embedding is a numerical representation of text.
* Instead of words/characters, the text is turned into a vector (list of numbers) in high-dimensional space.0
* Similar meanings → closer vectors (small cosine distance).
* Different meanings → far apart vectors.

👉 This allows us to search by meaning, not just exact words.

In [9]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

vectordb = Chroma.from_documents(split_docs , embeddings)

In [35]:

# Create embedding model
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

# Example text
text = "LangChain makes it easier to build LLM apps."

# Get embedding vector
vector = embeddings.embed_query(text)
print(len(vector))  # dimension size (e.g., 768 or more)
print(vector[:10])  # first 10 values


768
[0.04500666633248329, -0.008530666120350361, -0.035316307097673416, 0.02876937761902809, 0.052248261868953705, 0.017567235976457596, 0.03779364377260208, -0.0038489217404276133, 0.050953008234500885, 0.0346384271979332]


## 🔍 5.5. Create a Retriever

Retriever = takes query → finds top `k` most relevant docs.


In [21]:
retriever = vectordb.as_retriever(search_kwargs={"k":2})

results = retriever.get_relevant_documents("What is Chroma")
for r in results:
    print(r.page_content)

Chroma is a vector database for embeddings.
Embeddings convert text into numerical vectors.


## 🤖 5.6. Build RAG QA Chain

Now we connect:
- Retriever (fetches context)
- LLM (answers using context)

We’ll use `RetrievalQA`


In [27]:
from langchain.chains import RetrievalQA
from langchain_community.llms import Ollama

llm2 = Ollama(model = "tinyllama")

qa_chain = RetrievalQA.from_chain_type(
    llm = llm2,
    retriever = retriever,
    chain_type = "stuff",
    verbose = True
)


In [28]:
# Ask questions
print(qa_chain.run("What does RAG mean"))



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Yes, I can provide the context for the question "What is RAG and what is it used for?"

RAG (Retrieval Augmented Generation) is a type of AI-based machine learning model designed to improve natural language processing (NLP). The main purpose of RAG is to help the system understand and analyze large amounts of unstructured data, such as text or spoken dialogues, by providing contextual insights and interpretations.

RAG uses embedding techniques to convert text into numerical vectors, allowing it to learn and understand contextual relationships between words and phrases within a given document or conversation. This helps the system better understand the underlying meaning and structure of the data, which can then be used to make more informed decisions and recommendations.


In [29]:
print(qa_chain.invoke("Which database is used for embeddings"))



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
{'query': 'Which database is used for embeddings', 'result': "If you don't know the answer to this question, just skip to the next context piece.\n\nChroma is a vector database for embedding conversations and text into numerical vectors."}


## ✅ Summary

In this notebook we built a **basic RAG pipeline**:
1. Prepared documents
2. Split into chunks
3. Created embeddings
4. Stored in **Chroma vector DB**
5. Retrieved relevant chunks
6. Built a **RAG QA chain** with Gemini

👉 This is the foundation for **chat-with-PDF**, **Q&A bots**, and **knowledge assistants**.
