<a href="https://colab.research.google.com/github/anahid-rr/Classdemo2/blob/main/RAG_application.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧰 RAG Detailed Walkthrough

We’ll split the pipeline into three main parts:

1. **Indexing**: load web content → split into chunks → embed → store in vector store  
2. **Retrieval & Generation**: load prompt → retrieve relevant chunks → generate answer  
3. **Assembly via LangGraph**: wire retrieval and generation into a single application  


## 1️⃣ Indexing: Building Your Knowledge Base

### 1.1 Install Dependencies


In [1]:
# Cell 1.1
!pip install -qU google-generativeai langchain langchain-core langchain-community \
               langchain-openai langgraph beautifulsoup4 tiktoken


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m437.9/437.9 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.9/62.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m155.3/155.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.2/44.2 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.0/50.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### 1.2 Load the Web Page as a Document

We use **WebBaseLoader** + BeautifulSoup’s **SoupStrainer** to extract only the title, headers, and content.


In [2]:
# Cell 1.2
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep elements with these classes
strainer = bs4.SoupStrainer(class_=("post-title","post-header","post-content"))
loader = WebBaseLoader(
    web_paths=["https://lilianweng.github.io/posts/2023-06-23-agent/"],
    bs_kwargs={"parse_only": strainer},
)
docs = loader.load()

assert len(docs) == 1
print(f"✔ Loaded 1 document, total characters: {len(docs[0].page_content)}")
print(docs[0].page_content[:200], "…")




✔ Loaded 1 document, total characters: 43047


      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a  …


### 1.3 Split into Chunks

Long documents won’t fit into many model contexts.  
We’ll split each Document into 1,000-character chunks with 200-character overlap.


In [3]:
# Cell 1.3
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True,  # track original position
)
all_splits = splitter.split_documents(docs)
print(f"✔ Split into {len(all_splits)} chunks")
print("First chunk preview:", all_splits[0].page_content[:100], "…")
print("Metadata:", all_splits[0].metadata)


✔ Split into 63 chunks
First chunk preview: LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author …
Metadata: {'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 8}


### 1.4 Embed & Store Chunks

We define a Gemini embedding wrapper, initialize a vector store, then embed and store all chunks in one command.


In [4]:
# Cell 1.4
import os
from getpass import getpass
from langchain_openai import ChatOpenAI
from langchain_core.embeddings import Embeddings
from langchain_core.vectorstores import InMemoryVectorStore
from google import generativeai as genai

# Prompt for API keys
os.environ["GEMINI_API_KEY"]  = getpass("🔐 Gemini API Key: ")
os.environ["OPENAI_API_KEY"]  = getpass("🔐 OpenRouter API Key: ")
os.environ["OPENAI_API_BASE"] = "https://openrouter.ai/api/v1"
os.environ["LANGSMITH_API_KEY"] = getpass("Enter your LangSmith API Key: ")
os.environ["LANGSMITH_TRACING"] = "true"

# Gemini embedding wrapper
class GeminiEmbeddings(Embeddings):
    def __init__(self, model="models/embedding-001", task_type="RETRIEVAL_DOCUMENT"):
        genai.configure(api_key=os.environ["GEMINI_API_KEY"])
        self.client    = genai
        self.model     = model
        self.task_type = task_type
    def embed_documents(self, texts):
        return [
            self.client.embed_content(
                model=self.model, content=t, task_type=self.task_type
            )["embedding"]
            for t in texts
        ]
    def embed_query(self, query):
        return self.client.embed_content(
            model=self.model, content=query, task_type="RETRIEVAL_QUERY"
        )["embedding"]

# Initialize LLM + vector store
llm           = ChatOpenAI(model="openai/gpt-4")
embeddings    = GeminiEmbeddings()
vector_store  = InMemoryVectorStore(embeddings)

# Add and embed chunks
doc_ids = vector_store.add_documents(all_splits)
print("✔ Stored chunks, sample IDs:", doc_ids[:3])


🔐 Gemini API Key: ··········
🔐 OpenRouter API Key: ··········
Enter your LangSmith API Key: ··········
✔ Stored chunks, sample IDs: ['b5fe49b9-6b69-4e42-9a17-aa224c9e5f0c', '8d0c400b-4013-499f-a912-fe88e1454c3c', '0a8b60d2-d0be-4a7c-bd0b-67f05f36c0f7']


## 2️⃣ Retrieval & Generation Logic

### 2.1 Load the RAG Prompt Template

We pull the official RAG prompt from the LangChain Hub.


In [8]:
# Cell 2.1: Load the RAG prompt and preview it
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")
example = prompt.invoke({
    "context": "(your context here)",
    "question": "(your question here)"
}).to_messages()[0].content

print("✔ Prompt template preview:\n", example)


✔ Prompt template preview:
 You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: (your question here) 
Context: (your context here) 
Answer:


### 2.2 Define the Application State

Use a TypedDict to track `question`, `context`, and `answer`.



In [9]:
# Cell 2.2
from langchain_core.documents import Document
from typing_extensions import TypedDict, List

class State(TypedDict):
    question: str
    context: List[Document]
    answer: str


### 2.3 Retrieval Step

Perform a vector similarity search to get the top-k relevant chunks.


In [10]:
# Cell 2.3
def retrieve(state: State):
    docs = vector_store.similarity_search(state["question"], k=10)
    return {"context": docs}

# Quick test
sample_context = retrieve({"question": "What are autonomous agents?"})
print("✔ Retrieved chunks start indices:",
      [d.metadata["start_index"] for d in sample_context["context"]])


✔ Retrieved chunks start indices: [971, 26585, 8, 39948, 22956, 40518, 24126, 3549, 25955, 25125]


### 2.4 Generation Step

Concatenate retrieved chunks, fill into the RAG prompt, and call the LLM to generate an answer.


In [11]:
# Cell 2.4
def generate(state: State):
    ctx = "\n\n".join(d.page_content for d in state["context"])
    msgs = hub.pull("rlm/rag-prompt").invoke({
        "context": ctx,
        "question": state["question"]
    })
    resp = llm.invoke(msgs)
    return {"answer": resp.content}

# Quick test
gen = generate({
    "question": "What are autonomous agents?",
    "context": sample_context["context"]
})
print("✔ Sample answer:", gen["answer"][:200], "…")


✔ Sample answer: Autonomous agents are systems that use a large language model (LLM) as their core controller. These agents can perform complex tasks by breaking them down into smaller, manageable subgoals, reflecting …


## 3️⃣ Assemble with LangGraph

Wire the two steps into a single pipeline and run a full query.


In [13]:
# Cell 3.1
from langgraph.graph import START, StateGraph

builder = StateGraph(State).add_sequence([retrieve, generate])
builder.add_edge(START, "retrieve").set_entry_point("retrieve")
graph = builder.compile()
print("✔ LangGraph pipeline compiled.")


✔ LangGraph pipeline compiled.


### 3.2 Test the Full Pipeline


In [14]:
# After you invoke the pipeline:
result = graph.invoke({"question": "What is Task Decomposition?"})

# Print the start_index of each chunk
print("Retrieved chunk start_indices:", [d.metadata["start_index"] for d in result["context"]])

# Then print the first 200 characters of each chunk so you can see its content
for i, doc in enumerate(result["context"], 1):
    snippet = doc.page_content.strip().replace("\n", " ")
    print(f"\n--- Chunk {i} (start_index={doc.metadata['start_index']}) ---")
    print(snippet[:200] + "…")

# Finally, show the model’s answer
print("\nAnswer:\n", result["answer"])


Retrieved chunk start_indices: [1638, 2578, 19303, 17352, 18591, 37831, 30868, 31986, 19944, 35043]

--- Chunk 1 (start_index=1638) ---
Component One: Planning# A complicated task usually involves many steps. An agent needs to know what they are and plan ahead. Task Decomposition# Chain of thought (CoT; Wei et al. 2022) has become a s…

--- Chunk 2 (start_index=2578) ---
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outl…

--- Chunk 3 (start_index=19303) ---
(3) Task execution: Expert models execute on the specific tasks and log results. Instruction:  With the input and the inference results, the AI assistant needs to describe the process and results. The…

--- Chunk 4 (start_index=17352) ---
Illustration of how HuggingGPT works. (Image source: Shen et al. 2023)  The system comprises of 4 stages: (1) Task planning: LLM works as the brain 

In [15]:
!pip install requests beautifulsoup4 sentence-transformers faiss-cpu


Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_6

In [16]:
import requests
from bs4 import BeautifulSoup

def scrape_webpage(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract visible text
    paragraphs = soup.find_all('p')
    text = "\n".join([p.get_text() for p in paragraphs])
    return text

url = "https://pandas.pydata.org/docs/"  # Example: Pandas Documentation
web_text = scrape_webpage(url)


In [17]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Chunk the text (naive method for now)
chunks = [web_text[i:i+500] for i in range(0, len(web_text), 500)]

embedder = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedder.encode(chunks, convert_to_numpy=True)

# Store in FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Chunk the text (naive method for now)
chunks = [web_text[i:i+500] for i in range(0, len(web_text), 500)]

embedder = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedder.encode(chunks, convert_to_numpy=True)

# Store in FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)


In [20]:
query = "tag for reading csv in pandas"
query_vector = embedder.encode([query])

D, I = index.search(np.array(query_vector), 3)

# Show results
for idx in I[0]:
    print(f"\n--- Retrieved Chunk ---\n")
    print(chunks[idx])



--- Retrieved Chunk ---


    Site Navigation
  

    Site Navigation
  
Date: Sep 20, 2024 Version: 2.2.3
Download documentation: Zipped HTML
Previous versions: Documentation of previous pandas versions is available at
pandas.pydata.org.
Useful links:
Binary Installers |
Source Repository |
Issues & Ideas |
Q&A Support |
Mailing List
pandas is an open source, BSD-licensed library providing high-performance,
easy-to-use data structures and data analysis tools for the Python
programming language.
New to pandas? Check out

--- Retrieved Chunk ---

 the getting started guides. They contain an
introduction to pandasâ main concepts and links to additional tutorials.
To the getting started guides
The user guide provides in-depth information on the
key concepts of pandas with useful background information and explanation.
To the user guide
The reference guide contains a detailed description of
the pandas API. The reference describes how the methods work and which parameters can
be used. It

In [21]:
from transformers import pipeline

# Load RAG pipeline (you can also use GPT via OpenAI API if needed)
rag = pipeline("text2text-generation", model="facebook/rag-sequence-nq")

context = " ".join([chunks[i] for i in I[0]])
input_text = f"question: What is the tag for reading CSV files in pandas? context: {context}"

result = rag(input_text)
print(result[0]['generated_text'])


config.json:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/rag-sequence-nq were not used when initializing RagSequenceForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.bias', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing RagSequenceForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagSequenceForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizerFast'.
Device set to use cpu
The model 'RagSequenceForGeneration' is not supported for text2text-generation. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditi

AttributeError: 'Text2TextGenerationPipeline' object has no attribute 'prefix'

In [19]:
# Step 1: Install required libraries
!pip install sentence-transformers faiss-cpu transformers

# Step 2: Load documents (you can also read .ipynb or plain text files)
from pathlib import Path

data_dir = Path("/content/sample_data")  # change this to your notebook/data path
documents = []

for file in data_dir.glob("*.txt"):  # adapt for .ipynb if needed
    with open(file, "r", encoding="utf-8") as f:
        text = f.read()
        documents.append({"text": text, "source": str(file)})

# Step 3: Create embeddings and index them with FAISS
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

embedder = SentenceTransformer('all-MiniLM-L6-v2')
texts = [doc['text'] for doc in documents]
embeddings = embedder.encode(texts, convert_to_numpy=True)

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# Step 4: Query using a tag (example: "pandas", "data cleaning", etc.)
query = "tag for loading CSV files"
query_embedding = embedder.encode([query])

top_k = 3
D, I = index.search(np.array(query_embedding), top_k)

# Step 5: Display retrieved documents
retrieved_docs = [documents[i] for i in I[0]]
for doc in retrieved_docs:
    print(f"\n--- Retrieved from: {doc['source']} ---\n")
    print(doc['text'][:1000])  # Print first 1000 chars




IndexError: tuple index out of range