# RAG (Retrieval Augmented Generation) Demo

This notebook demonstrates how to build a RAG system using LangChain, Hugging Face models, and Chroma vector database. RAG combines the power of retrieval-based and generation-based approaches to provide more accurate and context-aware responses.

## What is RAG?
RAG is a technique that:
1. **Retrieves** relevant documents from a knowledge base
2. **Augments** the input prompt with this retrieved context
3. **Generates** a response using a language model

This approach helps overcome the limitations of pure language models by providing them with specific, relevant information to work with.

## Prerequisites and Setup 

## Download model / Requirements

In [3]:
# Download model or OpenAI API, install dependencies
!pip install -r requirements.txt

# Login to Hugging Face Hub (required for accessing some models)
# Replace with your own token or use environment variables for security
from huggingface_hub import login
login(token = 'your_hugging_face_key')



## Import Required Libraries

We'll import all necessary components for our RAG system:
- **Document loaders and text splitters** for processing PDFs
- **Embedding models** for vector representations
- **Vector store** for storing and retrieving documents
- **LLM components** for generating responses


In [4]:
# Import all required modules for the RAG system
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from transformers import pipeline, LlamaForCausalLM, LlamaTokenizer
from langchain.llms import HuggingFacePipeline
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer, AutoModelForCausalLM

print("✅ All libraries imported successfully!")

✅ All libraries imported successfully!


## Download and Setup Language Model

We'll download and set up our language model. In this example, we're using **Llama 3.2 1B**, which is:
- Relatively small and fast for demonstration purposes
- Good balance between performance and resource requirements
- Suitable for local execution

The model will be downloaded and saved locally for faster subsequent access.


In [5]:
# Configuration for model download and storage
save_path = "meta-llama/Llama-3.2-1B"  # Local directory to store the model
hf_model = 'meta-llama/Llama-3.2-1B'   # Hugging Face model identifier
access_token = 'your_token'             # Replace with your HF token

print(f"📥 Downloading model: {hf_model}")
print(f"💾 Saving to: {save_path}")

# Download model and tokenizer from Hugging Face
# This downloads the model weights, architecture, and configuration
model = AutoModelForCausalLM.from_pretrained(
    hf_model, 
    return_dict=True, 
    trust_remote_code=True, 
    token=access_token
)
tokenizer = AutoTokenizer.from_pretrained(hf_model)

# Save the model and tokenizer locally for future use
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print("✅ Model downloaded and saved successfully!")

📥 Downloading model: meta-llama/Llama-3.2-1B
💾 Saving to: meta-llama/Llama-3.2-1B
✅ Model downloaded and saved successfully!


## Create vectors 

In [6]:
# Step 1: Load PDF Document
print("📄 Loading PDF document...")

# Load the PDF file using PyPDFLoader
pdf_path = "./Data/Dynamic_Resource_Scheduler_for_Distributed_Deep_Learning_Training_in_Kubernetes.pdf"
loader = PyPDFLoader(pdf_path)
pages = loader.load()

# Extract text content from all pages
all_page_text = [p.page_content for p in pages]
joined_page_text = " ".join(all_page_text)

print(f"✅ Loaded {len(pages)} pages")
print(f"📊 Total characters: {len(joined_page_text):,}")
print(f"📝 First 200 characters: {joined_page_text[:200]}...")

📄 Loading PDF document...
✅ Loaded 6 pages
📊 Total characters: 24,094
📝 First 200 characters: 978-1-7281-8038-0/20/$31.00 ©2020 IEEE 
 
Dynamic Resource Scheduler for Distributed Deep 
Learning Training in Kubernetes 
Muhammad Fadhriga Bestari 
School of Electrical Engineering and Informatics,...


In [7]:
# Step 2: Split text into chunks
print("✂️ Splitting text into chunks...")

# Configure the text splitter
# chunk_size: Maximum characters per chunk
# chunk_overlap: Characters to overlap between chunks (maintains context)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,    # Optimal size for most embedding models
    chunk_overlap=150   # Overlap to maintain context between chunks
)

splits = text_splitter.split_text(joined_page_text)

print(f"✅ Created {len(splits)} text chunks")
print(f"📏 Average chunk size: {sum(len(chunk) for chunk in splits) // len(splits)} characters")
print(f"\n📖 First chunk preview:\n{splits[0][:300]}...")

# Display the first chunk as output
splits[0]

✂️ Splitting text into chunks...
✅ Created 18 text chunks
📏 Average chunk size: 1450 characters

📖 First chunk preview:
978-1-7281-8038-0/20/$31.00 ©2020 IEEE 
 
Dynamic Resource Scheduler for Distributed Deep 
Learning Training in Kubernetes 
Muhammad Fadhriga Bestari 
School of Electrical Engineering and Informatics, ITB, 
Indonesia 
fadhriga.bestari@gmail.com 
 
Achmad Imam Kistijantoro1,2 
1School of Electrical E...


'978-1-7281-8038-0/20/$31.00 ©2020 IEEE \n \nDynamic Resource Scheduler for Distributed Deep \nLearning Training in Kubernetes \nMuhammad Fadhriga Bestari \nSchool of Electrical Engineering and Informatics, ITB, \nIndonesia \nfadhriga.bestari@gmail.com \n \nAchmad Imam Kistijantoro1,2 \n1School of Electrical Engineering and Informatics, ITB, \nIndonesia \n2University Center of Excellence on Artificial Intelligence \nfor Vision, Natural Language Processing & Big Data \nAnalytics (U-CoE AI-VLB), Indonesia \nimam@stei.itb.ac.id\n \nAnggrahita Bayu Sasmita \nSchool of Electrical Engineering and Informatics, ITB, Indonesia \nangga@stei.itb.ac.id \n \n \nAbstract—Distributed deep learning is a method of machine \nlearning that is used today due to its many advantages. One of the \nmany tools used to train distributed deep learning model is \nKubeflow, which runs on top of Kubernetes. Kubernetes is a \ncontainerized application orchestrator that ease the deploy ment \nprocess of applications.

In [8]:
# Step 3: Create embeddings and vector store
print("🔢 Creating embeddings and vector store...")

# Set up the embedding model and storage directory
persist_directory = 'basic_langchain/chroma_storage'
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

print(f"🤖 Using embedding model: {embedding_model}")
print(f"💾 Vector store location: {persist_directory}")

# Initialize the embedding model
embedding = HuggingFaceEmbeddings(model_name=embedding_model)

# Create vector store from text chunks
# This process converts each text chunk into a numerical vector
vectordb = Chroma.from_texts(
    texts=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

# Persist the vector store to disk for future use
vectordb.persist()

# Load the persisted vector store (demonstrates how to reload)
vectordb_loaded = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

print(f"✅ Vector store created with {len(splits)} documents")
print("🎯 Ready for similarity search and retrieval!")

🔢 Creating embeddings and vector store...
🤖 Using embedding model: sentence-transformers/all-MiniLM-L6-v2
💾 Vector store location: basic_langchain/chroma_storage


  embedding = HuggingFaceEmbeddings(model_name=embedding_model)


✅ Vector store created with 18 documents
🎯 Ready for similarity search and retrieval!


  vectordb.persist()
  vectordb_loaded = Chroma(


# USAGE : Run the Chain

In [9]:
# Define the RAG prompt template
print("📝 Creating prompt template...")

custom_prompt_template = """Use the following pieces of information to answer the user's question. Explain the answer clearly.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context: {context}
Question: {question}

Only return the helpful answer below and nothing else. Give an answer in 1000 characters at maximum please.
Helpful answer:
"""

# Create the prompt template object
prompt = PromptTemplate(
    template=custom_prompt_template,
    input_variables=['context', 'question']
)

print("✅ Prompt template created successfully!")
print("\n📋 Template structure:")
print("- Context: Retrieved relevant documents")
print("- Question: User's query")
print("- Instructions: How to format the response")



📝 Creating prompt template...
✅ Prompt template created successfully!

📋 Template structure:
- Context: Retrieved relevant documents
- Question: User's query
- Instructions: How to format the response


## Loading the Language Model

Now we'll load our pre-downloaded language model and create a text generation pipeline.


In [10]:
# Load the pre-downloaded model from local directory
print("🤖 Loading language model...")

model_dir = "./meta-llama/Llama-3.2-1B/"  # Path to our downloaded model
print(f"📂 Loading from: {model_dir}")

# Load the tokenizer and model from local storage
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir, ignore_mismatched_sizes=True)

print("✅ Model loaded successfully!")
print(f"🔧 Model type: {type(model).__name__}")
print(f"📊 Model parameters: ~{sum(p.numel() for p in model.parameters()) / 1e9:.1f}B")

🤖 Loading language model...
📂 Loading from: ./meta-llama/Llama-3.2-1B/
✅ Model loaded successfully!
🔧 Model type: LlamaForCausalLM
📊 Model parameters: ~1.2B


## Creating the Complete RAG Pipeline

Now we'll combine everything into a working RAG system:
1. **Text Generation Pipeline**: Configures how the model generates text
2. **LangChain Integration**: Wraps the pipeline for use with LangChain
3. **Retrieval QA Chain**: Combines retrieval and generation


In [11]:
# Create the text generation pipeline
print("⚙️ Setting up text generation pipeline...")

pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
    max_new_tokens=100,           # Maximum tokens to generate
    repetition_penalty=1.1,       # Reduce repetitive text
    model_kwargs={
        "max_length": 1200,       # Maximum total sequence length
        "temperature": 0.01       # Low temperature for more focused responses
    }
)

# Wrap the pipeline for LangChain compatibility
llm_pipeline = HuggingFacePipeline(pipeline=pipe)

print("🔗 Creating RAG retrieval chain...")

# Create the complete RAG pipeline
rag_retrieval = RetrievalQA.from_chain_type(
    llm=llm_pipeline,
    chain_type='stuff',                                    # How to combine retrieved docs
    retriever=vectordb.as_retriever(search_kwargs={'k': 3}),  # Retrieve top 3 similar docs
    chain_type_kwargs={'prompt': prompt}                   # Use our custom prompt
)

print("✅ RAG pipeline created successfully!")
print("🎯 Configuration:")
print("  - Retrieval: Top 3 most similar documents")
print("  - Generation: Llama 3.2 1B with low temperature")
print("  - Chain type: Stuff (concatenate all retrieved docs)")


Device set to use mps:0


⚙️ Setting up text generation pipeline...
🔗 Creating RAG retrieval chain...
✅ RAG pipeline created successfully!
🎯 Configuration:
  - Retrieval: Top 3 most similar documents
  - Generation: Llama 3.2 1B with low temperature
  - Chain type: Stuff (concatenate all retrieved docs)


  llm_pipeline = HuggingFacePipeline(pipeline=pipe)


# Testing and Comparing RAG vs Base Model

Now let's test our RAG system and compare it with the base model to see the difference!

## The Test Case: "DRAGON"
We'll ask about "DRAGON" - a specific method mentioned in our PDF document. This is a perfect example of why RAG is useful:

- **Base Model**: Will likely give generic information about dragons (mythical creatures)
- **RAG Model**: Should provide specific information about the DRAGON method from the research paper

This comparison will clearly demonstrate RAG's ability to provide contextually relevant, document-specific information.

In [14]:
# Test 1: Base Model (without RAG context)
print("🔍 Testing Base Model (without context)")
print("=" * 50)

# Create a simple prompt template for the base model
fm_template = """Use the following pieces of information to answer the user's question. Explain the answer clearly.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Question: {question}

Only return the helpful answer below and nothing else. Give an answer in 1000 characters at maximum please.
Helpful answer:
"""

fm_prompt = PromptTemplate.from_template(fm_template)
user_question = 'Tell me about DRAGON'

print(f"❓ Question: {user_question}")
print("\n🤖 Base Model Response:")

# Create chain and get response
chain_fm = fm_prompt | llm_pipeline
response = chain_fm.invoke({"question": user_question})

# Extract and display the answer
answer_only = response.split("answer:")[-1].strip()
print(f"📝 {answer_only}")
print("\n" + "=" * 50)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


🔍 Testing Base Model (without context)
❓ Question: Tell me about DRAGON

🤖 Base Model Response:
📝 Dragon is a game where you can play with your friends by playing together online or locally.
You can also play other games like "PewDiePie", "Fortnite",and many more.
There are different modes such as Free for All, Team Deathmatch, Capture the Flag etc. You can also buy in-game items using real money (in-app purchases).
The game has various gameplay features such as auto-aim, infinite ammo, unlimited health, no recoil, no ADS etc.
It



In [15]:
# Test 2: RAG Model (with retrieved context)
print("🔍 Testing RAG Model (with context)")
print("=" * 50)

print(f"❓ Question: {user_question}")
print("\n🎯 RAG Model Response:")

# Get response from RAG system
response = rag_retrieval.invoke({"query": user_question})
rag_answer = response["result"].split("answer:")[-1].strip()
print(f"📝 {rag_answer}")

print("\n" + "=" * 50)
print("🎉 Comparison Complete!")
print("\n📊 Key Differences:")
print("• Base Model: Provides general knowledge about dragons")
print("• RAG Model: Provides specific information from the research paper")
print("• RAG demonstrates how retrieval enhances generation with relevant context")


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


🔍 Testing RAG Model (with context)
❓ Question: Tell me about DRAGON

🎯 RAG Model Response:
📝 DRAGON is a distributed deep learning framework developed by Google Research team. DRAGON provides a general programming model for training and executing large-scale neural networks with thousands of parameters, called DAG network. This paper is about the DAG network in detail.

🎉 Comparison Complete!

📊 Key Differences:
• Base Model: Provides general knowledge about dragons
• RAG Model: Provides specific information from the research paper
• RAG demonstrates how retrieval enhances generation with relevant context


## Summary and Next Steps

Congratulations! You've successfully built and tested a complete RAG system. Here's what we accomplished:

### ✅ What We Built
1. **Document Processing Pipeline**: Loaded and chunked a PDF document
2. **Vector Store**: Created embeddings and stored them in Chroma
3. **Language Model Integration**: Set up Llama 3.2 1B for text generation
4. **RAG Pipeline**: Combined retrieval and generation for context-aware responses

### 🎯 Key Benefits Demonstrated
- **Contextual Accuracy**: RAG provides specific, document-relevant information
- **Knowledge Grounding**: Responses are based on actual document content
- **Reduced Hallucination**: Less likely to generate incorrect information

### 🚀 Potential Improvements
1. **Multiple Documents**: Add more PDFs to expand the knowledge base
2. **Better Chunking**: Experiment with different chunk sizes and overlap
3. **Advanced Retrieval**: Try different similarity search methods
4. **Larger Models**: Use more powerful language models for better responses
5. **Evaluation Metrics**: Add quantitative evaluation of response quality

### 💡 Use Cases
This RAG system can be adapted for:
- **Research Assistant**: Query academic papers and documents
- **Customer Support**: Answer questions based on product documentation
- **Legal Research**: Search through legal documents and cases
- **Technical Documentation**: Query API docs, manuals, and guides
