# RAG (Retrieval Augmented Generation) Demo

This notebook demonstrates how to build a RAG system using LangChain, Hugging Face models, and Chroma vector database. RAG combines the power of retrieval-based and generation-based approaches to provide more accurate and context-aware responses.

## What is RAG?
RAG is a technique that:
1. **Retrieves** relevant documents from a knowledge base
2. **Augments** the input prompt with this retrieved context
3. **Generates** a response using a language model

This approach helps overcome the limitations of pure language models by providing them with specific, relevant information to work with.

## Prerequisites and Setup 

## Download model / Requirements

In [None]:
# Download model or OpenAI API, install dependencies
!pip install -r requirements.txt

# Login to Hugging Face Hub (required for accessing some models)
# SECURITY: Use environment variables for tokens in production
from huggingface_hub import login
import os

# Option 1: Use environment variable (recommended)
# Set your token in terminal: export HUGGINGFACE_TOKEN=your_token_here
hf_token = os.getenv('HUGGINGFACE_TOKEN')

# Option 2: Manual token input (for development only)
if not hf_token:
    print("⚠️  No HUGGINGFACE_TOKEN environment variable found.")
    print("📝 You can either:")
    print("   1. Set it as environment variable: export HUGGINGFACE_TOKEN=your_token")
    print("   2. Enter it manually below (not recommended for production)")
    # hf_token = input("Enter your Hugging Face token: ")  # Uncomment if needed
    
if hf_token:
    login(token=hf_token)
    print("✅ Successfully logged in to Hugging Face Hub")
else:
    print("⚠️  Skipping Hugging Face login - some models may not be accessible")



## Import Required Libraries

We'll import all necessary components for our RAG system:
- **Document loaders and text splitters** for processing PDFs
- **Embedding models** for vector representations
- **Vector store** for storing and retrieving documents
- **LLM components** for generating responses


In [2]:
# Import all required modules for the RAG system
from langchain import PromptTemplate
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline, LlamaCpp  # Added LlamaCpp
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

print(" LlamaCpp support for GGUF models")

 LlamaCpp support for GGUF models


## Download and Setup Language Model

We'll download and set up our language model. In this example, we're using **Llama 3.2 1B**, which is:
- Relatively small and fast for demonstration purposes
- Good balance between performance and resource requirements
- Suitable for local execution

The model will be downloaded and saved locally for faster subsequent access.


In [3]:
## Download Llama 8B in GGUF format for llama-cpp-python

# Configuration for GGUF model download
import os
from huggingface_hub import hf_hub_download

# Using Llama 3.1 8B in GGUF format (optimized for llama-cpp-python)
model_name = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"  # Pre-quantized GGUF model
model_file = "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"    # 4-bit quantized version (~4.9GB)
local_dir = "./llama-models/"
access_token = os.getenv('HUGGINGFACE_TOKEN')  # Uses environment variable for security

print(f"📥 Downloading Llama 8B model in GGUF format...")
print(f"🤖 Model: {model_name}")
print(f"📁 File: {model_file}")
print(f"💾 Saving to: {local_dir}")

# Create directory if it doesn't exist
os.makedirs(local_dir, exist_ok=True)

# Check if model already exists
model_path = os.path.join(local_dir, model_file)
if os.path.exists(model_path):
    print(f"✅ Model already exists at: {model_path}")
else:
    # Download the GGUF model file
    print("⬇️ Downloading model (this may take a while - ~4.9GB)...")
    model_path = hf_hub_download(
        repo_id=model_name,
        filename=model_file,
        local_dir=local_dir,
        token=access_token
    )
    print(f"✅ Model downloaded successfully!")

print(f"📍 Model path: {model_path}")
print(f"📊 Model size: ~4.9GB (4-bit quantized)")
print("🚀 Ready for llama-cpp-python!")


📥 Downloading Llama 8B model in GGUF format...
🤖 Model: bartowski/Meta-Llama-3.1-8B-Instruct-GGUF
📁 File: Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
💾 Saving to: ./llama-models/
✅ Model already exists at: ./llama-models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
📍 Model path: ./llama-models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
📊 Model size: ~4.9GB (4-bit quantized)
🚀 Ready for llama-cpp-python!


## Create vectors 

In [4]:
# Step 1: Load PDF Document
print("📄 Loading PDF document...")

# Load the PDF file using PyPDFLoader
pdf_path = "./Data/Dynamic_Resource_Scheduler_for_Distributed_Deep_Learning_Training_in_Kubernetes.pdf"
loader = PyPDFLoader(pdf_path)
pages = loader.load()

# Extract text content from all pages
all_page_text = [p.page_content for p in pages]
joined_page_text = " ".join(all_page_text)

print(f"✅ Loaded {len(pages)} pages")
print(f"📊 Total characters: {len(joined_page_text):,}")
print(f"📝 First 200 characters: {joined_page_text[:200]}...")

📄 Loading PDF document...
✅ Loaded 6 pages
📊 Total characters: 24,094
📝 First 200 characters: 978-1-7281-8038-0/20/$31.00 ©2020 IEEE 
 
Dynamic Resource Scheduler for Distributed Deep 
Learning Training in Kubernetes 
Muhammad Fadhriga Bestari 
School of Electrical Engineering and Informatics,...


In [5]:
# Step 2: Split text into chunks
print("✂️ Splitting text into chunks...")

# Configure the text splitter
# chunk_size: Maximum characters per chunk
# chunk_overlap: Characters to overlap between chunks (maintains context)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,    # Optimal size for most embedding models
    chunk_overlap=150   # Overlap to maintain context between chunks
)

splits = text_splitter.split_text(joined_page_text)

print(f"✅ Created {len(splits)} text chunks")
print(f"📏 Average chunk size: {sum(len(chunk) for chunk in splits) // len(splits)} characters")
print(f"\n📖 First chunk preview:\n{splits[0][:300]}...")

# Display the first chunk as output
splits[0]

✂️ Splitting text into chunks...
✅ Created 29 text chunks
📏 Average chunk size: 947 characters

📖 First chunk preview:
978-1-7281-8038-0/20/$31.00 ©2020 IEEE 
 
Dynamic Resource Scheduler for Distributed Deep 
Learning Training in Kubernetes 
Muhammad Fadhriga Bestari 
School of Electrical Engineering and Informatics, ITB, 
Indonesia 
fadhriga.bestari@gmail.com 
 
Achmad Imam Kistijantoro1,2 
1School of Electrical E...


'978-1-7281-8038-0/20/$31.00 ©2020 IEEE \n \nDynamic Resource Scheduler for Distributed Deep \nLearning Training in Kubernetes \nMuhammad Fadhriga Bestari \nSchool of Electrical Engineering and Informatics, ITB, \nIndonesia \nfadhriga.bestari@gmail.com \n \nAchmad Imam Kistijantoro1,2 \n1School of Electrical Engineering and Informatics, ITB, \nIndonesia \n2University Center of Excellence on Artificial Intelligence \nfor Vision, Natural Language Processing & Big Data \nAnalytics (U-CoE AI-VLB), Indonesia \nimam@stei.itb.ac.id\n \nAnggrahita Bayu Sasmita \nSchool of Electrical Engineering and Informatics, ITB, Indonesia \nangga@stei.itb.ac.id \n \n \nAbstract—Distributed deep learning is a method of machine \nlearning that is used today due to its many advantages. One of the \nmany tools used to train distributed deep learning model is \nKubeflow, which runs on top of Kubernetes. Kubernetes is a \ncontainerized application orchestrator that ease the deploy ment'

In [6]:
# Step 3: Create embeddings and vector store
print("🔢 Creating embeddings and vector store...")

import os
import shutil

# Set up the embedding model and storage directory
persist_directory = 'chroma_vectorstore'
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

print(f"🤖 Using embedding model: {embedding_model}")
print(f"💾 Vector store location: {persist_directory}")

# Clean up any existing vector store to avoid conflicts
if os.path.exists(persist_directory):
    print(f"🧹 Cleaning up existing vector store...")
    shutil.rmtree(persist_directory)

# Initialize the embedding model
embedding = HuggingFaceEmbeddings(model_name=embedding_model)

# Create vector store from text chunks
# This process converts each text chunk into a numerical vector
vectordb = Chroma.from_texts(
    texts=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

# Persist the vector store to disk for future use
vectordb.persist()

# Load the persisted vector store (demonstrates how to reload)
vectordb_loaded = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

print(f"✅ Vector store created with {len(splits)} documents")
print("🎯 Ready for similarity search and retrieval!")

🔢 Creating embeddings and vector store...
🤖 Using embedding model: sentence-transformers/all-MiniLM-L6-v2
💾 Vector store location: chroma_vectorstore


  embedding = HuggingFaceEmbeddings(model_name=embedding_model)


✅ Vector store created with 29 documents
🎯 Ready for similarity search and retrieval!


  vectordb.persist()
  vectordb_loaded = Chroma(


# Create Prompts

In [7]:
# Create a simple prompt template for direct LLM queries
direct_llm_template = """You are a helpful AI assistant. Answer the following question to the best of your knowledge in 1000 characters at maximum.

Question: {question}

Answer:"""

# Create prompt template
direct_prompt = PromptTemplate(
    template=direct_llm_template,
    input_variables=['question']
)

# Create a simple function to query LLM directly
def query_llm_directly(question):
    """Query the LLM directly without any document context"""
    formatted_prompt = direct_prompt.format(question=question)
    response = llm_cpp(formatted_prompt)
    return response

print("Direct LLM prompt created successfully!")


Direct LLM prompt created successfully!


In [12]:
# Define the RAG prompt template

prompt_template = """You are a helpful AI assistant. Answer the question using ONLY the provided context. Give a concise answer in maximum 100 words. Do NOT repeat information.

Context: {context}

Question: {question}

Direct answer:"""

# Create the improved prompt template object
RAG_prompt_template = PromptTemplate(
    template=prompt_template,
    input_variables=['context', 'question']
)

print("RAG prompt created successfully!")


RAG prompt created successfully!


## Loading the Language Model

Now we'll load our pre-downloaded language model and create a text generation pipeline.


In [None]:
## Load Llama 8B with llama-cpp-python

print("🤖 Loading Llama 8B with llama-cpp-python...")

# Set up callback manager for streaming output (optional)
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])


# Configure LlamaCpp with BETTER parameters to prevent repetition
llm_cpp = LlamaCpp(
    model_path=model_path,           
    n_ctx=2048,                      # Context window size
    n_batch=512,                     # Batch size for processing
    n_threads=8,                     # Number of CPU threads
    max_tokens=150,                  # 150 for complete responses
    temperature=0.1,                 # 0.1 for more focused responses
    top_p=0.8,                       # 0.8 for less randomness
    repeat_penalty=2.0,              # 2.0 to prevent repetition
    callback_manager=callback_manager,
    verbose=False,                   
    streaming=True,                  
    stop=["Question:", "Context:", "\n\n", "Note:", "Answer:", "Human:", "Assistant:", "fig.", "design is"]  # More stop sequences
)

# Update the global llm_cpp variable
llm_cpp = llm_cpp

print("✅ Llama 8B loaded successfully with llama-cpp-python!")


🤖 Loading Llama 8B with llama-cpp-python...


llama_context: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96

✅ Llama 8B loaded successfully with llama-cpp-python!


## Creating the Complete RAG Pipeline

Now we'll combine everything into a working RAG system:
1. **Text Generation Pipeline**: Configures how the model generates text
2. **LangChain Integration**: Wraps the pipeline for use with LangChain
3. **Retrieval QA Chain**: Combines retrieval and generation


In [10]:
## Create RAG Pipeline with Llama 8B (llama-cpp-python)

print("🔗 Creating RAG retrieval chain with Llama 8B...")

# Create the complete RAG pipeline using llama-cpp-python with streaming support
rag_retrieval_cpp = RetrievalQA.from_chain_type(
    llm=llm_cpp,                                           # Using LlamaCpp instead of HuggingFacePipeline
    chain_type='stuff',                                    # How to combine retrieved docs
    retriever=vectordb.as_retriever(search_kwargs={'k': 3}),  # Retrieve top 3 similar docs
    chain_type_kwargs={'prompt': RAG_prompt_template},                  # Use our custom prompt
    return_source_documents=False                          # Don't return source docs for cleaner output
)

# Create a streaming function for RAG
def query_rag_streaming(question):
    """Query RAG with streaming support"""
    print("📝 RAG Answer: ", end="", flush=True)
    # Use invoke instead of run for better control
    result = rag_retrieval_cpp.invoke({"query": question})
    return result['result'] if 'result' in result else result

print("✅ RAG pipeline with Llama 8B created successfully!")
print("🎯 Configuration:")
print("  - Model: Llama 3.1 8B Instruct (4-bit quantized)")
print("  - Backend: llama-cpp-python (optimized)")
print("  - Retrieval: Top 3 most similar documents")
print("  - Context: 2048 tokens")
print("  - Chain type: Stuff (concatenate all retrieved docs)")
print("  - ✨ Streaming: Enabled for both direct LLM and RAG!")
print("💡 This should be much faster and use less memory than transformers!")


🔗 Creating RAG retrieval chain with Llama 8B...
✅ RAG pipeline with Llama 8B created successfully!
🎯 Configuration:
  - Model: Llama 3.1 8B Instruct (4-bit quantized)
  - Backend: llama-cpp-python (optimized)
  - Retrieval: Top 3 most similar documents
  - Context: 2048 tokens
  - Chain type: Stuff (concatenate all retrieved docs)
  - ✨ Streaming: Enabled for both direct LLM and RAG!
💡 This should be much faster and use less memory than transformers!


# Testing and Comparing RAG vs Base Model

Now let's test our RAG system and compare it with the base model to see the difference!

## The Test Case: "DRAGON"
We'll ask about "DRAGON" - a specific method mentioned in our PDF document. This is a perfect example of why RAG is useful:

- **Base Model**: Will likely give generic information about dragons (mythical creatures)
- **RAG Model**: Should provide specific information about the DRAGON method from the research paper

This comparison will clearly demonstrate RAG's ability to provide contextually relevant, document-specific information.

In [13]:
# Test Direct LLM vs RAG Comparison
print("🔄 Testing Direct LLM vs RAG Comparison")
print("=" * 60)

# Test question
test_question = "What is DRAGON?"

print(f"❓ Test Question: {test_question}")
print("\n" + "="*30)

# 1. Test Direct LLM (No Context)
print("🤖 1. DIRECT LLM RESPONSE (No Document Context):")
print("-" * 40)
print("💭 Model will use only its training knowledge...")

direct_response = query_llm_directly(test_question)

# 2. Test RAG LLM (With Context)
print("\n")
print("🧠 2. RAG LLM RESPONSE (With Document Context):")
print("-" * 40)
print("📚 Model will use retrieved document context...")

# Use the new streaming RAG function
rag_response = query_rag_streaming(test_question)

🔄 Testing Direct LLM vs RAG Comparison
❓ Test Question: What is DRAGON?

🤖 1. DIRECT LLM RESPONSE (No Document Context):
----------------------------------------
💭 Model will use only its training knowledge...
 DRAGON is a type of astronomical object, specifically an active galactic nucleus (AGN). It's characterized by extremely high energy output and strong emission lines. The name "DRAGO" comes from the acronym for its spectral characteristics: D = Dust; R= Radio continuum ; A=GALACTIC NUCLEUS EMISSION LINES.

🧠 2. RAG LLM RESPONSE (With Document Context):
----------------------------------------
📚 Model will use retrieved document context...
📝 RAG Answer:  DRAGON is a resource scheduler that schedules distributed jobs using gang scheduling and autoscaling.  (Maximum of around ~70 words) 

## Summary and Next Steps

Congratulations! You've successfully built and tested a complete RAG system. Here's what we accomplished:

### ✅ What We Built
1. **Document Processing Pipeline**: Loaded and chunked a PDF document
2. **Vector Store**: Created embeddings and stored them in Chroma
3. **Language Model Integration**: Set up Llama 3.2 1B for text generation
4. **RAG Pipeline**: Combined retrieval and generation for context-aware responses

### 🎯 Key Benefits Demonstrated
- **Contextual Accuracy**: RAG provides specific, document-relevant information
- **Knowledge Grounding**: Responses are based on actual document content
- **Reduced Hallucination**: Less likely to generate incorrect information

### 🚀 Potential Improvements
1. **Multiple Documents**: Add more PDFs to expand the knowledge base
2. **Better Chunking**: Experiment with different chunk sizes and overlap
3. **Advanced Retrieval**: Try different similarity search methods
4. **Larger Models**: Use more powerful language models for better responses
5. **Evaluation Metrics**: Add quantitative evaluation of response quality

### 💡 Use Cases
This RAG system can be adapted for:
- **Research Assistant**: Query academic papers and documents
- **Customer Support**: Answer questions based on product documentation
- **Legal Research**: Search through legal documents and cases
- **Technical Documentation**: Query API docs, manuals, and guides
