# RAG Chatbot with Local HuggingFace Model

A complete Retrieval-Augmented Generation (RAG) system for students - **NO API TOKEN NEEDED!**

## What This Notebook Does

1. **Loads PDF documents** from the `documents/` folder
2. **Chunks text** into smaller pieces with overlap
3. **Creates embeddings** (vector representations) locally
4. **Builds a search index** using FAISS
5. **Retrieves relevant chunks** for your questions
6. **Generates answers** using a local HuggingFace model (runs on your computer)

## Requirements

- **NO API token needed** - everything runs locally!
- PDF files in the `documents/` folder
- Required packages: `pypdf`, `sentence-transformers`, `faiss-cpu`, `transformers`, `gradio`

## Step 1: Configuration (Global Variables)

**IMPORTANT**: This cell configures the local model - no API token needed!

In [None]:
# ============================================
# GLOBAL CONFIGURATION
# ============================================

# Local Model Configuration (No API token needed!)
MODEL_NAME = "google/flan-t5-large"  # You can also use "google/flan-t5-base" for faster, smaller model

# Folder containing your PDF documents
PDF_FOLDER = "documents"

print("‚úÖ Configuration loaded!")
print(f"   Local Model: {MODEL_NAME}")
print(f"   PDF Folder: {PDF_FOLDER}")
print(f"   üéâ No API token needed - everything runs locally!")

## Step 2: Import Required Libraries

In [None]:
import os
from typing import List, Dict
from pathlib import Path

# PDF processing
from pypdf import PdfReader

# Embeddings
from sentence_transformers import SentenceTransformer

# Vector store
import faiss
import numpy as np

# Local LLM
from transformers import pipeline

# UI (optional)
import gradio as gr

print("‚úÖ All libraries imported successfully!")

## Step 3: Initialize the RAG System

This sets up:
- Local HuggingFace model (downloads once, then cached)
- Local embedding model (runs on your computer)

In [None]:
print("=" * 60)
print("üöÄ INITIALIZING RAG CHATBOT")
print("=" * 60)

# Load embedding model (this runs locally)
print("\n‚úì Loading embedding model (local)...")
embedder = SentenceTransformer("all-MiniLM-L6-v2")
print("  ‚úì Embedding model loaded")

# Load local LLM for answer generation
print(f"\n‚úì Loading language model: {MODEL_NAME}")
print("  (This may take a minute on first run - model will be downloaded and cached)")
llm = pipeline(
    "text2text-generation",
    model=MODEL_NAME,
    max_length=512,
    device=-1  # Use CPU (-1) or GPU (0) if available
)
print("  ‚úì Language model loaded and ready!")

# Initialize storage
chunks = []
chunk_embeddings = None
index = None

print("\n" + "=" * 60)
print("‚úÖ RAG CHATBOT READY - 100% LOCAL!")
print("=" * 60)

## Step 4: Load PDF Documents

This function extracts text from all PDFs in your `documents/` folder.

In [4]:
def load_pdfs(pdf_folder: str = PDF_FOLDER) -> List[str]:
    """
    Load and extract text from all PDFs in folder
    
    Args:
        pdf_folder: Path to folder containing PDFs
        
    Returns:
        List of text strings, one per PDF
    """
    print(f"\nüìÑ Loading PDFs from '{pdf_folder}/'...")

    all_text = []
    pdf_folder_path = Path(pdf_folder)
    pdf_folder_path.mkdir(exist_ok=True)

    pdf_files = list(pdf_folder_path.glob("*.pdf"))

    if not pdf_files:
        print(f"  ‚ö†Ô∏è  No PDF files found in '{pdf_folder}/'")
        return []

    print(f"  Found {len(pdf_files)} PDF file(s)")

    for pdf_path in pdf_files:
        print(f"  Processing: {pdf_path.name}")
        try:
            reader = PdfReader(pdf_path)
            text = ""
            for page in reader.pages:
                text += page.extract_text() + "\n"
            all_text.append(text)
            print(f"    ‚úì Extracted {len(reader.pages)} pages")
        except Exception as e:
            print(f"    ‚úó Error: {e}")

    return all_text

print("‚úÖ Function defined: load_pdfs()")

‚úÖ Function defined: load_pdfs()


## Step 5: Chunk Text with Overlap

Breaking documents into chunks helps with:
- More precise retrieval
- Staying within model limits
- Better matching with queries

**Overlap** ensures we don't lose context at chunk boundaries.

In [5]:
def chunk_text(texts: List[str], chunk_size: int = 500, overlap: int = 100) -> List[str]:
    """
    Split texts into smaller chunks with overlap
    
    Args:
        texts: List of text strings to chunk
        chunk_size: Target size of each chunk in characters
        overlap: Number of characters to overlap between chunks
        
    Returns:
        List of text chunks
    """
    print(f"\n‚úÇÔ∏è  Chunking text (size={chunk_size}, overlap={overlap})...")

    chunks = []

    for text in texts:
        sentences = text.replace('\n', ' ').split('. ')

        current_chunk = []
        current_length = 0

        for sentence in sentences:
            sentence = sentence.strip()
            if not sentence:
                continue

            sentence_length = len(sentence) + 2

            if current_length + sentence_length > chunk_size and current_chunk:
                chunk_text = '. '.join(current_chunk) + '.'
                chunks.append(chunk_text)

                if overlap > 0:
                    overlap_text = chunk_text[-overlap:]
                    current_chunk = [overlap_text + sentence]
                    current_length = len('. '.join(current_chunk))
                else:
                    current_chunk = [sentence]
                    current_length = sentence_length
            else:
                current_chunk.append(sentence)
                current_length += sentence_length

        if current_chunk:
            chunk_text = '. '.join(current_chunk)
            if not chunk_text.endswith('.'):
                chunk_text += '.'
            chunks.append(chunk_text)

    print(f"  ‚úì Created {len(chunks)} chunks")
    if chunks:
        print(f"  ‚úì Avg length: {sum(len(c) for c in chunks) / len(chunks):.0f} chars")

    return chunks

print("‚úÖ Function defined: chunk_text()")

‚úÖ Function defined: chunk_text()


## Step 6: Create Vector Store

This creates:
1. **Embeddings**: Vector representations of text chunks
2. **FAISS Index**: Fast similarity search structure

In [6]:
def create_vector_store(text_chunks: List[str]):
    """
    Create embeddings and FAISS index
    
    Args:
        text_chunks: List of text chunks
        
    Returns:
        Tuple of (embeddings, index)
    """
    global chunk_embeddings, index
    
    print(f"\nüî¢ Creating vector store...")

    print(f"  Encoding {len(text_chunks)} chunks...")
    chunk_embeddings = embedder.encode(
        text_chunks,
        show_progress_bar=True,
        batch_size=32
    )

    print(f"  Building FAISS index...")
    dimension = chunk_embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(chunk_embeddings.astype('float32'))

    print(f"  ‚úì Vector store ready ({index.ntotal} vectors)")
    
    return chunk_embeddings, index

print("‚úÖ Function defined: create_vector_store()")

‚úÖ Function defined: create_vector_store()


## Step 7: Setup Knowledge Base

**Run this cell to process your PDFs!**

This will:
1. Load all PDFs from `documents/` folder
2. Chunk the text
3. Create embeddings and search index

In [7]:
# Load PDFs
documents = load_pdfs(PDF_FOLDER)

if not documents:
    print("\n‚ùå No PDFs loaded. Please add PDFs to the 'documents/' folder.")
else:
    # Chunk texts
    chunks = chunk_text(documents)

    # Create vector store
    chunk_embeddings, index = create_vector_store(chunks)

    print("\n" + "=" * 60)
    print("‚úÖ KNOWLEDGE BASE READY")
    print(f"   Total chunks: {len(chunks)}")
    print(f"   Embedding dim: {chunk_embeddings.shape[1]}")
    print("=" * 60)


üìÑ Loading PDFs from 'documents/'...
  Found 1 PDF file(s)
  Processing: intro-to-econometrics.pdf
    ‚úì Extracted 801 pages

‚úÇÔ∏è  Chunking text (size=500, overlap=100)...
  ‚úì Created 6431 chunks
  ‚úì Avg length: 434 chars

üî¢ Creating vector store...
  Encoding 6431 chunks...


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 201/201 [03:53<00:00,  1.16s/it]

  Building FAISS index...
  ‚úì Vector store ready (6431 vectors)

‚úÖ KNOWLEDGE BASE READY
   Total chunks: 6431
   Embedding dim: 384





## Step 8: Retrieval Function

This finds the most relevant chunks for a given question.

In [8]:
def retrieve(query: str, top_k: int = 3) -> tuple:
    """
    Retrieve most relevant chunks for a query
    
    Args:
        query: Search query
        top_k: Number of chunks to retrieve
        
    Returns:
        Tuple of (relevant_chunks, similarities)
    """
    # Encode query
    query_embedding = embedder.encode([query])

    # Search
    distances, indices = index.search(
        query_embedding.astype('float32'),
        min(top_k, len(chunks))
    )

    # Convert distances to similarities
    similarities = 1 / (1 + distances[0])

    # Get chunks
    relevant_chunks = [chunks[i] for i in indices[0]]

    return relevant_chunks, similarities

print("‚úÖ Function defined: retrieve()")

‚úÖ Function defined: retrieve()


## Step 9: Answer Generation with Local Model

This uses the local HuggingFace model to generate answers based on retrieved context.

**Note**: Uses the global `llm` pipeline configured in Step 3.

In [None]:
def generate_answer(query: str, context_chunks: List[str]) -> str:
    """
    Generate answer using local LLM
    
    Args:
        query: User's question
        context_chunks: Relevant context chunks
        
    Returns:
        Generated answer
    """
    # Combine context (keep it reasonable for the model)
    context_text = "\n".join(context_chunks)
    
    # Truncate if too long (FLAN-T5 has token limits)
    max_context_length = 1000
    if len(context_text) > max_context_length:
        context_text = context_text[:max_context_length] + "..."
    
    # Create prompt optimized for FLAN-T5
    prompt = f"""Answer the question based only on the context below.

Context: {context_text}

Question: {query}

Answer:"""
    
    try:
        # Generate answer using local model
        response = llm(prompt, max_length=150, do_sample=False)[0]['generated_text']
        return response.strip()
    except Exception as e:
        return f"Error generating answer: {str(e)}"

print("‚úÖ Function defined: generate_answer()")

## Step 10: Complete RAG Pipeline

This combines retrieval + generation into one simple function.

In [10]:
def ask(question: str, top_k: int = 3) -> Dict:
    """
    Complete RAG pipeline: retrieve + generate
    
    Args:
        question: User's question
        top_k: Number of chunks to retrieve
        
    Returns:
        Dictionary with answer, sources, and similarities
    """
    if not index:
        return {
            "answer": "‚ùå Knowledge base not set up.",
            "sources": [],
            "similarities": []
        }

    # Retrieve relevant chunks
    relevant_chunks, similarities = retrieve(question, top_k)

    # Generate answer
    answer = generate_answer(question, relevant_chunks)

    return {
        "answer": answer,
        "sources": relevant_chunks,
        "similarities": similarities
    }

print("‚úÖ Function defined: ask()")
print("\nüéâ RAG pipeline ready! Try asking questions in the next cell.")

‚úÖ Function defined: ask()

üéâ RAG pipeline ready! Try asking questions in the next cell.


## Step 11: Test the RAG System!

Ask questions about your documents here.

In [11]:
# Test with sample questions
test_questions = [
    "What is regression analysis?",
    "What is econometrics?",
    "What are the main topics in this book?"
]

print("=" * 60)
print("TESTING RAG SYSTEM")
print("=" * 60)

for q in test_questions:
    print(f"\n‚ùì Q: {q}")
    result = ask(q, top_k=3)
    print(f"üí¨ A: {result['answer']}")
    print(f"üìä Similarities: {[f'{s:.2f}' for s in result['similarities']]}")
    print("-" * 60)

TESTING RAG SYSTEM

‚ùì Q: What is regression analysis?
üí¨ A: ‚ùå Invalid HuggingFace token. Please check your token.
üìä Similarities: ['0.54', '0.54', '0.53']
------------------------------------------------------------

‚ùì Q: What is econometrics?
üí¨ A: ‚ùå Invalid HuggingFace token. Please check your token.
üìä Similarities: ['0.76', '0.70', '0.65']
------------------------------------------------------------

‚ùì Q: What are the main topics in this book?
üí¨ A: ‚ùå Invalid HuggingFace token. Please check your token.
üìä Similarities: ['0.51', '0.51', '0.50']
------------------------------------------------------------


## Step 12: Interactive Q&A

Ask your own questions!

In [12]:
# Ask your own question here
my_question = "What is regression analysis?"  # Change this!

print(f"\n‚ùì Your Question: {my_question}\n")

result = ask(my_question, top_k=3)

print(f"üí¨ Answer:\n{result['answer']}\n")

print("üìö Source Chunks:")
for i, (source, sim) in enumerate(zip(result['sources'], result['similarities']), 1):
    print(f"\n{i}. (Similarity: {sim:.2f})")
    print(f"   {source[:200]}...")


‚ùì Your Question: What is regression analysis?

üí¨ Answer:
‚ùå Invalid HuggingFace token. Please check your token.

üìö Source Chunks:

1. (Similarity: 0.54)
   ceptual framework used in this text is the multiple regression model, the  mainstay of econometrics.This model, introduced in Part II, provides a mathematical way to quantify how a change in one varia...

2. (Similarity: 0.54)
   OLS algorithm. Regression software typically computes  the¬†OLS fixed effects estimator in two steps.In the first step, the entity-specific average is subtracted from each variable. In the second step,...

3. (Similarity: 0.53)
   s explain how to use multiple regression to analyze the  relationship among variables in a data set.In this chapter, we step back and ask,  What makes a study that uses multiple regression reliable or...


## Step 13: Launch Gradio Interface (Optional)

Create a web interface for easier interaction.

In [None]:
def create_gradio_interface():
    """
    Create Gradio web interface
    """
    def chatbot_response(message, history):
        result = ask(message, top_k=3)

        response = f"{result['answer']}\n\n"

        if result['sources']:
            response += "---\n**üìö Sources:**\n"
            for i, (source, sim) in enumerate(
                zip(result['sources'][:2], result['similarities'][:2]),
                1
            ):
                source_preview = source[:100].replace('\n', ' ')
                response += f"\n{i}. (Similarity: {sim:.2f}) {source_preview}...\n"

        return response

    interface = gr.ChatInterface(
        chatbot_response,
        title="üìö RAG Chatbot - 100% Local!",
        description=f"Ask questions about your PDF documents! Uses {MODEL_NAME} running locally - no API needed!",
        examples=[
            "What is this document about?",
            "Explain the main concept",
            "Summarize the key points",
        ],
        theme="soft",
    )

    return interface

# Launch the interface
print("üåê Launching Gradio interface...")
print("   Press the stop button in Jupyter to stop the server.\n")

demo = create_gradio_interface()
demo.launch(share=False)  # Set share=True for a public link

## Summary

You've built a complete RAG system! Here's what you learned:

### Components
1. **PDF Processing** - Extract text from documents
2. **Text Chunking** - Split into manageable pieces with overlap
3. **Embeddings** - Create vector representations (local)
4. **FAISS Index** - Fast similarity search
5. **Retrieval** - Find relevant chunks
6. **Local LLM** - Generate answers using HuggingFace transformers

### Key Features
- ‚úÖ **100% Free** - No API tokens or costs
- ‚úÖ **100% Local** - Everything runs on your computer
- ‚úÖ **Privacy** - Your documents never leave your machine
- ‚úÖ **Production-ready** - Same patterns used in real applications
- ‚úÖ **Works with HuggingFace Hub** - Uses transformers library

### Models Used
- **Embeddings**: `all-MiniLM-L6-v2` (sentence-transformers)
- **Generation**: `google/flan-t5-large` (can change to `flan-t5-base` for speed)

### Next Steps
- Try different questions
- Add more PDF documents
- Experiment with `top_k` parameter
- Try the Gradio interface
- Switch to `google/flan-t5-base` for faster responses (change in Step 1)

### Troubleshooting
- **Slow responses**: Try using `google/flan-t5-base` instead of `flan-t5-large`
- **Out of memory**: Reduce chunk size or use smaller model
- **No PDFs found**: Add PDFs to `documents/` folder
- **Model download issues**: Check internet connection - models download once and are cached