# RAG From Scratch — Step-by-Step Tutorial

This notebook demonstrates how to implement a Retrieval-Augmented Generation (RAG) pipeline from scratch, **without relying on high-level RAG frameworks**.

We will build:
1. **Document Text Extraction** — Read text from PDF files.
2. **Chunking with Overlap** — Split text into manageable pieces.
3. **Embeddings** — Convert text chunks into vector representations.
4. **Retrieval** — Find the most relevant chunks for a given query.
5. **Answer Generation (Basic)** — Use retrieved chunks to form answers.

**Goal:** Learn and understand each step of RAG by coding it ourselves.


## Step 0 — Environment & Setup
Before starting, let's check our Python version and install required libraries.


In [2]:
import sys, platform
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())


Python: 3.11.13
Platform: Linux-6.1.123+-x86_64-with-glibc2.35


### Install dependencies
We will use:
- `pdfplumber` for PDF text extraction
- `sentence-transformers` for embeddings
- `numpy` for similarity calculations


In [4]:
!pip install pdfplumber sentence-transformers numpy

Collecting pdfplumber
  Downloading pdfplumber-0.11.7-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting pdfminer.six==20250506 (from pdfplumber)
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downlo

## Step 1 — Document Text Extraction
We start by extracting text from a PDF file using `pdfplumber`.

**Inputs:**
- `pdf_path` — Path to the PDF file.

**Outputs:**
- `text` — Extracted text from all pages.


In [6]:
import pdfplumber

pdf_path = "/content/AI Agent Project Report_ Voice-Based Ride Booking System.pdf"
with pdfplumber.open(pdf_path) as pdf:
    text = "".join([page.extract_text() for page in pdf.pages])

print(text[:500])  # preview first 500 characters




AI Agent Project Report: Voice-Based Ride
Booking System
Executive Summary
This project presents an innovative AI agent solution that addresses digital accessibility
barriers in transportation services through voice-only ride booking capabilities. The
system leverages conversational AI and function-calling architecture to serve
underserved populations including illiterate users, visually impaired individuals, and
elderly citizens.
1. Problem Statement
Core Challenge: Digital Exclusion in Transpo


## Step 2 — Generating Unique IDs
Sometimes we may want to assign unique IDs to chunks or documents.
Here we use `uuid-utils` to generate a random UUID.


In [5]:
!pip install uuid-utils


Collecting uuid-utils
  Downloading uuid_utils-0.11.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.7 kB)
Downloading uuid_utils-0.11.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (332 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m332.2/332.2 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uuid-utils
Successfully installed uuid-utils-0.11.0


In [7]:
import uuid_utils as uuid
id = uuid.uuid4()
print(str(id))


3cfbf7e4-6400-4cdf-93e3-233678138da2


## Step 3 — Chunking with Overlap
We split the document text into smaller chunks, allowing overlaps for context preservation.

**Function:** `chunk_overlap(text, chunk_size, overlap)`
- `text` — Full text
- `chunk_size` — Length of each chunk
- `overlap` — Number of overlapping characters between chunks


In [8]:
def chunk_overlap(text, chunk_size, overlap):
    """
    Splits text into chunks of given size with overlaps.

    Args:
        text (str): Full document text.
        chunk_size (int): Size of each chunk in characters.
        overlap (int): Number of characters overlapping between chunks.

    Returns:
        dict: Mapping from chunk index to chunk text.
    """
    chunks = {}
    start = 0
    while start < len(text):
        chunk_id = str(uuid.uuid4())
        end = start + chunk_size
        chunks[chunk_id] = text[start:end]
        start = end - overlap
    return chunks


### Apply chunking
We apply our chunking function to the extracted text.


In [9]:
chunks = chunk_overlap(text, chunk_size=500, overlap=50)
len(chunks), list(chunks.items())[:3]


(15,
 [('46311a04-6989-4a29-b9e1-d19b3bd1786d',
   'AI Agent Project Report: Voice-Based Ride\nBooking System\nExecutive Summary\nThis project presents an innovative AI agent solution that addresses digital accessibility\nbarriers in transportation services through voice-only ride booking capabilities. The\nsystem leverages conversational AI and function-calling architecture to serve\nunderserved populations including illiterate users, visually impaired individuals, and\nelderly citizens.\n1. Problem Statement\nCore Challenge: Digital Exclusion in Transpo'),
  ('efe8d474-ea85-4846-bcc8-d871ae756863',
   "ement\nCore Challenge: Digital Exclusion in Transportation\n● Target Population: 72% of India's population with limited digital literacy\n● Global Impact: 2.2 billion visually impaired individuals worldwide\n● Pain Points: Complex app interfaces, visual dependency, smartphone barriers\nCurrent System Limitations\n● Heavy reliance on visual interfaces and text input\n● Multi-step naviga

## Step 4 — Load Embedding Model
We use `SentenceTransformer` to convert each chunk into a dense vector embedding.


In [10]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Generate embeddings for all chunks
We pass each chunk to the embedding model and store the resulting vectors.


In [11]:
def embedd_chunks(chunks):
    """
    Converts text chunks into embeddings using SentenceTransformer.

    Args:
        chunks (dict): Mapping from chunk index to text.

    Returns:
        dict: Mapping from chunk index to embedding vector.
    """
    chunk_embeddings = {}
    for idx, chunk in chunks.items():
        chunk_embeddings[idx] = model.encode(chunk)
    return chunk_embeddings

embeddings = embedd_chunks(chunks)


## Step 5 — Retrieve Relevant Chunks
Given a query, we find the top-k most relevant chunks using cosine similarity.


In [12]:
import numpy as np

def retrieve_chunks(query, k):
    """
    Retrieves the top-k most relevant chunks for a query.

    Args:
        query (str): The user query.
        k (int): Number of chunks to retrieve.

    Returns:
        dict: Mapping of chunk index to similarity score.
    """
    query_embedd = model.encode([query])[0]
    similarity = {}
    for idx, emb in embeddings.items():
        sim = np.dot(query_embedd, emb) / (np.linalg.norm(query_embedd) * np.linalg.norm(emb))
        similarity[idx] = sim
    sorted_similarity = sorted(similarity.items(), key = lambda x: x[1], reverse = True)
    top_chunks = [chunks[id] for id, _ in sorted_similarity[:k]]
    return top_chunks


### Test retrieval with a sample query
We query the system and print the top-k chunks.


In [14]:
query = "how to book ride?"
k = 3
retrieve_chunks(query, k)


['on: Access to 30-40% additional population\n● Cost Reduction: 80% decrease in customer service costs\n● User Experience: Zero-training required for system usage8. Demo Implementation\nExperience our AI-powered ride booking assistant through both phone and web\ninterfaces.\n● 📞 Phone Call Demo: Watch a real 3-minute call showcasing end-to-end ride\nbooking with natural language processing, error handling, and address validation.\n👉 Drive Video Link – Ride_booking_video.mp4\n● 🌐 Streamlit Web Demo: Intera',
 '● Lack of accessibility features for disabled users2. Solution Overview\nApproach\nA zero-interface transportation booking system operating entirely through natural voice\nconversations, eliminating visual and technical barriers while maintaining commercial\nviability.\nKey Features\n● Natural language processing for ride booking\n● Real-time address validation\n● Database integration for booking persistence\n● Error handling and conversation state management\n● Phone and web inte

## Step 7 — LLM Answer Generation with Groq API

Now that we can retrieve the most relevant chunks, let's use an LLM to generate an answer.  
We'll use Groq's `llama-3.3-70b-versatile` model, passing in both the **query** and the **retrieved context**.

**Process:**
1. Retrieve top-k relevant chunks from our knowledge base.
2. Combine them into a single context string.
3. Pass query + context to Groq LLM for final answer.


In [15]:
!pip install groq

Collecting groq
  Downloading groq-0.31.0-py3-none-any.whl.metadata (16 kB)
Downloading groq-0.31.0-py3-none-any.whl (131 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m131.4/131.4 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-0.31.0


In [17]:
from groq import Groq

# Initialize Groq client with your API key
api_key = "Your Groq Api Key"  # Replace with your actual key
client = Groq(api_key=api_key)

def generate_answer(query, k=3):
    """
    Generates an answer using Groq LLM based on retrieved chunks.

    Args:
        query (str): The user's question.
        k (int): Number of top chunks to use as context.

    Returns:
        str: The LLM-generated answer.
    """

    # Step 1: Retrieve top-k chunks
    retrieve_chunk = retrieve_chunks(query,k=3)

    # Step 2: Build context from retrieved chunks
    retrieve_context = "\n".join(retrieve_chunk)

    # Step 3: Create prompt for the LLM
    prompt = f"""
    You are an help full assistent, who help other to answer query from the given context.
    query: {query}
    context: {retrieve_context}

    If you dont find answer from the context, politely say so.
    """

    # Step 4: Get completion from Groq LLM
    chat_completion = client.chat.completions.create(
        messages=[
            # Set an optional system message. This sets the behavior of the
            # assistant and can be used to provide specific instructions for
            # how it should behave throughout the conversation.
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            # Set a user message for the assistant to respond to.
            {
                "role": "user",
                "content": prompt,
            }
        ],

        # The language model which will generate the completion.
        model="llama-3.3-70b-versatile"
    )
    return chat_completion.choices[0].message.content

In [20]:
# Example test
query = "how to book ride?"
print(generate_answer(query, k=3))

To book a ride, you can use the AI-powered ride booking assistant through either a phone call or a web interface. The system uses natural language processing, allowing you to book a ride by having a conversation. 

Here's a step-by-step guide based on the context provided:

1. **Phone Interface**: You can book a ride by making a phone call. The system will guide you through the process using natural language processing. You can watch a demo of this process through the provided video link, "Ride_booking_video.mp4".

2. **Web Interface**: The system also supports a web interface, although the details on how to use it for booking a ride are not fully specified in the provided context. It mentions a "Streamlit Web Demo" but does not outline the steps for using it to book a ride.

For more detailed, step-by-step instructions on how to book a ride using either the phone or web interface, I would recommend referring to the specific demo implementations or user guides provided by the system, a

## 📌 Summary & Next Steps

In this notebook, we built a **Retrieval-Augmented Generation (RAG)** system from scratch:

1. **Document Loading** — Extracted text from a file for knowledge base creation.  
2. **Chunking** — Split the text into overlapping chunks for better retrieval.  
3. **Embedding** — Converted each chunk into a vector representation using a sentence transformer model.  
4. **Similarity Search** — Found the most relevant chunks for a query using cosine similarity.  
5. **Retrieval Pipeline** — Built a function to fetch top-k chunks as context.  
6. **LLM Integration** — Used Groq's `llama-3.3-70b-versatile` model to generate final answers from retrieved context.  

---

### ✅ What We Achieved
- Created a **minimal but complete** RAG pipeline without heavy frameworks.
- Kept the process transparent so each step is easy to understand and modify.
- Enabled the system to **politely handle** cases where the answer is not found in the context.

---

### 🚀 Next Steps
- **Switch to a Vector Database** (FAISS, ChromaDB, Weaviate) for faster and scalable retrieval.
- Experiment with **different embedding models** for better semantic matching.
- Add **multi-document support** to handle a larger knowledge base.
- Fine-tune or prompt-engineer the LLM for domain-specific tasks.
- Deploy as an **API or web app** for interactive querying.

---

💡 *By understanding each step, you now have the foundation to build more advanced RAG systems for real-world applications.*
