# Notebook 3: Foundational RAG Pipeline

**Retrieval-Augmented Generation**

## Learning Objectives
- Understand the RAG pipeline and why it matters
- Implement document chunking with different strategies
- Create embeddings and store them in a vector database
- Build a simple retriever to find relevant context


## 1. Setup

In [1]:
# Install required packages
!pip install langchain==1.2.7 langchain-community langchain-groq langchain-huggingface langchain-text-splitters faiss-cpu sentence-transformers python-dotenv



In [2]:
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set up Groq API key
if not os.getenv('GROQ_API_KEY'):
    os.environ['GROQ_API_KEY'] = input('Enter your Groq API key: ')

## 2. What is RAG?

**Retrieval-Augmented Generation (RAG)** solves two key problems with LLMs:

1. **Knowledge**: LLMs only know what they were trained on
2. **Hallucination**: LLMs can make up facts

**Solution**: Before generating, retrieve relevant information from a knowledge base and include it in the prompt.

### The RAG Pipeline

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     INDEXING (one-time)                         ‚îÇ
‚îÇ        Document ‚Üí Chunk ‚Üí Embed ‚Üí Store in Vector DB            ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                              ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     RETRIEVAL (per query)                       ‚îÇ
‚îÇ     Query ‚Üí Embed ‚Üí Search Vector DB ‚Üí Get Relevant Chunks      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                              ‚Üì
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                        GENERATION                               ‚îÇ
‚îÇ       Query + Retrieved Context ‚Üí LLM ‚Üí Answer                  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

## 3. Document Loading

First, let's load our sample document.

In [3]:
from langchain_community.document_loaders import TextLoader

# Load the CCI undergraduate catalog document
loader = TextLoader("data/CCI_2022-2023-Undergraduate-Catalog.txt")
documents = loader.load()

# Check what we loaded
print(f"Loaded {len(documents)} document(s)")
print(f"Document length: {len(documents[0].page_content)} characters")
print(f"\nFirst 500 characters:")
print(documents[0].page_content[:500])

  from .autonotebook import tqdm as notebook_tqdm


Loaded 1 document(s)
Document length: 91946 characters

First 500 characters:
College of
Computing and Informatics
2022-2023 UNC CHARLOTTE UNDERGRADUATE CATALOG College of Computing and Informatics | 165
College of
Computing and Informatics
cci.charlotte.edu
The University of North Carolina at Charlotte's College of Computing and Informatics (CCI) is part of a dynamic and exciting educational and research
institution that combines the knowledge and expertise of multidisciplinary faculty, industry professionals, and students. The CCI was formed in 2000 as the
College of In


## 4. Chunking

Documents are often too long to fit in an LLM's context window, and we only need relevant parts anyway. **Chunking** splits documents into smaller pieces.

### Key Parameters
- **chunk_size**: Maximum characters per chunk
- **chunk_overlap**: Characters shared between consecutive chunks (prevents cutting off context)

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # Maximum characters per chunk
    chunk_overlap=50,      # Overlap between chunks
    length_function=len,
    separators=["\n\n", "\n", " ", ""]  # Try to split at these boundaries first
)

# Split the documents
chunks = text_splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks from the document")
print(f"\n--- Chunk 1 ---")
print(chunks[0].page_content)
print(f"\n--- Chunk 10 ---")
print(chunks[9].page_content)

Created 205 chunks from the document

--- Chunk 1 ---
College of
Computing and Informatics
2022-2023 UNC CHARLOTTE UNDERGRADUATE CATALOG College of Computing and Informatics | 165
College of
Computing and Informatics
cci.charlotte.edu
The University of North Carolina at Charlotte's College of Computing and Informatics (CCI) is part of a dynamic and exciting educational and research
institution that combines the knowledge and expertise of multidisciplinary faculty, industry professionals, and students. The CCI was formed in 2000 as the

--- Chunk 10 ---
‚Ä¢ Software Systems
Undergraduate Certificates
‚Ä¢ Game Design and Development
Honors Program
The Computing and Informatics Honors Program (CCI Honors) is a research-based experience designed to provide mentoring to high-achieving students to
better prepare them for post-graduate success. CCI Honors students must complete a capstone research project under the supervision of a faculty


### Experiment: Different Chunk Sizes

Let's see how chunk size affects the number and content of chunks.

In [5]:
# Try different chunk sizes
for size in [200, 500, 1000]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=size, 
        chunk_overlap=50,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]  # Try to split at these boundaries first
    )
    
    test_chunks = splitter.split_documents(documents)
    avg_len = sum(len(c.page_content) for c in test_chunks) / len(test_chunks)

    print(f"Chunk size {size}: {len(test_chunks)} chunks, avg length: {avg_len:.0f} chars")

Chunk size 200: 616 chunks, avg length: 158 chars
Chunk size 500: 205 chunks, avg length: 458 chars
Chunk size 1000: 98 chunks, avg length: 947 chars


**Trade-offs**:
- **Smaller chunks**: More precise retrieval, but may lose context
- **Larger chunks**: More context, but may include irrelevant information

A common starting point is **500-1000 characters** with **10-20% overlap**.

## 5. Embeddings

**Embeddings** convert text into numerical vectors that capture meaning. Similar texts have similar vectors.

- a) "Machine learning is AI"  ‚Üí  [0.2, -0.5, 0.8, ...]
- b) "AI and ML are related"   ‚Üí  [0.3, -0.4, 0.7, ...]  
- c) "I like pizza"            ‚Üí  [-0.8, 0.1, 0.2, ...]  

### Libraries:
**sentence-transformers**
- Developed by HuggingFace for semantic text embeddings
- Provides pre-trained models that can convert text into dense vector representations (embeddings)

**langchain-huggingface**
- LangChain integration package that wraps sentence-transformers
- Provides LangChain-compatible interfaces to use HuggingFace models in LangChain workflows

**all-MiniLM-L6-v2 embedding model**
- https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2

In [6]:
from langchain_huggingface import HuggingFaceEmbeddings

# Initialize embedding model (downloads on first run, ~90MB)
print("Loading embedding model...")
embeddings = HuggingFaceEmbeddings(
    model_name="all-MiniLM-L6-v2",  # Fast and good quality
    model_kwargs={'device': 'cpu'}   # Use 'cuda' if you have a GPU
)
print("Embedding model loaded!")

Loading embedding model...
Embedding model loaded!


In [7]:
# Let's see what embeddings look like
test_text = "Machine learning is a type of artificial intelligence."
test_embedding = embeddings.embed_query(test_text)

# We will only print the first 10 entries out of 384.
print(f"Text: '{test_text}'")
print(f"Embedding dimensions: {len(test_embedding)}")
print(f"First 10 values: {test_embedding[:10]}")

Text: 'Machine learning is a type of artificial intelligence.'
Embedding dimensions: 384
First 10 values: [0.003782775951549411, -0.026872709393501282, 0.051296573132276535, 0.027737408876419067, -0.010244319215416908, -0.028220683336257935, -0.015101945959031582, -0.016157962381839752, -0.04108556732535362, 0.015193924307823181]


### How Similarity is Measured: Cosine Similarity

**Cosine similarity** measures the angle between two vectors, ranging from -1 to 1:
- **1.0**: Identical meaning (0¬∞ angle)
- **0.0**: No relationship (90¬∞ angle) 
- **-1.0**: Opposite meaning (180¬∞ angle)

In [8]:
# Demonstrate similarity - similar texts have similar embeddings
import numpy as np

texts = [
    "Machine learning is a type of AI",
    "AI and machine learning are closely related",
    "I like pizza"
]

embs = [embeddings.embed_query(t) for t in texts]

# Calculate cosine similarity between first text and others
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("Similarity to 'Machine learning is a type of AI':")
for i, text in enumerate(texts):
    sim = cosine_similarity(embs[0], embs[i])
    print(f"  {sim:.3f} - '{text}'")

Similarity to 'Machine learning is a type of AI':
  1.000 - 'Machine learning is a type of AI'
  0.729 - 'AI and machine learning are closely related'
  0.081 - 'I like pizza'


## 6. Vector Store (FAISS)

### Vector Store
A specialized database optimized for:
- **Storing** high-dimensional vectors (embeddings)
- **Indexing** vectors for fast retrieval
- **Searching** for similar vectors using distance metrics (e.g., cosine similarity)

### FAISS
- **Free & Open Source**: Developed by Meta AI Research
- **Runs Locally**: No API calls, no cloud costs
- **Fast**: Optimized for billion-scale similarity searches

**Alternative Vector Stores:**
- **Pinecone**, **Weaviate**, **Qdrant**: Cloud-hosted (require API keys)
- **Chroma**, **LanceDB**: Other local options similar to FAISS

**GitHub**: https://github.com/facebookresearch/faiss

In [9]:
from langchain_community.vectorstores import FAISS

# Create vector store from our chunks
print(f"Creating vector store from {len(chunks)} chunks...")
vectorstore = FAISS.from_documents(chunks, embeddings)
print("Vector store created!")

Creating vector store from 205 chunks...
Vector store created!


## 7. Building a Retriever

A **retriever** wraps the vector store and provides a clean interface for getting relevant documents.

In [10]:
# Create a retriever from the vector store
retriever = vectorstore.as_retriever(
    search_type="similarity",  
    search_kwargs={"k": 3}     # Number of results to return
)

# Use the retriever
query = "What are the graduation requirements for CCI students?"
relevant_docs = retriever.invoke(query)

print(f"Query: '{query}'")
print(f"\nRetrieved {len(relevant_docs)} relevant documents")

for i, doc in enumerate(relevant_docs, 1):
    print(f"--- Result {i} ---")
    print(doc.page_content[:300] + "..." if len(doc.page_content) > 300 else doc.page_content)
    print()

Query: 'What are the graduation requirements for CCI students?'

Retrieved 3 relevant documents
--- Result 1 ---
‚Ä¢ A GPA of 3.4 in CCI courses
Students should apply in the semester prior to the semester they plan to graduate. The CCI Honors Committee will formally approve admission.
Course Requirements
ITSC 4750 - Honors Thesis (3)
Certification Requirements
To graduate with Honors in Computing and Informatics...

--- Result 2 ---
member. Upon the successful completion of the honors program in CCI, students receive Honors commendations on their transcript and in the
commencement program.
Admission Requirements
Consideration for admission to the honors program may be initiated by the student or by any faculty member in the Col...

--- Result 3 ---
College Algebra.
‚Ä¢ Other Requirements: Transfer students must present an overall ‚Ä¢ Minor
GPA of at least 2.5 with no grade less than C in Computer Science ‚Ä¢ Second major
courses. For internal transfer students, participation in a Chang

## 8. Complete RAG Pipeline

Now let's put it all together: retrieve context and generate an answer!

In [None]:
from langchain_groq import ChatGroq
from langchain_core.messages import HumanMessage

# Initialize LLM
llm = ChatGroq(model="openai/gpt-oss-20b", temperature=0.3)

def simple_rag(question: str) -> str:
    """A simple RAG pipeline: retrieve context, then generate answer."""
    
    # Step 1: Retrieve relevant chunks
    relevant_docs = retriever.invoke(question)
    context = "\n\n".join([doc.page_content for doc in relevant_docs])
    
    # Step 2: Create prompt with context
    prompt = f"""Answer the question based ONLY on the following context. 

Context:
{context}

Question: {question}

Answer:"""
    
    # Step 3: Generate answer
    response = llm.invoke([HumanMessage(content=prompt)])
    return response.content


# Test the RAG pipeline
question = "What are the graduation requirements for CCI students?"
answer = simple_rag(question)

print(f"‚ùì Question: {question}")
print(f"\nüí¨ Answer: {answer}")

‚ùì Question: What are the graduation requirements for CCI students?

üí¨ Answer: **Graduation (Honors) Requirements for CCI Students ‚Äì as stated in the provided context**

1. **GPA Requirements**  
   - Overall cumulative GPA‚ÄØ‚â•‚ÄØ3.2.  
   - GPA in CCI‚Äëspecific courses‚ÄØ‚â•‚ÄØ3.4.

2. **Course Requirement**  
   - Completion of **ITSC‚ÄØ4750 ‚Äì Honors Thesis (3 credit hours)**.

3. **Honors‚ÄëProgram Certification**  
   - Prepare and submit a description of the proposed honors research to the CCI Honors Committee.  
   - Obtain formal approval (or recommendation) from the committee.  
   - Upon successful completion, receive honors commendations on the transcript and in the commencement program.

4. **Additional Requirements for Transfer Students**  
   - Overall GPA‚ÄØ‚â•‚ÄØ2.5 with **no grade lower than a C** in any Computer Science course.  
   - Internal transfer students must complete the **Change‚Äëof‚ÄëMajor Workshop** offered by the CCI Advising Center before becom

In [12]:
# Try more questions!
questions = [
    "What courses are required for computer science majors?",
    "How many credit hours are needed to graduate?",
    "What degree programs are within the College of Computing and Informatics?",
    "What is a recipe for chocolate cake?"  # Not in our document!
]

for q in questions:
    print(f"‚ùì {q}")
    print(f"üí¨ {simple_rag(q)}")
    print("-" * 50)

‚ùì What courses are required for computer science majors?
üí¨ Based on the passage, a Computer‚ÄØScience major must complete three groups of coursework:

1. **General‚ÄëEducation requirements** ‚Äì the specific courses are listed in the university‚Äôs General Education program (the passage does not name them).

2. **Mathematical & Logical Reasoning** ‚Äì  
   * **MATH‚ÄØ1120 ‚Äì Calculus (3‚ÄØcredits)** satisfies this requirement.

3. **Concentration Technical Elective Courses** ‚Äì two credit‚Äëhour blocks drawn from upper‚Äëlevel (3000‚Äë or 4000‚Äëlevel) courses offered by the College of Computing and Informatics, **excluding any courses already listed** (such as MATH‚ÄØ1120):
   * **12 credit‚Äëhour block:** select **four** upper‚Äëlevel electives.  
   * **18 credit‚Äëhour block:** select **six** upper‚Äëlevel electives.

In total, the major requires the general‚Äëeducation courses, MATH‚ÄØ1120, and a selection of **ten** upper‚Äëlevel (3000‚Äë/4000‚Äëlevel) electives from the C

## Summary

In this notebook, you learned the foundational RAG pipeline:

1. **Document Loading**: Load documents from files
2. **Chunking**: Split documents into smaller pieces with `RecursiveCharacterTextSplitter`
3. **Embeddings**: Convert text to vectors with `HuggingFaceEmbeddings`
4. **Vector Store**: Index and search with `FAISS`
5. **Retriever**: Clean interface for getting relevant documents
6. **Generation**: Combine context with query and send to LLM

**Key Parameters to Tune**:
- `chunk_size`: 500-1000 is a good starting point
- `chunk_overlap`: 10-20% of chunk size
- `k`: Number of documents to retrieve (3-5 is common)