---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: AI Engineering

### 📋 **Topic**: You Can Just Build Things

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

## Welcome!

In our firstfour lectures, we've covered how
1. We can call LLMs via APIs and get structured responses
2. We can build lexical search with BM25
3. We can build semantic search with embeddings
4. We can combine lexical and semantic search into hybrid search

Today you will put it all together by building a Retrieval Augmented Generation (RAG) system.
- This is a question-answering bot that can answer questions about Fordham University
- You will use real data scraped from the Fordham website.


Your RAG pipeline will look like this:

```
User Question
     ↓
1. RETRIEVE: Find relevant documents (search!)
     ↓
2. AUGMENT: Stuff those documents into a prompt
     ↓
3. GENERATE: Ask an LLM to answer using the context
     ↓
Answer
```


---

# 1. Look at your data

In `data/fordham-website.zip` you'll find **~9,500 Markdown files** scraped from Fordham's website. Each file is one page — admissions info, program descriptions, faculty pages, financial aid, campus life, and more.

Your task: **look at the data**
- The first step in any AI engineering or data science project should always be to familiarize yourself with the data.
- I cannot stress this enough.. without this step, it's hard to build anything useful.

Tips:
- Unzip the archive and look at some of the files. 
- Open a few in a text editor. 
- Get a feel for what you're working with.
- The first line of every file is always the **URL** of the page it was scraped from. The rest is the page content converted to Markdown. Here's an example — `gabelli-school-of-business_veterans.md`:

```markdown
https://www.fordham.edu/gabelli-school-of-business/veterans

# Military Veterans & Active Duty Members of the Military

## Transform Your Knowledge & Skills Into a Business Career for the Future

As a veteran or an active duty member of the United States Armed Services,
you have gained or are currently acquiring the invaluable organizational,
leadership, analytics, and technical knowledge and skills that hiring
managers seek. These transferrable skills provide a major advantage in
emerging, business-related industries where innovation, a global mind-set,
and the ability to lead individuals and teams in the continuously evolving
work environment, are critical for success.

By completing a graduate or undergraduate business degree at the Gabelli
School of Business, you can prepare for a lifelong career in some of
today's fastest-growing fields. ...

### Study at a Top-Ranked, Military-Friendly University

The Gabelli School of Business is part of Fordham University, the only
New York City university to be among those ranked "Best for Vets" by
Military Times. ...

### Learn How the Yellow Ribbon Program Works

The Yellow Ribbon GI Education Enhancement Program, or the Yellow Ribbon
Program, is a part of the Post-9/11 Veterans Educational Assistance Act
of 2008. ...
```

The filenames mirror the URL structure — underscores replace path separators (e.g. `gabelli-school-of-business_veterans.md` came from `/gabelli-school-of-business/veterans`). Some files are short (a few lines), others are quite long.

- Once you've looked around, load the files into Python. Python's built-in `zipfile` module can read zip archives without extracting to disk. Load them into a list of dictionaries or a DataFrame with at least two fields: the filename (or a clean page name) and the content

In [None]:
import zipfile
import pandas as pd
import os
import pathlib

def load_fordham_data(source_path):
    data = []
    
    # Check if it's a zip file
    if source_path.endswith('.zip') and os.path.exists(source_path):
        print(f"Loading from zip: {source_path}")
        with zipfile.ZipFile(source_path, 'r') as z:
            file_list = [f for f in z.namelist() if f.endswith('.md')]
            for file_name in file_list:
                with z.open(file_name) as f:
                    try:
                        content = f.read().decode('utf-8')
                    except UnicodeDecodeError:
                        continue
                    
                    lines = content.split('\n', 1)
                    url = lines[0].strip() if lines else ""
                    body = lines[1].strip() if len(lines) > 1 else ""
                    
                    data.append({
                        "filename": file_name,
                        "url": url,
                        "content": body
                    })
                    
    # Check if it's a directory
    elif os.path.exists(source_path) and os.path.isdir(source_path):
        print(f"Loading from directory: {source_path}")
        path = pathlib.Path(source_path)
        files = list(path.glob('*.md'))
        
        for file_path in files:
             try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    lines = f.readlines()
                    if not lines: continue
                    
                    url = lines[0].strip()
                    content = "".join(lines[1:]).strip()
                    
                    data.append({
                        "filename": file_path.name,
                        "url": url,
                        "content": content
                    })
             except Exception as e:
                print(f"Error reading {file_path}: {e}")
                
    else:
        print(f"Source not found or invalid: {source_path}")
        # Try default locations if provided path fails?
        # Let's inspect data/fordham-website just in case user passed zip but has dir
        fallback_dir = 'data/fordham-website'
        if source_path != fallback_dir and os.path.exists(fallback_dir) and os.path.isdir(fallback_dir):
             print(f"Fallback: Loading from {fallback_dir}")
             return load_fordham_data(fallback_dir) # Recursive call to load from dir
             
        return pd.DataFrame()

    return pd.DataFrame(data)

# Usage
source_path = 'data/fordham-website' # Pointing directly to the directory as confirmed
if not os.path.exists(source_path):
    print(f"Warning: Path {source_path} does not exist relative to notebook.")

df = load_fordham_data(source_path)
print(f"Loaded {len(df)} documents.")
print(df.head())


---

# 2. Chunk the Documents

Some of the pages could be too long to embed as a single unit. Down the line, the pages may be too long to stuff into the LLM's prompt during the generation step. As such, most of the RAG systems will break down big documents into into smaller **chunks**.

> 📚 **TERM: Chunking**  
> Splitting documents into smaller, self-contained pieces for embedding and retrieval. The goal is chunks that are small enough to be specific, but large enough to be meaningful.

Your task: **write a function that splits each document into chunks.**

Things to think about:
- What's a reasonable chunk size? (Think about what fits in a prompt vs. what's too vague)
- Should you split on sentences? Paragraphs? A fixed character/word count?
- Should chunks overlap? What happens if an answer spans two chunks?
- How do you keep track of which document each chunk came from? You may need that information down the line.

In [None]:
def chunk_text(text, chunk_size=1000, overlap=100):
    chunks = []
    if not text:
        return chunks

    start = 0
    text_len = len(text)

    while start < text_len:
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start += (chunk_size - overlap)

    return chunks

all_chunks = []

# Iterate over the loaded dataframe
for index, row in df.iterrows():
    filename = row['filename']
    url = row['url']
    content = row['content']

    chunks = chunk_text(content)

    for chunk in chunks:
        all_chunks.append({
            'filename': filename,
            'url': url,
            'content': chunk
        })

df_chunks = pd.DataFrame(all_chunks)

print(f"Original documents: {len(df)}")
print(f"Total chunks: {len(df_chunks)}")
print(df_chunks.head())


---

# 3. Embed the Chunks

Now we need to turn each chunk into a vector so we can search over them. You've done this before in Lecture 4.

Your task: **embed all chunks using an embedding model.**

Tips:
- You could use a local model, or API model. What are the tradeoffs?
- This will take a while if you do it serially. You might want to use async/batch.
- Once you've created your embeddings, you may want to save them to disk so you don't have to redo this step every time
- You'll need to embed queries with the **same model** at search time

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Get the list of texts to embed
chunk_texts = df_chunks['content'].tolist()

print(f"Embedding {len(chunk_texts)} chunks...")

# Embed the chunks
# show_progress_bar=True is nice but might not show up well in non-interactive run, but good for notebook
embeddings = model.encode(chunk_texts, show_progress_bar=True)

# Add embeddings to the dataframe 
# We can store them as a list/array in a new column
df_chunks['embedding'] = list(embeddings)

print("Embedding complete.")
print(df_chunks.head())


---

# 4. Retrieve

Now build the **R** in RAG. Given a user's question, find the most relevant chunks.

Your task: **write a retrieval function that takes a question and returns the most relevant chunks.**

Tips:
- You can use lexical or semantic search or both!
- How many chunks should you retrieve? Too few and you might miss the answer; too many and you'll overwhelm the LLM (and pay more tokens)
- Try a few test questions and eyeball whether the retrieved chunks are relevant
- Try a few questions and see what comes back. For example:
  - "What programs does the Gabelli School of Business offer?"
  - "How do I apply for financial aid?"
  - "Where is Fordham's campus?"

In [None]:
import numpy as np

def retrieve(query, df, top_k=5):
    # Embed the query
    query_embedding = model.encode([query])[0]

    # Calculate cosine similarity
    # Stack the embeddings from the dataframe
    embeddings = np.stack(df['embedding'].values)

    # Normalize query and embeddings for cosine similarity
    # (SentenceTransformer embeddings are usually normalized, but good practice)
    query_norm = query_embedding / np.linalg.norm(query_embedding)
    embeddings_norm = embeddings / np.linalg.norm(embeddings, axis=1)[:, np.newaxis]

    # Dot product
    scores = np.dot(embeddings_norm, query_norm)

    # valid_indices = np.argsort(scores)[::-1][:top_k] # This is slow for large arrays but fine here

    # Get top k indices
    # We can use argpartition for faster top-k
    top_k_indices = np.argpartition(scores, -top_k)[-top_k:]
    top_k_indices = top_k_indices[np.argsort(scores[top_k_indices])][::-1]

    # Return the top k rows
    return df.iloc[top_k_indices]

# Test it
results = retrieve("financial aid", df_chunks)
for i, row in results.iterrows():
    print(f"URL: {row['url']}")
    print(f"Content: {row['content'][:150]}...")
    print("-" * 20)


---

# 5. Generate

Now build the **G** in RAG. Take the retrieved chunks and pass them to an LLM along with the user's question.

Your task: **write a function that takes a question and the retrieved chunks, builds a prompt, and calls an LLM to generate an answer.**

Tips:
- How should you structure the prompt? The LLM needs to know: (1) what is the context of the application, (2) what is the question, (3) what it should include in its answer
- What should the LLM do if the context doesn't contain the answer?
- Start with a cheap model; try a better one when you've figured out the pipeline

In [None]:
import zipfile
import pandas as pd
import os

def load_fordham_data(zip_path):
    data = []

    if not os.path.exists(zip_path):
        print(f"Zip file not found: {zip_path}")
        # Fallback to directory if zip is missing but directory exists?
        # For now, just return empty or error
        return pd.DataFrame()

    with zipfile.ZipFile(zip_path, 'r') as z:
        # Get list of all files in the zip
        file_list = [f for f in z.namelist() if f.endswith('.md')]

        for file_name in file_list:
            with z.open(file_name) as f:
                # Read content and decode to string
                try:
                    content = f.read().decode('utf-8')
                except UnicodeDecodeError:
                    continue

                # Split the first line (URL) from the rest of the text
                lines = content.split('\n', 1)
                url = lines[0].strip() if lines else ""
                body = lines[1].strip() if len(lines) > 1 else ""

                data.append({
                    "filename": file_name,
                    "url": url,
                    "content": body
                })

    return pd.DataFrame(data)

# Usage
zip_path = 'data/fordham-website.zip'
df = load_fordham_data(zip_path)
print(f"Loaded {len(df)} documents.")
print(df.head())


---

# 6. Wire everything together

Combine the previous steps into a simple function that takes in a question and returns an answer.

Your task: **write a `rag(question)` function that retrieves relevant chunks and generates an answer.**

In [None]:
import zipfile
import pandas as pd
import os

def load_fordham_data(zip_path):
    data = []

    if not os.path.exists(zip_path):
        print(f"Zip file not found: {zip_path}")
        # Fallback to directory if zip is missing but directory exists?
        # For now, just return empty or error
        return pd.DataFrame()

    with zipfile.ZipFile(zip_path, 'r') as z:
        # Get list of all files in the zip
        file_list = [f for f in z.namelist() if f.endswith('.md')]

        for file_name in file_list:
            with z.open(file_name) as f:
                # Read content and decode to string
                try:
                    content = f.read().decode('utf-8')
                except UnicodeDecodeError:
                    continue

                # Split the first line (URL) from the rest of the text
                lines = content.split('\n', 1)
                url = lines[0].strip() if lines else ""
                body = lines[1].strip() if len(lines) > 1 else ""

                data.append({
                    "filename": file_name,
                    "url": url,
                    "content": body
                })

    return pd.DataFrame(data)

# Usage
zip_path = 'data/fordham-website.zip'
df = load_fordham_data(zip_path)
print(f"Loaded {len(df)} documents.")
print(df.head())


---

# 7. Evaluate, experiment and improve

Your RAG system works — but there's always room to make it better. 

Your task: **evaluate, experiment, and improve your system**

Tips:
- How do you know that your system is working or that your changes are improving it?
- Try different questions — where does it do well? Where does it struggle?
- Adjust the number of retrieved chunks — what happens with more or fewer?
- Try different chunking strategies — bigger chunks? Smaller? Overlap?
- Try a different embedding model — does it change retrieval quality?
- Improve the prompt — can you get better, more concise answers?
- Add source attribution — can the system tell the user which pages the answer came from?

In [None]:
import zipfile
import pandas as pd
import os

def load_fordham_data(zip_path):
    data = []

    if not os.path.exists(zip_path):
        print(f"Zip file not found: {zip_path}")
        # Fallback to directory if zip is missing but directory exists?
        # For now, just return empty or error
        return pd.DataFrame()

    with zipfile.ZipFile(zip_path, 'r') as z:
        # Get list of all files in the zip
        file_list = [f for f in z.namelist() if f.endswith('.md')]

        for file_name in file_list:
            with z.open(file_name) as f:
                # Read content and decode to string
                try:
                    content = f.read().decode('utf-8')
                except UnicodeDecodeError:
                    continue

                # Split the first line (URL) from the rest of the text
                lines = content.split('\n', 1)
                url = lines[0].strip() if lines else ""
                body = lines[1].strip() if len(lines) > 1 else ""

                data.append({
                    "filename": file_name,
                    "url": url,
                    "content": body
                })

    return pd.DataFrame(data)

# Usage
zip_path = 'data/fordham-website.zip'
df = load_fordham_data(zip_path)
print(f"Loaded {len(df)} documents.")
print(df.head())


---

# 8. (Optional) Make it an app

So far your RAG system lives inside a notebook. That's great for development — but nobody is going to use your Jupyter notebook to ask questions about Fordham. Let's turn it into a real web app.

> 📚 **TERM: Streamlit**  
> A Python library that turns plain Python scripts into interactive web apps. You write Python — no HTML, CSS, or JavaScript — and Streamlit renders it as a web page with inputs, buttons, and formatted output. It's the fastest way to go from "I have a function" to "I have a web app."

Your task: **create a Streamlit app that lets a user type a question about Fordham and get an answer from your RAG system.**

To get started:
- Install it: `uv pip install streamlit` 
- A Streamlit app is just a `.py` file (not a notebook). Create something like `fordham_rag_app.py`
- Run it: `streamlit run scripts/fordham_rag_app.py` — this opens a browser tab with your app

Tips:
- Check out the [Streamlit docs](https://docs.streamlit.io/) — the "Get started" tutorial is very short
- Your best bet is to vibecode your way to this. You'll be surprised how fast you can get it up and running

---

# Summary

## What You Built

| Step | What You Did | What It Does |
|------|-------------|-------------|
| **Load** | Read 9,500+ Fordham web pages | Get raw content |
| **Chunk** | Split pages into smaller pieces | Make content searchable and promptable |
| **Embed** | Turn chunks into vectors | Enable semantic search |
| **Retrieve** | Find relevant chunks for a question | The **R** in RAG |
| **Generate** | Ask an LLM to answer using the chunks | The **G** in RAG |
| **RAG** | Wire it all together | Question in, answer out |

## The Big Picture

RAG is one of the most common patterns in AI engineering today. What you built here is the same core architecture behind tools like ChatGPT with search, Perplexity, enterprise Q&A bots, and more. The details get more sophisticated (vector databases, reranking, query rewriting, evaluation) but the pattern is the same:

**Find relevant stuff → give it to an LLM → get an answer.**

You can just build things.