---

### ðŸŽ“ **Professor**: Apostolos Filippas

### ðŸ“˜ **Class**: AI Engineering

### ðŸ“‹ **Topic**: You Can Just Build Things

ðŸš« **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

## Welcome!

In our firstfour lectures, we've covered how
1. We can call LLMs via APIs and get structured responses
2. We can build lexical search with BM25
3. We can build semantic search with embeddings
4. We can combine lexical and semantic search into hybrid search

Today you will put it all together by building a Retrieval Augmented Generation (RAG) system.
- This is a question-answering bot that can answer questions about Fordham University
- You will use real data scraped from the Fordham website.


Your RAG pipeline will look like this:

```
User Question
     â†“
1. RETRIEVE: Find relevant documents (search!)
     â†“
2. AUGMENT: Stuff those documents into a prompt
     â†“
3. GENERATE: Ask an LLM to answer using the context
     â†“
Answer
```


---

# 1. Look at your data

In `data/fordham-website.zip` you'll find **~9,500 Markdown files** scraped from Fordham's website. Each file is one page â€” admissions info, program descriptions, faculty pages, financial aid, campus life, and more.

Your task: **look at the data**
- The first step in any AI engineering or data science project should always be to familiarize yourself with the data.
- I cannot stress this enough.. without this step, it's hard to build anything useful.

Tips:
- Unzip the archive and look at some of the files. 
- Open a few in a text editor. 
- Get a feel for what you're working with.
- The first line of every file is always the **URL** of the page it was scraped from. The rest is the page content converted to Markdown. Here's an example â€” `gabelli-school-of-business_veterans.md`:

```markdown
https://www.fordham.edu/gabelli-school-of-business/veterans

# Military Veterans & Active Duty Members of the Military

## Transform Your Knowledge & Skills Into a Business Career for the Future

As a veteran or an active duty member of the United States Armed Services,
you have gained or are currently acquiring the invaluable organizational,
leadership, analytics, and technical knowledge and skills that hiring
managers seek. These transferrable skills provide a major advantage in
emerging, business-related industries where innovation, a global mind-set,
and the ability to lead individuals and teams in the continuously evolving
work environment, are critical for success.

By completing a graduate or undergraduate business degree at the Gabelli
School of Business, you can prepare for a lifelong career in some of
today's fastest-growing fields. ...

### Study at a Top-Ranked, Military-Friendly University

The Gabelli School of Business is part of Fordham University, the only
New York City university to be among those ranked "Best for Vets" by
Military Times. ...

### Learn How the Yellow Ribbon Program Works

The Yellow Ribbon GI Education Enhancement Program, or the Yellow Ribbon
Program, is a part of the Post-9/11 Veterans Educational Assistance Act
of 2008. ...
```

The filenames mirror the URL structure â€” underscores replace path separators (e.g. `gabelli-school-of-business_veterans.md` came from `/gabelli-school-of-business/veterans`). Some files are short (a few lines), others are quite long.

- Once you've looked around, load the files into Python. Python's built-in `zipfile` module can read zip archives without extracting to disk. Load them into a list of dictionaries or a DataFrame with at least two fields: the filename (or a clean page name) and the content

In [5]:
import zipfile
import pandas as pd
import os
import pathlib

def load_fordham_data(source_path):
    data = []
    
    # 1. Check if it's a zip file
    if source_path.endswith('.zip') and os.path.exists(source_path):
        print(f"Loading from zip: {source_path}")
        with zipfile.ZipFile(source_path, 'r') as z:
            file_list = [f for f in z.namelist() if f.endswith('.md')]
            for file_name in file_list:
                with z.open(file_name) as f:
                    try:
                        content = f.read().decode('utf-8')
                    except UnicodeDecodeError:
                        continue
                    
                    lines = content.split('\n', 1)
                    url = lines[0].strip() if lines else ""
                    body = lines[1].strip() if len(lines) > 1 else ""
                    
                    data.append({
                        "filename": file_name,
                        "url": url,
                        "content": body
                    })
                    
    # 2. Check if it's a directory (Robust Fallback)
    elif os.path.exists(source_path) and os.path.isdir(source_path):
        print(f"Loading from directory: {source_path}")
        path = pathlib.Path(source_path)
        files = list(path.glob('*.md'))
        print(f"Found {len(files)} markdown files.")
        
        for file_path in files:
             try:
                with open(file_path, 'r', encoding='utf-8') as f:
                    lines = f.readlines()
                    if not lines: continue
                    
                    url = lines[0].strip()
                    content = "".join(lines[1:]).strip()
                    
                    data.append({
                        "filename": file_path.name,
                        "url": url,
                        "content": content
                    })
             except Exception as e:
                print(f"Error reading {file_path}: {e}")
                
    else:
        print(f"Source not found or invalid: {source_path}")
        # Try default locations if provided path fails
        fallback_dir = 'data/fordham-website'
        if source_path != fallback_dir and os.path.exists(fallback_dir) and os.path.isdir(fallback_dir):
             print(f"Fallback: Loading from {fallback_dir}")
             return load_fordham_data(fallback_dir) # Recursive call
             
        return pd.DataFrame()

    return pd.DataFrame(data)

# Usage - Point to the directory since you extracted it
source_path = 'data/fordham-website' 
df = load_fordham_data(source_path)

print(f"Loaded {len(df)} documents.")
print(df.head())

Loading from directory: data/fordham-website
Found 9530 markdown files.
Loaded 9530 documents.
                                            filename  \
0  about_living-the-mission_campus-ministry_catho...   
1  academics_centers-and-institutes_center-for-et...   
2  graduate-school-of-arts-and-sciences_student-r...   
3  academics_departments_psychology_graduate-prog...   
4  academics_departments_african--african-america...   

                                                 url  \
0  https://www.fordham.edu/about/living-the-missi...   
1  https://www.fordham.edu/academics/centers-and-...   
2  https://www.fordham.edu/graduate-school-of-art...   
3  https://www.fordham.edu/academics/departments/...   
4  https://www.fordham.edu/academics/departments/...   

                                             content  
0  # Ministry of Music\n\n\nFordham offers each s...  
1  # Advanced Certificate in Health Care Ethics C...  
2  Skip to Main Content\nGraduate School of Arts ...  
3  # Alumni

---

# 2. Chunk the Documents

Some of the pages could be too long to embed as a single unit. Down the line, the pages may be too long to stuff into the LLM's prompt during the generation step. As such, most of the RAG systems will break down big documents into into smaller **chunks**.

> ðŸ“š **TERM: Chunking**  
> Splitting documents into smaller, self-contained pieces for embedding and retrieval. The goal is chunks that are small enough to be specific, but large enough to be meaningful.

Your task: **write a function that splits each document into chunks.**

Things to think about:
- What's a reasonable chunk size? (Think about what fits in a prompt vs. what's too vague)
- Should you split on sentences? Paragraphs? A fixed character/word count?
- Should chunks overlap? What happens if an answer spans two chunks?
- How do you keep track of which document each chunk came from? You may need that information down the line.

In [6]:
def chunk_text(text, chunk_size=800, overlap=150):
    chunks = []
    if not text:
        return chunks
        
    start = 0
    text_len = len(text)
    
    while start < text_len:
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        # Move the start pointer forward
        step = chunk_size - overlap
        if step <= 0: step = 1 # Avoid infinite loop
        start += step
        
    return chunks

def process_to_chunks(df):
    chunked_data = []
    
    for index, row in df.iterrows():
        # Get content, default to empty string if missing
        content = row.get('content', '')
        if not isinstance(content, str): content = ""
            
        # Split the body content into pieces
        text_chunks = chunk_text(content)
        
        for i, chunk_content in enumerate(text_chunks):
            # Only add non-empty chunks if you prefer
            if not chunk_content.strip(): continue
                
            chunked_data.append({
                "chunk_id": f"{row['filename']}_{i}",
                "source_url": row['url'],
                "content": chunk_content.strip(),
                "parent_file": row['filename']
            })
            
    return pd.DataFrame(chunked_data)

# Apply the function to create df_chunks
df_chunks = process_to_chunks(df)

print(f"Generated {len(df_chunks)} chunks from {len(df)} documents.")
print(df_chunks.head())

Generated 66082 chunks from 9530 documents.
                                            chunk_id  \
0  about_living-the-mission_campus-ministry_catho...   
1  about_living-the-mission_campus-ministry_catho...   
2  about_living-the-mission_campus-ministry_catho...   
3  about_living-the-mission_campus-ministry_catho...   
4  about_living-the-mission_campus-ministry_catho...   

                                          source_url  \
0  https://www.fordham.edu/about/living-the-missi...   
1  https://www.fordham.edu/about/living-the-missi...   
2  https://www.fordham.edu/about/living-the-missi...   
3  https://www.fordham.edu/about/living-the-missi...   
4  https://www.fordham.edu/about/living-the-missi...   

                                             content  \
0  # Ministry of Music\n\n\nFordham offers each s...   
1  itan area over the live radio broadcast by WFU...   
2  mble:**Liturgical/Worship - Mixed Voices**Memb...   
3  ers at the 7:00 p.m. Sunday Mass in the Univer...   
4 

In [7]:
# Usage
df_chunks = process_to_chunks(df)
print(f"Generated {len(df_chunks)} chunks from {len(df)} documents.")

Generated 66082 chunks from 9530 documents.


---

# 3. Embed the Chunks

Now we need to turn each chunk into a vector so we can search over them. You've done this before in Lecture 4.

Your task: **embed all chunks using an embedding model.**

Tips:
- You could use a local model, or API model. What are the tradeoffs?
- This will take a while if you do it serially. You might want to use async/batch.
- Once you've created your embeddings, you may want to save them to disk so you don't have to redo this step every time
- You'll need to embed queries with the **same model** at search time

In [8]:
from sentence_transformers import SentenceTransformer
import numpy as np

# 1. Initialize the model
# 'all-MiniLM-L6-v2' is a small, fast model great for local use
model = SentenceTransformer('all-MiniLM-L6-v2')

# 2. Sample 10% of the data to speed up processing
print(f"Original dataframe size: {len(df_chunks)}")
# Use a fixed random state for reproducibility
df_chunks = df_chunks.sample(frac=0.1, random_state=42).reset_index(drop=True)
print(f"Sampled dataframe size: {len(df_chunks)}")

# 3. Get the list of texts to embed
# We use the 'content' column of our chunks dataframe
chunk_texts = df_chunks['content'].tolist()

print(f"Embedding {len(chunk_texts)} chunks...")

# 4. Embed the chunks
# The model handles batching automatically
embeddings = model.encode(chunk_texts, show_progress_bar=True)

# 5. Add embeddings to the dataframe 
df_chunks['embedding'] = list(embeddings)

print("Embedding complete.")
print(df_chunks.head())

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Original dataframe size: 66082
Sampled dataframe size: 6608
Embedding 6608 chunks...


Batches:   0%|          | 0/207 [00:00<?, ?it/s]

Embedding complete.
                                            chunk_id  \
0      info_24004_faculty_7156_steven_j_franks.md_14   
1  information-technology_it-security--assurance_...   
2  academics_departments_english_graduate_current...   
3  about_leadership-and-administration_administra...   
4  about_leadership-and-administration_administra...   

                                          source_url  \
0  https://www.fordham.edu/info/24004/faculty/715...   
1  https://www.fordham.edu/information-technology...   
2  https://www.fordham.edu/academics/departments/...   
3  https://www.fordham.edu/about/leadership-and-a...   
4  https://www.fordham.edu/about/leadership-and-a...   

                                             content  \
0  iming of flowering. American Journal of Botany...   
1  e, or networked.\n\n## Policy Statement\n\n- T...   
2  uage in which Beowulf was composed. Students w...   
3  t ways to share the results, and propose short...   
4  ernal guest for busines

---

# 4. Retrieve

Now build the **R** in RAG. Given a user's question, find the most relevant chunks.

Your task: **write a retrieval function that takes a question and returns the most relevant chunks.**

Tips:
- You can use lexical or semantic search or both!
- How many chunks should you retrieve? Too few and you might miss the answer; too many and you'll overwhelm the LLM (and pay more tokens)
- Try a few test questions and eyeball whether the retrieved chunks are relevant
- Try a few questions and see what comes back. For example:
  - "What programs does the Gabelli School of Business offer?"
  - "How do I apply for financial aid?"
  - "Where is Fordham's campus?"

In [9]:
import numpy as np

def retrieve(query, df, top_k=5):
    # 1. Embed the query
    query_embedding = model.encode([query])[0]
    
    # 2. Get embeddings from dataframe as a matrix
    embeddings = np.stack(df['embedding'].values)
    
    # 3. Normalize for cosine similarity
    query_norm = query_embedding / np.linalg.norm(query_embedding)
    embeddings_norm = embeddings / np.linalg.norm(embeddings, axis=1)[:, np.newaxis]
    
    # 4. Calculate similarity scores (Dot Product)
    scores = np.dot(embeddings_norm, query_norm)
    
    # 5. Get top k indices
    top_k_indices = np.argsort(scores)[::-1][:top_k]
    
    # 6. Return the top k rows
    return df.iloc[top_k_indices]

# Test it out!
test_queries = [
    "What programs does the Gabelli School of Business offer?",
    "How do I apply for financial aid?",
    "Where is Fordham's campus?"
]

for q in test_queries:
    print(f"\nQuestion: {q}")
    results = retrieve(q, df_chunks)
    
    for i, row in results.iterrows():
        # Handle different column names (filename vs parent_file)
        filename = row.get('filename', row.get('parent_file', 'unknown_file'))
        print(f"  - [{filename}] {row['content'][:150]}...")


Question: What programs does the Gabelli School of Business offer?
  - [undergraduate-admission_apply_how-to-apply_transfer-students_the-transfer-application-process.md] l-of-business/academic-programs-and-admissions/undergraduate-programs/which-campus/lincoln-center/)) and a corresponding major on their application. A...
  - [undergraduate-admission_transfer-students.md] l-of-business/academic-programs-and-admissions/undergraduate-programs/which-campus/lincoln-center/)) and a corresponding major on their application. A...
  - [gabelli-school-of-business_student-and-career-resources_undergraduate-student-resources_academic-advising_first-year-advising_information-for-new-students.md] the wisdom and experience of two assigned upperclass student mentorsâ€”your FAM. They will serve as invaluable resources and guides as you navigate your...
  - [undergraduate-admission_majors-and-minors_accounting-information-systems.md] counting majors are offered through the Gabelli School of Business a

---

# 5. Generate

Now build the **G** in RAG. Take the retrieved chunks and pass them to an LLM along with the user's question.

Your task: **write a function that takes a question and the retrieved chunks, builds a prompt, and calls an LLM to generate an answer.**

Tips:
- How should you structure the prompt? The LLM needs to know: (1) what is the context of the application, (2) what is the question, (3) what it should include in its answer
- What should the LLM do if the context doesn't contain the answer?
- Start with a cheap model; try a better one when you've figured out the pipeline

In [10]:
import openai
import os

# Ensure you have your API key set
# os.environ["OPENAI_API_KEY"] = "sk-..." 

def generate_answer(query, df_chunks):
    # 1. RETRIEVE: Get top k relevant chunks
    # (Assuming you named your retrieval function 'retrieve' from Step 4)
    relevant_docs = retrieve(query, df_chunks, top_k=5)
    
    # 2. AUGMENT: Format the chunks into a context string
    context_list = []
    for i, row in relevant_docs.iterrows():
        # Handle filename/parent_file column naming issue safely
        filename = row.get('filename', row.get('parent_file', 'unknown'))
        context_list.append(f"SOURCE: {filename}\nCONTENT: {row['content']}")
    
    context_block = "\n\n---\n\n".join(context_list)

    # 3. PROMPT: Construct the final message
    system_message = (
        "You are a helpful, professional assistant for Fordham University. "
        "Answer the user's question ONLY using the provided context. "
        "If the answer is not in the context, say 'I'm sorry, I don't have that information in the Fordham records.' "
        "Do not use your own internal knowledge to make up facts."
    )
    
    user_message = (
        f"Use the following Fordham University documents to answer the question:\n\n"
        f"{context_block}\n\n"
        f"Question: {query}\n"
        f"Answer:"
    )

    # 4. GENERATE: Call the LLM (gpt-4o-mini is a great 'cheap' starting model)
    try:
        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            temperature=0
        )
        return response.choices[0].message.content
    except Exception as e:
        return f"Error calling LLM: {e}"

# Usage
answer = generate_answer("How do I apply for financial aid?", df_chunks)
print(answer)

To apply for financial aid at Fordham University, you need to fill out both the Free Application for Federal Student Aid (FAFSA) to be considered for federal aid and the CSS Profile to be considered for institutional aid. Deadlines for these applications vary based on your choice of admission application (Early Action, Early Decision, or Regular Decision).


In [11]:
# List of detailed questions to test the RAG system
test_questions = [
    "What specific programs does the Gabelli School of Business offer for undergraduates?",
    "Can you explain the difference between the Rose Hill and Lincoln Center campuses?",
    "What are the application deadlines for early action vs regular decision?",
    "How does the yellow ribbon program work for veterans at Fordham?",
    "Who is the current president of Fordham University?"
]

print(f"--- Testing RAG Generation with {len(test_questions)} questions ---\n")

for i, q in enumerate(test_questions):
    print(f"Question {i+1}: {q}")
    print("-" * 30)
    
    # Generate the answer
    # Make sure you have df_chunks available from Step 2
    answer = generate_answer(q, df_chunks)
    
    print(f"Answer:\n{answer}\n")
    print("=" * 50 + "\n")

--- Testing RAG Generation with 5 questions ---

Question 1: What specific programs does the Gabelli School of Business offer for undergraduates?
------------------------------
Answer:
The Gabelli School of Business offers a broad selection of undergraduate programs that include a liberal arts core, a business core, a major and concentration, and electives. Specific majors are available, including accounting, and students can also choose from various minors and concentrations. More detailed information about the curriculum can be found in the course bulletin.


Question 2: Can you explain the difference between the Rose Hill and Lincoln Center campuses?
------------------------------
Answer:
The Rose Hill campus features 85 acres with Gothic architecture and tree-lined walkways, while the Lincoln Center campus is an 8-acre site located in Manhattan at 60th Street and Columbus Avenue, spanning two city blocks with a landscaped plaza. The Rose Hill campus is known for its traditional cam

---

# 6. Wire everything together

Combine the previous steps into a simple function that takes in a question and returns an answer.

Your task: **write a `rag(question)` function that retrieves relevant chunks and generates an answer.**

In [13]:
# Placeholder for your implementation
def rag(question):
    # This function is your main entry point
    # It calls generate_answer, which in turn calls retrieve
    return generate_answer(question, df_chunks)

# Test the final system
q = "What is the tuition for the MBA program?"
print(rag(q))


I'm sorry, I don't have that information in the Fordham records.


---

# 7. Evaluate, experiment and improve

Your RAG system works â€” but there's always room to make it better. 

Your task: **evaluate, experiment, and improve your system**

Tips:
- How do you know that your system is working or that your changes are improving it?
- Try different questions â€” where does it do well? Where does it struggle?
- Adjust the number of retrieved chunks â€” what happens with more or fewer?
- Try different chunking strategies â€” bigger chunks? Smaller? Overlap?
- Try a different embedding model â€” does it change retrieval quality?
- Improve the prompt â€” can you get better, more concise answers?
- Add source attribution â€” can the system tell the user which pages the answer came from?

In [None]:
# Placeholder for your implementation

---

# 8. (Optional) Make it an app

So far your RAG system lives inside a notebook. That's great for development â€” but nobody is going to use your Jupyter notebook to ask questions about Fordham. Let's turn it into a real web app.

> ðŸ“š **TERM: Streamlit**  
> A Python library that turns plain Python scripts into interactive web apps. You write Python â€” no HTML, CSS, or JavaScript â€” and Streamlit renders it as a web page with inputs, buttons, and formatted output. It's the fastest way to go from "I have a function" to "I have a web app."

Your task: **create a Streamlit app that lets a user type a question about Fordham and get an answer from your RAG system.**

To get started:
- Install it: `uv pip install streamlit` 
- A Streamlit app is just a `.py` file (not a notebook). Create something like `fordham_rag_app.py`
- Run it: `streamlit run scripts/fordham_rag_app.py` â€” this opens a browser tab with your app

Tips:
- Check out the [Streamlit docs](https://docs.streamlit.io/) â€” the "Get started" tutorial is very short
- Your best bet is to vibecode your way to this. You'll be surprised how fast you can get it up and running

---

# Summary

## What You Built

| Step | What You Did | What It Does |
|------|-------------|-------------|
| **Load** | Read 9,500+ Fordham web pages | Get raw content |
| **Chunk** | Split pages into smaller pieces | Make content searchable and promptable |
| **Embed** | Turn chunks into vectors | Enable semantic search |
| **Retrieve** | Find relevant chunks for a question | The **R** in RAG |
| **Generate** | Ask an LLM to answer using the chunks | The **G** in RAG |
| **RAG** | Wire it all together | Question in, answer out |

## The Big Picture

RAG is one of the most common patterns in AI engineering today. What you built here is the same core architecture behind tools like ChatGPT with search, Perplexity, enterprise Q&A bots, and more. The details get more sophisticated (vector databases, reranking, query rewriting, evaluation) but the pattern is the same:

**Find relevant stuff â†’ give it to an LLM â†’ get an answer.**

You can just build things.