<a href="https://colab.research.google.com/github/ben-blake/cs5542-lab01/blob/main/week1_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 5542 — Week 1 Lab
## From Data to Retrieval: GitHub → Colab → Hugging Face → Embeddings

**Learning Goals:**
- Use GitHub for collaborative analytics workflows
- Run notebooks in Google Colab
- Load datasets and models from Hugging Face Hub
- Build an embedding-based retrieval system (mini-RAG)


### GenAI Systems Context (Mini-RAG)
This lab implements a **mini Retrieval-Augmented Generation (RAG)** pipeline:
- A **Transformer encoder** produces semantic embeddings
- A **vector index (FAISS)** enables fast retrieval
- Retrieved context is what a downstream **LLM** would use for grounded generation


## Step 1 — Environment Setup
Install required libraries. This may take ~1 minute.


In [None]:
!pip install -q transformers datasets sentence-transformers faiss-cpu

## Step 2 — Load Dataset & Model from Hugging Face Hub
We use a lightweight news dataset and a sentence embedding model.


In [None]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

dataset = load_dataset("ag_news", split="train[:200]")
model = SentenceTransformer("all-MiniLM-L6-v2")

texts = dataset["text"]
print(f"Loaded {len(texts)} documents")

## Step 3 — Create Embeddings
These vectors represent semantic meaning and enable retrieval before generation.


In [None]:
embeddings = model.encode(texts, show_progress_bar=True)
print('Embedding shape:', embeddings.shape)

## Step 4 — Build a Vector Index (FAISS)
This simulates the retrieval layer in RAG systems.


In [None]:
import faiss
import numpy as np

dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.array(embeddings))
print('Index size:', index.ntotal)

## Step 5 — Retrieval Function
Search for documents related to a query.


In [None]:
def search(query, k=3):
    q_emb = model.encode([query])
    distances, indices = index.search(np.array(q_emb), k)
    return [texts[int(i)] for i in indices[0]]

## Step 6 — Try It!


In [None]:
search("stock market and economy")

In [None]:
print(search("artificial intelligence in healthcare"))

## Reflection
**In 1–2 sentences, explain how embeddings enable retrieval before generation in GenAI systems.**


Embeddings convert text into numerical vectors where similar meanings end up close together, allowing us to quickly find the most relevant documents for any query using vector similarity search. RAG systems leverage this by fetching relevant context first and then feeding it to the LLM along with the user's question, enabling the model to generate answers grounded in actual retrieved knowledge instead of "hallucinating" from what it learned during training.