<a href="https://colab.research.google.com/github/ango3636/CS5588DSCapstone/blob/update-notebook/CS5588_Week1_HandsOn_MiniRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 5588 — Week 1 Hands On
## From Data to Retrieval: GitHub → Colab → Hugging Face → Embeddings

**Learning Goals:**
- Use GitHub for collaborative analytics workflows
- Run notebooks in Google Colab
- Load datasets and models from Hugging Face Hub
- Build an embedding-based retrieval system (mini-RAG)


### GenAI Systems Context (Mini-RAG)
This lab implements a **mini Retrieval-Augmented Generation (RAG)** pipeline:
- A **Transformer encoder** produces semantic embeddings
- A **vector index (FAISS)** enables fast retrieval
- Retrieved context is what a downstream **LLM** would use for grounded generation


## Step 1 — Environment Setup
Install required libraries. This may take ~1 minute.


In [None]:
!pip install -q transformers datasets sentence-transformers faiss-cpu

## Step 2 — Load Dataset & Model from Hugging Face Hub
We use a lightweight news dataset and a sentence embedding model.


Replace you hugging face token in the empty string, on the mentioned commented line.

In [None]:
from huggingface_hub import login

HF_TOKEN = " "  # <-- REPLACE THE EMPTY STRING WITH YOUR HF TOKEN

if HF_TOKEN and HF_TOKEN != "YOUR_HF_TOKEN_HERE":
    login(token=HF_TOKEN)
    print("✅ Logged in to Hugging Face")
else:
    print("⚠️ No HF token provided. Public models may still work, but rate limits may apply.")

✅ Logged in to Hugging Face


TASK

1. Find out any /ag_news dataset from the huggingface.
2. Look for "Use this dataset" button on the left side --> Use huggingface library option.
3. Copy the entire code and paste in the empty cell and run it successfully.

In [None]:
#paste your dataset code from huggingface here
from datasets import load_dataset

# Login using e.g. `huggingface-cli login` to access this dataset
dataset = load_dataset("HuyAugie/Smaller_AG_News_Dataset")

In [None]:
texts = dataset["train"].select(range(200))
print(f"Selected {len(texts)} examples from the 'train' split.")

Selected 200 examples from the 'train' split.


In [None]:
texts[1:6]

{'text': ["BBC reporters' log BBC correspondents record events in the Middle East and their thoughts as the funeral of the Palestinian leader Yasser Arafat takes place.",
  'Israel welcomes Rice nomination; Palestinians wary Israel on Tuesday warmly welcomed the naming of Condoleezza Rice as America #39;s top diplomat, but Palestinians were wary, saying the new Bush administration must put more energy into the quest for Middle East peace.',
  'Medical Journal Calls for a New Drug Watchdog Medical researchers said the U.S. needs a system independent of the F.D.A. to detect harmful effects of drugs already on the market.',
  'Militants Kidnap Relatives of Iraqi Minister-TV Militants have kidnapped two relatives of Iraqi Defense Minister Hazim al-Shalaan and demanded US forces leave the holy city of Najaf, Al Jazeera television reported Wednesday.',
  'US to support democracy WASHINGTON, Sept 18: The United States has said that it would reiterate its support for a  #39;fully functioning #

TASK

1. Find out the sentence-transformer -
all-MiniLM-L6-v2 from the huggingface website.
2. Look for "Use this model" button on the left side --> Use sentence-transformer library option.
3. Copy the first 2 lines of the code and paste in the empty cell and run it successfully.

In [None]:
#paste your first 2 lines of the sentence-transformer library code from the hugginface to load the model
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

## Step 3 — Create Embeddings
These vectors represent semantic meaning and enable retrieval before generation.


In [None]:
embeddings = model.encode(texts, show_progress_bar=True)
print('Embedding shape:', embeddings.shape)

## Step 4 — Build a Vector Index (FAISS)
This simulates the retrieval layer in RAG systems.


In [None]:
import faiss
import numpy as np

dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.array(embeddings))
print('Index size:', index.ntotal)

Index size: 200


## Step 5 — Retrieval Function
Search for documents related to a query.


In [None]:
def search(query, k=3):
    q_emb = model.encode([query])
    distances, indices = index.search(np.array(q_emb), k)
    return [texts[int(i)] for i in indices[0]]

## Step 6 — Try It!


In [None]:
query = "no intelligence in healthcare"
top_chunks = search(query, k=3)

print(f"Top 3 chunks retrieved for query: '{query}'\n")
for i, chunk in enumerate(top_chunks):
    print(f"{i+1}. {chunk['text']}\n")

Top 3 chunks retrieved for query: 'no intelligence in healthcare'

1. Bill Overhauling Intelligence Faces Uncertain Fate in Senate The Senate opened a floor debate today and moved toward a final vote on a bill endorsed by 9/11 commission leaders.

2. Concern grows over ailing Arafat Medics, aides and family rush to the West Bank as fears grow for Yasser Arafat's health.

3. Medical Journal Calls for a New Drug Watchdog Medical researchers said the U.S. needs a system independent of the F.D.A. to detect harmful effects of drugs already on the market.



## Reflection
In 1–2 sentences, explain how embeddings enable retrieval before generation in GenAI systems.

Embeddings enable retrieval by converting text into high-dimensional numerical vectors that represent semantic meaning, allowing the system to mathematically compare a user's query against a vast database of stored information. By calculating which stored vectors are closest to the query vector in this "meaning space," the system can instantly identify and pull relevant context to ground the GenAI model's response before it begins generating text.
