# CS 5542 — Week 1 Lab
## From Data to Retrieval: GitHub → Colab → Hugging Face → Embeddings

**Learning Goals:**
- Use GitHub for collaborative analytics workflows
- Run notebooks in Google Colab
- Load datasets and models from Hugging Face Hub
- Build an embedding-based retrieval system (mini-RAG)


### GenAI Systems Context (Mini-RAG)
This lab implements a **mini Retrieval-Augmented Generation (RAG)** pipeline:
- A **Transformer encoder** produces semantic embeddings
- A **vector index (FAISS)** enables fast retrieval
- Retrieved context is what a downstream **LLM** would use for grounded generation


## Step 1 — Environment Setup
Install required libraries. This may take ~1 minute.


In [2]:
!pip install -q transformers datasets sentence-transformers faiss-cpu

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m84.6 MB/s[0m eta [36m0:00:00[0m
[?25h

## Step 2 — Load Dataset & Model from Hugging Face Hub
We use a lightweight news dataset and a sentence embedding model.


Replace you hugging face token in the empty string, on the mentioned commented line.

In [5]:
from huggingface_hub import login

HF_TOKEN = ""  # <-- REPLACE THE EMPTY STRING WITH YOUR HF TOKEN

if HF_TOKEN and HF_TOKEN != "YOUR_HF_TOKEN_HERE":
    login(token=HF_TOKEN)
    print("✅ Logged in to Hugging Face")
else:
    print("⚠️ No HF token provided. Public models may still work, but rate limits may apply.")

✅ Logged in to Hugging Face


TASK

1. Find out any /ag_news dataset from the huggingface.
2. Look for "Use this dataset" button on the left side --> Use huggingface library option.
3. Copy the entire code and paste in the empty cell and run it successfully.

In [6]:
#paste your dataset code from huggingface here
from datasets import load_dataset

dataset = load_dataset("fancyzhx/ag_news")

In [7]:
texts = dataset["train"].select(range(200))
print(f"Selected {len(texts)} examples from the 'train' split.")

Selected 200 examples from the 'train' split.


In [8]:
texts[1:6]

{'text': ['Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.',
  "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.",
  'Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\\flows from the main pipeline in southern Iraq after\\intelligence showed a rebel militia could strike\\infrastructure, an oil official said on Saturday.',
  'Oil prices soar to all-time record, posing new menace to US economy (AFP) AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the 

TASK

1. Find out the sentence-transformer -
all-MiniLM-L6-v2 from the huggingface website.
2. Look for "Use this model" button on the left side --> Use sentence-transformer library option.
3. Copy the first 2 lines of the code and paste in the empty cell and run it successfully.

In [9]:
#paste your first 2 lines of the sentence-transformer library code from the hugginface to load the model
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Step 3 — Create Embeddings
These vectors represent semantic meaning and enable retrieval before generation.


In [10]:
embeddings = model.encode(texts, show_progress_bar=True)
print('Embedding shape:', embeddings.shape)

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Embedding shape: (200, 384)


## Step 4 — Build a Vector Index (FAISS)
This simulates the retrieval layer in RAG systems.


In [11]:
import faiss
import numpy as np

dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(np.array(embeddings))
print('Index size:', index.ntotal)

Index size: 200


## Step 5 — Retrieval Function
Search for documents related to a query.


In [12]:
def search(query, k=3):
    q_emb = model.encode([query])
    distances, indices = index.search(np.array(q_emb), k)
    return [texts[int(i)] for i in indices[0]]

## Step 6 — Try It!


In [20]:
query = "nuclear material"
top_chunks = search(query, k=3)

print(f"Top 3 chunks retrieved for query: '{query}'\n")
for i, chunk in enumerate(top_chunks):
    print(f"{i+1}. {chunk['text']}\n")

Top 3 chunks retrieved for query: 'nuclear material'

1. Japan nuclear firm shuts plants The company running the Japanese nuclear plant hit by a fatal accident is to close its reactors for safety checks.

2. Will Russia, the Oil Superpower, Flex Its Muscles? Russia is again emerging as a superpower - but the reason has less to do with nuclear weapons than with oil.

3. Vietnam's Citadel Vulnerable to Weather (AP) AP - Experts from Europe and Asia surveyed 1,400-year-old relics of an ancient citadel in Hanoi Tuesday and said they were concerned the priceless antiquities were at risk from exposure to the elements.



## Reflection
In 1–2 sentences, explain how embeddings enable retrieval before generation in GenAI systems.
