# Lab 03 — Building the Embedding & Vector Database Pipeline

**Course:** A Practical Guide to Building a GenAI Application

**Duration:** 90–120 minutes

**Objectives:**
- Understand embeddings & vector search
- Build a complete text‑to‑embedding pipeline
- Implement chunking, cleaning, embedding, and storage
- Use a real vector database (FAISS for this lab)
- Evaluate similarity search accuracy


## 1 — Introduction to Embeddings

Embeddings convert text into numerical vectors representing semantic meaning.

Examples of embedding models:
- OpenAI text-embedding-3-large
- BGE Large / Small
- Sentence Transformers


## 2 — Install dependencies

Run this cell to install required libraries.


In [None]:
!pip install faiss-cpu sentence-transformers

## 3 — Load an embedding model

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
model

## 4 — Create a sample document set

In [None]:
documents = [
    'The Nigerian Stock Exchange closed higher today.',
    'The Central Bank of Nigeria announced new monetary policies.',
    'Python is a popular programming language for AI.',
    'Lagos is the commercial capital of Nigeria.'
]
documents

## 5 — Chunking & Text Cleaning

### Exercise: Implement a simple chunking function.
Chunk size = 40 words.


In [None]:
import re
def clean(text):
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def chunk(text, size=40):
    words = text.split()
    return [' '.join(words[i:i+size]) for i in range(0, len(words), size)]

chunks = []
for doc in documents:
    cleaned = clean(doc)
    chunks.extend(chunk(cleaned))

chunks

## 6 — Convert chunks to embeddings

In [None]:
embeddings = model.encode(chunks, convert_to_numpy=True)
embeddings.shape

## 7 — Build a Vector Database (FAISS)


In [None]:
import faiss
d = embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(embeddings)
index

## 8 — Perform Similarity Search

Query example: *What is the role of the Central Bank of Nigeria?*


In [None]:
query = 'What does the Central Bank of Nigeria do?'
query_vec = model.encode([query])
D, I = index.search(query_vec, k=2)
I, D

### Return matching chunks

In [None]:
[chunks[i] for i in I[0]]

## 9 — Lab Exercise: Improve Retrieval Quality

### Task:
- Add your own 5 documents
- Rebuild embeddings
- Run 3 queries
- Compare FAISS results and explain which are correct or incorrect


## 10 — Instructor Key

**Expected improvements:**
- More diverse documents increase accuracy
- Longer chunks reduce context fragmentation
- Using a larger embedding model improves semantic recall
