# Introduction to FAISS

**FAISS** (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors.  
It is particularly powerful for:

- **Large-scale vector search**: Can handle millions or even billions of vectors  
- **Multiple index types**: Ranges from brute force to approximate methods  
- **GPU acceleration**: Provides significant speedup for large datasets  
- **Memory efficiency**: Offers various compression techniques to save memory  


In [16]:
# Import essential libraries
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

import faiss
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import re
from typing import List, Dict, Tuple
import pickle
import time

### Chunking Your Resume by section


In [24]:
import re
from langchain_community.document_loaders import PyPDFLoader
from langchain.docstore.document import Document

# Load the PDF
pdf_path = "../data/resume.pdf"
loader = PyPDFLoader(pdf_path)
pages = loader.load()

# Combine text
cv_text = "".join(page.page_content for page in pages)

# Define section-based splitting using regex
sections = re.split(r"(?=Summary|Skills|Experience|Education and Training|Projects|Certifications)", cv_text)

# Create Documents for each section
cv_documents = [
    Document(page_content=section.strip(), metadata={"source": "cv", "section": section.split("\n",1)[0]})
    for section in sections if section.strip()
]

# Print results
print(f"Number of sections: {len(cv_documents)}")
for i, doc in enumerate(cv_documents):
    print(f"\nSection {i+1} ({doc.metadata['section']}):\n{doc.page_content[:300]}...")


Number of sections: 8

Section 1 (CLEAVESTONE ADUNGO ):
CLEAVESTONE ADUNGO 
cleavestone94@gmail.com   |   +254703457427   |   Nairobi Kenya       
GitHub: Link  |   Linkedln: Link | Portfolio: Link...

Section 2 (Summary      ):
Summary      
Data Scientist transitioning from finance with 2+ years of hands-on ML, NLP, and data analytics experience. 
Skilled in Python, SQL, statistical modeling, LLM fine-tuning, RAG systems, and MLOps pipelines. Diverse 
industry background (education, logistics, finance, data annotation) wi...

Section 3 (Skills      ):
Skills      
Programming & Analytics: Python (pandas, NumPy, scikit-learn, matplotlib, seaborn), SQL, Excel 
Machine Learning & AI: Classification, Regression, NLP, Deep Learning, Feature Engineering, Model Evaluation 
LLMs & Conversational AI: LangChain, RAG, Vector Databases (Pinecone, LanceDB), P...

Section 4 (Skills: Analytical Thinking, Problem-Solving, Communication, Self-Directed Learning ):
Skills: Analytical Thinking, Problem-

In [30]:
cv_documents[7].page_content

'Certifications      \n• Certificate in MLOPS from Udemy \n• Certificate in LLM Engineering from Udemy \n• Certificate in Foundational Generative AI from Ineuron'

### Generating Embeddings with Sentence-BERT



In [32]:
# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract plain text from Documents
texts = [doc.page_content for doc in cv_documents]

# Generate embeddings
embeddings = model.encode(texts, convert_to_numpy=True, show_progress_bar=True)

print(f"Embeddings shape: {embeddings.shape}")  # (n_chunks, 384)
print(f"Sample embedding (first 5 dims): {embeddings[0][:5]}")

Batches: 100%|██████████| 1/1 [00:03<00:00,  3.24s/it]

Embeddings shape: (8, 384)
Sample embedding (first 5 dims): [-0.05490848  0.05856662 -0.03859244  0.03439187 -0.01320006]





### Building and Using the FAISS 
IndexFAISS indexes vectors for O(log n) searches vs. O(n) brute-force.



#### Simple Exact Search (IndexFlatL2)



In [33]:
import faiss

d = embeddings.shape[1]  # 384
index = faiss.IndexFlatL2(d)  # Exact L2 search

# Add embeddings
index.add(embeddings.astype('float32'))  # FAISS needs float32
print(f"Total vectors indexed: {index.ntotal}")

Total vectors indexed: 8


#### Approximate Search (IndexIVFFlat)



In [34]:
# Parameters: nlist = sqrt(n_vectors) for balance
nlist = int(np.sqrt(len(texts)))  # e.g., 2 for 6 chunks
quantizer = faiss.IndexFlatL2(d)  # Coarse quantizer
index = faiss.IndexIVFFlat(quantizer, d, nlist)

# Train on embeddings (needs > nlist vectors; for small data, use all)
index.train(embeddings.astype('float32'))

# Add vectors
index.add(embeddings.astype('float32'))
print(f"Indexed {index.ntotal} vectors with {nlist} clusters")

Indexed 8 vectors with 2 clusters


In [41]:
import faiss
import numpy as np

# Normalize embeddings for cosine similarity
faiss.normalize_L2(embeddings)  # Ensures unit length for cosine

# Set number of clusters (nlist)
# For small datasets (<10 chunks), use nlist=1 or min(4, len(texts))
nlist = max(1, min(4, int(np.sqrt(len(texts)))))  # Avoid too few clusters
print(f"Using {nlist} clusters for {len(texts)} chunks")

# Create quantizer with cosine similarity (Inner Product)
d = embeddings.shape[1]  # 384 dimensions
quantizer = faiss.IndexFlatIP(d)  # Use IP for cosine similarity

# Initialize IndexIVFFlat
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)

# Train on embeddings (needs > nlist vectors; for small data, use all)
index.train(embeddings.astype('float32'))

# Add vectors to index
index.add(embeddings.astype('float32'))
print(f"Indexed {index.ntotal} vectors with {nlist} clusters")

# Optional: Set nprobe for search accuracy (higher = more accurate, slower)
index.nprobe = min(nlist, 2)  # For small datasets, search 1-2 clusters

Using 2 clusters for 8 chunks
Indexed 8 vectors with 2 clusters


#### Performing Similarity Search



In [45]:
# Sample query
query= "experience in finance"
query_embedding = model.encode([query]).astype('float32')

# Search: k=3 nearest neighbors, returns distances + indices
k = 3
distances, indices = index.search(query_embedding, k)

# Results
for i in range(k):
    chunk_idx = indices[0][i]
    similarity = 1 - (distances[0][i] / 2)  # Convert L2 to approx cosine (if normalized)
    print(f"Rank {i+1}: Score {similarity:.3f} | Chunk: {chunks[chunk_idx]}")

Rank 1: Score 0.730 | Chunk: page_content='Summary      
Data Scientist transitioning from finance with 2+ years of hands-on ML, NLP, and data analytics experience.' metadata={'source': 'cv'}
Rank 2: Score 0.837 | Chunk: page_content='industry background (education, logistics, finance, data annotation) with a proven track record of turning raw' metadata={'source': 'cv'}
Rank 3: Score 0.840 | Chunk: page_content='Skilled in Python, SQL, statistical modeling, LLM fine-tuning, RAG systems, and MLOps pipelines. Diverse' metadata={'source': 'cv'}
