# Embeddings, RAG and Vector Database

## Objective:
This notebook demonstrates the concepts of text embeddings, how a vector database stores and retrieves them, and how these are combined in a Retrieval Augmented Generation (RAG) pipeline to enhance an LLM's ability to answer questions based on specific knowledge.

### **1. Setup and Installation**

In [2]:
# install the necessary libraries.

In [5]:
!pip install -q sentence-transformers faiss-cpu transformers numpy nltk

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m81.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [6]:
# Import required modules

In [7]:
import numpy as np
import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import faiss
import random # For simulating document chunks

In [None]:
# Download NLTK data for sentence tokenization

In [19]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### **2. Text Embeddings**

**Concept:**
Text embeddings are numerical representations (vectors) of text that capture semantic meaning. Texts with similar meanings will have vectors that are "closer" to each other in a multi-dimensional space.

We'll use a pre-trained SentenceTransformer model to generate these embeddings.

In [16]:
# Load a pre-trained sentence embedding model
# 'all-MiniLM-L6-v2' is a good balance of performance and speed for demonstration
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample texts
documents = [
    "The capital of France is Paris.",
    "The Eiffel Tower is located in Paris.",
    "Elephants are large mammals native to Africa and Asia.",
    "Mount Everest is the highest mountain in the world.",
    "What is the largest ocean on Earth? The Pacific Ocean."
]

print("Original Documents:")
for i, doc in enumerate(documents):
    print(f"Doc {i+1}: {doc}")

# Generate embeddings for the documents
document_embeddings = embedding_model.encode(documents)

print(f"\nShape of embeddings: {document_embeddings.shape}") # (num_documents, embedding_dimension)
print(f"Sample embedding for Doc 1 (first 5 dimensions): {document_embeddings[0][:5]}")


# You can observe that similar sentences will have closer embeddings.
# Let's compare two related sentences and two unrelated sentences.
query_related = "What is the capital of France?"
query_unrelated = "Tell me about cars."

embedding_query_related = embedding_model.encode([query_related])
embedding_query_unrelated = embedding_model.encode([query_unrelated])

# Calculate cosine similarity (a common metric for vector similarity)
from sklearn.metrics.pairwise import cosine_similarity

# Similarity between query and related document
sim_related = cosine_similarity(embedding_query_related, [document_embeddings[0]])
print(f"\nCosine similarity between '{query_related}' and '{documents[0]}': {sim_related[0][0]:.4f}")

# Similarity between query and unrelated document
sim_unrelated = cosine_similarity(embedding_query_unrelated, [document_embeddings[2]])
print(f"Cosine similarity between '{query_unrelated}' and '{documents[2]}': {sim_unrelated[0][0]:.4f}")

Original Documents:
Doc 1: The capital of France is Paris.
Doc 2: The Eiffel Tower is located in Paris.
Doc 3: Elephants are large mammals native to Africa and Asia.
Doc 4: Mount Everest is the highest mountain in the world.
Doc 5: What is the largest ocean on Earth? The Pacific Ocean.

Shape of embeddings: (5, 384)
Sample embedding for Doc 1 (first 5 dimensions): [ 0.10325699  0.03042011  0.02909581 -0.0373229   0.07867623]

Cosine similarity between 'What is the capital of France?' and 'The capital of France is Paris.': 0.8790
Cosine similarity between 'Tell me about cars.' and 'Elephants are large mammals native to Africa and Asia.': 0.1869


### **3. Vector Database (using FAISS)**

**Concept:**

A vector database is a specialized database designed to store, manage, and query high-dimensional vectors (embeddings). It allows for efficient "similarity search," finding vectors that are most similar to a given query vector.

We'll use FAISS (Facebook AI Similarity Search) as an in-memory vector database for simplicity.

In [17]:
# Get the dimension of the embeddings
embedding_dimension = document_embeddings.shape[1]

# Create a FAISS index (Flat Index with Inner Product similarity)
# Inner Product is suitable when embeddings are normalized (which sentence-transformers often does)
index = faiss.IndexFlatIP(embedding_dimension)

# Add the document embeddings to the index
index.add(document_embeddings)

print(f"\nFAISS index created with {index.ntotal} documents.")

# Now, let's perform a similarity search.
# We'll use the embedding of a query to find the most similar document.
query = "What is the highest peak in the world?"
query_embedding = embedding_model.encode([query])

# Perform k-nearest neighbor search
k = 2 # Retrieve top 2 similar documents
distances, indices = index.search(query_embedding, k)

print(f"\nQuery: '{query}'")
print(f"Top {k} similar documents:")
for i in range(k):
    doc_index = indices[0][i]
    distance = distances[0][i]
    print(f"  Rank {i+1}: Doc {doc_index+1} (Distance: {distance:.4f}) - '{documents[doc_index]}'")


FAISS index created with 5 documents.

Query: 'What is the highest peak in the world?'
Top 2 similar documents:
  Rank 1: Doc 4 (Distance: 0.6725) - 'Mount Everest is the highest mountain in the world.'
  Rank 2: Doc 5 (Distance: 0.4135) - 'What is the largest ocean on Earth? The Pacific Ocean.'


### **4. Retrieval Augmented Generation (RAG)**

**Concept:**
RAG combines the strengths of retrieval systems (like vector databases) with generative large language models (LLMs). Instead of the LLM generating an answer solely from its pre-trained knowledge, RAG first retrieves relevant information from a knowledge base (using embeddings and a vector database) and then uses this retrieved information as context for the LLM to generate a more accurate and grounded answer.

**Steps in RAG:**



1. **Load Data & Chunking:** Load your knowledge base (e.g., a PDF, website text) and split it into smaller, manageable chunks.
2.**Embed Chunks:** Generate embeddings for each chunk using an embedding model.
3.**Store in Vector DB:** Store these chunk embeddings (and ideally the original chunks) in a vector database.
4.**Query & Retrieve:** When a user asks a question, embed the query and use the vector database to find the most semantically similar chunks.
5. **Augment Prompt:** Take the retrieved chunks and construct a prompt for the LLM that includes the original query and the retrieved context.
6. **Generate Answer:** The LLM generates an answer based on the augmented prompt.

In [26]:
# --- 4.1. Simulate a larger knowledge base (document splitting) ---
long_text = """
The Amazon rainforest is the largest rainforest in the world, covering much of northwestern South America.
It is home to an incredible diversity of wildlife, including jaguars, sloths, and countless species of birds and insects.
The Amazon River flows through the rainforest and is the second-longest river in the world by length, and the largest by discharge volume.
Deforestation is a major threat to the Amazon, impacting climate change and biodiversity.
Sustainable practices are crucial for its preservation.
Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.
It is often used for web development, data analysis, artificial intelligence, and scientific computing.
"""

# Simple chunking by sentences
# For real-world applications, more sophisticated chunking strategies are needed
# (e.g., recursive character text splitter from LangChain)

# Download the required 'punkt_tab' resource for sentence tokenization
import nltk
try:
    # Check if the resource exists
    nltk.data.find('tokenizers/punkt_tab/')
except LookupError:
    # If it doesn't exist, download it
    nltk.download('punkt_tab')


text_chunks = sent_tokenize(long_text)

print("\n--- 4.1. Document Chunks ---")
for i, chunk in enumerate(text_chunks):
    print(f"Chunk {i+1}: {chunk}")

# --- 4.2. Embed Chunks & 4.3. Store in Vector DB ---
chunk_embeddings = embedding_model.encode(text_chunks)

# Create a new FAISS index for these chunks
chunk_embedding_dimension = chunk_embeddings.shape[1]
chunk_index = faiss.IndexFlatIP(chunk_embedding_dimension)
chunk_index.add(chunk_embeddings)
print("\nChunks Embeddings")
print(chunk_embeddings)

print(f"\nFAISS index for chunks created with {chunk_index.ntotal} chunks.")

# --- 4.4. Query & Retrieve ---
# Load a small, CPU-friendly LLM for demonstration
# distilbert-base-cased-distilled-squad is a good choice for question-answering
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="distilbert-base-cased-distilled-squad")

def rag_pipeline(query, top_k_chunks=3):
    """
    Implements a simplified RAG pipeline.
    """
    print(f"\n--- RAG Pipeline for Query: '{query}' ---")

    # 1. Embed the query
    query_embedding = embedding_model.encode([query])

    # 2. Retrieve relevant chunks from the vector database
    distances, indices = chunk_index.search(query_embedding, top_k_chunks)
    retrieved_chunk_indices = indices[0]

    retrieved_documents = [text_chunks[i] for i in retrieved_chunk_indices]
    print("\nRetrieved Documents (from Vector Database):")
    for i, doc in enumerate(retrieved_documents):
        print(f"  {i+1}. {doc}")

    # 3. Augment the prompt for the LLM
    context = " ".join(retrieved_documents)
    # The LLM will use this context to answer the question.
    # For a QA model, the 'context' parameter is directly used.
    # For a generative LLM, you'd format a prompt like:
    # prompt = f"Context: {context}\nQuestion: {query}\nAnswer:"

    # 4. Generate answer using the LLM with augmented context
    try:
        answer = qa_pipeline(question=query, context=context)
        print("\nGenerated Answer (from LLM):")
        print(answer['answer'])
        print(f"Confidence Score: {answer['score']:.4f}")
    except Exception as e:
        print(f"\nError during LLM generation: {e}")
        print("This might happen if the context is too long or the model struggles.")
        print("Consider a larger model or a more robust RAG framework like LangChain/LlamaIndex.")

# --- Test the RAG pipeline ---
rag_pipeline("What is the Amazon rainforest known for?")
rag_pipeline("What programming paradigms does Python support?")
rag_pipeline("What is the main threat to the Amazon?")
rag_pipeline("What is the capital of Italy?") # Example where the answer is NOT in the knowledge base


--- 4.1. Document Chunks ---
Chunk 1: 
The Amazon rainforest is the largest rainforest in the world, covering much of northwestern South America.
Chunk 2: It is home to an incredible diversity of wildlife, including jaguars, sloths, and countless species of birds and insects.
Chunk 3: The Amazon River flows through the rainforest and is the second-longest river in the world by length, and the largest by discharge volume.
Chunk 4: Deforestation is a major threat to the Amazon, impacting climate change and biodiversity.
Chunk 5: Sustainable practices are crucial for its preservation.
Chunk 6: Python is a high-level, general-purpose programming language.
Chunk 7: Its design philosophy emphasizes code readability with the use of significant indentation.
Chunk 8: Python is dynamically typed and garbage-collected.
Chunk 9: It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.
Chunk 10: It is often used for web

Device set to use cpu



--- RAG Pipeline for Query: 'What is the Amazon rainforest known for?' ---

Retrieved Documents (from Vector Database):
  1. 
The Amazon rainforest is the largest rainforest in the world, covering much of northwestern South America.
  2. Deforestation is a major threat to the Amazon, impacting climate change and biodiversity.
  3. The Amazon River flows through the rainforest and is the second-longest river in the world by length, and the largest by discharge volume.

Generated Answer (from LLM):
the largest rainforest in the world
Confidence Score: 0.3748

--- RAG Pipeline for Query: 'What programming paradigms does Python support?' ---

Retrieved Documents (from Vector Database):
  1. Python is a high-level, general-purpose programming language.
  2. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.
  3. Python is dynamically typed and garbage-collected.

Generated Answer (from LLM):
structured (pa