# Building a FAISS-Based Vector Store: A Journey Through Data Processing and Visualization

In this notebook, you'll learn how to transform raw PDF documents into a searchable vector store using FAISS. We'll go on a journey where we:

1. **Read and extract text from PDF files.**
2. **Split the text into manageable chunks.**
3. **Display tokenization outputs from different tokenizers.**
4. **Generate embeddings from the text using a SentenceTransformer.**
5. **Store the embeddings in a FAISS index.**
6. **Project the embeddings into 2D space using UMAP for visualization.**
7. **Visualize the entire process on a scatter plot.**
8. **Incect your data into a prompt for a large language model**

In [None]:
import os
import tqdm
import glob
from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings  # For generating embeddings for text chunks
import faiss
import pickle
import matplotlib.pyplot as plt
import umap.umap_ as umap
import numpy as np
from dotenv import load_dotenv
from groq import Groq


## 1. Reading Data from PDFs

First, we load PDF files from a directory, extract their text content, and combine it into one large text string.

In [None]:
### load the pdf from the path
glob_path = "data/*.pdf"
text = ""
for pdf_path in tqdm.tqdm(glob.glob(glob_path)):
    with open(pdf_path, "rb") as file:
        print(file)
        reader = PdfReader(file)
         # Extract text from all pages in the PDF
        text += " ".join(page.extract_text() for page in reader.pages if page.extract_text())

text[:50]

## 2. Splitting the Text into Chunks

Large texts can be difficult to work with. We use a text splitter, in this case [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/how_to/recursive_text_splitter/),  to break the full text into smaller, overlapping chunks. This helps preserve context when we later embed the text.

In [None]:
# Create a splitter: 2000 characters per chunk with an overlap of 200 characters
splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
# Split the extracted text into manageable chunks
chunks = splitter.split_text(text)

In [None]:
print(f"Total chunks: {len(chunks)}")
print("Preview of the first chunk:", chunks[0][:200])

## 3. Tokenizing the Text with Different Tokenizers

Before embedding, it's insightful to see how different tokenizers break up our text. Here, we use the tokenizer from the SentenceTransformer model (see [SentenceTransformersTokenTextSplitter](https://python.langchain.com/api_reference/text_splitters/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html#sentencetransformerstokentextsplitter)).

In [None]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=128, model_name="paraphrase-multilingual-MiniLM-L12-v2")

In [None]:
token_split_texts = []
for text in chunks:
    token_split_texts += token_splitter.split_text(text)

print(f"\nTotal chunks: {len(token_split_texts)}")
print(token_split_texts[0])

In [None]:
model_name = "paraphrase-multilingual-MiniLM-L12-v2"
model = SentenceTransformer(model_name)
tokenized_chunks = []
for i, text in enumerate(token_split_texts[:10]):
    # Tokenize each chunk
    encoded_input = model.tokenizer(text, padding=True, truncation=True, max_length=128, return_tensors='pt')
    # Convert token IDs back to tokens
    tokens = model.tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0].tolist())
    tokenized_chunks.append(tokens)
    print(f"Chunk {i}: {tokens}")

In [None]:
model_name = "Sahajtomar/German-semantic"
model = SentenceTransformer(model_name)
tokenized_chunks = []
for i, text in enumerate(token_split_texts[:10]):
    # Tokenize each chunk
    encoded_input = model.tokenizer(text, padding=True, truncation=True, max_length=128, return_tensors='pt')
    # Convert token IDs back to tokens
    tokens = model.tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0].tolist())
    tokenized_chunks.append(tokens)
    print(f"Chunk {i}: {tokens}")

## 4. Generating Embeddings for Each Chunk

Now we convert each text chunk into a numerical embedding that captures its semantic meaning. These embeddings will be used for similarity search.

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="paraphrase-multilingual-MiniLM-L12-v2")
chunk_embeddings = model.encode(token_split_texts, convert_to_numpy=True)

## 5. Building a FAISS Vector Store

FAISS is a powerful library for efficient similarity search. Here, we build an index from our embeddings. Remember, FAISS only stores the numerical vectors so we must keep our original text mapping separately.

In [None]:
d = chunk_embeddings.shape[1]
print(d)

In [None]:
index = faiss.IndexFlatL2(d)
index.add(chunk_embeddings)
print("Number of embeddings in FAISS index:", index.ntotal)

In [None]:
if not os.path.exists('faiss'):
    os.makedirs('faiss')
    
faiss.write_index(index, "faiss/faiss_index.index")
with open("faiss/chunks_mapping.pkl", "wb") as f:
    pickle.dump(chunks, f)

In [None]:
index_2 = faiss.read_index("faiss/faiss_index.index")
with open("faiss/chunks_mapping.pkl", "rb") as f:
    token_split_texts_2 = pickle.load(f)
print(len(token_split_texts_2))
print(len(token_split_texts))

## 6. Projecting Embeddings with UMAP

To visualize high-dimensional embeddings, we use UMAP to project them into 2D space. You can project both the entire dataset and individual query embeddings.

In [None]:
# Fit UMAP on the full dataset embeddings
umap_transform = umap.UMAP(random_state=0, transform_seed=0).fit(chunk_embeddings)

def project_embeddings(embeddings, umap_transform):
    """
    Project a set of embeddings using a pre-fitted UMAP transform.
    """
    umap_embeddings = np.empty((len(embeddings), 2))
    for i, embedding in enumerate(tqdm.tqdm(embeddings, desc="Projecting Embeddings")):
        umap_embeddings[i] = umap_transform.transform([embedding])
    return umap_embeddings


In [None]:
# Project the entire dataset embeddings
projected_dataset_embeddings = project_embeddings(chunk_embeddings, umap_transform)
print("Projected dataset embeddings shape:", projected_dataset_embeddings.shape)

## 7. Querying the Vector Store and Projecting Results

We now define a retrieval function that takes a text query, embeds it, and searches our FAISS index for similar documents. We then project these result embeddings with UMAP.
"""

In [None]:
def retrieve(query, k=5):
    """
    Retrieve the top k similar text chunks and their embeddings for a given query.
    """
    query_embedding = model.encode([query], convert_to_numpy=True)
    distances, indices = index.search(query_embedding, k)
    retrieved_texts = [token_split_texts[i] for i in indices[0]]
    retrieved_embeddings = np.array([chunk_embeddings[i] for i in indices[0]])
    return retrieved_texts, retrieved_embeddings, distances[0]

In [None]:
query = "KI während der Bachelorarbeit"
results, result_embeddings, distances = retrieve(query, k=3)
print("Retrieved document preview:")
print(results[0][:300])

In [None]:
# Project the result embeddings
projected_result_embeddings = project_embeddings(result_embeddings, umap_transform)

# Also embed and project the original query for visualization
query_embedding = model.encode([query], convert_to_numpy=True)
project_original_query = project_embeddings(query_embedding, umap_transform)

## 8. Visualizing the Results

Finally, we create a scatter plot to visualize the entire dataset, the retrieved results, and the original query in 2D space.

In [None]:

def shorten_text(text, max_length=15):
    """Shortens text to max_length and adds an ellipsis if shortened."""
    return (text[:max_length] + '...') if len(text) > max_length else text

plt.figure()

# Scatter plots
plt.scatter(projected_dataset_embeddings[:, 0], projected_dataset_embeddings[:, 1],
            s=10, color='gray', label='Dataset')
plt.scatter(projected_result_embeddings[:, 0], projected_result_embeddings[:, 1],
            s=100, facecolors='none', edgecolors='g', label='Results')
plt.scatter(project_original_query[:, 0], project_original_query[:, 1],
            s=150, marker='X', color='r', label='Original Query')

# If results is a list of texts, iterate directly
for i, text in enumerate(results):
    if i < len(projected_result_embeddings):
        plt.annotate(shorten_text(text),
                     (projected_result_embeddings[i, 0], projected_result_embeddings[i, 1]),
                     fontsize=8)

# Annotate the original query point
original_query_text = 'Welche hilfsmittel sind erlaubt?'  # Replace with your actual query text if needed
original_query_text = 'Wieviele Seiten muss die Arbeit sein?'  # Replace with your actual query text if needed

plt.annotate(shorten_text(original_query_text),
             (project_original_query[0, 0], project_original_query[0, 1]),
             fontsize=8)

plt.gca().set_aspect('equal', 'datalim')
plt.title('Visualization')
plt.legend()
plt.show()


---

# 📝 Task: Semantic Retrieval-Augmented Question Answering Using Groq LLM

## Objective
Implement a question-answering system that:
1. Retrieves the most semantically relevant text passages to a user query.
2. Constructs a natural language prompt based on the retrieved content.
3. Uses a large language model (LLM) hosted by Groq to generate an answer.

---

## Task Breakdown

### 1. Embedding-Based Semantic Retrieval
- Use the `SentenceTransformer` model `"Sahajtomar/German-semantic"` to encode a user query into a dense vector embedding.
- Perform a nearest-neighbor search in a prebuilt FAISS index to retrieve the top-**k** similar text chunks. You can **use the prebuilt FAISS form above**.


### 2. LLM Prompt Construction and Query Answering
- Build the prompt:
  - Using the retrieved text chunks, concatenates the results into a context block.
  - Builds a **prompt** asking the LLM to answer the question using that context.
  - Sends the prompt to the **Groq LLM API** (`llama-3.3-70b-versatile`) and returns the response.

### 3. User Query Execution
- An example query (`"What is the most important factor in diagnosing asthma?"`) is used to demonstrate the pipeline.
- The final answer from the LLM is printed.


## Tools & Models Used
- **SentenceTransformers** (`Sahajtomar/German-semantic`) for embedding generation.
- **FAISS** for efficient vector similarity search.
- **Groq LLM API** (`llama-3.3-70b-versatile`) for generating the final response.


In [None]:
load_dotenv()
# Access the API key using the variable name defined in the .env file
groq_api_key = os.getenv("GROQ_API_KEY")

In [1]:
import os
import sys
import numpy as np
from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv
from groq import Groq

# This is a minimal RAG implementation using in-memory vector storage instead of FAISS
# It's designed to work with minimal dependencies and be less prone to kernel crashes

# Load environment variables
load_dotenv()
groq_api_key = os.getenv("GROQ_API_KEY")

# Print Python version and environment info for debugging
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")

# Define a minimal test dataset if we can't use existing data
sample_chunks = [
    "KI kann während der Bachelorarbeit als nützliches Tool eingesetzt werden, um Literaturrecherche zu unterstützen.",
    "Die Formatierung der Bachelorarbeit folgt den Vorgaben der Universität und sollte konsistent sein.",
    "Eine Bachelorarbeit umfasst typischerweise zwischen 40 und 60 Seiten, abhängig von den spezifischen Anforderungen.",
    "Beim Schreiben der Bachelorarbeit ist auf wissenschaftliche Sprache und korrekte Zitation zu achten.",
    "Die Gliederung einer Bachelorarbeit umfasst typischerweise: Einleitung, Hauptteil, Fazit und Literaturverzeichnis."
]

class SimpleRAG:
    """
    A simplified RAG implementation that doesn't rely on FAISS or other complex libraries.
    Uses in-memory vector storage and cosine similarity for retrieval.
    """
    
    def __init__(self, model_name="Sahajtomar/German-semantic"):
        """Initialize the RAG system with a SentenceTransformer model."""
        print(f"Loading model: {model_name}")
        self.model = SentenceTransformer(model_name)
        self.chunks = []
        self.embeddings = None
        self.groq_client = None
        if groq_api_key:
            self.groq_client = Groq(api_key=groq_api_key)
            print("Groq client initialized")
        else:
            print("Warning: No GROQ_API_KEY found. LLM functionality will be unavailable.")
    
    def load_chunks(self, text_chunks):
        """Load text chunks and compute embeddings."""
        print(f"Loading {len(text_chunks)} chunks...")
        self.chunks = text_chunks
        # Process chunks in small batches to avoid memory issues
        batch_size = 32
        all_embeddings = []
        
        for i in range(0, len(self.chunks), batch_size):
            batch = self.chunks[i:i+batch_size]
            print(f"Processing batch {i//batch_size + 1}/{(len(self.chunks)-1)//batch_size + 1}...")
            batch_embeddings = self.model.encode(batch, convert_to_numpy=True)
            all_embeddings.append(batch_embeddings)
        
        self.embeddings = np.vstack(all_embeddings)
        print(f"Generated embeddings with shape {self.embeddings.shape}")
    
    def cosine_similarity(self, a, b):
        """Compute cosine similarity between two vectors."""
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    
    def retrieve(self, query, k=3):
        """Retrieve the most similar chunks to the query."""
        # Encode the query
        query_embedding = self.model.encode([query], convert_to_numpy=True)[0]
        
        # Compute similarities
        similarities = []
        for i, embedding in enumerate(self.embeddings):
            similarity = self.cosine_similarity(query_embedding, embedding)
            similarities.append((i, similarity))
        
        # Sort by similarity (descending)
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        # Return top-k chunks
        results = []
        for i, similarity in similarities[:k]:
            results.append((self.chunks[i], similarity))
        
        return results
    
    def generate_answer(self, query, retrieved_chunks):
        """Generate an answer using the Groq LLM API."""
        if not self.groq_client:
            return "LLM functionality unavailable. Please set GROQ_API_KEY."
        
        # Prepare context from retrieved chunks
        context = "\n\n---\n\n".join([chunk for chunk, _ in retrieved_chunks])
        
        # Build the prompt
        prompt = f"""Answer the following question based on the provided context.

Context:
{context}

Question: {query}

Answer:
"""
        
        # Query the LLM
        try:
            completion = self.groq_client.chat.completions.create(
                model="llama-3.3-70b-versatile",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant that answers questions based on the provided context."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.2,
                max_tokens=500
            )
            return completion.choices[0].message.content
        except Exception as e:
            print(f"Error calling Groq API: {e}")
            return f"Error generating answer: {str(e)}"
    
    def answer_question(self, query, k=3):
        """End-to-end question answering."""
        print(f"Processing query: {query}")
        
        # Retrieve similar chunks
        retrieved_chunks = self.retrieve(query, k)
        
        print(f"Retrieved {len(retrieved_chunks)} chunks:")
        for i, (chunk, similarity) in enumerate(retrieved_chunks):
            preview = chunk[:50].replace('\n', ' ') + '...'
            print(f"Chunk {i+1} (similarity: {similarity:.4f}): {preview}")
        
        # Generate answer
        print("Generating answer...")
        answer = self.generate_answer(query, retrieved_chunks)
        
        return answer

# Main execution
try:
    print("Initializing SimpleRAG system...")
    rag = SimpleRAG()
    
    # Try to use chunks from the notebook, fall back to sample data if not available
    text_chunks = None
    try:
        # Try to access variables from the notebook
        if 'token_split_texts' in globals():
            text_chunks = token_split_texts
            print(f"Using token_split_texts with {len(text_chunks)} chunks")
        elif 'chunks' in globals():
            text_chunks = chunks
            print(f"Using chunks with {len(text_chunks)} chunks")
    except NameError:
        pass
    
    # If no chunks found, use the sample data
    if not text_chunks:
        print("Using sample text chunks")
        text_chunks = sample_chunks
    
    # Limit the number of chunks for stability
    max_chunks = 1000
    if len(text_chunks) > max_chunks:
        print(f"Too many chunks ({len(text_chunks)}). Using first {max_chunks} chunks.")
        text_chunks = text_chunks[:max_chunks]
    
    # Load the chunks and create embeddings
    rag.load_chunks(text_chunks)
    
    # Example queries
    print("\n" + "="*50)
    print("Running example queries:")
    
    # Example query from the task
    example_query = "What is the most important factor in diagnosing asthma?"
    print(f"\nQuery: {example_query}")
    answer = rag.answer_question(example_query)
    print("\nAnswer:")
    print("-" * 50)
    print(answer)
    print("-" * 50)
    
    # German query
    german_query = "KI während der Bachelorarbeit"
    print(f"\nQuery: {german_query}")
    german_answer = rag.answer_question(german_query)
    print("\nAnswer:")
    print("-" * 50)
    print(german_answer)
    print("-" * 50)
    
except Exception as e:
    print(f"An error occurred: {e}")
    import traceback
    traceback.print_exc()

  from .autonotebook import tqdm as notebook_tqdm


Python version: 3.12.1 (main, Mar 17 2025, 17:13:06) [GCC 9.4.0]
Python executable: /home/codespace/.python/current/bin/python3
Initializing SimpleRAG system...
Loading model: Sahajtomar/German-semantic
Groq client initialized
Using sample text chunks
Loading 5 chunks...
Processing batch 1/1...
Generated embeddings with shape (5, 1024)

Running example queries:

Query: What is the most important factor in diagnosing asthma?
Processing query: What is the most important factor in diagnosing asthma?
Retrieved 3 chunks:
Chunk 1 (similarity: 0.1472): KI kann während der Bachelorarbeit als nützliches ...
Chunk 2 (similarity: 0.1456): Beim Schreiben der Bachelorarbeit ist auf wissensc...
Chunk 3 (similarity: 0.1408): Die Gliederung einer Bachelorarbeit umfasst typisc...
Generating answer...

Answer:
--------------------------------------------------
Es gibt keine Informationen im Kontext, die auf die Diagnose von Asthma eingehen. Der Kontext behandelt stattdessen die Erstellung einer Bachelor