<a href="https://colab.research.google.com/github/axcel342/NLP_Labs/blob/main/Lab11_LLM_Embeddings_and_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Lab 11: Large Language Model Embeddings and Retrieval-Augmented Generation**

This lab focuses on creating a complete Retrieval-Augmented Generation (RAG) system using modern NLP techniques, embedding models, and vector databases. The system allows users to search through text documents semantically and receive AI-generated answers based on relevant retrieved contexts.

## **Objective**
The main objective of this module is to demonstrate how to:
1. Process and chunk text documents for efficient retrieval
2. Generate high-quality embeddings using pre-trained models
3. Store and query vector embeddings in a vector database ([Qdrant](https://qdrant.tech/))
4. Implement a complete RAG pipeline by connecting retrieval with an LLM (provided by [OpenRouter](https://openrouter.ai/))
5. Create a user-friendly interface for interacting with the RAG system

In [None]:
# Define a function that sets custom CSS for the notebook output
# This specific style makes <pre> blocks wrap text instead of overflowing in a single line
def set_css():
    display(HTML('''
    <style>
      pre {
          white-space: pre-wrap;  /* Enable word-wrapping in code/output blocks */
      }
    </style>
    '''))

# Register the CSS-setting function to run automatically before each code cell runs
# This ensures the styling stays applied throughout the notebook session
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
import markdown
from IPython.display import display, HTML

def render_markdown(md_text):
    # Convert Markdown to HTML
    html = markdown.markdown(md_text)
    # Display the HTML
    display(HTML(html))

## Setup and Dependencies

In [None]:
# Install the necessary libraries
!pip install sentence_transformers openai
!pip install plotly
!pip install matplotlib
!pip install -Uqqq rich openai gradio
!pip install qdrant_client
!pip install transformers

# Import basic libraries
import numpy as np
import os, random
from pathlib import Path
from getpass import getpass
from rich.markdown import Markdown
import torch
import sys
import csv
csv.field_size_limit(sys.maxsize)


Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence_transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence_transformers)
 

131072

## OpenAI-Compatible LLM Client Configuration

In [None]:
# Retrieve API key securely from Colab user data
from google.colab import userdata
OPEN_ROUTER_API_KEY = userdata.get('OPEN_ROUTER_API_KEY')

# Initialize OpenRouter client (OpenAI-compatible API)
from openai import OpenAI
open_router_client = OpenAI(
  base_url="https://openrouter.ai/api/v1",
  api_key=OPEN_ROUTER_API_KEY,
)


## Data Loading and Preparation

In [None]:
# Load data from Google Drive
import pandas as pd
url = 'https://drive.google.com/uc?id=1gl7WAkJr6Nyke7YckzXxdL-iM4UjhLGX'
df = pd.read_csv(url)


In [None]:
df.shape

(3327, 8)

In [None]:
df = df[:5]  # Using only 5 rows for demonstration
df = df.dropna(axis=1)  # Drop columns with null values

# Prepare data with metadata for traceability
data = []
for row_num, row in df.iterrows():
    content = " ".join([f"{col}: {row[col]}" for col in df.columns])
    data.append({
        "page_content": content,
        "metadata": {
            "source": row["title"],
        }
    })
df

Unnamed: 0,_id,title,author,publish_date,full_text,url,categories
0,63927168098523fa9d5ef047,Are The Democrats Screwed In The Senate After ...,Nate Silver,1670516000000.0,\n\n\n\n2028 Election\nAre The Democrats Screw...,https://fivethirtyeight.com/features/democrats...,['news']
1,652b21ec754471a9795c291a,Beyond the Narrative,Sam Freedman,1697272000000.0,We are told before every party conference that...,https://samf.substack.com/p/beyond-the-narrative,['news']
2,6529cfaff59ed61396db838c,Ballot Measures: A Preview,Walter Olson,1697227000000.0,Walter Olson Voters will go to the polls soon...,https://www.cato.org/blog/ballot-measures-preview,['news']
3,6529d030f59ed61396db8398,Trump’s Only Real Worldview Is Pettiness,David A. Graham,1697214000000.0,Let no one say that Donald Trump has lost his ...,"tag:theatlantic.com,2023:50-675637",['news']
4,6529d072f59ed61396db839e,The COVID Bailout of State and Local Governmen...,Eric Boehm,1697212000000.0,Two years after Congress authorized a h...,https://reason.com/2023/10/13/the-covid-bailou...,['news']


## Document Chunking

In [None]:
def simple_recursive_split(docs, chunk_size=1000, chunk_overlap=200, separators=None):
    # Extract the main text and its associated metadata
    text = docs["page_content"]
    metadata = docs["metadata"]

    # Set default separators if none are provided
    if separators is None:
        separators = ["\n\n", "\n", " ", ".", ",", "\uff0c", "\u3001", "\uff0e", "\u3002"]

    # Helper function to recursively split text based on the separators
    def split_with_separators(t):
        # If the text is already within the chunk size, return it directly
        if len(t) <= chunk_size:
            return [t]

        # Attempt splitting by each separator in order
        for sep in separators:
            if sep and sep in t:
                parts = t.split(sep)
                chunks = []
                current = ""

                # Build chunks without exceeding the maximum chunk size
                for part in parts:
                    part += sep  # Reattach the separator to preserve structure
                    if len(current + part) <= chunk_size:
                        current += part
                    else:
                        if current:
                            chunks.append(current.strip())
                        current = part  # Start a new chunk

                # Add the final leftover chunk
                if current:
                    chunks.append(current.strip())

                # Recursively re-split chunks that are still too large
                result = []
                for chunk in chunks:
                    if len(chunk) > chunk_size:
                        result.extend(split_with_separators(chunk))
                    else:
                        result.append(chunk)
                return result

        # Fallback: if no separators are effective, split the text by fixed character lengths
        return [t[i:i + chunk_size] for i in range(0, len(t), chunk_size)]

    # Split the original text
    splits = split_with_separators(text)

    # Add overlap between chunks to preserve context between adjacent segments
    overlapped = []
    for i, chunk in enumerate(splits):
        if i == 0:
            # First chunk, no overlap
            overlapped.append({
                "page_content": chunk,
                "metadata": metadata
            })
        else:
            # For subsequent chunks, add overlap from the end of the previous chunk
            overlap = splits[i - 1][-chunk_overlap:]
            overlapped.append({
                "page_content": f"{overlap} {chunk}",
                "metadata": metadata
            })

    return overlapped

# Apply the chunking function to each document in the dataset
# This flattens all chunks into a single list
texts = [chunk for doc in data for chunk in simple_recursive_split(doc, 2048, 50)]

In [None]:
print (f'You now have {len(texts)} document(s) in your data')
print (f'There are {len(texts[1]["page_content"])} characters in your document')

You now have 22 document(s) in your data
There are 1704 characters in your document


## Text Embedding Generation

In [None]:
# Load embedding model from HuggingFace
from transformers import AutoTokenizer, AutoModel
text_tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
text_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

# Function to generate embeddings from text
def get_text_embeddings(text):
    inputs = text_tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    outputs = text_model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings[0].detach().numpy()

# Example usage of the function
text = "This is a test sentence."

# Get the embedding vector for the input text
embeddings = get_text_embeddings(text)

# Optionally, get the length of the embedding (number of dimensions)
text_embeddings_size = len(embeddings)

# Print the first 5 values of the embedding vector for inspection
print(embeddings[:5])



Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

configuration_hf_nomic_bert.py:   0%|          | 0.00/1.96k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_hf_nomic_bert.py:   0%|          | 0.00/104k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/547M [00:00<?, ?B/s]



[ 1.2799689   0.40158418 -3.5162663  -0.3981327   1.5919126 ]


In [None]:
# Generate embeddings for all chunks
text_embeded = [get_text_embeddings(document["page_content"]) for document in texts]

## Qdrant VectorDatabase

In [None]:
# Import necessary modules from the Qdrant client library
# Qdrant is a vector database that allows you to store and search high-dimensional vector embeddings efficiently
from qdrant_client import QdrantClient, models

# Create a new Qdrant client instance using in-memory storage
# ":memory:" means the data will be stored temporarily in RAM (not saved to disk)
# Useful for testing or prototyping — everything is wiped when the program ends
client = QdrantClient(":memory:")

# Display the size (number of dimensions) of the text embeddings we generated earlier
# This is important because Qdrant needs to know the exact size of each vector to create a collection
text_embeddings_size

768

In [None]:
try:
    # Define the name of the collection we want to manage in Qdrant.
    # A collection in Qdrant is similar to a table in traditional databases —
    # it stores a group of vectors and their associated metadata.
    collection_name = "demo_collection"

    # Check whether the collection already exists in Qdrant.
    # This avoids attempting to create a collection with a name that's already taken.
    if client.collection_exists(collection_name):
        # If the collection already exists, delete it to ensure we're starting fresh.
        # This is useful when we want to reset the state (e.g., during development or re-indexing).
        client.delete_collection(collection_name=collection_name)

        # Output a message confirming the collection was deleted successfully.
        print(f"Collection '{collection_name}' deleted successfully.")

    # Proceed to create a new collection regardless of whether it was previously deleted or not.
    # This ensures we always end up with a clean, newly-created collection.
    client.create_collection(
        collection_name=collection_name,  # The name of the new collection being created

        # Configure how vectors will be stored in this collection.
        # This includes the dimensionality (size) and the distance metric used for similarity.
        vectors_config=models.VectorParams(
            size=text_embeddings_size,       # The number of dimensions in each vector.
                                             # Must match the output size of your embedding model.
            distance=models.Distance.COSINE  # The distance function used for comparing vectors.
                                             # COSINE is commonly used for text embeddings as it measures angular similarity.
        ),
    )

    # Print a confirmation that the collection was created successfully.
    print(f"Collection '{collection_name}' created successfully.")

except Exception as e:
    # If any error occurs during the process (e.g., connection issues, invalid parameters),
    # it will be caught here and the error message will be printed.
    print(f"An error occurred while setting up the collection: {e}")


Collection 'demo_collection' created successfully.


In [None]:

# Import the `uuid4` function to generate unique IDs for each vector
# These IDs help identify and retrieve individual points later
from uuid import uuid4

# Import NumPy to handle vector data formats (embeddings are stored as NumPy arrays)
import numpy as np

# Upload all our text embeddings to the "demo_collection" in Qdrant
client.upload_points(
    collection_name="demo_collection",  # Target collection where we want to store our vectors

    # Create a list of PointStruct objects, one for each text chunk
    points=[
        models.PointStruct(
            id=str(uuid4()),  # Generate a unique ID for each point (as a string)

            # Convert the embedding to a NumPy array, which is the expected format
            vector=np.array(text_embeded[idx]),

            # Attach payload — additional information stored with each vector
            # This allows us to retrieve the original text and its metadata later
            payload={
                "metadata": doc["metadata"],         # Source and row info
                "content": doc["page_content"]       # The full text chunk
            }
        )
        for idx, doc in enumerate(texts)  # Loop through all texts and match them to their embeddings
    ]
)


In [None]:
# Import the Google Drive integration module for Google Colab
from google.colab import drive

# Mount your Google Drive to the Colab environment
# This allows you to read from and write to files stored in your Drive
# After running this, a link will appear asking for permission to access your Drive
# Once authorized, your Drive will be available under '/content/drive'
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
# Import necessary modules from the Qdrant client
# QdrantClient allows us to interact with a Qdrant vector database (e.g., create collections, insert/query vectors)
# models provides configuration classes for vector parameters, point operations, etc.
from qdrant_client import QdrantClient, models

# Import Python's built-in modules
import os        # For file system operations like checking and creating directories
import shutil    # For high-level file operations like removing entire directories

# Define the local directory path where Qdrant should store its data
# This is where vector collections and other database files will be saved
# For Google Colab users, this may point to a mounted Google Drive location
qdrant_data_dir = '/content/drive/MyDrive/Semantic_Search/qdrant_data'

# ----------------------------------------
# Forcefully remove the directory if it exists
# ----------------------------------------

try:
    # Remove the entire directory and its contents recursively
    # This is useful when you want to reset the Qdrant database from scratch
    shutil.rmtree(qdrant_data_dir)
    print(f"Directory '{qdrant_data_dir}' removed successfully.")

except FileNotFoundError:
    # If the directory does not exist, no need to worry — just proceed
    print(f"Directory '{qdrant_data_dir}' not found. Proceeding to create it.")

except OSError as e:
    # If there's a permission issue or the directory is in use, catch and report it
    print(f"Error removing directory '{qdrant_data_dir}': {e}")

# ----------------------------------------
# Recreate the directory
# ----------------------------------------

# Create the directory (and any missing parent directories) if it doesn't exist
# After deletion, this ensures a clean, fresh directory is in place for Qdrant
os.makedirs(qdrant_data_dir, exist_ok=True)
print(f"Directory '{qdrant_data_dir}' created.")

# ----------------------------------------
# Initialize Qdrant Client
# ----------------------------------------

# Initialize the Qdrant client, telling it to use the newly created directory for local storage
# This setup will persist vector data across sessions (e.g., in Google Drive)
client = QdrantClient(path=qdrant_data_dir)
print("Qdrant client initialized with fresh storage directory.")


Directory '/content/drive/MyDrive/Semantic_Search/qdrant_data' removed successfully.
Directory '/content/drive/MyDrive/Semantic_Search/qdrant_data' created.
Qdrant client initialized with fresh storage directory.


In [None]:
try:
    # Define the name of the collection we want to manage in Qdrant.
    # A collection in Qdrant is similar to a table in traditional databases —
    # it stores a group of vectors and their associated metadata.
    collection_name = "demo_collection"

    # Check whether the collection already exists in Qdrant.
    # This avoids attempting to create a collection with a name that's already taken.
    if client.collection_exists(collection_name):
        # If the collection already exists, delete it to ensure we're starting fresh.
        # This is useful when we want to reset the state (e.g., during development or re-indexing).
        client.delete_collection(collection_name=collection_name)

        # Output a message confirming the collection was deleted successfully.
        print(f"Collection '{collection_name}' deleted successfully.")

    # Proceed to create a new collection regardless of whether it was previously deleted or not.
    # This ensures we always end up with a clean, newly-created collection.
    client.create_collection(
        collection_name=collection_name,  # The name of the new collection being created

        # Configure how vectors will be stored in this collection.
        # This includes the dimensionality (size) and the distance metric used for similarity.
        vectors_config=models.VectorParams(
            size=text_embeddings_size,       # The number of dimensions in each vector.
                                             # Must match the output size of your embedding model.
            distance=models.Distance.COSINE  # The distance function used for comparing vectors.
                                             # COSINE is commonly used for text embeddings as it measures angular similarity.
        ),
    )

    # Print a confirmation that the collection was created successfully.
    print(f"Collection '{collection_name}' created successfully.")

except Exception as e:
    # If any error occurs during the process (e.g., connection issues, invalid parameters),
    # it will be caught here and the error message will be printed.
    print(f"An error occurred while setting up the collection: {e}")


Collection 'demo_collection' created successfully.


In [None]:


# Import the `uuid4` function to generate unique IDs for each vector
# These IDs help identify and retrieve individual points later
from uuid import uuid4

# Import NumPy to handle vector data formats (embeddings are stored as NumPy arrays)
import numpy as np

# Upload all our text embeddings to the "demo_collection" in Qdrant
client.upload_points(
    collection_name="demo_collection",  # Target collection where we want to store our vectors

    # Create a list of PointStruct objects, one for each text chunk
    points=[
        models.PointStruct(
            id=str(uuid4()),  # Generate a unique ID for each point (as a string)

            # Convert the embedding to a NumPy array, which is the expected format
            vector=np.array(text_embeded[idx]),

            # Attach payload — additional information stored with each vector
            # This allows us to retrieve the original text and its metadata later
            payload={
                "metadata": doc["metadata"],         # Source and row info
                "content": doc["page_content"]       # The full text chunk
            }
        )
        for idx, doc in enumerate(texts)  # Loop through all texts and match them to their embeddings
    ]
)


## Run Query

In [None]:
# Create a query vector by embedding a sample search string
# This string represents what you're looking for — in this case, something about "Democrats challenges in Senate"
# The result is a vector in the same format as the ones stored in the Qdrant collection
query = get_text_embeddings('Democrats challenges in Senate')

# Perform a similarity search in Qdrant using the query vector
# This finds the most relevant text chunks (based on vector similarity)
text_hits = client.query_points(
    collection_name="demo_collection",  # The name of the collection where vectors were stored
    query=query,                         # The query vector — what we want to find similar results to
    limit=3,                             # Limit the number of results to 3 most relevant chunks
).points                                 # Extract only the list of matching points (each with vector + payload)


In [None]:
text_hits

[ScoredPoint(id='03618717-adda-4ebc-a161-248ceaf38040', version=0, score=0.6620313464035947, payload={'metadata': {'source': 'Are The Democrats Screwed In The Senate After 2024?'}, 'content': "airly competitive race against Mike Lee this year. Even with an additional senator going into 2023, the 2024 map is still so bad for Democrats that keeping the Senate for years to come will be a fairly tough order. The party’s prospects might rest more upon limiting the damage in 2024 so that it has a chance to regain the Senate in 2026 or 2028. But a bad 2024 could make it very difficult for Democrats to regain the Senate before 2030 or 2032.\nThat bleak picture may shape the next few years of political maneuvering. When Vox’s Dylan Matthews suggested on Twitter that liberal Justices Sonia Sotomayor (age 68) and Elena Kagan (age 62) should retire while Democrats have their Senate majority and be replaced by younger justices, it didn’t go over well. But it’s a perfectly rational suggestion if Dem

## We can start from here now!


NOTE: Disconnect and Delete Runtime!

In [None]:
# Define a function that sets custom CSS for the notebook output
# This specific style makes <pre> blocks wrap text instead of overflowing in a single line
def set_css():
    display(HTML('''
    <style>
      pre {
          white-space: pre-wrap;  /* Enable word-wrapping in code/output blocks */
      }
    </style>
    '''))

# Register the CSS-setting function to run automatically before each code cell runs
# This ensures the styling stays applied throughout the notebook session
get_ipython().events.register('pre_run_cell', set_css)

In [None]:
import markdown
from IPython.display import display, HTML

def render_markdown(md_text):
    # Convert Markdown to HTML
    html = markdown.markdown(md_text)
    # Display the HTML
    display(HTML(html))

In [None]:
!pip install qdrant_client
!pip install openai
!pip install gradio

Collecting qdrant_client
  Downloading qdrant_client-1.14.2-py3-none-any.whl.metadata (10 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant_client)
  Downloading portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Downloading qdrant_client-1.14.2-py3-none-any.whl (327 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.7/327.7 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading portalocker-2.10.1-py3-none-any.whl (18 kB)
Installing collected packages: portalocker, qdrant_client
Successfully installed portalocker-2.10.1 qdrant_client-1.14.2
Collecting gradio
  Downloading gradio-5.29.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-clien

In [None]:
!pip install qdrant_client
!pip install openai
!pip install gradio

Collecting gradio
  Downloading gradio-5.29.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.10.0 (from gradio)
  Downloading gradio_client-1.10.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.8-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6

In [None]:
# Import the Google Drive integration module for Google Colab
from google.colab import drive

# Mount your Google Drive to the Colab environment
# This allows you to read from and write to files stored in your Drive
# After running this, a link will appear asking for permission to access your Drive
# Once authorized, your Drive will be available under '/content/drive'
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import the main Qdrant client class to connect and interact with a Qdrant vector database
from qdrant_client import QdrantClient

# Import specific classes used to configure how vectors are stored and compared in a collection
from qdrant_client.http.models import Distance, VectorParams

# Import the `userdata` module from Google Colab.
from google.colab import userdata

# This is used to send requests to OpenRouter, which gives access to various LLMs (large language models)
from openai import OpenAI

# Import HTML and display tools from IPython
# These allow you to inject custom HTML or CSS into the notebook
from IPython.display import HTML, display

# Import necessary classes from the Hugging Face Transformers library
# AutoTokenizer handles breaking text into tokens
# AutoModel loads the pre-trained model used to compute vector embeddings
from transformers import AutoTokenizer, AutoModel

# Import the openai

import openai

**1. Define the Qdrant client first to connect to the vector database.**

In [None]:
# Attempt to initialize the Qdrant client
try:
    # Initialize the Qdrant client and set its storage path
    # This stores and retrieves the vector database in the specified directory on disk
    client = QdrantClient(path='/content/drive/MyDrive/Semantic_Search/qdrant_data')

except RuntimeError as e:
    # Catch the specific error that occurs when the Qdrant client is already running with this path
    if "already accessed by another instance" in str(e):
        print("Qdrant is already initialized with this path in the current session.")
        print("You don't need to create the client again — reuse the existing one.")
    else:
        # Re-raise the error if it's something else
        raise


**2. Define the OpenRouter client to serve as the language model (LLM) for the pipeline.**


In [None]:

# Retrieve the value of a saved environment variable named 'OPEN_ROUTER_API_KEY'.

OPEN_ROUTER_API_KEY = userdata.get('OPEN_ROUTER_API_KEY')



# Initialize the OpenAI-compatible client, but point it to OpenRouter's API instead of OpenAI's
# OpenRouter is a gateway to multiple LLMs like GPT, Claude, Mistral, and others, through one unified API

open_router_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",  # Set the API endpoint to OpenRouter (not OpenAI)
    api_key=OPEN_ROUTER_API_KEY               # Use your OpenRouter API key for authentication
)


**3. Import the same embedding model used during vector database creation to ensure consistency.**

In [None]:


# Load a pre-trained tokenizer and model designed for generating text embeddings
# "nomic-ai/nomic-embed-text-v1.5" is a model specifically trained to turn text into high-quality vector representations
# trust_remote_code=True allows use of any custom logic included with the model
text_tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
text_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

# Define a function to convert input text into a fixed-size vector (embedding)
def get_text_embeddings(text):
    # Tokenize the input text and return it as PyTorch tensors
    # padding=True: pad shorter sequences to ensure consistent length
    # truncation=True: cut off text that is too long for the model
    inputs = text_tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    # Pass the tokenized input through the model to obtain output embeddings
    outputs = text_model(**inputs)

    # outputs.last_hidden_state contains embeddings for each token
    # We take the mean across all tokens to get a single vector for the entire text
    embeddings = outputs.last_hidden_state.mean(dim=1)

    # Convert the result to a NumPy array and remove it from the computation graph
    return embeddings[0].detach().numpy()


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

configuration_hf_nomic_bert.py:   0%|          | 0.00/1.96k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_hf_nomic_bert.py:   0%|          | 0.00/104k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/547M [00:00<?, ?B/s]



**4. Test the retrieval functions to ensure they're returning relevant results.**

In [None]:
query = """What is CASE Uni?"""

In [None]:
# get query embedded
query_em = get_text_embeddings(query)

In [None]:
# prompt: timer for this code: text_hits = client.query_points(
#         collection_name="demo_collection",
#         query=query_em,
#         limit=5,
#     ).points

import time

start_time = time.time()

text_hits = client.query_points(
    collection_name="demo_collection",
    query=query_em,
    limit=5,
).points

end_time = time.time()
elapsed_time = end_time - start_time

print(f"Time taken to execute query: {elapsed_time} seconds")


Time taken to execute query: 0.002841472625732422 seconds


In [None]:
text_hits = client.query_points(
        collection_name="demo_collection",
        query=query_em,
        limit=5,
    ).points



In [None]:
text_hits

[ScoredPoint(id='433f9b88-ed19-4336-b50a-7bce8c348149', version=0, score=0.44652832199226583, payload={'metadata': {'source': 'Are The Democrats Screwed In The Senate After 2024?'}, 'content': '_id: 63927168098523fa9d5ef047 title: Are The Democrats Screwed In The Senate After 2024? author: Nate Silver publish_date: 1670515535000.0 full_text: \n\n\n\n2028 Election\nAre The Democrats Screwed In The Senate After 2024?\nNo, but the party faces an uphill battle.\n\n\n\nBy Nate Silver\n\n\nDec. 8, 2022, at 11:05 AM\t\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSamuel Corum / Getty Images'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id='6dcee216-1282-49ac-9832-2eab2270d7f0', version=0, score=0.4317332941470319, payload={'metadata': {'source': 'The COVID Bailout of State and Local Governments Was Unnecessary'}, 'content': '_id: 6529d072f59ed61396db839e title: The COVID Bailout of State and Local Governments Was Unnecessary author: Eric Boehm publish_date: 1697211917000.0 full_text:  

In [None]:
# Extract the original text content from each result returned by the similarity search
# `text_hits` is a list of points returned by Qdrant's query
# Each point has a `payload`, which contains metadata and the original text chunk

contents = [point.payload['content'] for point in text_hits]

In [None]:
contents

["airly competitive race against Mike Lee this year. Even with an additional senator going into 2023, the 2024 map is still so bad for Democrats that keeping the Senate for years to come will be a fairly tough order. The party’s prospects might rest more upon limiting the damage in 2024 so that it has a chance to regain the Senate in 2026 or 2028. But a bad 2024 could make it very difficult for Democrats to regain the Senate before 2030 or 2032.\nThat bleak picture may shape the next few years of political maneuvering. When Vox’s Dylan Matthews suggested on Twitter that liberal Justices Sonia Sotomayor (age 68) and Elena Kagan (age 62) should retire while Democrats have their Senate majority and be replaced by younger justices, it didn’t go over well. But it’s a perfectly rational suggestion if Democrats don’t feel like gambling with their judicial future. (Consider how consequential Ruth Bader Ginsburg’s decision not to retire has been for liberals.) Democrats have a narrow path to Se

In [None]:
# Extract the metadata for each point returned by the similarity search
# Each result (point) has a payload dictionary that includes metadata stored when uploading the vectors

meta = [point.payload['metadata'] for point in text_hits]

In [None]:
meta

[{'source': 'The COVID Bailout of State and Local Governments Was Unnecessary'},
 {'source': 'Ballot Measures: A Preview'},
 {'source': 'Are The Democrats Screwed In The Senate After 2024?'},
 {'source': 'Beyond the Narrative'},
 {'source': 'Ballot Measures: A Preview'}]

In [None]:
# Loop through each text chunk in the `contents` list
# These are the top-matching results returned by the Qdrant similarity search
for i in contents:
    # Print the actual text content
    print(i)

    # Print a separator line to clearly distinguish between different chunks
    print('###########')


airly competitive race against Mike Lee this year. Even with an additional senator going into 2023, the 2024 map is still so bad for Democrats that keeping the Senate for years to come will be a fairly tough order. The party’s prospects might rest more upon limiting the damage in 2024 so that it has a chance to regain the Senate in 2026 or 2028. But a bad 2024 could make it very difficult for Democrats to regain the Senate before 2030 or 2032.
That bleak picture may shape the next few years of political maneuvering. When Vox’s Dylan Matthews suggested on Twitter that liberal Justices Sonia Sotomayor (age 68) and Elena Kagan (age 62) should retire while Democrats have their Senate majority and be replaced by younger justices, it didn’t go over well. But it’s a perfectly rational suggestion if Democrats don’t feel like gambling with their judicial future. (Consider how consequential Ruth Bader Ginsburg’s decision not to retire has been for liberals.) Democrats have a narrow path to Senat

**5. Create a retriever function to extract relevant chunks from the documents.**

In [None]:
# Define a function to search the Qdrant vector database using a natural language query
def query_qdrant(query, qdrant_client, limit=5):
    # Step 1: Convert the query text into an embedding (vector representation)
    # This embedding will be compared with stored vectors in the collection
    query_em = get_text_embeddings(query)

    # Step 2: Query the Qdrant collection using the embedding
    # This finds the top `limit` most similar text chunks based on vector similarity
    text_hits = qdrant_client.query_points(
        collection_name="demo_collection",  # The name of the Qdrant collection to search
        query=query_em,                     # The embedding of the input query
        limit=limit                         # Number of top results to return
    ).points                                 # Extract the matching points (results)

    # Step 3: Prepare the results in a clean format (text + metadata)
    results = []
    for point in text_hits:
        results.append({
            'content': point.payload['content'],    # The original text content
            'metadata': point.payload['metadata']   # Associated metadata (e.g., title, row number)
        })

    # Return the list of results
    return results


In [None]:
query_qdrant(query, client)

[{'content': "airly competitive race against Mike Lee this year. Even with an additional senator going into 2023, the 2024 map is still so bad for Democrats that keeping the Senate for years to come will be a fairly tough order. The party’s prospects might rest more upon limiting the damage in 2024 so that it has a chance to regain the Senate in 2026 or 2028. But a bad 2024 could make it very difficult for Democrats to regain the Senate before 2030 or 2032.\nThat bleak picture may shape the next few years of political maneuvering. When Vox’s Dylan Matthews suggested on Twitter that liberal Justices Sonia Sotomayor (age 68) and Elena Kagan (age 62) should retire while Democrats have their Senate majority and be replaced by younger justices, it didn’t go over well. But it’s a perfectly rational suggestion if Democrats don’t feel like gambling with their judicial future. (Consider how consequential Ruth Bader Ginsburg’s decision not to retire has been for liberals.) Democrats have a narro

**6. Now, let's integrate everything by combining our Retrieval functiom with the Language Model to complete our RAG (Retrieval-Augmented Generation) pipeline.**

In [None]:
# Define a function that uses a language model to generate an answer based on a user's query
def generate_answer(query):
    # Build the prompt that will be sent to the LLM
    # The prompt includes:
    # - Instructions to clean and format the answer
    # - The user's original query
    # - The context retrieved from Qdrant (via semantic search)
    prompt = f"""
    Based on the following query from a user, please generate a small answer
    focusing on the original query and the response given. The answer should be in paragraphs.
    Remove the special characters and (/n), make the output clean and long.
    Please cite source for each part as [1][2].
    Just start with the answer, no need to give any salutations.
    IF query and context do not match, please say that you don't know. Don't try to make up an answer. If query and context doesn't match
    don't describe the context, just say "I DON'T KNOW"

    ###########
    query:
    "{query}"

    ########

    context:
    "{query_qdrant(query, client)}"
    #####

    Return in Markdown format.
    """

    # Send the prompt to the LLM using streaming mode
    # This allows the response to be received in real-time, piece by piece
    stream = open_router_client.chat.completions.create(
        model="qwen/qwen3-8b:free",  # Model to use (can be any OpenAI-compatible model)
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
        stream=True,  # Enable streaming so we get partial output as it generates
    )

    # Initialize a variable to hold the full response
    output_text = ""

    # Iterate through the streaming response chunks
    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            content = chunk.choices[0].delta.content
            output_text += content  # Append new content to the full output
            print(content, end="")  # Print each chunk live as it's received

    # Return both the final answer and the context used (for reference or display)
    return output_text, query_qdrant(query, client)


In [None]:
response, sources = generate_answer('Challenges for Democrats in Senate')

The challenges for Democrats in the Senate are primarily tied to the 2024 election map, which is described as particularly unfavorable for the party. Even with an additional senator entering the Senate in 2023, the 2024 map is considered so difficult that retaining Senate control for years will be a significant challenge [1]. The article suggests that Democrats’ prospects may hinge on minimizing losses in 2024 to create a path for regaining the Senate in 2026 or 2028, though a poor performance in 2024 could extend this timeline to 2030 or 2032 [1]. This uncertainty shapes political strategies, with some liberal voices advocating for retiring Justice Sonia Sotomayor and Elena Kagan to replace them with younger justices, a move framed as a pragmatic effort to secure long-term judicial influence [1].  

The 2024 election is viewed as a critical juncture, with Republicans projected to pick up seats if they secure the presidency. The article notes that a GOP win in 2024 could result in a Se

In [None]:
#for markdown layout
render_markdown(response)

In [None]:
sources

[{'content': "airly competitive race against Mike Lee this year. Even with an additional senator going into 2023, the 2024 map is still so bad for Democrats that keeping the Senate for years to come will be a fairly tough order. The party’s prospects might rest more upon limiting the damage in 2024 so that it has a chance to regain the Senate in 2026 or 2028. But a bad 2024 could make it very difficult for Democrats to regain the Senate before 2030 or 2032.\nThat bleak picture may shape the next few years of political maneuvering. When Vox’s Dylan Matthews suggested on Twitter that liberal Justices Sonia Sotomayor (age 68) and Elena Kagan (age 62) should retire while Democrats have their Senate majority and be replaced by younger justices, it didn’t go over well. But it’s a perfectly rational suggestion if Democrats don’t feel like gambling with their judicial future. (Consider how consequential Ruth Bader Ginsburg’s decision not to retire has been for liberals.) Democrats have a narro

## Time to Build a functional Gradio interface to interact with the RAG system.

In [None]:
import gradio as gr

**1. Redefine our RAG function**

In [None]:
# Import OpenAI-compatible library (used here with OpenRouter)
import openai

# Define a function to generate a streamed answer to a user's query using an LLM
# This version includes error handling and uses Python's `yield` to stream results back as they're generated
def generate_answer(query):
    # Step 1: Try to get relevant context from Qdrant (vector search)
    try:
        sources = query_qdrant(query, client)
    except Exception as e:
        # If something goes wrong (e.g., Qdrant is not running), return a fallback message
        sources = [{"error": f"Error retrieving sources: {str(e)}"}]

    # Step 2: Prepare the prompt for the language model
    # Includes the user's question and the context retrieved from the vector database
    prompt = f"""
    Based on the following query from a user, please generate a small answer
    focusing on the original query and the response given. The answer should be paragraphs.
    Remove special characters and (/n); make the output clean and long.
    Please cite source for each part as [1][2]. Just start with the answer — no salutations.
    If you don't know the answer, please say that you don't know. Don't try to make up an answer.

    ###########
    query:
    "{query}"

    ########

    context:
    "{sources}"
    #####

    Return in Markdown format.
    """

    # Step 3: Send the prompt to the OpenRouter-compatible LLM (Qwen model)
    stream = open_router_client.chat.completions.create(
        model="qwen/qwen3-8b:free",  # A free-to-use large language model hosted on OpenRouter
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
        stream=True,  # Enable streaming response
    )

    # Step 4: Stream and yield the generated content chunk by chunk
    full_response = ""
    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            content = chunk.choices[0].delta.content
            full_response += content

            # Yield lets us return partial results as they're received (for real-time feedback)
            yield full_response


**2. Create a Demo Interface**

In [None]:
# Define example inputs for the UI — users can click these to try predefined queries
examples = [
    ["Democrats in Senate"],
    ["Climate Change Challenges in Europe"],
    ["Philosophy in the world of Minimalism"],
    ["Hate Speech vs Freedom of Speech"],
    ["Articles by Noam Chomsky on US Politics"],
    ["The importance of values and reflection"]
]

# Set up the Gradio interface
# - fn: the function to call when user enters input (must be a generator if using yield)
# - title: the name shown at the top of the web app
# - inputs: defines the input component (in this case, a text box)
# - outputs: defines what kind of output to display (Textbox with 3 lines labeled "Response")
# - examples: preloaded example queries for users to click and run

import gradio as gr

demo = gr.Interface(
    fn=generate_answer,  # The function that will process user input
    title="The Truth Serum",  # Title for the web app
    inputs="text",  # Single text input from the user
    outputs=gr.components.Textbox(lines=3, label="Response"),  # Output display
    examples=examples,  # List of sample queries for users to try
    live=False,  # Optional: set to True if you want real-time feedback as user types
)

# Launch the interface
# - share=True gives you a public link (useful in Colab or for sharing with others)
# - debug=True enables logging for error tracking
demo.launch(share=True)


Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://9724936b990cd394a8.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




**3. Create a Demo Interface with Sources**

In [None]:
import json

def generate_answer(query):
    # First, get the sources
    try:
        sources = query_qdrant(query, client)
    except Exception as e:
        sources = [{"error": f"Error retrieving sources: {str(e)}"}]

    # Convert the sources list to a string for the prompt
    #sources_str = "\n".join([f"Source {i+1}: {source['content']}" for i, source in enumerate(sources)])

    prompt = f"""
    Based on the following query from a user, please generate a small answer
    focusing on the original query and the response given. The answer should be paragraphs
    remove the special characters and (/n ), make the output clean and long. Please cite source for each part as [1][2]
    Just start with the answer, no need to give any salutations

    ###########
    query:
    "{query}"

    ########

    context:
    "{sources}"
    #####

    Return in Markdown format.
    """

    # Send the prompt to the OpenRouter-compatible LLM (Qwen model)
    stream = open_router_client.chat.completions.create(
        model="qwen/qwen3-8b:free",  # A free-to-use large language model hosted on OpenRouter
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
        stream=True,  # Enable streaming response
    )
    # Convert sources to a proper JSON string for the JSON component
    sources_json = json.dumps(sources)

    # For Gradio streaming with multiple outputs
    full_response = ""
    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            content = chunk.choices[0].delta.content
            full_response += content
            # Return both the accumulated response and the sources as JSON string
            yield full_response, sources_json

    # In case the stream is empty, yield one final time
    if not full_response:
        yield "No response generated", sources_json

examples = [
    ["Democrats in Senate"],
    ["Climate Change Challenges in Europe"],
    ["Philosophy in the world of Minimalism"],
    ["Hate Speech vs Freedom of Speech"],
    ["Articles by Noam Chomsky on US Politics"],
    ["The importance of values and reflection"]
]

demo = gr.Interface(
    fn=generate_answer,
    title="The Truth Serum",
    inputs="text",
    outputs=[
        gr.components.Textbox(lines=8, label="Response"),
        gr.components.JSON(label="Sources")
    ],
    examples=examples
)

demo.queue()
demo.launch(share=True, debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://b90e7971c09aeaef97.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://179d0f31dc599bbe16.gradio.live
Killing tunnel 127.0.0.1:7861 <> https://b90e7971c09aeaef97.gradio.live


