# Lab 2: Vector Databases with Qdrant

Welcome to Lab 2! In this session, you'll learn how to use **vector databases** to enable powerful semantic search capabilities for your AI applications.

**What you'll learn:**
- What vector databases are and why they matter for AI
- How to convert text into embeddings (vector representations)
- How to store and search vectors using Qdrant
- How to filter search results using metadata
- Building blocks for Retrieval Augmented Generation (RAG)

**Prerequisites:**
- Completed Lab 1 (basic LLM concepts)
- Basic Python and Pandas knowledge
- A Qdrant Cloud account (free tier available)

**References:**
- [Qdrant Beginner Tutorial](https://qdrant.tech/documentation/beginner-tutorials/search-beginners/)
- [Qdrant Cloud Quickstart](https://qdrant.tech/documentation/cloud-quickstart/)

---


## Understanding Key Concepts

### What is a Vector Database?

A **vector database** is a specialized database designed to store and search **embeddings** (numerical representations of data). Unlike traditional databases that search for exact matches, vector databases find items that are **semantically similar**.

**Example:**
- Traditional search: "diet soda" only finds items with those exact words
- Vector search: "diet soda" also finds "zero calorie beverage", "sugar-free drink", etc.

### What are Embeddings?

**Embeddings** are numerical representations (arrays of numbers) that capture the **meaning** of text. Similar texts have similar embeddings.

```
"I love cats" ‚Üí [0.12, -0.34, 0.56, ...]  (384 numbers)
"I adore kittens" ‚Üí [0.11, -0.33, 0.55, ...]  (similar numbers!)
"I hate Mondays" ‚Üí [-0.45, 0.22, -0.18, ...]  (different numbers)
```

### What is Similarity Search?

**Similarity search** finds items in the database whose embeddings are closest to a query embedding. This is measured using distance metrics like **cosine similarity**.

### Why Qdrant?

**Qdrant** is an open-source vector database that:
-  Is fast and scalable
-  Supports metadata filtering
-  Has a generous free cloud tier (On the cloud hosted platform)
-  Works great with Python

---


## Step 1: Install Required Packages

First, let's install all the libraries we need:
- `pandas`: Data manipulation
- `qdrant-client`: Python client for Qdrant
- `sentence-transformers`: Create text embeddings
- `langchain-community` & `langchain-google-vertexai`: LangChain integrations


In [None]:
# Install required packages (run this cell first, only needed once)
%pip install pandas qdrant-client sentence-transformers langchain-community langchain-google-vertexai 

print(" All packages installed successfully!")


Note: you may need to restart the kernel to use updated packages.
 All packages installed successfully!



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: C:\Users\asggm\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


### Important: Restart Kernel After First Installation

If this is your **first time running** the installation cell above, you need to **restart your kernel** before continuing. This ensures all libraries are properly loaded.

**How to restart the kernel:**
- Jupyter: `Kernel` ‚Üí `Restart`
- VS Code: Click the restart button in the notebook toolbar
- Google Colab: `Runtime` ‚Üí `Restart runtime`

Based on which group you are doing this with.
---


## Step 2: Load the Soft Drinks Data

Now let's load our soft drinks dataset. This CSV file contains product information including:
- Product names and descriptions
- Brand information
- Category/shelf classification
- Other metadata

We'll use this data to build a searchable product database!


In [None]:
# Import all necessary libraries
import pandas as pd
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer

# Load the soft drinks data from CSV
# Make sure the file path is correct for your environment
df = pd.read_csv('data/softdrinks.csv')

# Display the dataframe to see what we're working with
print(f" Loaded {len(df)} products")
df


  from .autonotebook import tqdm as notebook_tqdm


 Loaded 10 products


Unnamed: 0,product_id,product_name,brand,shelf,category,description
0,108050729,Sprite Soda Pop Lemon Lime - 1 Liter,Sprite,Lemon-Lime-Citrus-Soda,Beverages,"With its cool, satisfying, and refreshing citr..."
1,108050873,Sprite Soda Lemon Lime - 24-12 Fl. Oz.,Sprite,Lemon-Lime-Citrus-Soda,Beverages,"Designers, artists, musicians, athletes. Taste..."
2,108051814,Canada Dry Zero Sugar Ginger Ale Soda Cans - 1...,Canada Dry,Ginger-Ale,Beverages,Sip into your comfort zone with the crisp good...
3,108010020,Diet Coke Soda Pop Cola 6 Count - 16.9 Fl. Oz.,Diet Coke,Cola,Beverages,Enjoy a break with the one and only Diet Coke....
4,108010060,Diet Coke Soda Pop Cola Caffeine Free - 6-16.9...,Diet Coke,Cola,Beverages,"Crisp, cold and reliable, this is your everyda..."
5,108010223,Coca-Cola Soda Pop Caffeine Free - 12.12 Fl. Oz.,Coca-Cola,Cola,Beverages,Soda. Pop. Soft drink. Sparkling beverage. \n\...
6,108010130,Pepsi Soda Diet Caffeine Free - 6-16.9 Fl. Oz.,Pepsi,Cola,Beverages,Enjoy Diet Caffeine Free Pepsi Cola with 0 cal...
7,108010222,Coca-Cola Soda Pop Classic - 12-12 Fl. Oz.,Coca-Cola,Cola,Beverages,"There's nothing quite like the crisp, refreshi..."
8,108050651,Sprite Soda Pop Lemon Lime Pack In Cans - 12-1...,Sprite,Lemon-Lime-Citrus-Soda,Beverages,"Designers, artists, musicians, athletes. Taste..."
9,108051456,Sprite Zero Sugar Soda Pop Lemon Lime - 2 Liter,Sprite,Lemon-Lime-Citrus-Soda,Beverages,"Designers, artists, musicians, athletes. Taste..."


### Convert DataFrame to Documents

For Qdrant, we need to convert our DataFrame rows into a list of dictionaries (documents). Each document will become a point in our vector database.


In [None]:
# Convert DataFrame to a list of dictionaries
documents = df.to_dict(orient='records')

print(f" Created {len(documents)} documents")
print("\n Example document (first product):")
documents[0]


 Created 10 documents

 Example document (first product):


{'product_id': 108050729,
 'product_name': 'Sprite Soda Pop Lemon Lime - 1 Liter',
 'brand': 'Sprite',
 'shelf': 'Lemon-Lime-Citrus-Soda',
 'category': 'Beverages',
 'description': 'With its cool, satisfying, and refreshing citrus taste, Sprite is the soda that ignites your senses to keep you on your toes. \n  \nAs the OG, the head honcho in the lemon-lime flavored soft drink biz, Sprite was, is, and will always be an innovator. An inventor. Keeping things interesting for you, no matter how you drink it. It‚Äôs a big challenge, but one that Sprite‚Äôs not afraid of‚Äîbecause innovators aren‚Äôt afraid of anything. \n  \nEvery sip of Sprite is refreshing, so it should come as no surprise why Sprite and lemon-lime have become synonymous. As the awe-inspiring trendsetter of the iconic flavor, Sprite never fails to deliver exactly what it promises, every single time. After all, it‚Äôs been doing that for 50 years already.\n  \nFor the people who love a little more zing, there‚Äôs a variety

---

## Step 3: Connect to Qdrant Cloud

Now we need to connect to Qdrant. You'll need:
1. A **Qdrant Cloud account** (free at [cloud.qdrant.io](https://cloud.qdrant.io))
2. An **API Key** from your cluster
3. The **Cluster URL**

### How to get your Qdrant credentials:

1. Go to [cloud.qdrant.io](https://cloud.qdrant.io) and sign up/login
2. Create a new cluster (free tier is fine)
3. Click on your cluster ‚Üí **API Keys** ‚Üí Create a new key

  **Keep Note that the key will only be visable in the next window so be sure to copy the key and URL before closing it.**

Sample from the site :

qdrant_client = QdrantClient(

    url="https://d78b1147-cde0-4b94-aa1e-9b6b2278050c.us-east4-0.gcp.cloud.qdrant.io:6333", 
    api_key="eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3MiOiJtIn0.B0RB3KPvC_HSZ4wxZnSymZKrt6DZ_za45eKViBpdvIw",
    
)

4. Copy your API Key and Cluster URL

 **Security Note for Production use case**: Never commit API keys to version control!


In [None]:
# ============================================
# REPLACE THESE WITH YOUR QDRANT CREDENTIALS
# ============================================

# Your Qdrant API Key (from Qdrant Cloud dashboard)
API_KEY = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2Nlc3MiOiJtIn0.B0RB3KPvC_HSZ4wxZnSymZKrt6DZ_za45eKViBpdvIw"  # Example: "abc123xyz..."

# Your Qdrant Cluster URL (from Qdrant Cloud dashboard)
QDRANT_URL = "https://d78b1147-cde0-4b94-aa1e-9b6b2278050c.us-east4-0.gcp.cloud.qdrant.io:6333"  # Example: "https://abc123-xyz.us-east-1-0.aws.cloud.qdrant.io:6333"

# ============================================

# Connect to Qdrant Cloud
qdrant_client = QdrantClient(
    url=QDRANT_URL,
    api_key=API_KEY,
)

print(" Connected to Qdrant Cloud!")
print(f" URL: {QDRANT_URL[:50]}..." if len(QDRANT_URL) > 50 else f" URL: {QDRANT_URL}")


 Connected to Qdrant Cloud!
 URL: https://d78b1147-cde0-4b94-aa1e-9b6b2278050c.us-ea...


---

## Step 4: Initialize the Embedding Model

We'll use **SentenceTransformers** to convert text into embeddings. The model `all-MiniLM-L6-v2` is:
- Fast and lightweight
- Creates 384-dimensional vectors
- Great for semantic similarity tasks

The first time you run this, it will download the model (~90MB).


In [7]:
# Initialize the embedding model
# This model converts text into 384-dimensional vectors
encoder = SentenceTransformer("all-MiniLM-L6-v2")

# Let's see how embeddings work with a quick example
example_text = "refreshing diet cola"
example_embedding = encoder.encode(example_text)

print(f" Embedding model loaded!")
print(f" Embedding dimension: {len(example_embedding)}")
print(f"\n Example embedding for '{example_text}':")
print(f"   First 10 values: {example_embedding[:10].round(4)}")


 Embedding model loaded!
 Embedding dimension: 384

 Example embedding for 'refreshing diet cola':
   First 10 values: [-0.1018 -0.0866  0.0263  0.1092  0.0594 -0.0221  0.0306 -0.0043 -0.0521
 -0.0756]


---

## Step 5: Create a Qdrant Collection

A **collection** in Qdrant is like a table in a traditional database. It stores all your vectors and their associated data (payload).

When creating a collection, we specify:
- **Vector size**: Must match our embedding dimension (384)
- **Distance metric**: How similarity is calculated (COSINE is most common)


In [8]:
# Define collection name
collection_name = "softdrink"

# Optional: Delete existing collection if you want to start fresh
# Uncomment the lines below if needed:
# try:
#     qdrant_client.delete_collection(collection_name=collection_name)
#     print(f"üóëÔ∏è Deleted existing collection: {collection_name}")
# except Exception as e:
#     print(f"‚ÑπÔ∏è No existing collection to delete")

# Create the collection
qdrant_client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(),  # 384 dimensions
        distance=models.Distance.COSINE,  # Use cosine similarity
    ),
)

print(f" Collection '{collection_name}' created successfully!")


 Collection 'softdrink' created successfully!


---

##  Step 6: Create Metadata Indexes

**Payload indexes** allow us to efficiently filter search results by metadata fields. Without indexes, filtering would be slow on large datasets.

We'll create indexes for:
- `bpn` - Product number
- `shelf` - Product category (e.g., "Cola", "Sports Drinks")
- `brand` - Brand name
- `category` - Product category
- `product_name` - Product name


In [9]:
# Create payload indexes for faster filtered searches
# These are the metadata fields we'll want to filter by

index_fields = ["bpn", "shelf", "brand", "category", "product_name"]

for field in index_fields:
    qdrant_client.create_payload_index(
        collection_name=collection_name,
        field_name=field,
        field_type="keyword"  # Use "keyword" for string fields
    )
    print(f" Created index for: {field}")

print("\n All metadata indexes created!")


  qdrant_client.create_payload_index(


 Created index for: bpn
 Created index for: shelf
 Created index for: brand
 Created index for: category
 Created index for: product_name

 All metadata indexes created!


---

##  Step 7: Upload Documents with Embeddings

Now we'll upload our products to Qdrant. For each document:
1. Generate an embedding from the **product description**
2. Store the embedding as a vector
3. Store all product info as the **payload** (metadata)

This is where the magic happens - we're converting text descriptions into searchable vectors!


In [10]:
# Upload all documents to Qdrant
# Each point has: ID, vector (embedding), and payload (metadata)

print(" Uploading documents to Qdrant...")

qdrant_client.upload_points(
    collection_name=collection_name,
    points=[
        models.PointStruct(
            id=idx,  # Unique ID for each point
            vector=encoder.encode(doc.get("description", "")).tolist(),  # Embed the description
            payload=doc  # Store all document fields as metadata
        )
        for idx, doc in enumerate(documents)
    ],
)

print(f" Successfully uploaded {len(documents)} products to Qdrant!")
print("\n Your vector database is ready for searching!")


 Uploading documents to Qdrant...
 Successfully uploaded 10 products to Qdrant!

 Your vector database is ready for searching!


---

##  Step 8: Similarity Search

Now let's search our database! We'll:
1. Convert our search query into an embedding
2. Find the most similar products based on cosine similarity
3. Return the top results

**Example query**: "diet drink" - This should find low-calorie, sugar-free beverages even if they don't contain the exact words "diet drink".


In [11]:
# Basic similarity search
query = "diet drink"

# Search for similar products
hits = qdrant_client.query_points(
    collection_name=collection_name,
    query=encoder.encode(query).tolist(),  # Convert query to embedding
    limit=3,  # Return top 3 results
).points

# Display results
print(f" Search query: '{query}'")
print(f" Found {len(hits)} results:\n")

for i, hit in enumerate(hits, 1):
    print(f"--- Result {i} (Score: {hit.score:.4f}) ---")
    print(f"   Product: {hit.payload.get('product_name', 'N/A')}")
    print(f"   Brand: {hit.payload.get('brand', 'N/A')}")
    print(f"   Shelf: {hit.payload.get('shelf', 'N/A')}")
    print(f"   Description: {hit.payload.get('description', 'N/A')[:100]}...")
    print()


 Search query: 'diet drink'
 Found 3 results:

--- Result 1 (Score: 0.6071) ---
   Product: Diet Coke Soda Pop Cola Caffeine Free - 6-16.9 Fl. Oz.
   Brand: Diet Coke
   Shelf: Cola
   Description: Crisp, cold and reliable, this is your everyday wing (wo)man‚ÄîDiet Coke. Your deliciously fizzy go-to...

--- Result 2 (Score: 0.5506) ---
   Product: Diet Coke Soda Pop Cola 6 Count - 16.9 Fl. Oz.
   Brand: Diet Coke
   Shelf: Cola
   Description: Enjoy a break with the one and only Diet Coke. This diet soda is your perfect no-sugar, no-calorie c...

--- Result 3 (Score: 0.4822) ---
   Product: Pepsi Soda Diet Caffeine Free - 6-16.9 Fl. Oz.
   Brand: Pepsi
   Shelf: Cola
   Description: Enjoy Diet Caffeine Free Pepsi Cola with 0 calories and no caffeine....



### What Just Happened?

1. We converted "diet drink" into a 384-dimensional embedding
2. Qdrant compared this embedding against all product embeddings
3. It returned the products with the highest cosine similarity scores
4. Even products without the exact words "diet" or "drink" can match, we pulled brand name pepsi for example with no of those two words in the product name!

**Similarity Score**: Ranges from 0 to 1 (higher = more similar)

---

## Step 9: Filtered Similarity Search

Sometimes you want to combine semantic search with exact filters. For example:
- Find "diet drinks" but **only in the Cola category**
- Search for "refreshing" but **only Pepsi brand**

This is the power of Qdrant's metadata filtering!


In [18]:
# Filtered similarity search
# Search for "diet drink" but ONLY in the Cola shelf

query = "diet drink"
filter_shelf = "Cola"

hits = qdrant_client.query_points(
    collection_name=collection_name,
    query=encoder.encode(query).tolist(),
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="shelf",
                match=models.MatchValue(value=filter_shelf)
            ),
        ]
    ),
    limit=2,  # Return top 2 results
).points

# Display results
print(f" Search query: '{query}'")
print(f" Filter: shelf = '{filter_shelf}'")
print(f" Found {len(hits)} results:\n")

for i, hit in enumerate(hits, 1):
    print(f"--- Result {i} (Score: {hit.score:.4f}) ---")
    print(f"   Product: {hit.payload.get('product_name', 'N/A')}")
    print(f"   Brand: {hit.payload.get('brand', 'N/A')}")
    print(f"   Shelf: {hit.payload.get('shelf', 'N/A')}")
    print()


 Search query: 'diet drink'
 Filter: shelf = 'Cola'
 Found 2 results:

--- Result 1 (Score: 0.6071) ---
   Product: Diet Coke Soda Pop Cola Caffeine Free - 6-16.9 Fl. Oz.
   Brand: Diet Coke
   Shelf: Cola

--- Result 2 (Score: 0.5506) ---
   Product: Diet Coke Soda Pop Cola 6 Count - 16.9 Fl. Oz.
   Brand: Diet Coke
   Shelf: Cola



As we can see the semantic search worked but with our filter appled for the spesfic self 'Cola'

---

# üß™ LAB EXERCISES

Now it's your turn! Complete the following exercises to practice what you've learned.

---

### Exercise 1: Basic Similarity Search

**Task:** Search for "refreshing summer beverage" and return the top 3 results.

Print out the product name, brand, and similarity score for each result.


In [None]:
print("---Lab Exercise 1: Basic Similarity Search---\n")

# Your code here!
query = "refreshing summer beverage"

hits = qdrant_client.query_points(
    collection_name=collection_name,
    query=encoder.encode(query).tolist(),
    limit=3,
).points

print(f" Query: '{query}'\n")
for i, hit in enumerate(hits, 1):
    print(f"{i}. {hit.payload.get('product_name', 'N/A')}")
    print(f"   Brand: {hit.payload.get('brand', 'N/A')}")
    print(f"   Score: {hit.score:.4f}")
    print()


---Lab Exercise 1: Basic Similarity Search---

üîç Query: 'refreshing summer beverage'

1. Coca-Cola Soda Pop Classic - 12-12 Fl. Oz.
   Brand: Coca-Cola
   Score: 0.5415

2. Coca-Cola Soda Pop Caffeine Free - 12.12 Fl. Oz.
   Brand: Coca-Cola
   Score: 0.5395

3. Sprite Soda Pop Lemon Lime Pack In Cans - 12-12 Fl. Oz.
   Brand: Sprite
   Score: 0.4639



---

### Exercise 2: Filtered Search by Brand

**Task:** Search for "energy boost" but filter to only show products from a specific brand.

**Hints:**
- Use `query_filter` with a `FieldCondition` on the "brand" field
- Try filtering by brands like "Red Bull", "Monster", or "Gatorade"


In [None]:
print("---Lab Exercise 2: Filtered Search by Brand---\n")

# Your code here!
query = "energy boost"
filter_brand = "Gatorade"  # Try different brands!

hits = qdrant_client.query_points(
    collection_name=collection_name,
    query=encoder.encode(query).tolist(),
    query_filter=models.Filter(
        must=[
            models.FieldCondition(
                key="brand",
                match=models.MatchValue(value=filter_brand)
            ),
        ]
    ),
    limit=3,
).points

print(f" Query: '{query}'")
print(f" Filter: brand = '{filter_brand}'\n")

if hits:
    for i, hit in enumerate(hits, 1):
        print(f"{i}. {hit.payload.get('product_name', 'N/A')} (Score: {hit.score:.4f})")
else:
    print("No results found. Try a different brand!")


---Lab Exercise 2: Filtered Search by Brand---

 Query: 'energy boost'
 Filter: brand = 'Gatorade'

No results found. Try a different brand!


---

### Exercise 3: Create Your Own Search! üéÆ

**Task:** Write your own similarity search query against the soft drinks database.

**Ideas to try:**
- "healthy sports drink"
- "kids party drink"
- "coffee alternative"
- "natural fruit flavor"
- Filter by multiple conditions (category AND brand)

Have fun exploring!!!! and Share future sesstions with your team mates.


In [None]:
print("---Lab Exercise 3: Your Own Search!---\n")

# ========================================
# üéÆ YOUR CODE HERE - Be creative!
# ========================================

# Example: Search with multiple filters
query = "natural fruit flavor"

hits = qdrant_client.query_points(
    collection_name=collection_name,
    query=encoder.encode(query).tolist(),
    limit=5,
).points

print(f"üîç Query: '{query}'\n")
for i, hit in enumerate(hits, 1):
    print(f"{i}. {hit.payload.get('product_name', 'N/A')}")
    print(f"   Brand: {hit.payload.get('brand', 'N/A')} | Shelf: {hit.payload.get('shelf', 'N/A')}")
    print(f"   Score: {hit.score:.4f}")
    print()


---

# üéâ Congratulations!

You've completed Lab 2! Here's what you learned:

| Concept | What You Learned |
|---------|------------------|
| **Vector Databases** | How to store and search embeddings with Qdrant |
| **Embeddings** | How to convert text to numerical vectors using SentenceTransformers |
| **Similarity Search** | How to find semantically similar items |
| **Metadata Filtering** | How to combine semantic search with exact filters |
| **Payload Indexes** | How to optimize filtered queries |

##  How This Connects to RAG

What you learned today is a key building block for **Retrieval Augmented Generation (RAG)**:

1. **Store** your documents in a vector database (what we did today)
2. **Search** for relevant documents based on user queries
3. **Augment** LLM prompts with retrieved context
4. **Generate** accurate, grounded responses

##  What's Next?

In the next session, we'll combine vector search with LLMs to build a complete RAG application!

## Additional Resources

- [Qdrant Documentation](https://qdrant.tech/documentation/)
- [SentenceTransformers Models](https://www.sbert.net/docs/pretrained_models.html)
- [Vector Search Explained](https://qdrant.tech/articles/vector-search/)
- [LangChain + Qdrant Integration](https://python.langchain.com/docs/integrations/vectorstores/qdrant)
