# Step 3: Indexing Embeddings in a Vector Database

This notebook demonstrates the final step in preparing our search system: taking our code chunks and their embeddings and storing them in a **vector database**. For this demo, we'll use **Qdrant**.

## Prerequisites: Starting the Database with Docker Compose

1.  **Save the `docker-compose.yml` file** provided in the article to this directory.
2.  **Run the command:** Open your terminal in this directory and run `docker-compose up -d`.
3.  **Verify:** Check that the Qdrant dashboard is running at [http://localhost:6333/dashboard](http://localhost:6333/dashboard).

In [None]:
# Install necessary libraries
!pip install openai python-dotenv qdrant-client

## 1. Setup Clients (OpenAI and Qdrant)

We'll set up our clients by loading the OpenAI API key from a `.env` file and connecting to our local Qdrant instance.

### Instructions for Using a `.env` File

1.  **Create a file:** In the same directory as this notebook, create a file named `.env` or rename provided `.env_example` to `.env`
2.  **Add your key:** Open it and add your OpenAI API key like this:
    `OPENAI_API_KEY="sk-YourSecretKeyGoesHere"`
3.  **Save the file.** The code below will automatically load it.

In [1]:
import os
import getpass
import uuid
from openai import OpenAI
from dotenv import load_dotenv
from qdrant_client import QdrantClient, models

# --- OpenAI Client Setup ---
load_dotenv()
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    api_key = getpass.getpass("OpenAI API key not found. Please enter your key: ")
    os.environ["OPENAI_API_KEY"] = api_key

try:
    openai_client = OpenAI()
    print("✅ OpenAI client initialized successfully!")
except Exception as e:
    print(f"❌ Error initializing OpenAI client: {e}")

# --- Qdrant Client Setup ---
try:
    qdrant_client = QdrantClient("localhost", port=6333)
    # Check if the client can communicate with the server.
    qdrant_client.get_collections()
    print("✅ Qdrant client connected successfully!")
except Exception as e:
    print(f"❌ Error connecting to Qdrant: {e}")
    print("⚠️ Please ensure the Qdrant Docker container is running via docker-compose.")

✅ OpenAI client initialized successfully!
✅ Qdrant client connected successfully!


## 2. Prepare Enriched Data and Create Embeddings

Here, we'll implement the **hybrid approach**: for each chunk, we'll create a combined text string containing both the **LLM-generated description** and the **raw code**. This combined text is what we'll send to OpenAI to create a single, powerful embedding.

In [2]:
# This list represents our chunks with AI-generated descriptions and rich metadata
enriched_chunks = [
    {
        'name': 'read_file_content', 'type': 'function', 'llm_description': 'Reads and returns the entire content of a specified file, handling file-not-found errors gracefully.', 
        'code': 'def read_file_content(filepath: str) -> str:\n    """Read and return the content of a file."""\n    try:\n        with open(filepath, \'r\', encoding=\'utf-8\') as file:\n            return file.read()\n    except FileNotFoundError:\n        return ""',
        'file_path': 'src/utils/file_handler.py', 'line_from': 10, 'line_to': 15
    },
    {
        'name': 'validate_email', 'type': 'function', 'llm_description': 'Performs a simple validation of an email address by checking for the presence of “@” and “.” symbols.', 
        'code': 'def validate_email(email: str) -> bool:\n    """Simple email validation function."""\n    return "@" in email and "." in email.split("@")[-1]',
        'file_path': 'src/utils/validators.py', 'line_from': 8, 'line_to': 10
    },
    {
        'name': 'DataProcessor', 'type': 'class', 'llm_description': 'A class designed to process and analyze batches of string data, including cleaning and calculating statistics.', 
        'code': 'class DataProcessor:\n    """A class for processing and analyzing data."""\n\n    def __init__(self, data_source: str):\n        self.data_source = data_source\n        self.processed_count = 0\n\n    def process_batch(self, items: List[str]) -> List[str]:\n        #... (rest of class code)',
        'file_path': 'src/processing/core.py', 'line_from': 20, 'line_to': 45
    }
]

def get_openai_embedding(text: str, model: str = "text-embedding-3-small"):
    text = text.replace("\n", " ")
    try:
        return openai_client.embeddings.create(input=[text], model=model).data[0].embedding
    except Exception as e:
        print(f"❌ Error generating embedding: {e}")
        return None

# --- Generate embeddings for the combined text ---
print("--- Generating Hybrid Embeddings ---\n")
for chunk in enriched_chunks:
    combined_text = f"Description: {chunk['llm_description']}\n---\nCode:\n{chunk['code']}"
    print(f"Embedding chunk: {chunk['name']}...")
    chunk['embedding'] = get_openai_embedding(combined_text)
    if chunk['embedding']:
        print(f"  ✓ Embedding created (Dimensions: {len(chunk['embedding'])})\n")
    else:
        print(f"  ✗ FAILED to embed chunk: {chunk['name']}\n")

--- Generating Hybrid Embeddings ---

Embedding chunk: read_file_content...
  ✓ Embedding created (Dimensions: 1536)

Embedding chunk: validate_email...
  ✓ Embedding created (Dimensions: 1536)

Embedding chunk: DataProcessor...
  ✓ Embedding created (Dimensions: 1536)



## 3. Create and Populate the Qdrant Collection

Now we'll create a collection in Qdrant. A collection is like a table in a SQL database. Here, we use the modern approach to first check if the collection exists, delete it if it does, and then create a new one to ensure our notebook is runnable every time.

In [3]:
COLLECTION_NAME = "semantic_code_search"
VECTOR_SIZE = 1536 # For text-embedding-3-small

# --- Create the Collection (Modern Approach) ---
try:
    # Check if the collection already exists
    collections = qdrant_client.get_collections().collections
    collection_exists = any(collection.name == COLLECTION_NAME for collection in collections)
    
    if collection_exists:
        print(f"🗑️ Collection '{COLLECTION_NAME}' already exists. Deleting it.")
        qdrant_client.delete_collection(collection_name=COLLECTION_NAME)
    
    # Create a new collection
    print(f"✨ Creating new collection '{COLLECTION_NAME}'.")
    qdrant_client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=models.VectorParams(size=VECTOR_SIZE, distance=models.Distance.COSINE),
    )
    print("✅ Collection created successfully.")

except Exception as e:
    print(f"❌ Error during collection setup: {e}")

# --- Prepare and Upsert Points ---
points_to_upsert = []
for chunk in enriched_chunks:
    if chunk.get('embedding'):
        points_to_upsert.append(
            models.PointStruct(
                id=str(uuid.uuid4()), 
                vector=chunk['embedding'],
                payload={
                    "name": chunk.get("name"),
                    "code_type": chunk.get("type"),
                    "llm_description": chunk.get("llm_description"),
                    "snippet": chunk.get("code"),
                    "context": {
                        "file_path": chunk.get("file_path"),
                        "line_from": chunk.get("line_from"),
                        "line_to": chunk.get("line_to")
                    }
                }
            )
        )

if points_to_upsert:
    try:
        qdrant_client.upsert(
            collection_name=COLLECTION_NAME,
            points=points_to_upsert,
            wait=True
        )
        print(f"\n✅ Successfully upserted {len(points_to_upsert)} points into the collection!")
    except Exception as e:
        print(f"❌ Error upserting points: {e}")

✨ Creating new collection 'semantic_code_search'.
✅ Collection created successfully.

✅ Successfully upserted 3 points into the collection!


## 4. Verify the Indexing

Finally, let's ask Qdrant for information about our collection to verify that the points have been indexed.

In [4]:
try:
    collection_info = qdrant_client.get_collection(collection_name=COLLECTION_NAME)
    print("--- Collection Info ---")
    print(f"Collection: {COLLECTION_NAME}")
    print(f"Indexed Vectors: {collection_info.points_count}")
    print("\n🎉 Your data is now indexed and ready for searching!")
    # Corrected Dashboard URL
    print("You can also view the collection in the Qdrant Dashboard: http://localhost:6333/dashboard")
except Exception as e:
    print(f"❌ Could not retrieve collection info: {e}")

--- Collection Info ---
Collection: semantic_code_search
Indexed Vectors: 3

🎉 Your data is now indexed and ready for searching!
You can also view the collection in the Qdrant Dashboard: http://localhost:6333/dashboard
