# Step 2: Creating Embeddings with the OpenAI API

This notebook demonstrates how to take the code chunks we extracted in the previous step and convert them into **vector embeddings** using OpenAI's API.

## Concept
An embedding is a numerical representation (a vector) of a piece of text or code. We will send each of our code chunks to an OpenAI embedding model, which will return a vector for each one. These vectors capture the semantic meaning of the code, allowing us to perform similarity searches later.

### Pros:
- **High Performance:** Access to state-of-the-art models without needing to train them yourself.
- **Ease of Use:** Simple API calls abstract away complex infrastructure.
- **Scalability:** Managed by OpenAI, so it can handle large volumes of requests.

### Cons:
- **Requires API Key & Internet:** You need a valid OpenAI API key and an internet connection.
- **Cost:** Each API call has an associated cost, though it's generally inexpensive.
- **Privacy:** Your code chunks are sent to OpenAI's servers for processing.

In [1]:
# Install the necessary libraries from OpenAI and for .env file handling
%pip install openai python-dotenv

Note: you may need to restart the kernel to use updated packages.


## 1. Setup and API Key Configuration

First, we'll set up our OpenAI API key. The best way to do this is with a `.env` file to keep your key secure and out of your code.

### Instructions for Using a `.env` File

1.  **Create a file:** In the same directory as this notebook, create a new file named `.env`.
2.  **Add your key:** Open the `.env` file and add your OpenAI API key in the following format:
    ```
    OPENAI_API_KEY="sk-YourSecretKeyGoesHere"
    ```
3.  **Save the file.** The code below will automatically find and load this key.

*(Note: If you're using Git, remember to add `.env` to your `.gitignore` file to prevent accidentally committing your secret key!)*

In [1]:
import os
import getpass
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Get the API key
# It will first check for the OPENAI_API_KEY in your .env file or system environment.
# If it's not found, it will securely prompt you to enter it.
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    api_key = getpass.getpass("OpenAI API key not found. Please enter your key: ")
    os.environ["OPENAI_API_KEY"] = api_key

# Initialize the OpenAI client
try:
    client = OpenAI()
    print("✅ OpenAI client initialized successfully!")
except Exception as e:
    print(f"❌ Error initializing OpenAI client: {e}")

✅ OpenAI client initialized successfully!


## 2. Prepare Code Chunks

Next, we'll use the code chunks we extracted in the previous notebook. For this example to be self-contained, we will redefine the list of chunks here.

In [2]:
# This list represents the output from our previous AST chunking notebook
code_chunks = [
    {'name': 'read_file_content', 'type': 'function', 'code': 'def read_file_content(filepath: str) -> str:\n    """Read and return the content of a file."""\n    try:\n        with open(filepath, \'r\', encoding=\'utf-8\') as file:\n            return file.read()\n    except FileNotFoundError:\n        return ""'},
    {'name': 'validate_email', 'type': 'function', 'code': 'def validate_email(email: str) -> bool:\n    """Simple email validation function."""\n    return "@" in email and "." in email.split("@")[-1]'},
    {'name': 'fetch_user_data', 'type': 'async_function', 'code': 'async def fetch_user_data(user_id: int) -> Dict:\n    """Async function to fetch user data from API."""\n    # Simulate API call\n    await asyncio.sleep(0.1)\n    return {"id": user_id, "name": f"User {user_id}"}'},
    {'name': 'DataProcessor', 'type': 'class', 'code': 'class DataProcessor:\n    """A class for processing and analyzing data."""\n\n    def __init__(self, data_source: str):\n        self.data_source = data_source\n        self.processed_count = 0\n\n    def process_batch(self, items: List[str]) -> List[str]:\n        """Process a batch of items."""\n        processed = []\n        for item in items:\n            processed.append(item.strip().upper())\n            self.processed_count += 1\n        return processed\n\n    def get_statistics(self) -> Dict[str, int]:\n        """Get processing statistics."""\n        return {\n            "processed_count": self.processed_count,\n            "data_source_length": len(self.data_source)\n        }'},
    {'name': 'FileManager', 'type': 'class', 'code': 'class FileManager:\n    """Utility class for file operations."""\n\n    def __init__(self, base_directory: str = "."):\n        self.base_directory = base_directory\n\n    def list_files(self, extension: str = None) -> List[str]:\n        """List files in the base directory."""\n        files = os.listdir(self.base_directory)\n        if extension:\n            files = [f for f in files if f.endswith(extension)]\n        return files\n\n    def file_exists(self, filename: str) -> bool:\n        """Check if a file exists."""\n        return os.path.exists(os.path.join(self.base_directory, filename))'}
]

print(f"Loaded {len(code_chunks)} code chunks to be embedded.")

Loaded 5 code chunks to be embedded.


## 3. Create Embeddings

Now we'll define a function to call the OpenAI API and generate an embedding for a given piece of text. We will then loop through our code chunks and create an embedding for each one.

In [3]:
def get_openai_embedding(text: str, model: str = "text-embedding-3-small"):
    """Generate an embedding for a given text using OpenAI's API."""
    # Replace newlines with spaces, as recommended by OpenAI for older models
    text = text.replace("\n", " ")
    
    try:
        response = client.embeddings.create(input=[text], model=model)
        return response.data[0].embedding
    except Exception as e:
        print(f"❌ Error generating embedding: {e}")
        return None

# --- Demonstration ---

print("--- Generating Embeddings for Code Chunks ---\n")

embedded_chunks = []

for chunk in code_chunks:
    element_type = chunk['type'].replace('_', ' ').title()
    name = chunk['name']
    code = chunk['code']
    
    print(f"Processing {element_type} '{name}'...")
    
    # Generate embedding
    embedding = get_openai_embedding(code)
    
    if embedding:
        # Store the embedding with the original chunk data
        chunk['embedding'] = embedding
        embedded_chunks.append(chunk)
        
        # Display a preview of the embedding
        print(f"  ✓ Embedding created successfully!")
        print(f"    Dimensions: {len(embedding)}")
        print(f"    Preview: {str(embedding[:4])[:-1]}...]")
    else:
        print(f"  ✗ Failed to create embedding for '{name}'.")
    print("-" * 20)

print("\n✅ Embedding process completed!")

--- Generating Embeddings for Code Chunks ---

Processing Function 'read_file_content'...
  ✓ Embedding created successfully!
    Dimensions: 1536
    Preview: [0.0472058467566967, 0.02799457125365734, -0.03252619132399559, -0.021800773218274117...]
--------------------
Processing Function 'validate_email'...
  ✓ Embedding created successfully!
    Dimensions: 1536
    Preview: [0.012898148968815804, -0.0068848892115056515, 0.02178092859685421, -0.005763523746281862...]
--------------------
Processing Async Function 'fetch_user_data'...
  ✓ Embedding created successfully!
    Dimensions: 1536
    Preview: [0.010124558582901955, -0.01818905770778656, -0.06986628472805023, -0.025423673912882805...]
--------------------
Processing Class 'DataProcessor'...
  ✓ Embedding created successfully!
    Dimensions: 1536
    Preview: [-0.02076515182852745, 0.01411991287022829, -0.011680398136377335, -0.06018771603703499...]
--------------------
Processing Class 'FileManager'...
  ✓ Embedding create

## 4. Final Output

The `embedded_chunks` list now contains our original code chunks, each with a new `embedding` key that holds its corresponding vector. This data is now ready to be stored in a vector database for searching.

In [4]:
# Inspect the first embedded chunk to see the final structure
if embedded_chunks:
    print("--- Structure of the First Embedded Chunk ---")
    first_chunk = embedded_chunks[0]
    print(f"Name: {first_chunk['name']}")
    print(f"Type: {first_chunk['type']}")
    print(f"Code:\n{first_chunk['code']}\n")
    print(f"Embedding Preview: {str(first_chunk['embedding'][:4])[:-1]}...]")
else:
    print("No chunks were embedded. Please check your API key and network connection.")

--- Structure of the First Embedded Chunk ---
Name: read_file_content
Type: function
Code:
def read_file_content(filepath: str) -> str:
    """Read and return the content of a file."""
    try:
        with open(filepath, 'r', encoding='utf-8') as file:
            return file.read()
    except FileNotFoundError:
        return ""

Embedding Preview: [0.0472058467566967, 0.02799457125365734, -0.03252619132399559, -0.021800773218274117...]
