# Amazing Logos V4 - Step 4: Categories Embedding Preparation

This notebook prepares embeddings for category matching using OpenAI's text embedding model.

## Configuration:
- **Boolean flags** control which embeddings to compute:
  - `COMPUTE_CONSOLIDATION_EMBEDDINGS`: Set to `True` to compute embeddings for consolidation map keys
  - `COMPUTE_CATEGORY_EMBEDDINGS`: Set to `True` to compute embeddings for unique categories

## Steps:
1. **Consolidation Map Embeddings** (optional): Compute embeddings for all keys in `consolidation_map` using OpenAI's `text-embedding-3-large` model
2. **Unique Category Embeddings**: Read metadata CSV and compute embeddings for each unique category (excluding 'unclassified')
3. **Save Results**: 
   - Save consolidation map embeddings to JSON (if computed)
   - Save unique category embeddings to JSON (if computed)

## Outputs:
- `consolidation_map_embeddings.json`: Embeddings for consolidation map keys (optional)
- `unique_category_embeddings.json`: Embeddings for unique categories only (not per logo row)

## Next Step:
In a subsequent notebook, these embeddings will be used to:
- Load both embedding JSON files
- Compute cosine similarity between each unique category and consolidation map keys
- Create a mapping from current categories to target categories based on highest similarity
- Apply this mapping to update the metadata

**Note**: Make sure to set your OpenAI API key as environment variable `OPENAI_API_KEY` before running this notebook.

In [1]:
import pandas as pd
from pathlib import Path
import sys

# Add utils folder to path
utils_path = Path('../../utils')
sys.path.append(str(utils_path))

# Import consolidation functions
from consolidation import consolidation_map

In [2]:
import openai
import json
import time
from openai import OpenAI

# Initialize OpenAI client (make sure to set your API key)
# You can set it as environment variable: OPENAI_API_KEY
client = OpenAI()

# Paths
input_metadata_csv = Path('../../output/amazing_logos_v4/data/amazing_logos_v4_cleanup/metadata7.csv')
output_consolidation_embeddings = Path('../output/amazing_logos_v4/data/consolidation_map_embeddings.json')
output_category_embeddings = Path('../output/amazing_logos_v4/data/category_embeddings.csv')

print(f"Input metadata CSV: {input_metadata_csv}")
print(f"Output consolidation embeddings: {output_consolidation_embeddings}")
print(f"Output category embeddings: {output_category_embeddings}")

# Check if input file exists
if not input_metadata_csv.exists():
    print(f"ERROR: Input file {input_metadata_csv} does not exist!")
else:
    print(f"Input metadata file exists.")

print(f"Found {len(consolidation_map)} categories in consolidation_map")

Input metadata CSV: ..\output\amazing_logos_v4\data\amazing_logos_v4_metadata7.csv
Output consolidation embeddings: ..\output\amazing_logos_v4\data\consolidation_map_embeddings.json
Output category embeddings: ..\output\amazing_logos_v4\data\category_embeddings.csv
Input metadata file exists.
Found 108 categories in consolidation_map


In [3]:
def get_embedding(text, model="text-embedding-3-large", max_retries=3, delay=1):
    """
    Get embedding for a text using OpenAI's embedding model.
    
    Args:
        text (str): Text to embed
        model (str): OpenAI embedding model to use
        max_retries (int): Maximum number of retries on failure
        delay (float): Delay between retries in seconds
    
    Returns:
        list: Embedding vector
    """
    for attempt in range(max_retries):
        try:
            # Clean and prepare text
            text = str(text).strip()
            if not text:
                print(f"Warning: Empty text provided")
                return None
                
            response = client.embeddings.create(
                input=text,
                model=model
            )
            return response.data[0].embedding
        
        except Exception as e:
            print(f"Attempt {attempt + 1} failed for text '{text[:50]}...': {e}")
            if attempt < max_retries - 1:
                time.sleep(delay * (2 ** attempt))  # Exponential backoff
            else:
                print(f"Failed to get embedding after {max_retries} attempts")
                return None

# Configuration flags to control which embeddings to compute
COMPUTE_CONSOLIDATION_EMBEDDINGS = False  # Set to True if you want to compute consolidation map embeddings
COMPUTE_CATEGORY_EMBEDDINGS = True        # Set to True if you want to compute category embeddings

print("Embedding function defined successfully")
print(f"Configuration:")
print(f"  - Compute consolidation map embeddings: {COMPUTE_CONSOLIDATION_EMBEDDINGS}")
print(f"  - Compute category embeddings: {COMPUTE_CATEGORY_EMBEDDINGS}")

Embedding function defined successfully
Configuration:
  - Compute consolidation map embeddings: False
  - Compute category embeddings: True


In [4]:
# Step 1: Compute embeddings for the keys of consolidation_map
print("\n=== Step 1: Computing embeddings for consolidation_map keys ===")

consolidation_embeddings = {}

if COMPUTE_CONSOLIDATION_EMBEDDINGS:
    total_keys = len(consolidation_map)

    for i, (key, synonyms) in enumerate(consolidation_map.items(), 1):
        print(f"Processing {i}/{total_keys}: {key}")
        # Compute embed by creating a combined string in format "Category: <key>, Synonyms: <synonyms>"
        combined_text = f"Category: {key}, Synonyms: {', '.join(synonyms)}"
        embedding = get_embedding(combined_text)

        if embedding is not None:
            consolidation_embeddings[key] = embedding
            print(f"  ✓ Embedded successfully (dim: {len(embedding)})")
        else:
            print(f"  ✗ Failed to embed")
        
        # Small delay to avoid rate limits
        time.sleep(0.1)

    print(f"\nSuccessfully computed {len(consolidation_embeddings)} embeddings out of {total_keys} keys")

    # Save consolidation embeddings to JSON file
    print(f"\nSaving consolidation embeddings to {output_consolidation_embeddings}")
    with open(output_consolidation_embeddings, 'w') as f:
        json.dump(consolidation_embeddings, f, indent=2)
    print("Consolidation embeddings saved successfully")
else:
    print("Skipping consolidation map embeddings (COMPUTE_CONSOLIDATION_EMBEDDINGS = False)")
    # Try to load existing embeddings if available
    if output_consolidation_embeddings.exists():
        print(f"Loading existing consolidation embeddings from {output_consolidation_embeddings}")
        with open(output_consolidation_embeddings, 'r') as f:
            consolidation_embeddings = json.load(f)
        print(f"Loaded {len(consolidation_embeddings)} existing consolidation embeddings")
    else:
        print("No existing consolidation embeddings found")


=== Step 1: Computing embeddings for consolidation_map keys ===
Skipping consolidation map embeddings (COMPUTE_CONSOLIDATION_EMBEDDINGS = False)
Loading existing consolidation embeddings from ..\output\amazing_logos_v4\data\consolidation_map_embeddings.json
Loaded 108 existing consolidation embeddings


In [5]:
# Step 2: Read metadata and compute embeddings for unique categories only
print("\n=== Step 2: Computing embeddings for unique categories ===")

category_embeddings = {}
output_unique_category_embeddings = Path('../output/amazing_logos_v4/data/unique_category_embeddings.json')

if COMPUTE_CATEGORY_EMBEDDINGS:
    # Read metadata
    print("Loading metadata CSV...")
    df_metadata = pd.read_csv(input_metadata_csv)
    print(f"Loaded {len(df_metadata):,} rows")
    print(f"Columns: {list(df_metadata.columns)}")

    # Get unique categories (excluding 'unclassified')
    unique_categories = df_metadata[df_metadata['category'] != 'unclassified']['category'].unique()
    print(f"Found {len(unique_categories)} unique categories (excluding 'unclassified')")

    # Show some examples
    print(f"Example categories: {list(unique_categories[:10])}")

    # Compute embeddings for each unique category
    total_categories = len(unique_categories)

    for i, category in enumerate(unique_categories, 1):
        print(f"Processing {i}/{total_categories}: {category}")
        
        embedding = get_embedding(category)
        
        if embedding is not None:
            category_embeddings[category] = embedding
            print(f"  ✓ Embedded successfully (dim: {len(embedding)})")
        else:
            print(f"  ✗ Failed to embed")
        
        # Small delay to avoid rate limits
        time.sleep(0.1)

    print(f"\nSuccessfully computed {len(category_embeddings)} category embeddings out of {total_categories} categories")
    
    # Save unique category embeddings to JSON file
    print(f"\nSaving unique category embeddings to {output_unique_category_embeddings}")
    with open(output_unique_category_embeddings, 'w') as f:
        json.dump(category_embeddings, f, indent=2)
    print("Unique category embeddings saved successfully")
    
else:
    print("Skipping category embeddings (COMPUTE_CATEGORY_EMBEDDINGS = False)")
    # Try to load existing embeddings if available
    if output_unique_category_embeddings.exists():
        print(f"Loading existing category embeddings from {output_unique_category_embeddings}")
        with open(output_unique_category_embeddings, 'r') as f:
            category_embeddings = json.load(f)
        print(f"Loaded {len(category_embeddings)} existing category embeddings")
    else:
        print("No existing category embeddings found")


=== Step 2: Computing embeddings for unique categories ===
Loading metadata CSV...
Loaded 393,298 rows
Columns: ['id', 'company', 'description', 'category', 'tags']
Found 40499 unique categories (excluding 'unclassified')
Example categories: ['hospitality_services', 'chemical_materials', 'safty_glass', 'park', 'film_video', 'engineering', 'marketing_advertising', 'design_creative', 'jewelry_accessories', 'vacation_recreation']
Processing 1/40499: hospitality_services
Loaded 393,298 rows
Columns: ['id', 'company', 'description', 'category', 'tags']
Found 40499 unique categories (excluding 'unclassified')
Example categories: ['hospitality_services', 'chemical_materials', 'safty_glass', 'park', 'film_video', 'engineering', 'marketing_advertising', 'design_creative', 'jewelry_accessories', 'vacation_recreation']
Processing 1/40499: hospitality_services
  ✓ Embedded successfully (dim: 3072)
Processing 2/40499: chemical_materials
  ✓ Embedded successfully (dim: 3072)
Processing 2/40499: che

KeyboardInterrupt: 

In [None]:
# Step 3: Summary and final output
print("\n=== Step 3: Summary ===")

# Update paths for the outputs we actually created
output_unique_category_embeddings = Path('../output/amazing_logos_v4/data/unique_category_embeddings.json')

print(f"Configuration used:")
print(f"  - Computed consolidation map embeddings: {COMPUTE_CONSOLIDATION_EMBEDDINGS}")
print(f"  - Computed category embeddings: {COMPUTE_CATEGORY_EMBEDDINGS}")
print()

if COMPUTE_CONSOLIDATION_EMBEDDINGS or output_consolidation_embeddings.exists():
    print(f"Consolidation map embeddings: {len(consolidation_embeddings)} keys")
    if output_consolidation_embeddings.exists():
        print(f"  Saved to: {output_consolidation_embeddings}")
        print(f"  File size: {output_consolidation_embeddings.stat().st_size:,} bytes")

if COMPUTE_CATEGORY_EMBEDDINGS or output_unique_category_embeddings.exists():
    print(f"Unique category embeddings: {len(category_embeddings)} categories")
    if output_unique_category_embeddings.exists():
        print(f"  Saved to: {output_unique_category_embeddings}")
        print(f"  File size: {output_unique_category_embeddings.stat().st_size:,} bytes")

# Show some category embedding examples if available
if category_embeddings:
    print(f"\nExample categories with embeddings:")
    for i, (category, embedding) in enumerate(list(category_embeddings.items())[:5]):
        print(f"  {i+1}. '{category}' (embedding dim: {len(embedding)})")

print(f"\n✓ Embedding preparation completed!")
print(f"\nNext steps:")
print(f"1. Load the unique category embeddings JSON file")
print(f"2. Load the consolidation map embeddings JSON file") 
print(f"3. Compute cosine similarity between each category and consolidation map keys")
print(f"4. Map categories to the most similar consolidation map key")