# LLM-Based Semantic Cluster Analysis

This notebook performs cluster analysis on semantic data using Azure AI Foundry APIs.

## Features
- Handles multi-line semantic data grouped by `sequence_uuid`
- **Three clustering methods**: Traditional (K-means, DBSCAN) and LLM-based clustering
- **LLM Clustering**: Uses AI to intelligently group data based on semantic meaning
- Uses Azure AI embeddings for semantic understanding
- Automatically generates cluster titles using LLM
- Visualizes clusters and their characteristics

## Prerequisites
Before running this notebook, ensure you have:
1. Azure AI Foundry API credentials
2. Set environment variables:
   - `AZURE_AI_ENDPOINT`: Your Azure AI endpoint URL
   - `AZURE_AI_KEY`: Your Azure AI API key
   - `AZURE_AI_MODEL_NAME`: The model name for embeddings (e.g., 'text-embedding-ada-002')
   - `AZURE_AI_CHAT_MODEL`: The chat model for cluster naming (e.g., 'gpt-4')

In [None]:
# Import required libraries
import os
import pandas as pd
import numpy as np
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Azure AI
from azure.ai.inference import ChatCompletionsClient, EmbeddingsClient
from azure.ai.inference.models import SystemMessage, UserMessage
from azure.core.credentials import AzureKeyCredential

# Clustering and ML
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Utilities
from dotenv import load_dotenv
from tqdm import tqdm

# Load environment variables
load_dotenv()

print("✓ Libraries imported successfully")

## Configuration

In [None]:
# Azure AI Configuration
AZURE_AI_ENDPOINT = os.getenv('AZURE_AI_ENDPOINT', '')
AZURE_AI_KEY = os.getenv('AZURE_AI_KEY', '')
AZURE_AI_MODEL_NAME = os.getenv('AZURE_AI_MODEL_NAME', 'text-embedding-ada-002')
AZURE_AI_CHAT_MODEL = os.getenv('AZURE_AI_CHAT_MODEL', 'gpt-4')

# Clustering Configuration
NUM_CLUSTERS = 5  # Adjust based on your data (used for kmeans and llm methods)
CLUSTERING_METHOD = 'llm'  # Options: 'kmeans', 'dbscan', 'llm'
# LLM clustering uses the AI model to intelligently group semantic data
# This provides better semantic understanding but is slower and uses more API calls

# Validate configuration
if not AZURE_AI_ENDPOINT or not AZURE_AI_KEY:
    print("⚠️  Warning: Azure AI credentials not found in environment variables.")
    print("Please set AZURE_AI_ENDPOINT and AZURE_AI_KEY before proceeding.")
else:
    print("✓ Configuration loaded successfully")

## Initialize Azure AI Clients

In [None]:
def initialize_azure_clients():
    """
    Initialize Azure AI clients for embeddings and chat completions.
    
    Returns:
        Tuple[EmbeddingsClient, ChatCompletionsClient]: Initialized clients
    """
    try:
        credential = AzureKeyCredential(AZURE_AI_KEY)
        
        embeddings_client = EmbeddingsClient(
            endpoint=AZURE_AI_ENDPOINT,
            credential=credential
        )
        
        chat_client = ChatCompletionsClient(
            endpoint=AZURE_AI_ENDPOINT,
            credential=credential
        )
        
        print("✓ Azure AI clients initialized successfully")
        return embeddings_client, chat_client
    except Exception as e:
        print(f"❌ Error initializing Azure AI clients: {str(e)}")
        return None, None

# Initialize clients
if AZURE_AI_ENDPOINT and AZURE_AI_KEY:
    embeddings_client, chat_client = initialize_azure_clients()
else:
    embeddings_client, chat_client = None, None
    print("⚠️  Skipping client initialization due to missing credentials")

## Data Loading and Preprocessing

In [None]:
def load_and_group_data(file_path: str) -> pd.DataFrame:
    """
    Load semantic data from CSV and group by sequence_uuid.
    
    Args:
        file_path: Path to the CSV file with columns: sequence_uuid, semantic_data
        
    Returns:
        DataFrame with grouped semantic data
    """
    # Load data
    df = pd.read_csv(file_path)
    print(f"✓ Loaded {len(df)} rows from {file_path}")
    
    # Validate required columns
    required_columns = ['sequence_uuid', 'semantic_data']
    missing_columns = [col for col in required_columns if col not in df.columns]
    if missing_columns:
        raise ValueError(f"Missing required columns: {missing_columns}")
    
    # Group by sequence_uuid and combine semantic data
    grouped_df = df.groupby('sequence_uuid').agg({
        'semantic_data': lambda x: ' '.join(x.astype(str))
    }).reset_index()
    
    grouped_df.rename(columns={'semantic_data': 'combined_semantic_data'}, inplace=True)
    
    print(f"✓ Grouped into {len(grouped_df)} unique sequence_uuids")
    
    return grouped_df

# Load sample data
data_file = 'sample_data.csv'
if os.path.exists(data_file):
    df_grouped = load_and_group_data(data_file)
    print("\nSample grouped data:")
    print(df_grouped.head())
else:
    print(f"⚠️  Data file '{data_file}' not found. Please provide your data file.")
    df_grouped = None

## Generate Embeddings using Azure AI

In [None]:
def get_embeddings(texts: List[str], client: EmbeddingsClient, batch_size: int = 10) -> np.ndarray:
    """
    Generate embeddings for a list of texts using Azure AI.
    
    Args:
        texts: List of text strings to embed
        client: Azure AI EmbeddingsClient
        batch_size: Number of texts to process in each batch
        
    Returns:
        NumPy array of embeddings
    """
    all_embeddings = []
    
    for i in tqdm(range(0, len(texts), batch_size), desc="Generating embeddings"):
        batch = texts[i:i + batch_size]
        
        try:
            response = client.embed(
                input=batch,
                model=AZURE_AI_MODEL_NAME
            )
            
            batch_embeddings = [item.embedding for item in response.data]
            all_embeddings.extend(batch_embeddings)
            
        except Exception as e:
            print(f"❌ Error generating embeddings for batch {i}: {str(e)}")
            # Return zero embeddings for failed batch
            all_embeddings.extend([np.zeros(1536) for _ in batch])
    
    return np.array(all_embeddings)

# Generate embeddings
if df_grouped is not None and embeddings_client is not None:
    texts = df_grouped['combined_semantic_data'].tolist()
    embeddings = get_embeddings(texts, embeddings_client)
    print(f"\n✓ Generated embeddings with shape: {embeddings.shape}")
    
    # Add embeddings to dataframe
    df_grouped['embedding'] = list(embeddings)
else:
    print("⚠️  Skipping embedding generation")
    embeddings = None

## Perform Clustering

In [None]:
def perform_clustering(embeddings: np.ndarray, texts: List[str], method: str = 'kmeans', 
                       n_clusters: int = 5, chat_client: ChatCompletionsClient = None) -> Tuple[np.ndarray, object]:
    """
    Perform clustering on embeddings.
    
    Args:
        embeddings: NumPy array of embeddings
        texts: List of text strings (required for LLM clustering)
        method: Clustering method ('kmeans', 'dbscan', or 'llm')
        n_clusters: Number of clusters (for kmeans and llm)
        chat_client: Azure AI ChatCompletionsClient (required for LLM clustering)
        
    Returns:
        Tuple of (cluster labels, clustering model)
    """
    if method == 'llm':
        print("Using LLM-based clustering...")
        labels, model = perform_llm_clustering(texts, chat_client, n_clusters)
        return labels, model
    
    # Traditional clustering methods
    # Standardize embeddings
    scaler = StandardScaler()
    embeddings_scaled = scaler.fit_transform(embeddings)
    
    if method == 'kmeans':
        model = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
        labels = model.fit_predict(embeddings_scaled)
        
    elif method == 'dbscan':
        model = DBSCAN(eps=0.5, min_samples=2)
        labels = model.fit_predict(embeddings_scaled)
        n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
        
    else:
        raise ValueError(f"Unknown clustering method: {method}")
    
    # Calculate silhouette score
    if len(set(labels)) > 1:
        score = silhouette_score(embeddings_scaled, labels)
        print(f"✓ Clustering completed with {len(set(labels))} clusters")
        print(f"  Silhouette Score: {score:.3f}")
    else:
        print(f"✓ Clustering completed with {len(set(labels))} cluster")
    
    return labels, model


def perform_llm_clustering(texts: List[str], client: ChatCompletionsClient, n_clusters: int = 5) -> Tuple[np.ndarray, str]:
    """
    Perform clustering using LLM to semantically group texts.
    
    This method uses the LLM to intelligently group texts based on semantic similarity.
    It's more accurate for semantic clustering but slower and uses more API calls.
    
    Args:
        texts: List of text strings to cluster
        client: Azure AI ChatCompletionsClient
        n_clusters: Desired number of clusters
        
    Returns:
        Tuple of (cluster labels array, method description)
    """
    import json as json_lib
    
    if client is None:
        raise ValueError("ChatCompletionsClient is required for LLM clustering")
    
    # For large datasets, we'll use a hierarchical approach
    # First, we'll ask LLM to suggest clusters, then assign items
    
    print(f"Step 1: Analyzing {len(texts)} items to identify {n_clusters} semantic clusters...")
    
    # Create a sample of texts for initial cluster identification
    sample_size = min(20, len(texts))
    sample_indices = np.linspace(0, len(texts)-1, sample_size, dtype=int)
    sample_texts = [texts[i] for i in sample_indices]
    
    # Ask LLM to identify cluster themes
    sample_formatted = "\n".join([f"{i+1}. {text[:200]}" for i, text in enumerate(sample_texts)])
    
    cluster_identification_prompt = f"""Analyze the following semantic data samples and identify {n_clusters} distinct thematic clusters.

Samples:
{sample_formatted}

Provide exactly {n_clusters} cluster themes as a JSON array. Each theme should be a concise description (3-6 words).

Format your response as valid JSON:
{{"clusters": ["Theme 1", "Theme 2", ...]}}

Provide ONLY the JSON, no other text."""

    try:
        response = client.complete(
            messages=[
                SystemMessage(content="You are a data analyst expert at identifying semantic patterns and themes in text data. Always respond with valid JSON."),
                UserMessage(content=cluster_identification_prompt)
            ],
            model=AZURE_AI_CHAT_MODEL,
            temperature=0.3,
            max_tokens=500
        )
        
        response_text = response.choices[0].message.content.strip()
        # Try to extract JSON from response
        if "```json" in response_text:
            response_text = response_text.split("```json")[1].split("```")[0].strip()
        elif "```" in response_text:
            response_text = response_text.split("```")[1].split("```")[0].strip()
        
        cluster_themes = json_lib.loads(response_text)["clusters"][:n_clusters]
        
        # Ensure we have exactly n_clusters
        while len(cluster_themes) < n_clusters:
            cluster_themes.append(f"Cluster {len(cluster_themes) + 1}")
        cluster_themes = cluster_themes[:n_clusters]
        
        print(f"✓ Identified cluster themes: {cluster_themes}")
        
    except Exception as e:
        print(f"⚠️  Error identifying clusters with LLM: {str(e)}")
        print("  Falling back to generic cluster names")
        cluster_themes = [f"Cluster {i+1}" for i in range(n_clusters)]
    
    # Step 2: Assign each text to a cluster
    print(f"\nStep 2: Assigning items to clusters...")
    labels = []
    
    # Process in batches to avoid token limits
    batch_size = 5
    
    for i in tqdm(range(0, len(texts), batch_size), desc="Assigning clusters"):
        batch_texts = texts[i:i+batch_size]
        batch_labels = []
        
        for text in batch_texts:
            # Truncate very long texts
            text_truncated = text[:500] if len(text) > 500 else text
            
            themes_formatted = "\n".join([f"{j}. {theme}" for j, theme in enumerate(cluster_themes)])
            
            assignment_prompt = f"""Given the following semantic data, assign it to the most appropriate cluster.

Semantic data:
{text_truncated}

Available clusters:
{themes_formatted}

Respond with ONLY the cluster number (0-{n_clusters-1}), nothing else."""

            try:
                response = client.complete(
                    messages=[
                        SystemMessage(content="You are a data classification expert. Respond only with the cluster number."),
                        UserMessage(content=assignment_prompt)
                    ],
                    model=AZURE_AI_CHAT_MODEL,
                    temperature=0.1,
                    max_tokens=10
                )
                
                cluster_num = response.choices[0].message.content.strip()
                # Extract first number found
                import re
                numbers = re.findall(r'\d+', cluster_num)
                if numbers:
                    cluster_id = int(numbers[0])
                    # Ensure valid cluster ID
                    cluster_id = max(0, min(cluster_id, n_clusters - 1))
                else:
                    cluster_id = 0
                
                batch_labels.append(cluster_id)
                
            except Exception as e:
                print(f"⚠️  Error assigning text to cluster: {str(e)}")
                # Assign to cluster 0 as fallback
                batch_labels.append(0)
        
        labels.extend(batch_labels)
    
    labels_array = np.array(labels)
    
    print(f"\n✓ LLM clustering completed with {len(set(labels))} clusters")
    print(f"  Cluster distribution: {np.bincount(labels_array)}")
    
    # Store cluster themes for later use
    llm_model_info = {
        'method': 'llm',
        'cluster_themes': cluster_themes
    }
    
    return labels_array, llm_model_info


# Perform clustering
if embeddings is not None:
    texts_for_clustering = df_grouped['combined_semantic_data'].tolist()
    
    cluster_labels, clustering_model = perform_clustering(
        embeddings, 
        texts_for_clustering,
        method=CLUSTERING_METHOD, 
        n_clusters=NUM_CLUSTERS,
        chat_client=chat_client
    )
    
    # Add cluster labels to dataframe
    df_grouped['cluster'] = cluster_labels
    
    # If LLM clustering was used, store the themes
    if CLUSTERING_METHOD == 'llm' and isinstance(clustering_model, dict):
        llm_cluster_themes = clustering_model.get('cluster_themes', {})
    else:
        llm_cluster_themes = None
    
    # Display cluster distribution
    print("\nCluster distribution:")
    print(df_grouped['cluster'].value_counts().sort_index())
else:
    print("⚠️  Skipping clustering")
    cluster_labels = None
    llm_cluster_themes = None

## Generate Cluster Titles using LLM

In [None]:
def generate_cluster_title(cluster_texts: List[str], client: ChatCompletionsClient) -> str:
    """
    Generate a descriptive title for a cluster using LLM.
    
    Args:
        cluster_texts: List of texts in the cluster
        client: Azure AI ChatCompletionsClient
        
    Returns:
        Generated cluster title
    """
    # Prepare sample texts (limit to avoid token limits)
    sample_size = min(10, len(cluster_texts))
    sample_texts = cluster_texts[:sample_size]
    
    # Create prompt
    texts_formatted = "\n".join([f"{i+1}. {text}" for i, text in enumerate(sample_texts)])
    
    prompt = f"""Analyze the following semantic data entries and provide a concise, descriptive title (3-6 words) that captures the main theme or topic:

{texts_formatted}

Provide only the title, nothing else."""
    
    try:
        response = client.complete(
            messages=[
                SystemMessage(content="You are a helpful assistant that analyzes text data and provides concise, descriptive titles."),
                UserMessage(content=prompt)
            ],
            model=AZURE_AI_CHAT_MODEL,
            temperature=0.3,
            max_tokens=50
        )
        
        title = response.choices[0].message.content.strip()
        # Remove quotes if present
        title = title.strip('"\'')
        return title
        
    except Exception as e:
        print(f"❌ Error generating cluster title: {str(e)}")
        return f"Cluster (Error generating title)"

def generate_all_cluster_titles(df: pd.DataFrame, client: ChatCompletionsClient, 
                                llm_themes: List[str] = None) -> Dict[int, str]:
    """
    Generate titles for all clusters.
    
    Args:
        df: DataFrame with cluster assignments
        client: Azure AI ChatCompletionsClient
        llm_themes: Pre-generated themes from LLM clustering (optional)
        
    Returns:
        Dictionary mapping cluster ID to title
    """
    cluster_titles = {}
    
    # If LLM themes are available, use them directly
    if llm_themes is not None and len(llm_themes) > 0:
        print("Using LLM-generated cluster themes...")
        for cluster_id in sorted(df['cluster'].unique()):
            if cluster_id < len(llm_themes):
                cluster_titles[cluster_id] = llm_themes[cluster_id]
                print(f"Cluster {cluster_id}: {llm_themes[cluster_id]}")
            else:
                cluster_titles[cluster_id] = f"Cluster {cluster_id}"
        return cluster_titles
    
    # Otherwise, generate titles using LLM
    print("Generating cluster titles using LLM analysis...")
    for cluster_id in sorted(df['cluster'].unique()):
        cluster_data = df[df['cluster'] == cluster_id]
        cluster_texts = cluster_data['combined_semantic_data'].tolist()
        
        print(f"Generating title for Cluster {cluster_id} ({len(cluster_texts)} items)...")
        title = generate_cluster_title(cluster_texts, client)
        cluster_titles[cluster_id] = title
        print(f"  → {title}")
    
    return cluster_titles

# Generate cluster titles
if df_grouped is not None and cluster_labels is not None and chat_client is not None:
    print("\nGenerating cluster titles...")
    cluster_titles = generate_all_cluster_titles(df_grouped, chat_client, llm_cluster_themes)
    
    # Add titles to dataframe
    df_grouped['cluster_title'] = df_grouped['cluster'].map(cluster_titles)
    
    print("\n✓ Cluster titles generated successfully")
else:
    print("⚠️  Skipping cluster title generation")
    cluster_titles = None

## Visualize Clusters

In [None]:
def visualize_clusters(embeddings: np.ndarray, labels: np.ndarray, titles: Dict[int, str] = None):
    """
    Visualize clusters using PCA dimensionality reduction.
    
    Args:
        embeddings: NumPy array of embeddings
        labels: Cluster labels
        titles: Dictionary mapping cluster ID to title
    """
    # Reduce dimensions to 2D using PCA
    pca = PCA(n_components=2, random_state=42)
    embeddings_2d = pca.fit_transform(embeddings)
    
    # Create plot
    plt.figure(figsize=(12, 8))
    
    # Get unique clusters
    unique_labels = sorted(set(labels))
    colors = sns.color_palette('husl', len(unique_labels))
    
    # Plot each cluster
    for i, cluster_id in enumerate(unique_labels):
        mask = labels == cluster_id
        label = titles[cluster_id] if titles else f"Cluster {cluster_id}"
        
        plt.scatter(
            embeddings_2d[mask, 0],
            embeddings_2d[mask, 1],
            c=[colors[i]],
            label=label,
            alpha=0.6,
            s=100,
            edgecolors='black',
            linewidth=0.5
        )
    
    plt.xlabel('First Principal Component', fontsize=12)
    plt.ylabel('Second Principal Component', fontsize=12)
    plt.title('Semantic Data Clusters (PCA Visualization)', fontsize=14, fontweight='bold')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    
    variance_explained = pca.explained_variance_ratio_
    print(f"\nPCA Variance Explained:")
    print(f"  PC1: {variance_explained[0]:.2%}")
    print(f"  PC2: {variance_explained[1]:.2%}")
    print(f"  Total: {sum(variance_explained):.2%}")
    
    plt.show()

# Visualize clusters
if embeddings is not None and cluster_labels is not None:
    visualize_clusters(embeddings, cluster_labels, cluster_titles)
else:
    print("⚠️  Skipping visualization")

## Display Cluster Summary

In [None]:
def display_cluster_summary(df: pd.DataFrame):
    """
    Display a summary of all clusters with sample data.
    
    Args:
        df: DataFrame with cluster assignments and titles
    """
    print("=" * 80)
    print("CLUSTER ANALYSIS SUMMARY")
    print("=" * 80)
    
    for cluster_id in sorted(df['cluster'].unique()):
        cluster_data = df[df['cluster'] == cluster_id]
        title = cluster_data.iloc[0]['cluster_title'] if 'cluster_title' in cluster_data.columns else f"Cluster {cluster_id}"
        
        print(f"\n{'─' * 80}")
        print(f"CLUSTER {cluster_id}: {title}")
        print(f"{'─' * 80}")
        print(f"Size: {len(cluster_data)} items")
        print(f"\nSample entries:")
        
        # Show up to 3 sample entries
        for idx, (_, row) in enumerate(cluster_data.head(3).iterrows(), 1):
            text = row['combined_semantic_data']
            # Truncate long texts
            if len(text) > 150:
                text = text[:150] + "..."
            print(f"  {idx}. [{row['sequence_uuid']}] {text}")
        
        if len(cluster_data) > 3:
            print(f"  ... and {len(cluster_data) - 3} more items")
    
    print(f"\n{'=' * 80}")
    print(f"Total: {len(df)} items across {len(df['cluster'].unique())} clusters")
    print(f"{'=' * 80}")

# Display summary
if df_grouped is not None and cluster_labels is not None:
    display_cluster_summary(df_grouped)
else:
    print("⚠️  No cluster data to display")

## Export Results

In [None]:
def export_results(df: pd.DataFrame, output_file: str = 'clustered_results.csv'):
    """
    Export clustering results to CSV.
    
    Args:
        df: DataFrame with cluster assignments
        output_file: Output file path
    """
    # Prepare export dataframe (exclude embedding column)
    export_df = df.copy()
    if 'embedding' in export_df.columns:
        export_df = export_df.drop(columns=['embedding'])
    
    # Save to CSV
    export_df.to_csv(output_file, index=False)
    print(f"✓ Results exported to {output_file}")
    
    return export_df

# Export results
if df_grouped is not None and cluster_labels is not None:
    results_df = export_results(df_grouped)
    print("\nExported columns:", results_df.columns.tolist())
else:
    print("⚠️  No results to export")

## Usage Instructions

### Step 1: Set up Environment Variables
Create a `.env` file in the same directory with:
```
AZURE_AI_ENDPOINT=https://your-endpoint.cognitiveservices.azure.com/
AZURE_AI_KEY=your-api-key-here
AZURE_AI_MODEL_NAME=text-embedding-ada-002
AZURE_AI_CHAT_MODEL=gpt-4
```

### Step 2: Prepare Your Data
Create a CSV file with the following structure:
- `sequence_uuid`: Unique identifier for grouping related semantic data
- `semantic_data`: The text content to be analyzed

Example:
```csv
sequence_uuid,semantic_data
seq_001,First line of semantic data
seq_001,Second line related to seq_001
seq_002,Different semantic data
```

### Step 3: Configure Clustering
In the Configuration cell, adjust:
- `NUM_CLUSTERS`: Number of clusters (for K-means and LLM methods)
- `CLUSTERING_METHOD`: Choose from:
  - **'llm'** (Recommended): Uses LLM to intelligently group data based on semantic meaning
    - More accurate semantic understanding
    - Automatically identifies cluster themes
    - Slower and uses more API calls
  - **'kmeans'**: Traditional K-means clustering
    - Fast and efficient
    - Good for well-separated clusters
    - Requires knowing the number of clusters
  - **'dbscan'**: Density-based clustering
    - Automatically detects number of clusters
    - Good for arbitrary-shaped clusters
    - May create noise/outlier category

### Step 4: Run the Notebook
Execute all cells in order. The notebook will:
1. Load and group your data by `sequence_uuid`
2. Generate embeddings using Azure AI
3. Perform clustering (using your chosen method)
4. Generate descriptive titles for each cluster
5. Visualize the results
6. Export the clustered data

### Clustering Method Comparison

| Method | Speed | Accuracy | API Calls | Best For |
|--------|-------|----------|-----------|----------|
| LLM | Slow | High | Many | Semantic understanding, complex themes |
| K-means | Fast | Medium | Few | Large datasets, known cluster count |
| DBSCAN | Medium | Medium | Few | Automatic cluster detection |

### Customization
- **LLM clustering parameters**: 
  - Adjust `NUM_CLUSTERS` to control how many semantic groups to identify
  - The LLM will analyze your data and create meaningful cluster themes
- **Traditional clustering**: 
  - Modify `CLUSTERING_METHOD` to 'kmeans' or 'dbscan'
  - Adjust `NUM_CLUSTERS` for K-means
- **Batch processing**: 
  - Modify `batch_size` in `get_embeddings()` if you hit rate limits
  - LLM clustering processes texts in batches of 5 by default
- **Visualization**: 
  - Customize colors and plot settings in `visualize_clusters()`

### Tips for LLM Clustering
- Works best with 3-10 clusters for optimal semantic grouping
- Larger datasets will take longer due to individual item classification
- Consider using a sample of your data first to test cluster quality
- The LLM identifies themes first, then assigns each item to the best-fit cluster