# AI Revolution: Unsupervised Clustering Analysis

**Objective:**
The "expert" Dr. Chen has manually labeled companies as **AI Makers** (building the tech) or **AI Users** (applying the tech).
We suspect this binary view is too simple.

**Methodology:**
1.  **Embeddings**: Use our best NLP embeddings (Nomic/MPNet) to capture semantic business descriptions.
2.  **UMAP**: Reduce dimensionality to visualize the "AI Manifold."
3.  **HDBScan**: Automatically detect clusters based on density (finding the "real" groups without specifying $k$).
4.  **Tuning Loop**: Optimize parameters to maximize the Adjusted Rand Index (ARI) against the expert labels, then analyze where the model *disagrees* with the expert.

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import umap
import hdbscan
import plotly.express as px
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score

# Custom modules
from adv_hedging.data.loaders import load_wiki_data
from adv_hedging.constants import AI_MAKERS, AI_USERS, AI_CATEGORIES
from adv_hedging.config import PROCESSED_DATA_DIR

# Set Plotly default size
import plotly.io as pio
pio.templates.default = "plotly_white"

## 1. Prepare the AI Universe
We filter our 650-stock universe down to just the 50+ companies identified by the expert.

In [2]:
df_wiki = load_wiki_data()

# Load embeddings (Nomic preferred)
nomic_path = PROCESSED_DATA_DIR / "nomic_embeddings.parquet"
if nomic_path.exists():
    print("Loading Nomic embeddings...")
    df_nomic = pd.read_parquet(nomic_path)
    if 'embedding_nomic' not in df_wiki.columns:
        df_wiki = df_wiki.merge(df_nomic, on='ticker', how='left')
    EMBEDDING_COL = 'embedding_nomic'
else:
    print("‚ö†Ô∏è Using MPNet embeddings (Nomic not found).")
    EMBEDDING_COL = 'embedding_mpnet'

# Filter for AI Universe
ai_universe = set(AI_MAKERS + AI_USERS)
df_ai = df_wiki[df_wiki['ticker'].isin(ai_universe)].copy()

# Apply Expert Labels
df_ai['Expert_Label'] = df_ai['ticker'].map(AI_CATEGORIES)

# Extract Embeddings Matrix
# Ensure we drop any missing embeddings
df_ai = df_ai.dropna(subset=[EMBEDDING_COL])
embeddings = np.stack(df_ai[EMBEDDING_COL].values)

print(f"AI Universe: {len(df_ai)} companies.")
df_ai[['ticker', 'name', 'Expert_Label']].head()

Loading Nomic embeddings...
AI Universe: 51 companies.


Unnamed: 0,ticker,name,Expert_Label
7,XYZ,Block Inc,User
19,WMT,Walmart Inc,User
26,WFC,Wells Fargo & Co,User
28,WDC,Western Digital Corp,Maker
29,WDAY,Workday Inc,User


## 2. Hyperparameter Tuning Loop
We don't know the best `n_neighbors` for UMAP or `min_cluster_size` for HDBScan.
We will run a grid search to find parameters that produce clusters **most aligned** with the expert labels (High ARI).

* **Logic**: If the score is High, the model "agrees" with the expert.
* **Insight**: We actually want a *moderately* high score. A perfect score means we learned nothing new. A moderate score implies we found structure, but maybe slightly different (better?) structure than the expert.

In [3]:
# Parameters to test
umap_neighbors = [5, 10, 15, 20]
umap_dists = [0.0, 0.1, 0.25]
hdbscan_sizes = [3, 4, 5]

best_score = -1
best_params = {}
results_log = []

print("Running Grid Search...")

for n in umap_neighbors:
    for d in umap_dists:
        # 1. Run UMAP
        # We set random_state for reproducibility
        reducer = umap.UMAP(n_neighbors=n, min_dist=d, n_components=5, metric='cosine', random_state=42)
        u_emb = reducer.fit_transform(embeddings)
        
        for s in hdbscan_sizes:
            # 2. Run HDBScan
            clusterer = hdbscan.HDBSCAN(min_cluster_size=s, min_samples=1)
            labels = clusterer.fit_predict(u_emb)
            
            # 3. Calculate Score (ignoring noise points labeled -1)
            mask = labels != -1
            if np.sum(mask) > 10: # Only score if we have enough clustered points
                ari = adjusted_rand_score(df_ai['Expert_Label'][mask], labels[mask])
                
                results_log.append({
                    'n_neighbors': n, 'min_dist': d, 'min_cluster_size': s, 
                    'ARI': ari, 'Clusters': len(set(labels)) - (1 if -1 in labels else 0)
                })
                
                if ari > best_score:
                    best_score = ari
                    best_params = {'n_neighbors': n, 'min_dist': d, 'min_cluster_size': s}

print(f"Best ARI: {best_score:.3f}")
print(f"Best Params: {best_params}")

# Show top 5 configs
pd.DataFrame(results_log).sort_values('ARI', ascending=False).head()

Running Grid Search...


  warn(
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


Best ARI: 0.154
Best Params: {'n_neighbors': 5, 'min_dist': 0.0, 'min_cluster_size': 4}


  warn(
  warn(


Unnamed: 0,n_neighbors,min_dist,min_cluster_size,ARI,Clusters
2,5,0.0,5,0.154061,2
4,5,0.1,4,0.154061,2
5,5,0.1,5,0.154061,2
6,5,0.25,3,0.154061,2
7,5,0.25,4,0.154061,2


## 3. Visualize the "True" Structure
Now we run the best model and visualize it interactively.
We project down to 2D for the plot.

In [4]:
# 1. Final UMAP (2D for Viz)
reducer_2d = umap.UMAP(
    n_neighbors=best_params['n_neighbors'], 
    min_dist=best_params['min_dist'], 
    n_components=2, 
    metric='cosine', 
    random_state=42
)
coords = reducer_2d.fit_transform(embeddings)

df_ai['x'] = coords[:, 0]
df_ai['y'] = coords[:, 1]

# 2. Final Clustering (on 5D or 2D - usually consistent, we use 2D here for simplicity)
clusterer = hdbscan.HDBSCAN(min_cluster_size=best_params['min_cluster_size'], min_samples=1)
df_ai['Cluster_ID'] = clusterer.fit_predict(coords)
df_ai['Cluster_Label'] = df_ai['Cluster_ID'].apply(lambda x: f"Cluster {x}" if x != -1 else "Noise")

# 3. Interactive Plot
fig = px.scatter(
    df_ai, x='x', y='y',
    color='Expert_Label',       # Color = What Dr. Chen thinks
    symbol='Cluster_Label',     # Shape = What the Data says
    hover_data=['ticker', 'name', 'sector', 'Cluster_Label'],
    title=f"AI Landscape: Expert Labels vs. Data-Driven Clusters (ARI: {best_score:.2f})",
    width=1000, height=700
)

# Update layout for readability
fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="left", x=1.05))
fig.show()

  warn(


## 4. Analysis: Where was the Expert Wrong?
Let's look at the clusters to find the insights.

* **Cluster Composition**: Are "Makers" and "Users" mixing?
* **The "Noise"**: Which companies are so unique they defy classification?

In [5]:
# Show composition of each cluster
crosstab = pd.crosstab(df_ai['Cluster_Label'], df_ai['Expert_Label'])
display(crosstab)

print("-" * 50)
print("INSIGHT GENERATOR:")
for cluster in sorted(df_ai['Cluster_Label'].unique()):
    if cluster == "Noise": continue
    
    subset = df_ai[df_ai['Cluster_Label'] == cluster]
    makers = subset[subset['Expert_Label'] == 'Maker']['ticker'].tolist()
    users = subset[subset['Expert_Label'] == 'User']['ticker'].tolist()
    
    print(f"\n{cluster} Analysis:")
    print(f"   Dominikant Sector: {subset['sector'].mode()[0]}")
    print(f"   Makers: {makers}")
    print(f"   Users:  {users}")
    
    if len(makers) > 0 and len(users) > 0:
        print("  üö® HYBRID CLUSTER DETECTED! (Potential Misclassification)")

Expert_Label,Both,Maker,User
Cluster_Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cluster 0,1,8,5
Cluster 1,1,9,27


--------------------------------------------------
INSIGHT GENERATOR:

Cluster 0 Analysis:
   Dominikant Sector: Information Technology
   Makers: ['VRT', 'SNOW', 'SMCI', 'MU', 'MRVL', 'LRCX', 'CLS', 'ANET']
   Users:  ['WDAY', 'NOW', 'NKE', 'ISRG', 'APP']
  üö® HYBRID CLUSTER DETECTED! (Potential Misclassification)

Cluster 1 Analysis:
   Dominikant Sector: Financials
   Makers: ['WDC', 'STX', 'ORCL', 'NVDA', 'MSFT', 'META', 'AVGO', 'AMZN', 'AMD']
   Users:  ['XYZ', 'WMT', 'WFC', 'V', 'USB', 'UNH', 'TSLA', 'SOFI', 'SHOP', 'PYPL', 'PGR', 'MS', 'MA', 'JPM', 'HOOD', 'HD', 'GS', 'GM', 'F', 'CRM', 'COIN', 'COF', 'C', 'BLK', 'BAC', 'ADBE', 'AAPL']
  üö® HYBRID CLUSTER DETECTED! (Potential Misclassification)


In [6]:
############################################################ END OF FILE ############################################################