
# Goal: Cluster AI agent projects by their short descriptions using dense vector embeddings.
Produces: clean dataset, embeddings, UMAP visualizations, KMeans and HDBSCAN clusters, qualitative cluster summary, and a saved CSV with cluster labels.

Notes:
- This notebook is written in competition-style with explanations, modular code, and reproducibility in mind.
- Replace/change paths or model names if needed. If running in an environment without `sentence-transformers`, the notebook includes an install cell.


# 1) Introduction & Problem Summary

# - Problem
# We have a small curated CSV of AI agent projects (title, industry, description, github link). The task is unsupervised: group similar projects together so that engineers/researchers can discover clusters of agent types (e.g., information-retrieval agents, autonomous copilots, data-collection agents, RL agents, etc.).

# - Motivation
# Clusters help create taxonomies, drive RAG pipelines (select representative repos), and identify gaps or duplicates in the dataset.

# - Expected challenges
# * Short descriptions (noisy / terse) -> embeddings must be robust.
# * Small dataset (N=71) -> prefer compact embedding models and robust clustering (HDBSCAN works well with small n).
# * Link stability / license heterogeneity not addressed here (this notebook focuses on descriptions only).


# 2) Setup


In [None]:
# Imports, seed, and optional installs
import os
import gc
import random
from pathlib import Path

import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Text and embeddings
try:
    from sentence_transformers import SentenceTransformer
except Exception:
    SentenceTransformer = None

# Dimensionality reduction & clustering
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Optional clustering
try:
    import hdbscan
except Exception:
    hdbscan = None

# UMAP for visualization
try:
    import umap
except Exception:
    umap = None

# Reproducibility & warnings
import warnings
warnings.filterwarnings('ignore')

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

# Display settings
pd.set_option('display.max_columns', 200)

# Optional: install missing libraries (uncomment if needed)
# Note: in restricted offline environments this might fail. If so, use precomputed embeddings or run locally.

# !pip install -q sentence-transformers umap-learn hdbscan

print('SentenceTransformer available?', SentenceTransformer is not None)
print('UMAP available?', umap is not None)
print('HDBSCAN available?', hdbscan is not None)



# 3) Data Loading

In [None]:


DATA_PATH = Path('/kaggle/input/ai-agents-dataset-github-repositories-use-cases/agents_list.csv')
assert DATA_PATH.exists(), f"Dataset not found at {DATA_PATH}"

df = pd.read_csv(DATA_PATH)
print('Shape:', df.shape)
print('\nColumns:')
print(df.columns.tolist())

# Quick head
print('\nFirst rows:')
print(df.head(10).to_string(index=False))

# Basic summary
print('\nMissing values:')
print(df.isnull().sum())

# Ensure descriptive column names exist; normalize
expected_cols = ['Use Case', 'Industry', 'Description', 'Code Github']
for c in expected_cols:
    if c not in df.columns:
        # try lowercase variants
        for col in df.columns:
            if col.strip().lower() == c.lower():
                df = df.rename(columns={col: c})
                break

# If Description is missing or empty, fill with Title/Use Case
if 'Description' not in df.columns:
    raise ValueError('Description column missing!')

# Create a text column to embed
df['text'] = df['Description'].fillna('')
# fallback to Use Case or Title
if df['text'].str.strip().eq('').any():
    df['text'] = df['text'].mask(df['text'].str.strip()=='', df['Use Case'].fillna(''))

print('\nSample texts:')
print(df['text'].head(8).to_string(index=False))


# 4) EDA (text)

In [None]:


# Length statistics
df['text_len'] = df['text'].str.len()
print('\nText length stats:')
print(df['text_len'].describe())

plt.figure(figsize=(8,4))
sns.histplot(df['text_len'], bins=20)
plt.title('Distribution of description lengths')
plt.xlabel('Characters')
plt.show()

# Show most common industries
if 'Industry' in df.columns:
    print('\nIndustry counts:')
    print(df['Industry'].value_counts().head(20))


# 5) Feature Engineering: Text cleaning & Embeddings

In [None]:


# Minimal, safe text cleaning function
import re

def clean_text(s):
    if pd.isna(s):
        return ''
    s = str(s)
    # Remove URLs (github links sometimes appear in text)
    s = re.sub(r'http\S+|www\S+', ' ', s)
    # Remove repeated whitespace
    s = re.sub(r'\s+', ' ', s)
    s = s.strip()
    return s

# Apply
df['text_clean'] = df['text'].apply(clean_text)
print('\nSample cleaned texts:')
print(df['text_clean'].head(8).to_string(index=False))

# Embedding function using sentence-transformers
EMBED_MODEL_NAME = 'all-MiniLM-L6-v2'  # compact, good for small datasets

if SentenceTransformer is None:
    raise RuntimeError('sentence-transformers is not installed in the environment. Uncomment the install cell and run again or run locally with internet access.')

model = SentenceTransformer(EMBED_MODEL_NAME)

# Compute embeddings
texts = df['text_clean'].tolist()
print('\nComputing embeddings for', len(texts), 'items...')
embeddings = model.encode(texts, show_progress_bar=True, convert_to_numpy=True)
print('Embeddings shape:', embeddings.shape)

# Save embeddings to dataframe
emb_df = pd.DataFrame(embeddings)
emb_df.columns = [f'emb_{i}' for i in range(emb_df.shape[1])]
df = pd.concat([df.reset_index(drop=True), emb_df.reset_index(drop=True)], axis=1)


# 6) Dimensionality Reduction for visualization & clustering preprocessing

In [None]:
!pip install --upgrade umap-learn


In [None]:


# PCA to reduce to 50 dims (for clustering stability / HDBSCAN)
N_PCA = min(50, embeddings.shape[1])
print('Running PCA ->', N_PCA, 'components')


pca = PCA(n_components=N_PCA, random_state=RANDOM_SEED)
emb_pca = pca.fit_transform(embeddings)
print('PCA explained variance ratio (sum):', pca.explained_variance_ratio_.sum())


# UMAP for 2D visualization
if umap is None:
    raise RuntimeError('umap-learn not installed. Uncomment install cell or run locally.')


# 2D PCA Dimensionality Reduction (UMAP disabled due to version conflict)
pca_2d = PCA(n_components=2, random_state=RANDOM_SEED).fit_transform(emb_pca)
emb_2d = pca_2d
print('2D PCA shape:', emb_2d.shape)
print('UMAP shape:', emb_2d.shape)


# Append to df
df['umap_0'] = emb_2d[:,0]
df['umap_1'] = emb_2d[:,1]


plt.figure(figsize=(8,6))
plt.scatter(df['umap_0'], df['umap_1'], s=60)
for i, txt in enumerate(df['Use Case'].fillna('').tolist()):
    if i < 30: # label only first 30 points to reduce clutter
        plt.text(df.loc[i,'umap_0']+0.01, df.loc[i,'umap_1']+0.01, str(i)+': '+txt, fontsize=8)
plt.title('UMAP projection of agent descriptions')
plt.xlabel('UMAP 0')
plt.ylabel('UMAP 1')
plt.show()

# 7) Clustering: KMeans (baseline) + HDBSCAN (density-based)


In [None]:



# Helper to evaluate KMeans for range of k
from sklearn.metrics import pairwise_distances

def kmeans_explore(X, k_min=2, k_max=12, random_state=RANDOM_SEED):
    inertia = []
    sil_scores = []
    K = list(range(k_min, k_max+1))
    for k in K:
        km = KMeans(n_clusters=k, random_state=random_state, n_init=20)
        labels = km.fit_predict(X)
        inertia.append(km.inertia_)
        if len(set(labels))>1:
            sil_scores.append(silhouette_score(X, labels))
        else:
            sil_scores.append(np.nan)
    return K, inertia, sil_scores

K, inertia, sil_scores = kmeans_explore(emb_pca, k_min=2, k_max=12)

plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(K, inertia, '-o')
plt.xlabel('k')
plt.ylabel('Inertia')
plt.title('KMeans elbow')

plt.subplot(1,2,2)
plt.plot(K, sil_scores, '-o')
plt.xlabel('k')
plt.ylabel('Silhouette score')
plt.title('KMeans silhouette')
plt.show()

# Choose k by inspecting plots; default to k=6 (example) but we'll pick the best silhouette
best_k_idx = np.nanargmax(sil_scores)
best_k = K[best_k_idx]
print('Best K by silhouette:', best_k)

# Fit final KMeans
kmeans = KMeans(n_clusters=best_k, random_state=RANDOM_SEED, n_init=50)
klabels = kmeans.fit_predict(emb_pca)
df['kmeans_cluster'] = klabels

plt.figure(figsize=(8,6))
palette = sns.color_palette('tab10', best_k)
for c in range(best_k):
    sel = df[df['kmeans_cluster']==c]
    plt.scatter(sel['umap_0'], sel['umap_1'], s=70, label=f'cluster {c}')
plt.legend()
plt.title(f'KMeans clusters (k={best_k}) on UMAP')
plt.show()

# HDBSCAN (if available) - often finds meaningful clusters and noise points
if hdbscan is None:
    raise RuntimeError('hdbscan not installed. Uncomment install cell or run locally to enable HDBSCAN.')

clusterer = hdbscan.HDBSCAN(min_cluster_size=3, min_samples=1, metric='euclidean')
hdb_labels = clusterer.fit_predict(emb_pca)

df['hdbscan_cluster'] = hdb_labels

print('HDBSCAN cluster counts (including -1 noise):')
print(pd.Series(hdb_labels).value_counts())

plt.figure(figsize=(8,6))
sns.scatterplot(x='umap_0', y='umap_1', hue='hdbscan_cluster', data=df, palette='tab10', s=80)
plt.title('HDBSCAN clusters on UMAP')
plt.legend(bbox_to_anchor=(1.05,1), loc='upper left')
plt.show()


# 8) Cluster analysis & qualitative summaries

In [None]:
pip install --upgrade pydantic


In [None]:




# Function to show representative items per cluster


def cluster_summary(df, cluster_col='hdbscan_cluster', n_top=8):
    clusters = sorted(df[cluster_col].unique())
    summary = {}
    for c in clusters:
        if c == -1:
            label = 'noise'
        else:
            label = str(c)
        sub = df[df[cluster_col]==c]
# pick top n by text length (proxy for richer description) or random if short
        rep = sub.sort_values('text_len', ascending=False).head(n_top)[['Use Case','Industry','text_clean','Code Github']]
        summary[c] = rep
    return summary


summ = cluster_summary(df, cluster_col='hdbscan_cluster', n_top=6)


# Print brief summaries for non-noise clusters
for c, rep in summ.items():
    print('\n' + '='*60)
    print('Cluster:', c, '| Size:', len(df[df['hdbscan_cluster']==c]))
    print(rep.to_string(index=False))


# Compute cluster-level keywords using simple token frequency (very lightweight)
from collections import Counter
import math


# NLTK removed â€” not required for embedding-based clustering


from nltk.tokenize import word_tokenize


# Create tokens for each description
STOP = set([w.strip().lower() for w in ['the','a','an','and','or','for','to','of','in','on','with','by','from','as','is','are']])


def top_keywords_for_cluster(df, cluster_col='hdbscan_cluster', topn=8):
    kc = {}
    for c in sorted(df[cluster_col].unique()):
        texts = df.loc[df[cluster_col]==c, 'text_clean'].tolist()
        tokens = []
        for t in texts:
            toks = [w.lower() for w in word_tokenize(t) if w.isalpha()]
            tokens.extend([w for w in toks if w not in STOP and len(w)>2])
        counts = Counter(tokens)
        most = counts.most_common(topn)
        kc[c] = most
    return kc


kw = top_keywords_for_cluster(df, 'hdbscan_cluster', topn=10)
print('\nTop keywords per HDBSCAN cluster:')
for c,v in kw.items():
    print('Cluster', c, ':', ', '.join([f'{w}({n})' for w,n in v]))



# 9) Save results

In [None]:



OUT_CSV = Path('agents_clusters.csv')
cols_to_save = ['Use Case','Industry','Description','Code Github','text_clean','kmeans_cluster','hdbscan_cluster','umap_0','umap_1']
df[cols_to_save].to_csv(OUT_CSV, index=False)
print('\nSaved cluster results to', OUT_CSV)

# Also save embeddings (optional)
EMB_OUT = Path('agents_embeddings.npy')
np.save(EMB_OUT, embeddings)
print('Saved embeddings to', EMB_OUT)




# 10) Full Explanation and Next Steps (competition-style notes)


In [None]:

# Why these choices:
# - SentenceTransformer 'all-MiniLM-L6-v2' is compact and effective for semantic similarity with small datasets.
# - PCA before clustering stabilizes distances and reduces noise; 50 dims is a common sweet spot.
# - UMAP gives a human-friendly 2D projection for visual inspection. Parameters can be tuned for tighter clusters.
# - KMeans provides a simple baseline and well-understood cluster centroids; silhouette score helps pick k.
# - HDBSCAN is density-based, robust to clusters of different shapes/sizes and returns a -1 label for noise.

# Limitations & advice:
# 1. This notebook uses only the short description. Better results come from combining README, topics, and code comments.
# 2. For production use in RAG systems, fetch the repo README and source, chunk texts, and build an index (FAISS/Chroma).
# 3. Try alternative embedding models: larger SBERT variants, OpenAI embeddings (text-embedding-3-large), or fine-tuned domain models.
# 4. Tune UMAP/HDBSCAN hyperparameters. Use GridSearch-style sweeps for min_cluster_size / min_samples.
# 5. Use cross-checks with stars/README length to validate cluster coherence.

# Quick pointers to improve:
# - Replace `top_keywords_for_cluster` with TF-IDF + top terms from cluster centroid.
# - Add qualitative human-in-the-loop labeling for a few clusters to bootstrap a classifier.
# - If you want an interactive scatter with hover tooltips, export `umap_0, umap_1` and `Use Case` to a small HTML via plotly.

print('\nNotebook complete. Review clusters, inspect the saved CSV, and iterate on model / parameters.')
