# 04 — Clustering Analysis

**MarketPulse Phase 1 — College Requirement: Unsupervised Learning**

This notebook segments stocks into behavioral groups using:
1. **K-Means** with elbow method and silhouette analysis
2. **DBSCAN** as a density-based alternative
3. **PCA** for 2D visualization
4. Cluster interpretation and profiling

In [None]:
import sys, os
sys.path.insert(0, os.path.abspath('..'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

from src.data.market_config import load_market_config
from src.data.fetcher import YFinanceFetcher
from src.data.preprocessing import preprocess_ohlcv, preprocess_multiple
from src.analysis.clustering import StockClusterAnalyzer

sns.set_theme(style='whitegrid')
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## 1. Fetch Data for Multiple Stocks

We need a diverse universe of stocks to see meaningful clusters. We'll use the 14 default stocks from our config plus the benchmark (SPY).

In [None]:
config = load_market_config('stocks')
fetcher = YFinanceFetcher(market_config=config)

end_date = datetime.now().strftime('%Y-%m-%d')
start_date = (datetime.now() - timedelta(days=5*365)).strftime('%Y-%m-%d')

# Fetch all default tickers
tickers = config.default_tickers
print(f"Fetching {len(tickers)} stocks: {tickers}")

raw_data = fetcher.fetch_multiple(tickers, start=start_date, end=end_date)
print(f"\nFetched: {len(raw_data)} tickers")

# Also fetch benchmark
benchmark_raw = fetcher.fetch('SPY', start=start_date, end=end_date)

In [None]:
# Preprocess all
processed = preprocess_multiple(raw_data, market_config=config)
benchmark = preprocess_ohlcv(benchmark_raw, market_config=config)

print(f"Preprocessed: {len(processed)} stocks")
for t, df in processed.items():
    print(f"  {t}: {len(df)} rows")

## 2. Compute Behavioral Features

For each stock we compute 10 summary statistics that capture its "personality":
- Return profile (mean return, Sharpe ratio)
- Risk profile (volatility, max drawdown)
- Market relationship (beta)
- Distribution properties (skewness, kurtosis)
- Trading characteristics (volume, daily range, up-day ratio)

In [None]:
analyzer = StockClusterAnalyzer(benchmark_ticker='SPY')
features = analyzer.compute_stock_features(processed, benchmark_data=benchmark)

print(f"Feature matrix shape: {features.shape}")
print(f"\nFeatures per stock:")
for col, desc in analyzer.FEATURE_DESCRIPTIONS.items():
    print(f"  {col:20s} — {desc}")

features.round(3)

In [None]:
# Visualize feature distributions
fig, axes = plt.subplots(2, 5, figsize=(20, 8))

for i, col in enumerate(features.columns):
    ax = axes[i // 5, i % 5]
    features[col].plot(kind='bar', ax=ax, color='steelblue', alpha=0.8)
    ax.set_title(col, fontsize=10)
    ax.tick_params(axis='x', rotation=45, labelsize=7)
    ax.grid(True, alpha=0.3, axis='y')

plt.suptitle('Stock Behavioral Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 3. K-Means Clustering

### Elbow Method & Silhouette Analysis

We try K = 2 through 7 and select the K with the highest silhouette score.

In [None]:
labels, metrics = analyzer.run_kmeans(k_range=range(2, 8))

print(f"\nOptimal K: {metrics['optimal_k']}")
print(f"Silhouette score: {metrics['final_silhouette']:.3f}")

# Elbow and silhouette plots
fig = analyzer.plot_elbow(metrics)
plt.show()

In [None]:
# Show cluster assignments
cluster_assignments = features[['kmeans_cluster']].copy()
cluster_assignments['kmeans_cluster'] = cluster_assignments['kmeans_cluster'].astype(int)

for cluster_id in sorted(cluster_assignments['kmeans_cluster'].unique()):
    members = cluster_assignments[cluster_assignments['kmeans_cluster'] == cluster_id].index.tolist()
    print(f"Cluster {cluster_id}: {members}")

## 4. PCA Visualization

Project the 10-dimensional feature space down to 2 dimensions for plotting.

In [None]:
pca_coords = analyzer.compute_pca(n_components=2)

print(f"PCA explained variance: {analyzer.pca.explained_variance_ratio_}")
print(f"Total: {sum(analyzer.pca.explained_variance_ratio_):.1%}")

In [None]:
# Interactive Plotly scatter
fig = analyzer.plot_clusters_interactive(cluster_col='kmeans_cluster')
fig.show()

In [None]:
# Static matplotlib version with labels
fig, ax = plt.subplots(figsize=(10, 8))

scatter = ax.scatter(
    pca_coords['PC1'], pca_coords['PC2'],
    c=pca_coords['kmeans_cluster'],
    cmap='Set1', s=150, edgecolors='white', linewidth=1.5, alpha=0.9
)

# Add ticker labels
for ticker, row in pca_coords.iterrows():
    ax.annotate(ticker, (row['PC1'], row['PC2']),
                fontsize=9, ha='center', va='bottom', 
                xytext=(0, 8), textcoords='offset points')

ax.set_xlabel(f"PC1 ({analyzer.pca.explained_variance_ratio_[0]:.1%} variance)")
ax.set_ylabel(f"PC2 ({analyzer.pca.explained_variance_ratio_[1]:.1%} variance)")
ax.set_title('Stock Clusters — PCA Projection', fontsize=14)
plt.colorbar(scatter, label='Cluster')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. DBSCAN Comparison

DBSCAN doesn't require specifying K — it finds clusters based on density and marks outliers as noise (-1).

In [None]:
dbscan_labels = analyzer.run_dbscan(eps=1.5, min_samples=2)

print(f"\nDBSCAN assignments:")
dbscan_df = features[['dbscan_cluster']].copy()
for cluster_id in sorted(dbscan_df['dbscan_cluster'].unique()):
    members = dbscan_df[dbscan_df['dbscan_cluster'] == cluster_id].index.tolist()
    label = 'NOISE' if cluster_id == -1 else f'Cluster {cluster_id}'
    print(f"  {label}: {members}")

In [None]:
# Compare K-Means vs DBSCAN side by side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

# K-Means
scatter1 = ax1.scatter(pca_coords['PC1'], pca_coords['PC2'],
                        c=pca_coords['kmeans_cluster'], cmap='Set1',
                        s=150, edgecolors='white', linewidth=1.5)
for ticker, row in pca_coords.iterrows():
    ax1.annotate(ticker, (row['PC1'], row['PC2']),
                fontsize=8, ha='center', va='bottom', xytext=(0, 8), textcoords='offset points')
ax1.set_title('K-Means Clustering')
ax1.set_xlabel('PC1')
ax1.set_ylabel('PC2')
ax1.grid(True, alpha=0.3)

# DBSCAN
scatter2 = ax2.scatter(pca_coords['PC1'], pca_coords['PC2'],
                        c=pca_coords['dbscan_cluster'], cmap='Set1',
                        s=150, edgecolors='white', linewidth=1.5)
for ticker, row in pca_coords.iterrows():
    ax2.annotate(ticker, (row['PC1'], row['PC2']),
                fontsize=8, ha='center', va='bottom', xytext=(0, 8), textcoords='offset points')
ax2.set_title('DBSCAN Clustering')
ax2.set_xlabel('PC1')
ax2.set_ylabel('PC2')
ax2.grid(True, alpha=0.3)

plt.suptitle('K-Means vs DBSCAN', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 6. Cluster Interpretation

What makes each cluster unique? We examine the average feature values per cluster and generate human-readable labels.

In [None]:
profiles = analyzer.interpret_clusters()

print("Cluster Profiles:")
print("=" * 80)
for cluster_id, row in profiles.iterrows():
    print(f"\nCluster {cluster_id}: {row['suggested_label']} ({int(row['n_stocks'])} stocks)")
    print(f"  Avg Annual Return:  {row['mean_return']:.1%}")
    print(f"  Avg Volatility:     {row['volatility']:.1%}")
    print(f"  Avg Sharpe Ratio:   {row['sharpe_ratio']:.2f}")
    print(f"  Avg Max Drawdown:   {row['max_drawdown']:.1%}")
    print(f"  Avg Beta:           {row['beta']:.2f}")

profiles.round(3)

In [None]:
# Cluster profile visualization
fig = analyzer.plot_cluster_profiles()
plt.show()

## 7. Feature Contribution to Clusters

Which features are most important in separating the clusters?

In [None]:
# PCA loadings — which original features contribute most to the PCA axes
loadings = pd.DataFrame(
    analyzer.pca.components_.T,
    columns=['PC1', 'PC2'],
    index=features.drop(columns=['kmeans_cluster', 'dbscan_cluster'], errors='ignore').columns
)

fig, ax = plt.subplots(figsize=(10, 8))
for i, (feature, row) in enumerate(loadings.iterrows()):
    ax.arrow(0, 0, row['PC1']*3, row['PC2']*3,
             head_width=0.05, head_length=0.03, fc='steelblue', ec='steelblue', alpha=0.7)
    ax.text(row['PC1']*3.2, row['PC2']*3.2, feature, fontsize=9, ha='center')

ax.set_xlabel('PC1 Loading')
ax.set_ylabel('PC2 Loading')
ax.set_title('PCA Loadings — Feature Contributions to Principal Components')
ax.axhline(0, color='gray', linewidth=0.5)
ax.axvline(0, color='gray', linewidth=0.5)
ax.grid(True, alpha=0.3)
ax.set_aspect('equal')
plt.tight_layout()
plt.show()

## Key Takeaways

1. **K-Means** identifies interpretable groups — typically separating high-growth/high-vol tech stocks from stable/defensive names.
2. **DBSCAN** may identify outlier stocks (e.g., TSLA often stands alone due to extreme volatility).
3. **PCA** shows that volatility and beta dominate the first principal component, while return and Sharpe separate stocks on the second.
4. **Cluster labels** (Growth, Defensive, High-Volatility, etc.) are auto-generated based on feature thresholds.

**College requirement satisfied**: We demonstrated K-Means, DBSCAN, and PCA with financial interpretation.

**Practical use**: Clustering can inform:
- Portfolio diversification (don't over-concentrate in one cluster)
- Strategy adaptation (different model parameters per cluster)
- Risk management (monitor cluster shifts over time)