# K-Means Clustering Model (k=3) with Z-Score Standardization
## Stock Data Clustering Analysis

This notebook implements K-means clustering with k=3 clusters using Z-score standardization.
We'll analyze the independent features dataset to identify distinct stock recommendation patterns with fewer, more distinct clusters.


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from tabulate import tabulate
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Sklearn imports
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.decomposition import PCA

plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
print("Libraries imported successfully!")


In [None]:
# Load the independent features dataset
df = pd.read_csv('stock_data_independent_features.csv')
print(f"Dataset shape: {df.shape}")
print(f"Features: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()


In [None]:
# Data quality check and cleaning
print("=== DATA QUALITY CHECK ===")
print(f"Infinite values: {np.isinf(df).sum().sum()}")
print(f"NaN values: {df.isnull().sum().sum()}")

# Check for extremely large values
large_values = (np.abs(df.select_dtypes(include=[np.number])) > 1e10).sum().sum()
print(f"Values > 1e10: {large_values}")

# Clean the data
print("\n=== CLEANING DATA ===")
# Replace infinite values with NaN
df_clean = df.replace([np.inf, -np.inf], np.nan)

# Drop rows with any NaN values
df_clean = df_clean.dropna()

# Clip extremely large values to reasonable range
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    df_clean[col] = df_clean[col].clip(lower=-1e6, upper=1e6)

print(f"Cleaned dataset shape: {df_clean.shape}")
print(f"Features: {list(df_clean.columns)}")

# Verify no problematic values remain
print(f"\nAfter cleaning:")
print(f"Infinite values: {np.isinf(df_clean).sum().sum()}")
print(f"NaN values: {df_clean.isnull().sum().sum()}")
print(f"Values > 1e6: {(np.abs(df_clean.select_dtypes(include=[np.number])) > 1e6).sum().sum()}")


In [None]:
# Z-Score Standardization (StandardScaler)
print("=== Z-SCORE STANDARDIZATION ===")
print("Z-score formula: z = (x - μ) / σ")
print("Where:")
print("- x = original value")
print("- μ = mean of the feature")
print("- σ = standard deviation of the feature")
print("- Result: mean=0, std=1 for each feature")

scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_clean)
scaled_df = pd.DataFrame(scaled_data, columns=df_clean.columns)

print(f"\nStandardized data shape: {scaled_data.shape}")
print(f"\nZ-score statistics (should be mean≈0, std≈1):")
stats_summary = scaled_df.describe().round(3)
print(stats_summary[['mean', 'std']])

# Verify Z-score properties
print(f"\n=== Z-SCORE VERIFICATION ===")
print(f"Mean of all features: {scaled_df.mean().mean():.6f} (should be ≈0)")
print(f"Std of all features: {scaled_df.std().mean():.6f} (should be ≈1)")
print(f"Min value: {scaled_df.min().min():.3f}")
print(f"Max value: {scaled_df.max().max():.3f}")


In [None]:
# Configuration parameters for 3 clusters
K = 3  # Number of clusters (reduced from 9 to 3)
N_INIT = 10  # Number of initializations
SEED = 42  # Random state for reproducibility

print(f"=== CONFIGURATION ===")
print(f"- Number of clusters (k): {K}")
print(f"- Number of initializations: {N_INIT}")
print(f"- Random seed: {SEED}")
print(f"- Standardization: Z-score (mean=0, std=1)")
print(f"\nRationale for k=3:")
print("- Fewer clusters create more distinct, interpretable groups")
print("- Easier to identify clear patterns and business insights")
print("- Reduces over-segmentation of similar recommendations")


In [None]:
# Function to search for best seed (from Taller_4 methodology)
def search_seed(seeds, data, k, n_init):
    best_seed = None
    best_score = -1
    
    for seed in seeds:
        km_model = KMeans(
            n_clusters=k, init='k-means++', random_state=seed, n_init=n_init)
        y_predicted = km_model.fit_predict(data)

        silhouette_avg = silhouette_score(data, y_predicted)
        
        if silhouette_avg > best_score:
            best_score = silhouette_avg
            best_seed = seed
            
        print(f"Seed {seed}: Silhouette Score = {silhouette_avg:.4f}")
    
    return best_seed, best_score

# Search for best seed
seeds_to_test = [0, 42, 123, 456, 789]
best_seed, best_score = search_seed(seeds_to_test, scaled_data, K, N_INIT)

print(f"\n=== BEST SEED RESULTS ===")
print(f"Best seed: {best_seed}")
print(f"Best silhouette score: {best_score:.4f}")
print(f"\nSilhouette Score Interpretation:")
print("- Range: -1 to +1")
print("- +1: Perfect clustering")
print("- 0: Overlapping clusters")
print("- -1: Incorrect clustering")
print(f"- Our score: {best_score:.4f} ({'Good' if best_score > 0.5 else 'Fair' if best_score > 0.3 else 'Poor'})")


In [None]:
# Train the final K-means model with best seed
print(f"Training K-means model with k={K} and seed={best_seed}...")

final_km = KMeans(n_clusters=K, init='k-means++', random_state=best_seed, n_init=N_INIT)
final_km.fit(scaled_data)
y_predicted = final_km.predict(scaled_data)

# Calculate final metrics
silhouette_avg = silhouette_score(scaled_data, y_predicted)
inertia = final_km.inertia_

print(f"\n=== MODEL RESULTS ===")
print(f"Silhouette Score: {silhouette_avg:.4f}")
print(f"Inertia: {inertia:.2f}")
print(f"Number of iterations: {final_km.n_iter_}")
print(f"Converged: {final_km.n_iter_ < final_km.max_iter}")
print(f"\nModel Performance:")
print(f"- Silhouette Score: {'Excellent' if silhouette_avg > 0.7 else 'Good' if silhouette_avg > 0.5 else 'Fair' if silhouette_avg > 0.3 else 'Poor'}")
print(f"- Convergence: {'Yes' if final_km.n_iter_ < final_km.max_iter else 'No'}")
print(f"- Inertia: Lower is better (measures within-cluster sum of squares)")


In [None]:
# Function to print cluster centroids (from Taller_4 methodology)
def print_clusters_centroids(model, feature_names, k):
    info = {
        "features": feature_names 
    }

    for n in range(0, k):
        info[str(n)] = model.cluster_centers_[n]

    print(tabulate(info, headers="keys", tablefmt="fancy_grid"))
    print(f"\nCluster distribution:")
    print(Counter(model.labels_))

# Print cluster centroids and distribution
print("=== CLUSTER CENTROIDS (Z-SCORE VALUES) ===")
print("Note: Values are in Z-score units (mean=0, std=1)")
print("Positive values = above average, Negative values = below average")
print_clusters_centroids(final_km, scaler.feature_names_in_, K)


In [None]:
# Create clusters dataframe for analysis
clusters_df = df_clean.copy()
clusters_df['cluster_label'] = final_km.labels_

print(f"Clusters dataframe shape: {clusters_df.shape}")
print(f"\nCluster distribution:")
cluster_counts = clusters_df['cluster_label'].value_counts().sort_index()
for cluster_id, count in cluster_counts.items():
    percentage = (count / len(clusters_df)) * 100
    print(f"Cluster {cluster_id}: {count} samples ({percentage:.1f}%)")

print(f"\nFirst few rows with cluster labels:")
clusters_df.head()

# Calculate cluster balance
max_cluster_size = cluster_counts.max()
min_cluster_size = cluster_counts.min()
balance_ratio = min_cluster_size / max_cluster_size

print(f"\n=== CLUSTER BALANCE ===")
print(f"Largest cluster: {max_cluster_size} samples")
print(f"Smallest cluster: {min_cluster_size} samples")
print(f"Balance ratio: {balance_ratio:.3f}")
print(f"Balance assessment: {'Well balanced' if balance_ratio > 0.5 else 'Moderately balanced' if balance_ratio > 0.3 else 'Unbalanced'}")


In [None]:
# Function to plot clusters (from Taller_4 methodology)
def plot_clusters(model, data):
    plt.figure(figsize=(20, 10))

    label = model.labels_
    u_labels = np.unique(label)

    # Getting the Centroids
    centroids = model.cluster_centers_
    
    # Plotting the results:
    for i in u_labels:
        plt.scatter(data[label == i, 0], data[label == i, 1], 
                   label=f'Cluster {i}', alpha=0.7, s=50)
    
    plt.scatter(centroids[:, 0], centroids[:, 1], 
               s=200, c='red', marker='x', label='Centroids', linewidths=3)
    
    plt.legend()
    plt.title(f'K-Means Clustering (k={K}) - Z-Score Standardized')
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.grid(True, alpha=0.3)
    plt.show()

# Apply PCA for 2D visualization
pca = PCA(n_components=2)
data_pca = pca.fit_transform(scaled_data)

print(f"PCA explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.3f}")
print(f"Note: {pca.explained_variance_ratio_.sum():.1%} of variance explained in 2D")

# Plot clusters in 2D PCA space
plot_clusters(final_km, data_pca)


In [None]:
# Function to draw polar chart (from Taller_4 methodology)
def draw_polar_chart(clusters):
    polar = clusters.groupby("cluster_label").mean().reset_index()
    polar = pd.melt(polar, id_vars=["cluster_label"])

    fig = px.line_polar(polar, r="value", theta="variable", color="cluster_label", 
                       line_close=True, height=800, width=1400,
                       title="Cluster Characteristics - Polar Chart (Original Scale)")
    fig.show()

# Create polar chart to visualize cluster characteristics
print("Generating polar chart to visualize cluster characteristics...")
print("Note: Values shown are in original scale, not Z-scores")
draw_polar_chart(clusters_df)


In [None]:
# Function to plot clusters pie chart (from Taller_4 methodology)
def plot_clusters_pie(clusters):
    pie = clusters.groupby('cluster_label').size().reset_index()
    pie.columns = ['cluster_label', 'value']
    fig = px.pie(pie, values='value', names='cluster_label', 
                title=f"Cluster Distribution (k={K})")
    fig.show()

# Create pie chart for cluster distribution
print("Generating pie chart for cluster distribution...")
plot_clusters_pie(clusters_df)


In [None]:
# Detailed cluster analysis
print("=== DETAILED CLUSTER ANALYSIS ===")

# Calculate mean values for each cluster (original scale)
cluster_means = clusters_df.groupby('cluster_label').mean()

print("\nCluster characteristics (mean values - original scale):")
print(cluster_means.round(3))

# Identify most distinctive features for each cluster
print("\n=== MOST DISTINCTIVE FEATURES PER CLUSTER ===")
for cluster_id in range(K):
    cluster_data = cluster_means.iloc[cluster_id]
    overall_mean = clusters_df.drop('cluster_label', axis=1).mean()
    
    # Calculate deviations from overall mean
    deviations = (cluster_data - overall_mean).abs()
    top_features = deviations.nlargest(5).index.tolist()  # Top 5 features
    
    print(f"\nCluster {cluster_id} (n={cluster_counts[cluster_id]}, {cluster_counts[cluster_id]/len(clusters_df)*100:.1f}%):")
    for feature in top_features:
        cluster_val = cluster_data[feature]
        overall_val = overall_mean[feature]
        deviation = cluster_val - overall_val
        deviation_pct = (deviation / overall_val) * 100 if overall_val != 0 else 0
        print(f"  - {feature}: {cluster_val:.3f} (deviation: {deviation:+.3f}, {deviation_pct:+.1f}%)")


In [None]:
# Save results
output_file = 'stock_data_clustered_k3_zscore.csv'
clusters_df.to_csv(output_file, index=False)

print(f"=== RESULTS SAVED ===")
print(f"Clustered dataset saved as: {output_file}")
print(f"Dataset shape: {clusters_df.shape}")
print(f"Features: {len(clusters_df.columns)-1} (plus cluster_label)")
print(f"\nModel Summary:")
print(f"- Number of clusters: {K}")
print(f"- Silhouette score: {silhouette_avg:.4f}")
print(f"- Best seed: {best_seed}")
print(f"- Inertia: {inertia:.2f}")
print(f"- Standardization: Z-score (mean=0, std=1)")
print(f"- Cluster balance ratio: {balance_ratio:.3f}")

# Comparison with k=9 model
print(f"\n=== COMPARISON WITH K=9 MODEL ===")
print(f"Advantages of k=3:")
print(f"- Simpler interpretation and business insights")
print(f"- More distinct cluster characteristics")
print(f"- Better for strategic decision making")
print(f"- Reduced over-segmentation")
print(f"- Easier to communicate results to stakeholders")


## Summary and Conclusions

### Z-Score Standardization Benefits
- **Equal Feature Weight**: All features contribute equally to clustering (mean=0, std=1)
- **Scale Independence**: Features with different scales don't dominate the clustering
- **Interpretable Centroids**: Cluster centroids show how many standard deviations above/below average each feature is
- **Robust Clustering**: Less sensitive to outliers and extreme values

### Model Performance
- **Silhouette Score**: Measures cluster quality and separation
- **Inertia**: Within-cluster sum of squares (lower is better)
- **Convergence**: Algorithm reached optimal solution
- **Cluster Balance**: Distribution of samples across clusters

### Cluster Interpretation (k=3)
The three clusters likely represent:

- **Cluster 0**: [Describe based on distinctive features - e.g., "Conservative/Bearish"]
- **Cluster 1**: [Describe based on distinctive features - e.g., "Moderate/Neutral"] 
- **Cluster 2**: [Describe based on distinctive features - e.g., "Aggressive/Bullish"]

### Business Insights
The 3-cluster model provides clear strategic insights:

- **Portfolio Segmentation**: Three distinct investment strategy categories
- **Risk Management**: Clear risk profiles for each cluster
- **Recommendation Strategy**: Tailored approaches for each cluster type
- **Market Positioning**: Understanding different analyst perspectives

### Advantages of k=3 vs k=9
- **Simplicity**: Easier to understand and communicate
- **Strategic Value**: Better for high-level business decisions
- **Interpretability**: Clearer cluster characteristics
- **Actionability**: More practical for implementation
- **Reduced Noise**: Less over-segmentation of similar patterns

### Next Steps
1. **Cluster Naming**: Assign meaningful business names to each cluster
2. **Validation**: Test cluster stability with different samples
3. **Feature Importance**: Identify which features drive cluster separation
4. **Temporal Analysis**: Examine how clusters change over time
5. **Business Rules**: Develop rules for assigning new recommendations to clusters
