# Week 6: Unsupervised Learning

## Learning Objectives:
- Understand clustering algorithms
- Learn dimensionality reduction techniques
- Apply unsupervised learning to real data
- Understand when to use unsupervised methods

## Topics Covered:
- K-means clustering
- Principal Component Analysis (PCA)
- Hierarchical clustering
- DBSCAN clustering
- Cluster evaluation metrics

In [None]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, adjusted_rand_score
from sklearn.datasets import make_blobs, make_circles
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import pdist, squareform
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")

## 1. Introduction to Unsupervised Learning

Unsupervised learning finds hidden patterns in data without labeled examples. Unlike supervised learning, we don't have target variables to predict.

### Main Types:
1. **Clustering**: Grouping similar data points together
2. **Dimensionality Reduction**: Reducing the number of features while preserving information
3. **Association Rules**: Finding relationships between variables
4. **Anomaly Detection**: Identifying unusual data points

### Applications:
- Customer segmentation
- Market basket analysis
- Data compression
- Feature engineering
- Exploratory data analysis

In [None]:
# Create synthetic customer segmentation dataset
np.random.seed(42)
n_samples = 500

# Generate customer features
age = np.random.normal(40, 15, n_samples)
age = np.clip(age, 18, 80)
income = np.random.normal(50000, 25000, n_samples)
income = np.clip(income, 20000, 200000)
spending_score = np.random.normal(50, 20, n_samples)
spending_score = np.clip(spending_score, 1, 100)
annual_spending = np.random.normal(2000, 1000, n_samples)
annual_spending = np.clip(annual_spending, 200, 10000)

# Create some realistic relationships
# Higher income generally leads to higher spending
annual_spending = annual_spending + (income / 50000) * 1000 + np.random.normal(0, 500, n_samples)
# Younger people might have different spending patterns
spending_score = spending_score + (80 - age) * 0.3 + np.random.normal(0, 5, n_samples)
spending_score = np.clip(spending_score, 1, 100)

# Create DataFrame
customer_data = pd.DataFrame({
    'Age': age,
    'Income': income,
    'Spending_Score': spending_score,
    'Annual_Spending': annual_spending
})

print("Customer Segmentation Dataset:")
print(customer_data.head())
print(f"\nDataset shape: {customer_data.shape}")
print(f"\nBasic statistics:")
print(customer_data.describe())

In [None]:
# Visualize the data
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Distribution plots
customer_data['Age'].hist(bins=30, ax=axes[0, 0], alpha=0.7, color='skyblue')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].set_xlabel('Age')
axes[0, 0].grid(True, alpha=0.3)

customer_data['Income'].hist(bins=30, ax=axes[0, 1], alpha=0.7, color='lightgreen')
axes[0, 1].set_title('Income Distribution')
axes[0, 1].set_xlabel('Income')
axes[0, 1].grid(True, alpha=0.3)

customer_data['Spending_Score'].hist(bins=30, ax=axes[1, 0], alpha=0.7, color='salmon')
axes[1, 0].set_title('Spending Score Distribution')
axes[1, 0].set_xlabel('Spending Score')
axes[1, 0].grid(True, alpha=0.3)

customer_data['Annual_Spending'].hist(bins=30, ax=axes[1, 1], alpha=0.7, color='gold')
axes[1, 1].set_title('Annual Spending Distribution')
axes[1, 1].set_xlabel('Annual Spending')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Correlation matrix
plt.figure(figsize=(10, 8))
correlation_matrix = customer_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
           square=True, fmt='.2f', cbar_kws={'shrink': 0.8})
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

## 2. K-means Clustering

K-means is one of the most popular clustering algorithms. It partitions data into k clusters by minimizing the within-cluster sum of squares.

### How it works:
1. Choose number of clusters (k)
2. Initialize k cluster centroids randomly
3. Assign each point to nearest centroid
4. Update centroids to mean of assigned points
5. Repeat steps 3-4 until convergence

### Key Properties:
- Requires specifying k in advance
- Assumes spherical clusters
- Sensitive to initialization
- Scales well to large datasets

In [None]:
# Prepare data for clustering
print("=== K-MEANS CLUSTERING ===")

# Scale the features (important for K-means)
scaler = StandardScaler()
customer_scaled = scaler.fit_transform(customer_data)

# Find optimal number of clusters using elbow method
k_range = range(1, 11)
inertias = []
silhouette_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(customer_scaled)
    inertias.append(kmeans.inertia_)
    
    if k > 1:  # Silhouette score requires at least 2 clusters
        silhouette_avg = silhouette_score(customer_scaled, kmeans.labels_)
        silhouette_scores.append(silhouette_avg)
    else:
        silhouette_scores.append(0)

# Plot elbow curve and silhouette scores
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Elbow method
axes[0].plot(k_range, inertias, 'bo-')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia (WCSS)')
axes[0].set_title('Elbow Method for Optimal k')
axes[0].grid(True, alpha=0.3)

# Silhouette analysis
axes[1].plot(k_range, silhouette_scores, 'ro-')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Analysis for Optimal k')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find best k based on silhouette score
best_k = k_range[np.argmax(silhouette_scores)]
print(f"Optimal k based on silhouette score: {best_k}")
print(f"Best silhouette score: {max(silhouette_scores):.4f}")

In [None]:
# Apply K-means with optimal k
optimal_kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=10)
cluster_labels = optimal_kmeans.fit_predict(customer_scaled)

# Add cluster labels to original data
customer_data['Cluster'] = cluster_labels

# Analyze clusters
print(f"\n=== CLUSTER ANALYSIS (K={best_k}) ===")
print("Cluster sizes:")
print(customer_data['Cluster'].value_counts().sort_index())

print("\nCluster centroids (original scale):")
cluster_centers = pd.DataFrame(
    scaler.inverse_transform(optimal_kmeans.cluster_centers_),
    columns=customer_data.columns[:-1]
)
cluster_centers['Cluster'] = range(best_k)
print(cluster_centers.round(2))

print("\nCluster statistics:")
cluster_stats = customer_data.groupby('Cluster').agg({
    'Age': ['mean', 'std'],
    'Income': ['mean', 'std'],
    'Spending_Score': ['mean', 'std'],
    'Annual_Spending': ['mean', 'std']
})
print(cluster_stats.round(2))

In [None]:
# Visualize clusters
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Different feature combinations
feature_pairs = [
    ('Age', 'Income'),
    ('Age', 'Spending_Score'),
    ('Income', 'Annual_Spending'),
    ('Spending_Score', 'Annual_Spending')
]

colors = ['red', 'blue', 'green', 'orange', 'purple', 'brown']

for i, (x_feat, y_feat) in enumerate(feature_pairs):
    row = i // 2
    col = i % 2
    
    # Plot each cluster
    for cluster in range(best_k):
        cluster_data = customer_data[customer_data['Cluster'] == cluster]
        axes[row, col].scatter(cluster_data[x_feat], cluster_data[y_feat], 
                              c=colors[cluster], alpha=0.6, s=50, 
                              label=f'Cluster {cluster}')
    
    # Plot centroids
    for cluster in range(best_k):
        centroid = cluster_centers.iloc[cluster]
        axes[row, col].scatter(centroid[x_feat], centroid[y_feat], 
                              c='black', marker='x', s=200, linewidths=3)
    
    axes[row, col].set_xlabel(x_feat)
    axes[row, col].set_ylabel(y_feat)
    axes[row, col].set_title(f'Clusters: {x_feat} vs {y_feat}')
    axes[row, col].legend()
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 3. Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique that transforms data to a lower-dimensional space while preserving as much variance as possible.

### How it works:
1. Standardize the data
2. Calculate covariance matrix
3. Find eigenvectors and eigenvalues
4. Sort by eigenvalues (descending)
5. Transform data using selected components

### Key Properties:
- Linear transformation
- Components are orthogonal
- Preserves maximum variance
- Useful for visualization and feature reduction

In [None]:
# Principal Component Analysis
print("=== PRINCIPAL COMPONENT ANALYSIS ===")

# Apply PCA to the customer data
pca = PCA()
customer_pca = pca.fit_transform(customer_scaled)

# Analyze explained variance
explained_variance_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance_ratio)

print("Explained variance ratio for each component:")
for i, ratio in enumerate(explained_variance_ratio):
    print(f"PC{i+1}: {ratio:.4f} ({ratio*100:.2f}%)")

print(f"\nCumulative explained variance:")
for i, cum_var in enumerate(cumulative_variance):
    print(f"PC1-PC{i+1}: {cum_var:.4f} ({cum_var*100:.2f}%)")

# Plot explained variance
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Individual explained variance
axes[0].bar(range(1, len(explained_variance_ratio) + 1), explained_variance_ratio)
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Explained Variance by Component')
axes[0].grid(True, alpha=0.3)

# Cumulative explained variance
axes[1].plot(range(1, len(cumulative_variance) + 1), cumulative_variance, 'bo-')
axes[1].axhline(y=0.8, color='r', linestyle='--', alpha=0.7, label='80% variance')
axes[1].axhline(y=0.95, color='g', linestyle='--', alpha=0.7, label='95% variance')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Explained Variance')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Visualize PCA components
print("\n=== PCA COMPONENT ANALYSIS ===")

# Component loadings (weights)
components_df = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(len(pca.components_))],
    index=customer_data.columns[:-1]
)

print("Component loadings:")
print(components_df.round(3))

# Visualize first two components
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Component loadings heatmap
sns.heatmap(components_df.iloc[:, :2], annot=True, cmap='RdBu_r', center=0,
           ax=axes[0], cbar_kws={'shrink': 0.8})
axes[0].set_title('Component Loadings (PC1 & PC2)')

# 2D PCA visualization with clusters
for cluster in range(best_k):
    cluster_mask = cluster_labels == cluster
    axes[1].scatter(customer_pca[cluster_mask, 0], customer_pca[cluster_mask, 1], 
                   c=colors[cluster], alpha=0.6, s=50, label=f'Cluster {cluster}')

axes[1].set_xlabel(f'PC1 ({explained_variance_ratio[0]*100:.1f}% variance)')
axes[1].set_ylabel(f'PC2 ({explained_variance_ratio[1]*100:.1f}% variance)')
axes[1].set_title('Data in PCA Space (Colored by K-means Clusters)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Hierarchical Clustering

Hierarchical clustering creates a tree-like structure of clusters. It can be agglomerative (bottom-up) or divisive (top-down).

### Agglomerative Clustering:
1. Start with each point as its own cluster
2. Merge closest clusters
3. Repeat until one cluster remains

### Key Properties:
- No need to specify number of clusters in advance
- Creates dendrogram showing cluster hierarchy
- Different linkage methods (single, complete, average, ward)
- Can handle non-spherical clusters

In [None]:
# Hierarchical Clustering
print("=== HIERARCHICAL CLUSTERING ===")

# For visualization, use a subset of data
n_sample = 100
sample_indices = np.random.choice(len(customer_scaled), n_sample, replace=False)
customer_sample = customer_scaled[sample_indices]

# Create linkage matrix
linkage_matrix = linkage(customer_sample, method='ward')

# Plot dendrogram
plt.figure(figsize=(15, 8))
dendrogram(linkage_matrix, truncate_mode='level', p=10)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index or (Cluster Size)')
plt.ylabel('Distance')
plt.show()

# Apply hierarchical clustering to full dataset
n_clusters = best_k
hierarchical = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
hierarchical_labels = hierarchical.fit_predict(customer_scaled)

# Compare with K-means
print(f"\nCluster comparison (K-means vs Hierarchical):")
print(f"K-means cluster sizes: {np.bincount(cluster_labels)}")
print(f"Hierarchical cluster sizes: {np.bincount(hierarchical_labels)}")

# Calculate similarity between clusterings
ari_score = adjusted_rand_score(cluster_labels, hierarchical_labels)
print(f"\nAdjusted Rand Index (similarity): {ari_score:.4f}")

# Silhouette score for hierarchical clustering
hierarchical_silhouette = silhouette_score(customer_scaled, hierarchical_labels)
print(f"Hierarchical clustering silhouette score: {hierarchical_silhouette:.4f}")
print(f"K-means silhouette score: {silhouette_score(customer_scaled, cluster_labels):.4f}")

## 5. DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed and marks outliers.

### Key Concepts:
- **Core points**: Points with at least min_samples neighbors within eps distance
- **Border points**: Points within eps distance of core points
- **Noise points**: Points that are neither core nor border

### Key Properties:
- Automatically determines number of clusters
- Can find arbitrarily shaped clusters
- Identifies outliers as noise
- Sensitive to hyperparameters (eps, min_samples)

In [None]:
# DBSCAN Clustering
print("=== DBSCAN CLUSTERING ===")

# Try different eps values
eps_values = [0.3, 0.5, 0.7, 1.0, 1.5]
min_samples = 5

dbscan_results = []

for eps in eps_values:
    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
    dbscan_labels = dbscan.fit_predict(customer_scaled)
    
    n_clusters = len(set(dbscan_labels)) - (1 if -1 in dbscan_labels else 0)
    n_noise = list(dbscan_labels).count(-1)
    
    if n_clusters > 1:
        silhouette = silhouette_score(customer_scaled, dbscan_labels)
    else:
        silhouette = -1
    
    dbscan_results.append({
        'eps': eps,
        'n_clusters': n_clusters,
        'n_noise': n_noise,
        'silhouette': silhouette,
        'labels': dbscan_labels
    })
    
    print(f"eps={eps}: {n_clusters} clusters, {n_noise} noise points, silhouette={silhouette:.4f}")

# Choose best eps based on silhouette score
best_dbscan = max(dbscan_results, key=lambda x: x['silhouette'])
print(f"\nBest DBSCAN: eps={best_dbscan['eps']}, silhouette={best_dbscan['silhouette']:.4f}")

# Visualize DBSCAN results
dbscan_labels = best_dbscan['labels']
unique_labels = set(dbscan_labels)
n_clusters = len(unique_labels) - (1 if -1 in unique_labels else 0)

plt.figure(figsize=(12, 8))

# Plot DBSCAN results
core_samples_mask = np.zeros_like(dbscan_labels, dtype=bool)
for i, label in enumerate(dbscan_labels):
    if label != -1:
        core_samples_mask[i] = True

# Use first two PCA components for visualization
pca_2d = PCA(n_components=2)
customer_pca_2d = pca_2d.fit_transform(customer_scaled)

# Plot clusters
unique_labels = set(dbscan_labels)
colors = ['red', 'blue', 'green', 'orange', 'purple', 'brown']

for k, col in zip(unique_labels, colors):
    if k == -1:
        # Noise points
        col = 'black'
        marker = 'x'
        alpha = 0.3
        label = 'Noise'
    else:
        marker = 'o'
        alpha = 0.6
        label = f'Cluster {k}'
    
    class_member_mask = (dbscan_labels == k)
    xy = customer_pca_2d[class_member_mask]
    plt.scatter(xy[:, 0], xy[:, 1], c=col, marker=marker, alpha=alpha, s=50, label=label)

plt.title(f'DBSCAN Clustering (eps={best_dbscan["eps"]}, min_samples={min_samples})')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 6. Clustering Comparison

Let's compare all the clustering methods we've implemented.

In [None]:
# Compare all clustering methods
print("=== CLUSTERING METHODS COMPARISON ===")

# Prepare results
clustering_results = {
    'K-means': {
        'labels': cluster_labels,
        'silhouette': silhouette_score(customer_scaled, cluster_labels),
        'n_clusters': best_k,
        'n_noise': 0
    },
    'Hierarchical': {
        'labels': hierarchical_labels,
        'silhouette': hierarchical_silhouette,
        'n_clusters': best_k,
        'n_noise': 0
    },
    'DBSCAN': {
        'labels': dbscan_labels,
        'silhouette': best_dbscan['silhouette'],
        'n_clusters': best_dbscan['n_clusters'],
        'n_noise': best_dbscan['n_noise']
    }
}

# Print comparison
print(f"{'Method':<12} {'Clusters':<10} {'Noise':<8} {'Silhouette':<12}")
print("-" * 45)
for method, results in clustering_results.items():
    print(f"{method:<12} {results['n_clusters']:<10} {results['n_noise']:<8} {results['silhouette']:<12.4f}")

# Find best method
best_method = max(clustering_results.items(), key=lambda x: x[1]['silhouette'])
print(f"\nBest method based on silhouette score: {best_method[0]}")

## 7. Customer Segmentation Insights

Let's interpret our clustering results for business insights.

In [None]:
# Business insights from clustering
print("=== CUSTOMER SEGMENTATION INSIGHTS ===")

# Use K-means results for interpretation
cluster_profiles = customer_data.groupby('Cluster').agg({
    'Age': 'mean',
    'Income': 'mean',
    'Spending_Score': 'mean',
    'Annual_Spending': 'mean'
}).round(0)

print("\nCluster Profiles:")
print(cluster_profiles)

# Create business-friendly cluster names
cluster_names = {}
for cluster in range(best_k):
    profile = cluster_profiles.iloc[cluster]
    
    # Simple heuristic for naming
    if profile['Income'] > 60000 and profile['Spending_Score'] > 60:
        name = 'High-Value Customers'
    elif profile['Income'] > 60000 and profile['Spending_Score'] <= 60:
        name = 'Conservative Spenders'
    elif profile['Income'] <= 60000 and profile['Spending_Score'] > 60:
        name = 'Enthusiastic Spenders'
    elif profile['Age'] > 50:
        name = 'Mature Customers'
    else:
        name = 'Budget-Conscious'
    
    cluster_names[cluster] = name

print("\nBusiness Segment Names:")
for cluster, name in cluster_names.items():
    size = (customer_data['Cluster'] == cluster).sum()
    percentage = (size / len(customer_data)) * 100
    print(f"Cluster {cluster}: {name} ({size} customers, {percentage:.1f}%)")

# Marketing recommendations
print("\n=== MARKETING RECOMMENDATIONS ===")
for cluster, name in cluster_names.items():
    profile = cluster_profiles.iloc[cluster]
    print(f"\n{name}:")
    print(f"  - Average Age: {profile['Age']:.0f}")
    print(f"  - Average Income: ${profile['Income']:,.0f}")
    print(f"  - Spending Score: {profile['Spending_Score']:.0f}/100")
    print(f"  - Annual Spending: ${profile['Annual_Spending']:,.0f}")
    
    # Simple recommendations
    if 'High-Value' in name:
        print(f"  - Strategy: Premium products, VIP treatment, loyalty programs")
    elif 'Conservative' in name:
        print(f"  - Strategy: Value propositions, quality assurance, long-term benefits")
    elif 'Enthusiastic' in name:
        print(f"  - Strategy: Trendy products, social media engagement, exclusive access")
    elif 'Mature' in name:
        print(f"  - Strategy: Traditional channels, customer service, reliability")
    else:
        print(f"  - Strategy: Discounts, value products, price promotions")

## 8. Summary

Congratulations! You've mastered the fundamentals of unsupervised learning. Here's what you learned:

### Key Concepts Mastered:
1. **Clustering**: Grouping similar data points without labels
2. **K-means**: Partitioning data into k spherical clusters
3. **Hierarchical Clustering**: Creating tree-like cluster structures
4. **DBSCAN**: Density-based clustering that finds arbitrary shapes and outliers
5. **PCA**: Dimensionality reduction while preserving variance
6. **Cluster Evaluation**: Using silhouette score and other metrics

### Key Skills Acquired:
- Implementing various clustering algorithms
- Determining optimal number of clusters
- Evaluating clustering performance
- Applying dimensionality reduction techniques
- Interpreting clusters for business insights
- Choosing appropriate algorithms for different data types

### When to Use Each Algorithm:
- **K-means**: When you know the number of clusters and expect spherical clusters
- **Hierarchical**: When you want to explore different numbers of clusters
- **DBSCAN**: When you have irregular cluster shapes and want to identify outliers
- **PCA**: When you need dimensionality reduction for visualization or feature engineering

### Best Practices:
- Always scale your features before clustering
- Use multiple methods to validate results
- Consider domain knowledge when interpreting clusters
- Evaluate clustering quality with appropriate metrics
- Visualize results to understand cluster structure
- Test different hyperparameters systematically

### Real-world Applications:
- Customer segmentation for targeted marketing
- Market research and consumer behavior analysis
- Image segmentation and computer vision
- Gene expression analysis in bioinformatics
- Social network analysis
- Anomaly detection in cybersecurity

### Next Steps:
In the next week, we'll explore general machine learning techniques including ensemble methods, hyperparameter tuning, and model interpretability.