# Assignment 12: Clustering Analysis

## Dataset: EastWest Airlines Customer Segmentation

**Objective:** Implement K-Means, Hierarchical, and DBSCAN clustering.

**Topics Covered:**
- K-Means Clustering (Elbow Method)
- Hierarchical Clustering (Dendrogram)
- DBSCAN Clustering
- Silhouette Score

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage

# Load data
df = pd.read_excel('EastWestAirlines.xlsx', sheet_name=1)
print("Dataset loaded! Shape:", df.shape)
df.head()

## Data Preprocessing

In [None]:
# Check for missing values
print("Missing values:", df.isnull().sum().sum())

# Drop ID column if exists
if 'ID#' in df.columns:
    df = df.drop('ID#', axis=1)

# Statistical summary
df.describe()

In [None]:
# Scale features
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
print("Features scaled!")

## K-Means Clustering

In [None]:
# Elbow Method to find optimal K
print("=== Elbow Method ===")

inertias = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(df_scaled)
    inertias.append(kmeans.inertia_)

# Plot Elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.grid(True)
plt.savefig('elbow_curve.png')
plt.show()

In [None]:
# Apply K-Means with optimal K
optimal_k = 4  # Based on elbow curve

kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df['KMeans_Cluster'] = kmeans.fit_predict(df_scaled)

# Silhouette Score
silhouette_kmeans = silhouette_score(df_scaled, df['KMeans_Cluster'])
print("K-Means Silhouette Score:", round(silhouette_kmeans, 4))

# Cluster distribution
print("\nCluster Distribution:")
print(df['KMeans_Cluster'].value_counts().sort_index())

## Hierarchical Clustering

In [None]:
# Dendrogram
print("=== Hierarchical Clustering ===")

plt.figure(figsize=(15, 8))
linkage_matrix = linkage(df_scaled[:200], method='ward')  # Using subset for clarity
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.savefig('dendrogram.png')
plt.show()

In [None]:
# Apply Hierarchical Clustering
hierarchical = AgglomerativeClustering(n_clusters=4, linkage='ward')
df['Hierarchical_Cluster'] = hierarchical.fit_predict(df_scaled)

# Silhouette Score
silhouette_hier = silhouette_score(df_scaled, df['Hierarchical_Cluster'])
print("Hierarchical Silhouette Score:", round(silhouette_hier, 4))

print("\nCluster Distribution:")
print(df['Hierarchical_Cluster'].value_counts().sort_index())

## DBSCAN Clustering

In [None]:
# Apply DBSCAN
print("=== DBSCAN Clustering ===")

dbscan = DBSCAN(eps=2.0, min_samples=5)
df['DBSCAN_Cluster'] = dbscan.fit_predict(df_scaled)

# Number of clusters (excluding noise points labeled as -1)
n_clusters = len(set(df['DBSCAN_Cluster'])) - (1 if -1 in df['DBSCAN_Cluster'].values else 0)
n_noise = list(df['DBSCAN_Cluster']).count(-1)

print("Number of clusters:", n_clusters)
print("Number of noise points:", n_noise)

# Silhouette Score (excluding noise)
mask = df['DBSCAN_Cluster'] != -1
if mask.sum() > 1:
    silhouette_dbscan = silhouette_score(df_scaled[mask], df['DBSCAN_Cluster'][mask])
    print("DBSCAN Silhouette Score:", round(silhouette_dbscan, 4))

## Visualization

In [None]:
# Compare clustering results using first 2 features
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# K-Means
axes[0].scatter(df_scaled[:, 0], df_scaled[:, 1], c=df['KMeans_Cluster'], cmap='viridis')
axes[0].set_title('K-Means Clustering')

# Hierarchical
axes[1].scatter(df_scaled[:, 0], df_scaled[:, 1], c=df['Hierarchical_Cluster'], cmap='viridis')
axes[1].set_title('Hierarchical Clustering')

# DBSCAN
axes[2].scatter(df_scaled[:, 0], df_scaled[:, 1], c=df['DBSCAN_Cluster'], cmap='viridis')
axes[2].set_title('DBSCAN Clustering')

plt.tight_layout()
plt.savefig('clustering_comparison.png')
plt.show()

In [None]:
# Compare silhouette scores
print("=== Algorithm Comparison ===")
print("K-Means Silhouette:", round(silhouette_kmeans, 4))
print("Hierarchical Silhouette:", round(silhouette_hier, 4))

# Cluster characteristics for K-Means
print("\n=== K-Means Cluster Characteristics ===")
cluster_summary = df.groupby('KMeans_Cluster').mean()
cluster_summary

## Summary

**Key Findings:**
- K-Means found clear customer segments based on flight patterns
- Hierarchical clustering showed similar results with dendrogram visualization
- DBSCAN identified outliers as noise points

**Insights:**
- Customers can be segmented by their flight frequency and bonus miles
- Different marketing strategies can target each cluster