# K-Means Clustering Assignment - Easy Level
## Mall Customers Dataset Analysis

**Objective**: Understand and apply basic K-Means clustering using key features.

**Dataset**: Mall Customers with features like Age, Income, and Spending Score

## Task 1: Load the Dataset

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

# Set style for better plots
plt.style.use('default')
sns.set_palette("husl")

In [None]:
# Load the dataset
df = pd.read_csv('Mall_Customers.csv')

# Display first few rows
print("First 5 rows of the dataset:")
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

if df.isnull().sum().sum() == 0:
    print("✓ No missing values found!")
else:
    print("⚠ Missing values detected - handling required")

## Task 2: Exploratory Data Analysis (EDA)

In [None]:
# Basic statistics
print("Dataset Info:")
df.info()
print("\nBasic Statistics:")
df.describe()

In [None]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Exploratory Data Analysis - Mall Customers', fontsize=16)

# Histogram for Age
axes[0, 0].hist(df['Age'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].set_title('Distribution of Age')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')

# Histogram for Annual Income
axes[0, 1].hist(df['Annual Income (k$)'], bins=20, alpha=0.7, color='lightgreen', edgecolor='black')
axes[0, 1].set_title('Distribution of Annual Income')
axes[0, 1].set_xlabel('Annual Income (k$)')
axes[0, 1].set_ylabel('Frequency')

# Box plot for Age
axes[1, 0].boxplot(df['Age'])
axes[1, 0].set_title('Box Plot - Age')
axes[1, 0].set_ylabel('Age')

# Box plot for Annual Income
axes[1, 1].boxplot(df['Annual Income (k$)'])
axes[1, 1].set_title('Box Plot - Annual Income')
axes[1, 1].set_ylabel('Annual Income (k$)')

plt.tight_layout()
plt.show()

## Task 3: Feature Selection

In [None]:
# Select 2 features for clustering: Annual Income and Spending Score
features = ['Annual Income (k$)', 'Spending Score (1-100)']
X = df[features].copy()

print(f"Selected features for clustering: {features}")
print(f"Feature matrix shape: {X.shape}")
print(f"\nFirst 5 rows of selected features:")
print(X.head())

In [None]:
# Visualize the selected features
plt.figure(figsize=(10, 6))
plt.scatter(X['Annual Income (k$)'], X['Spending Score (1-100)'], alpha=0.6, s=50)
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('Scatter Plot: Annual Income vs Spending Score (Before Clustering)')
plt.grid(True, alpha=0.3)
plt.show()

## Task 4: Implement K-Means Clustering

In [None]:
# Apply K-Means with k=3
k = 3
print(f"Applying K-Means clustering with k = {k}")

# Initialize and fit K-Means
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X)

# Add cluster labels to the original dataframe
df['Cluster'] = clusters

print(f"✓ K-Means clustering completed!")
print(f"Cluster centers:")
print(kmeans.cluster_centers_)

In [None]:
# Display cluster statistics
print(f"Cluster Distribution:")
cluster_counts = pd.Series(clusters).value_counts().sort_index()
for i in range(k):
    count = cluster_counts[i]
    percentage = (count / len(df)) * 100
    print(f"Cluster {i}: {count} customers ({percentage:.1f}%)")

# Calculate cluster statistics
print(f"\nCluster Statistics:")
cluster_stats = df.groupby('Cluster')[features].mean()
print(cluster_stats)

## Task 5: Visualize Clusters

In [None]:
# Create a 2D scatter plot with clusters
plt.figure(figsize=(12, 8))

# Define colors for clusters
colors = ['red', 'blue', 'green', 'purple', 'orange']

# Plot each cluster with different colors
for i in range(k):
    cluster_data = X[clusters == i]
    plt.scatter(cluster_data['Annual Income (k$)'], 
               cluster_data['Spending Score (1-100)'], 
               c=colors[i], 
               label=f'Cluster {i}', 
               alpha=0.6, 
               s=50)

# Plot cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], 
           c='black', marker='x', s=200, linewidths=3, 
           label='Centroids')

plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.title('K-Means Clustering Results (k=3)\nMall Customers Segmentation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Cluster Interpretation and Analysis

In [None]:
# Cluster interpretation
print("Cluster Interpretation:")
print("-" * 20)

for i in range(k):
    cluster_data = df[df['Cluster'] == i]
    avg_income = cluster_data['Annual Income (k$)'].mean()
    avg_spending = cluster_data['Spending Score (1-100)'].mean()
    avg_age = cluster_data['Age'].mean()
    
    print(f"\nCluster {i} ({len(cluster_data)} customers):")
    print(f"  - Average Income: ${avg_income:.1f}k")
    print(f"  - Average Spending Score: {avg_spending:.1f}")
    print(f"  - Average Age: {avg_age:.1f} years")
    
    # Simple interpretation
    if avg_income < 40 and avg_spending < 40:
        interpretation = "Low Income, Low Spending"
    elif avg_income < 40 and avg_spending > 60:
        interpretation = "Low Income, High Spending"
    elif avg_income > 60 and avg_spending < 40:
        interpretation = "High Income, Low Spending"
    elif avg_income > 60 and avg_spending > 60:
        interpretation = "High Income, High Spending"
    else:
        interpretation = "Moderate Income, Moderate Spending"
    
    print(f"  - Profile: {interpretation}")

## Summary

### Assignment Completed Successfully! ✅

**Key Results:**
- Successfully loaded and analyzed 200 mall customers
- No missing values found in the dataset
- Selected Annual Income and Spending Score as clustering features
- Applied K-Means clustering with k=3
- Identified 3 distinct customer segments:
  - **Cluster 0**: High Income, Low Spending customers
  - **Cluster 1**: High Income, High Spending customers  
  - **Cluster 2**: Moderate Income, Moderate Spending customers

**Business Value:**
These clusters can help the mall target different customer segments with appropriate marketing strategies and product offerings.