<img src="../images/cover.jpg" width="1920"/>

# Unsupervised Learning

## Introduction
Unsupervised learning is a type of machine learning where algorithms learn patterns from data without labeled responses. Unlike supervised learning, where we have input-output pairs (e.g., emails labeled as spam/not spam), unsupervised learning works with unlabeled data.

### Understanding Unsupervised Datasets: A Customer Segmentation Example

Customer segmentation is a powerful technique in data analysis that involves dividing a company's customer base into distinct groups or segments. These segments are formed based on shared characteristics, such as purchasing behavior, browsing patterns, and demographic information. The goal is to understand the unique traits and needs of each group, allowing companies to tailor their marketing, product offerings, and customer service strategies more effectively.

In unsupervised learning, customer segmentation is achieved without predefined labels or categories. For example:

- **Input Data**: Customer purchase history, browsing patterns, and demographic data.
- **No Labels**: We don't have predefined groups or labels for each customer.
- **Goal**: To discover natural clusters within the customer data that reveal meaningful segmentations.

## K-means Clustering

### Introduction
K-means clustering is one of the simplest and most popular unsupervised learning algorithms. It partitions n observations into k clusters, where each observation belongs to the cluster with the nearest mean (cluster center).

<img src="../images/k_means_clusters.jpg" width="1920"/>

### When to Use K-means
- When you want to segment data into distinct groups
- When your data has roughly spherical clusters
- When the number of clusters is known or can be estimated
- When dealing with numerical features
- When you need a simple, fast clustering solution

### Implementation

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [None]:
# Example dataset
data = {
    "customer_id": range(1000),
    "annual_income": np.random.normal(50000, 15000, 1000),
    "spending_score": np.random.normal(50, 25, 1000),
    "age": np.random.normal(35, 12, 1000),
}

df = pd.DataFrame(data)
df.head()

In [None]:
# Standardize the features
scaler = StandardScaler()
features_scaled = scaler.fit_transform(df[["annual_income", "spending_score", "age"]])

print(features_scaled[:5])

In [None]:
# Visualize the raw data
plt.figure(figsize=(10, 6))
plt.scatter(features_scaled[:, 0], features_scaled[:, 1])
plt.xlabel("Annual Income")
plt.ylabel("Spending Score")
plt.show()

In [None]:
X = features_scaled[
    :, :2
]  # Using first two features for visualization (annual_income and spending_score)

In [None]:
# Define the number of clusters
n_clusters = 3

In [None]:
# Initialize and fit K-means
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_

print("Cluster Centers:\n", centers)

In [None]:
for x, label in zip(X[:5], cluster_labels[:5]):
    print(f"Input: {x}, Belongs to cluster: {label}")

In [None]:
# Plot results (for 2D data)
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap="viridis")
plt.scatter(centers[:, 0], centers[:, 1], c="red", marker="x", s=200, linewidth=3)
plt.title("K-means Clustering Results")
plt.colorbar(scatter)
plt.show()