# Class 4 Notebook – Unsupervised Learning: Overview and Applications

This notebook provides an **overview of Unsupervised Learning**, exploring its key concepts, types, and real-world applications.

Unlike supervised learning (Classes 2–3), where we have labeled data and predict targets, **unsupervised learning** discovers patterns in data **without labels**. This notebook introduces the main categories and use cases.

**Objective**: Understand what unsupervised learning is, when to use it, and explore its main categories:
- Clustering (grouping similar data)
- Dimensionality Reduction (reducing features)
- Anomaly Detection (finding outliers)

**Key idea**: Unsupervised learning helps us discover hidden structures, reduce complexity, and find patterns in unlabeled data.

We'll explore:

1. What is unsupervised learning?
2. Types of unsupervised learning
3. Clustering examples
4. Dimensionality reduction concepts
5. Real-world applications

Run the first code cell to confirm your environment works.

## Run in the browser (no local setup)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adzuci/ai-fundamentals/blob/main/class-4-unsupervised-learning/03_class_4_unsupervised_learning_overview.ipynb)

> Tip: This notebook assumes you're comfortable with basic Python, NumPy, Pandas, and Matplotlib from Classes 2 and 3.

## What is Unsupervised Learning?

**Supervised Learning** (Classes 2–3):
- We have **labeled data** (features + target)
- Goal: Learn to predict the target from features
- Examples: Predict house price (regression), predict pass/fail (classification)

**Unsupervised Learning** (Class 4):
- We have **unlabeled data** (only features, no target)
- Goal: Discover hidden patterns, groups, or structures
- Examples: Customer segmentation, anomaly detection, dimensionality reduction

**Key difference**: In unsupervised learning, there's no "right answer" to learn from—we're exploring the data to find interesting patterns.

## STEP 1: Install Required Libraries

If running locally, install the required packages. In Colab, these are already available.

In [None]:
# Install required libraries (run this if needed)
# Uncomment the line below if running locally and packages aren't installed
# !pip install numpy pandas matplotlib scikit-learn

## STEP 2: Import Libraries

Import NumPy, Pandas, and Matplotlib for data manipulation and visualization.

In [None]:
# Environment sanity check + imports
import platform

print("Python:", platform.python_version())
print("OS:", platform.system(), platform.release())

try:
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt

    print("NumPy:", np.__version__, "| Pandas:", pd.__version__)
    print("All libraries imported successfully!")
except ModuleNotFoundError as exc:
    print("Missing dependency:", exc)
    print("Install with: python -m pip install numpy pandas matplotlib scikit-learn")
    raise

## Types of Unsupervised Learning

Unsupervised learning can be divided into three main categories:

### 1. Clustering
- **Goal**: Group similar data points together
- **Examples**: Customer segmentation, image compression, document grouping
- **Algorithms**: K-Means, Hierarchical Clustering, DBSCAN

### 2. Dimensionality Reduction
- **Goal**: Reduce the number of features while preserving important information
- **Examples**: Data visualization, noise reduction, feature selection
- **Algorithms**: PCA (Principal Component Analysis), t-SNE, Autoencoders

### 3. Anomaly Detection
- **Goal**: Identify unusual or outlier data points
- **Examples**: Fraud detection, network intrusion detection, quality control
- **Algorithms**: Isolation Forest, One-Class SVM, Local Outlier Factor

## Clustering Example: Customer Segmentation

Let's create a simple example to illustrate clustering. We'll generate sample customer data and visualize how clustering groups similar customers.

In [None]:
# Concept: Create sample customer data for clustering demonstration
np.random.seed(42)

# Generate three distinct customer groups
group1 = np.random.normal([20, 30], 5, (20, 2))  # Low income, low spending
group2 = np.random.normal([50, 50], 5, (20, 2))  # Medium income, medium spending
group3 = np.random.normal([80, 70], 5, (20, 2))  # High income, high spending

# Combine all groups
customers = np.vstack([group1, group2, group3])

# Create DataFrame
df = pd.DataFrame(customers, columns=["Income", "Spending"])

print("Sample Customer Data:")
print(df.head(10))
print(f"\nTotal customers: {len(df)}")

In [None]:
# Concept: Visualize customer data (before clustering)
plt.figure(figsize=(8, 6))
plt.scatter(df["Income"], df["Spending"], s=100, alpha=0.6, c='blue')
plt.xlabel("Annual Income (thousands)")
plt.ylabel("Spending Score")
plt.title("Customer Data (Before Clustering)")
plt.grid(True, alpha=0.3)
plt.show()

print("Can you spot natural groups in this data?")

In [None]:
# Concept: Apply K-Means clustering to discover customer segments
from sklearn.cluster import KMeans

# Apply K-Means with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(df)

df["Cluster"] = clusters

print("Clustering complete!")
print(f"\nCluster assignments:")
print(df["Cluster"].value_counts().sort_index())

In [None]:
# Concept: Visualize clustered customer data
plt.figure(figsize=(8, 6))
scatter = plt.scatter(
    df["Income"],
    df["Spending"],
    c=df["Cluster"],
    cmap="viridis",
    s=100,
    alpha=0.7,
    edgecolor="k"
)
plt.scatter(
    kmeans.cluster_centers_[:, 0],
    kmeans.cluster_centers_[:, 1],
    c='red',
    marker='X',
    s=200,
    label='Centroids',
    linewidths=2
)
plt.xlabel("Annual Income (thousands)")
plt.ylabel("Spending Score")
plt.title("Customer Segments (K-Means Clustering)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.colorbar(scatter, label='Cluster')
plt.show()

print("The algorithm automatically discovered 3 customer segments!")

## Dimensionality Reduction: Concept

**Dimensionality reduction** reduces the number of features while preserving important information. This is useful for:
- Visualizing high-dimensional data
- Reducing noise
- Speeding up machine learning algorithms
- Feature selection

**Principal Component Analysis (PCA)** is a common technique that finds the directions (principal components) where the data varies the most.

In [None]:
# Concept: Demonstrate PCA for dimensionality reduction
from sklearn.decomposition import PCA

# Create sample high-dimensional data (5 features)
np.random.seed(42)
high_dim_data = np.random.randn(100, 5)

print(f"Original data shape: {high_dim_data.shape}")
print(f"Original dimensions: {high_dim_data.shape[1]}")

# Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(high_dim_data)

print(f"\nReduced data shape: {reduced_data.shape}")
print(f"Reduced dimensions: {reduced_data.shape[1]}")
print(f"\nVariance explained by first 2 components: {pca.explained_variance_ratio_.sum():.2%}")

In [None]:
# Concept: Visualize reduced-dimensional data
plt.figure(figsize=(8, 6))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], s=100, alpha=0.6)
plt.xlabel(f"First Principal Component ({pca.explained_variance_ratio_[0]:.1%} variance)")
plt.ylabel(f"Second Principal Component ({pca.explained_variance_ratio_[1]:.1%} variance)")
plt.title("Data Reduced from 5D to 2D using PCA")
plt.grid(True, alpha=0.3)
plt.show()

print("PCA reduced 5-dimensional data to 2 dimensions for visualization!")

## Real-World Applications

### Clustering Applications
- **Customer Segmentation**: Group customers by purchasing behavior
- **Image Compression**: Reduce image file size by grouping similar pixels
- **Document Clustering**: Organize documents by topic
- **Gene Analysis**: Group genes with similar expression patterns

### Dimensionality Reduction Applications
- **Data Visualization**: Visualize high-dimensional data in 2D/3D
- **Feature Engineering**: Reduce noise and improve model performance
- **Image Processing**: Compress images while preserving important features
- **Recommendation Systems**: Reduce feature space for faster recommendations

### Anomaly Detection Applications
- **Fraud Detection**: Identify unusual credit card transactions
- **Network Security**: Detect network intrusions
- **Quality Control**: Find defective products in manufacturing
- **Medical Diagnosis**: Identify unusual patient patterns

## Key Learning

**Unsupervised Learning** is a powerful approach for discovering patterns in unlabeled data:

- **Clustering** groups similar data points together
- **Dimensionality Reduction** simplifies data while preserving important information
- **Anomaly Detection** finds unusual patterns

**When to use unsupervised learning**:
- You have unlabeled data
- You want to explore and discover patterns
- You need to reduce data complexity
- You want to find outliers or anomalies

**Next steps**: Explore specific algorithms in detail:
- K-Means clustering (see `01_class_4_kmeans_basics.ipynb`)
- Hierarchical clustering (see `02_class_4_hierarchical_clustering_basics.ipynb`)
- PCA and other dimensionality reduction techniques