
# Overview of Unsupervised Learning

**Unsupervised learning** is a type of machine learning where the model learns patterns from unlabeled data. Unlike supervised learning, where the model is trained on labeled examples, unsupervised learning algorithms try to infer the underlying structure of the data without any explicit labels. This makes it particularly useful for exploratory data analysis, clustering, and dimensionality reduction.

## Types of Unsupervised Learning

Unsupervised learning can be broadly categorized into two main types:

-   **Dimensionality Reduction**: Reducing the number of features while preserving the essential structure of the data.
-   **Clustering**: Grouping similar data points together based on their features.

## Dimensionality Reduction

**Dimensionality reduction** techniques aim to reduce the number of features in a dataset while retaining its essential characteristics. This is particularly useful for visualizing high-dimensional data or improving the performance of machine learning models by reducing noise and computational complexity. Common dimensionality reduction techniques include:

-   **Principal Component Analysis (PCA)**: Projects the data onto a lower-dimensional space by maximizing the variance along the new axes (principal components).
-   **t-Distributed Stochastic Neighbor Embedding (t-SNE)**: A non-linear technique that visualizes high-dimensional data by minimizing the divergence between probability distributions of pairwise similarities in high and low dimensions.
-   **Autoencoders**: Neural networks that learn to encode the input data into a lower-dimensional representation and then decode it back to the original space. They are particularly useful for learning complex, non-linear mappings.

## Clustering

**Clustering** is often used to discover inherent groupings in data, such as customer segmentation in marketing or grouping similar documents in text analysis. Common clustering algorithms include:

-   **K-Means**: Partitions data into K clusters by minimizing the variance within each cluster.
-   **Hierarchical Clustering**: Builds a tree of clusters by either merging or splitting clusters based on distance metrics.
-   **DBSCAN**: Groups together points that are closely packed together while marking points that lie alone in low-density regions as outliers.
-   **Gaussian Mixture Models (GMM)**: Assumes that the data is generated from a mixture of several Gaussian distributions and uses the Expectation-Maximization algorithm to estimate the parameters of these distributions.

****Advantages and Disadvantages****

-   ****Advantages****:
    -   Can discover hidden patterns and structures in data.
    -   Useful for exploratory data analysis and feature engineering.
    -   Can handle large datasets without the need for labeled data.
-   ****Disadvantages****:
    -   Results can be difficult to interpret, especially in clustering.
    -   No guarantee of finding meaningful patterns; results can be sensitive to the choice of algorithm and parameters.
    -   Evaluation of clustering results can be challenging without ground truth labels.

## Practical Demonstration

To illustrate the concepts of unsupervised learning, we will use the Iris dataset and apply PCA for dimensionality reduction and K-Means for clustering.

-   Load the iris dataset

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets

# Load the Iris dataset
df = sns.load_dataset('iris')
X = df.drop(columns=['species'])
y = df['species'].astype('category').cat.codes

sns.pairplot(df, hue='species', palette='viridis', markers=["o", "s", "D"])
plt.suptitle('Iris Dataset Pairplot', y=1.02)
plt.xlabel('Features')
plt.ylabel('Features')
# plt.legend(title='Species', loc='upper right')
plt.tight_layout()
plt.show()

-   Preprocess the data

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer

# Standardize the data
scaler = Normalizer()
X = scaler.fit_transform(X)

-   Apply PCA for dimensionality reduction to 2 dimensions and visualize the results

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Apply PCA to reduce to 2 dimensions
pca = PCA(n_components=2, n_oversamples=100, random_state=42)
X_pca = pca.fit_transform(X)

# Plot the PCA results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.title('PCA of Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Species')
plt.show()

-   Apply K-Means clustering on the PCA-reduced data and visualize the clusters

In [None]:
from sklearn.cluster import KMeans

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, max_iter=10, random_state=42)
kmeans.fit(X_pca)
y_kmeans = kmeans.predict(X_pca)

# Plot the K-Means clustering results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_kmeans, cmap='viridis', edgecolor='k', s=50)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            s=150, c='red', marker='X', label='Centroids')
plt.title('K-Means Clustering on PCA-reduced Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.colorbar(label='Cluster')
plt.show()

-   Evaluate the clustering results using silhouette score

In [None]:
import numpy as np
from sklearn.metrics import silhouette_score

# Calculate silhouette score
silhouette_avg = silhouette_score(X_pca, y_kmeans)
print(f'Silhouette Score: {silhouette_avg:.2f}')

-   Visualize the decision boundary of K-Means clustering

In [None]:
from sklearn.inspection import DecisionBoundaryDisplay
import numpy as np
import matplotlib.pyplot as plt
DecisionBoundaryDisplay.from_estimator(
    kmeans, X_pca, response_method="predict", cmap='viridis', alpha=0.8,
    grid_resolution=5000, xlabel='Principal Component 1', ylabel='Principal Component 2'
)
sns.scatterplot(x=X_pca[:, 0], y=X_pca[:, 1], hue=y_kmeans, palette='viridis', edgecolor='k', s=50)
plt.xlim(-0.5, 0.5)
plt.ylim(-0.1, 0.1)
plt.title('Decision Boundary of K-Means Clustering')

## Hands-on Exercises

Apply PCA and K-Means clustering on a different dataset, such as the Wine dataset or the Breast Cancer dataset from scikit-learn. Visualize the results and interpret the clusters formed.

-   Load the dataset
-   Explore the dataset
-   Visualize the dataset
-   Preprocess the data
-   Apply PCA to reduce to 2 dimensions
-   Apply K-Means clustering and plot the results