# Unsupervised Learning
<!--<img src="https://cdn-images-1.medium.com/v2/resize:fit:1440/1*YUl_BcqFPgX49sSb5yrk3A.jpeg" style="width: 1000px;">-->

## Introduction

We are going to see some techniques for unsupervised learning. In this setting, labels for classification or values for regression are not available and possible values are left to be discovered by the model. We will make use of modules and classes in the **sklearn** library, whereas more advanced methods based on Neural Networks can be found here:
- https://www.analyticsvidhya.com/blog/2018/05/essentials-of-deep-learning-trudging-into-unsupervised-deep-learning/

## Requirements

1. Python (preferably version > 3.7): https://www.python.org/downloads/
2. Numpy, Scipy and Matplotlib: https://www.scipy.org/install.html
3. Scikit-learn: http://scikit-learn.org/stable/install.html
4. Pandas: https://pandas.pydata.org/docs/getting_started/index.html

## Quick-start Setup
```bash
conda create --name ml_labs python=3.10
conda activate ml_labs
conda install -c conda-forge jupyterlab scikit-learn pandas
pip install matplotlib
jupyter lab
```

## References

- https://docs.scipy.org/doc/numpy/
- https://docs.scipy.org/doc/scipy/reference/
- https://matplotlib.org/users/index.html
- http://scikit-learn.org/stable/documentation.html



## Table of contents:
- ### Class discovery:
    - #### K-means clustering
    - #### Gaussian Mixture model
- ### Dimensionality reduction:
    - #### Principal Component Analysis 

We now define a couple of functions which will be useful to plot the decision function of a trained ML model

In [None]:
# Disable warnings within the notebook
import warnings
warnings.filterwarnings('ignore')

In [None]:
from utils.plot_utils import *

## Unsupervised Classification

<img src="img/kmean.png" style="width: 500px;"/>

Given some data ($X \in \mathbf{R}^d$) and $K \in \mathbb{R}^+$ (number of clusters), we want to split $X$ in $K$ partitions: $S_1, S_2, \dots, S_k$.

In [None]:
import numpy as np

from sklearn.datasets import make_blobs

np.random.seed(0)

centers = [[1, 1], [-1, -1], [1, -1]] # Centroids of the clusters
n_clusters = len(centers)
X, labels_true = ...

In [None]:
plot_blobs(X, labels_true)

### K-means Clustering


Each partition $S_i \subset X$ has a center point (centroid $c_i$). We would like to solve the following optimization problem problem:
$$ 
    arg min_S \sum_{i=1}^K \sum_{x \in S_i} dist(x, c_i)
$$
where $c_i$ is the mean of points in $S_i$. 
Suppose to use the norm 2 distance $dist(x,c_i) = || x - c_i||^2$

<img src="kmean_steps/1.png" style="width: 300px;"/> \


<img src="kmean_steps/2.png" style="width: 300px;"/> \


<img src="kmean_steps/3.png" style="width: 300px;"/> \


<img src="kmean_steps/4.png" style="width: 300px;"/> \


<img src="kmean_steps/5.png" style="width: 300px;"/> \


<img src="kmean_steps/6.png" style="width: 300px;"/> \


<img src="kmean_steps/7.png" style="width: 300px;"/> \


<img src="kmean_steps/8.png" style="width: 300px;"/>

In [None]:
import time
from sklearn.cluster import KMeans

## Call KMeans by choosing also how to initialize, number of cluster, and number of restarts
k_means = ...

# start counting time
t0 = time.time()

# We fit K-means using the data
k_means.fit(X)

# end of training
t_batch = time.time() - t0

print('Required training time:', '{:.3f}'.format(t_batch), 'sec.' )

In [None]:
from sklearn.cluster import MiniBatchKMeans

batch_size = 45

# This variant implements KMeans by selecting mini-batches 
mbk = ...

# start counting time
t0 = time.time()

mbk.fit(X)

# end of training
t_mini_batch = time.time() - t0

print('Required training time:', '{:.3f}'.format(t_mini_batch), 'sec.' )

In [None]:
from sklearn.metrics.pairwise import pairwise_distances_argmin

# evaluate centers for two variants
k_means_cluster_centers = k_means.cluster_centers_
mbk_means_cluster_centers = mbk.cluster_centers_

# make sure to reorder cluster labels
order = pairwise_distances_argmin(k_means.cluster_centers_, mbk.cluster_centers_)
mbk_means_cluster_centers = mbk_means_cluster_centers[order]

# assign unsupervised labels based on minimum distance from cluster centers
k_means_labels = pairwise_distances_argmin(X, k_means_cluster_centers)
mbk_means_labels = pairwise_distances_argmin(X, mbk_means_cluster_centers)

In [None]:
plot_K_means(X, 3, 
             k_means, k_means_labels, k_means_cluster_centers,
             mbk, mbk_means_labels, mbk_means_cluster_centers,
             t_batch, t_mini_batch)

### Decision surface

In [None]:
from utils.plot_utils import plot_kmeans_decision_boundaries

# Plot decision boundaries
plot_kmeans_decision_boundaries(X, k_means)


### How many clusters?

Suppose we do not know how many clusters should be used in the problem. How can we choose a suitable number?

In [None]:
centers = [[1, 1], [-1, -1], [1, -1], [-1,1]]
n_clusters = len(centers)

X, labels_true = ...

plot_blobs(X, np.zeros(X.shape[0], dtype=int))

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
wcss, silhuoettes = [], [] 
models, labels, centers, = [], [], []
for i in range(1, 11): 
    kmeans = ...
    kmeans.fit(X) 
    models.append(kmeans)
    labels.append(kmeans.predict(X))
    centers.append(kmeans.cluster_centers_)
    
    wcss.append(kmeans.inertia_)
    if i > 1:
        silhuoettes.append(silhouette_score(X, labels[-1]) )
    else:
        silhuoettes.append(0)

In [None]:
plot_kmeans_clusters(X, labels, centers)


### Which one should we choose?

In [None]:
fig = plt.figure(figsize=(12, 5))
fig.add_subplot(131)

plt.plot(range(1, 11), wcss, marker='*')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia') 

plt.xticks(range(1,11), list(range(1,11)))
plt.title('Elbow method')

fig.add_subplot(132)
plt.plot(range(1, 11), wcss, marker='*')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia') 

plt.xscale('log')
plt.yscale('log')
plt.xticks(range(1,11), list(range(1,11)))
plt.title('Elbow method (log/log scale)')

fig.add_subplot(133)
plt.plot(range(1, 11), silhuoettes, marker='*', color='orange')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette score') 
plt.title('Silhouette method')

plt.show()

## Gaussian Mixture Models

K-means work great but has some limitations. Consider the case where we have not-circular shape for the clusters. K-means will struggle to find that there actually three eliptic clusters:


<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2019/10/kmeans-fail-1.png" style="width: 500px;"/>

So instead of using a distance-based model, we will now use a distribution-based model. And that is where Gaussian Mixture Models come into play!

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs

# Generate synthetic data with clusters having increased variance along the first component
X, y = ...

plot_blobs(X, y)

In [None]:
# Create a Gaussian Mixture Model with a specified number of components (clusters)
n_components = 4
gmm = ...
kmeans = ...

# Fit the model to the data
kmeans.fit(X)
gmm.fit(X)

# Predict the cluster labels for each data point
k_labels = k_means.predict(X)
labels = gmm.predict(X)

figure = plt.figure(figsize=(10,5))

fig.add_subplot(121)
plot_kmeans_decision_boundaries(X, kmeans)
plt.title('K-Means clustering')


fig.add_subplot(122)
plt.scatter(X[:, 0], X[:, 1], c=labels, s=0.5, cmap='viridis')

# Plot ellipses for each Gaussian component
for i in range(n_components):
    plot_cov_ellipse(gmm.covariances_[i], gmm.means_[i], ax=plt.gca(), color='red', alpha=0.2)

plt.title('Gaussian Mixture Model Clustering with Ellipses')
plt.show()


### Solving classification problems with GMMS

In [None]:
# Author: Ron Weiss <ronweiss@gmail.com>, Gael Varoquaux
# Modified by Thierry Guillemot <thierry.guillemot.work@gmail.com>
# License: BSD 3 clause

import matplotlib.pyplot as plt
import numpy as np

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.mixture import GaussianMixture

from utils.plot_utils import estimate_GMMS

# Load the iris dataset.
iris = datasets.load_iris()

# Break up the dataset into non-overlapping training (75%) and testing
# (25%) sets.
X_train, X_test, y_train, y_test = ...

# Get the total classes
n_classes = len(np.unique(y_train))



In [None]:
# Try GMMs using different types of covariances.
estimators = {
    cov_type: GaussianMixture(
        n_components=n_classes, covariance_type=cov_type, max_iter=20, random_state=0
    )
    for cov_type in ["spherical", "diag", "tied", "full"]
}

n_estimators = len(estimators)

for index, (name, estimator) in enumerate(estimators.items()):
    # Since we have class labels for the training data, we can
    # initialize the GMM parameters in a supervised manner.
    estimator.means_init = np.array(
        [X_train[y_train == i].mean(axis=0) for i in range(n_classes)]
    )

    # Train the other parameters using the EM algorithm.
    estimator.fit(X_train)

estimate_GMMS(estimators, X_train, X_test, y_train, y_test)

# Dimensionality Reduction

### Real Dataset: Wine Dataset

These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines [1].

[1] https://archive.ics.uci.edu/ml/datasets/wine

In [None]:
import pandas as pd

df_wine = pd.read_csv("../sklearn/data/wine.csv")
df_wine.head()

In [None]:
# Separate the data into target and features
y = df_wine["type"]
X = df_wine.drop(columns=["type"])

In [None]:
# Standardize the data
scaler = ...
X = scaler.fit_transform(X)

In [None]:
# Initialize a PCA object and transform the data
pca = ...
X_pca = pca.fit_transform(X)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Plot the explained variace ratio for each component
percent_variance = pca.explained_variance_ratio_

fig = plt.figure(figsize=(10,5))
fig.add_subplot(121)
plt.bar(np.arange(len(percent_variance)), height=percent_variance)
plt.ylabel("Variance ($\sigma^2$)")
plt.xlabel("Principal Component")

fig.add_subplot(122)

sum_variance = [np.sum(percent_variance[:i+1]) for i in range(len(percent_variance))]

plt.plot(np.arange(len(percent_variance)), sum_variance)
plt.ylabel("Cumulative Variance")
plt.xlabel("Principal Component")


plt.show()

The point of inflexion (where the line starts to bend) should indicate how many components have to be retained. In this case, the magic number is 3

### PCA (2-dimensions)

In [None]:
# Initialize a PCA object (2 components) and transform data
pca = ...
X_pca = pca.fit_transform(X)

In [None]:
# Plot the data by using the utils.lib.plot_pca_clusters function
from utils.plot_utils import plot_pca_clusters
plot_pca_clusters(X_pca, y)

### PCA (3-dimensions)

In [None]:
# Initialize a PCA object (3 components) and transform data
pca = ...
X_pca = pca.fit_transform(X)

In [None]:
# Plot the data by using the utils.lib.plot_pca_clusters function
plot_pca_clusters(X_pca, y, True)