<a href="https://colab.research.google.com/github/WayneGretzky1/CSCI-4521-Applied-Machine-Learning/blob/main/3_1_gaussian_mixture_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Load and process data
from scipy import linalg
import numpy as np
import pandas as pd

# Graphics/plotting libraries
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl

# Clustering algorithms
from sklearn.cluster import KMeans
from sklearn import mixture # <- this contains the GMM

import itertools
color_iter = itertools.cycle(["navy", "c", "cornflowerblue", "gold", "darkorange"])

## Mickey Mouse Dataset

Each small error should be its own cluster separate from the big face.

In [None]:
n_samples = 500 # Samples in each initial Gaussian

# Generate random sample, two components
np.random.seed(0)

X = np.r_[
    1.5 * np.random.randn(n_samples, 2) + np.array([0, 0]),
    0.3 * np.random.randn(n_samples, 2) + np.array([-4, 3]),
    0.3 * np.random.randn(n_samples, 2) + np.array([4, 3]),
]
X_df = pd.DataFrame(data=X, columns=["x","y"])

sns.scatterplot(data=X_df, x="x", y="y")

### K-Means Clustering


In [None]:
# TODO: Fit k-means clustering with 3 clusters


In [None]:
plt.axis('equal')
sns.scatterplot(data=X_df, x="x", y="y", hue=km.labels_)

K-means does okay, but it does not cleanly separate the "ear" clusters on their own.

### GMM

In [None]:
# TODO: Fit a Gaussian mixture with EM


In [None]:
gmm_labels = gmm.predict(X)

In [None]:
gmm_labels

In [None]:
plt.axis('equal')
sns.scatterplot(data=X_df, x="x", y="y", hue=gmm_labels)

GMM does work perfectly. It accurately identifies that the two clusters in each top corner have small variances, so it decreases their size while increasing the size of the central cluster.

K-Means cannot achieve this. It always assumes that the cluster boundary is halfway between the centroids.

## Anisotropic Clusters

Let's move to a case with two clear clusters. However, one of the clusters is anisotropic—it is stretched out and rotated.

In [None]:
n_samples = 500 #Samples in each initial gaussian

# Generate random sample, two components
np.random.seed(0)

rot1 = np.array([[0.0, -0.1], [1.7, 0.4]])
rot2 = np.array([[0.3, -0.1], [0.3, 1.4]])

X = np.r_[
    1.5 * np.random.randn(n_samples, 2)@rot1 + np.array([0, 0]),
    1.0 * np.random.randn(n_samples, 2) + np.array([-2, 3]),
]
X_df = pd.DataFrame(data=X,columns=["x","y"])

sns.scatterplot(data=X_df,x="x",y="y")

### K-Means

In [None]:
# TODO: Fit k-measn clustering with 2 clusters


K-Means struggles with anisotropic clusters. It has a circular/isotropic built-in to its distance assumptions (Voronoi cells).

### GMM

In [None]:
# TODO: Fit a Gaussian mixture with EM with 2 guassians


GMM does great if you get the number of clusters right.

If we try with too many clusters, it can still go wrong:

In [None]:
# TODO: Fit a Gaussian mixture with EM with 5 gaussians


To help with the number of clusters, we can use a Bayesian Gaussian Mixture Model. With this type of GMM, it is able to remove distributions it feels don't fit the data. This means you don't have to know exactly the right number of clusters since the model can pull the number down itself. You simply need to make a guess (always overestimate) and the Bayesian GMM will get you at least close to correct.

In [None]:
# TODO: Fit a Dirichlet process Gaussian mixture with 3 components


In [None]:
dp_gmm_labels

In [None]:
set(dp_gmm_labels)

{0, 2}

In [None]:
sns.scatterplot(data=X_df, x="x", y="y", hue=dp_gmm_labels)

In [None]:
# TODO: Fit a Dirichlet process Gaussian mixture with 10 components


## Overlapping Clusters

As a final challenge, let's consider what happens when our clusters overlap each other.

In [None]:
n_samples = 500 # Samples in each initial Gaussian

# Generate random sample, two components
np.random.seed(0)

rot1 = np.array([[0.0, -0.1], [1.7, 0.4]])
rot2 = np.array([[0.3, -0.1], [0.3, 1.4]])

X = np.r_[
    1.5 * np.random.randn(n_samples, 2) @ rot1 + np.array([0, 0]),
    1.0 * np.random.randn(n_samples, 2) + np.array([-2, 0]),
]
X_df = pd.DataFrame(data=X, columns=["x","y"])

sns.scatterplot(data=X_df, x="x", y="y")

### K-Means

In [None]:
# Fit k-means clustering
num_clusters = 2
km = KMeans(n_clusters=num_clusters, init='random', n_init=1, verbose=1)
km.fit(X_df)
sns.scatterplot(data=X_df, x="x", y="y", hue=km.labels_)

K-Means splits off one part of one of the clusters. There is clear room for improvement.

### GMM

In [None]:
# Fit a Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=2, covariance_type="full").fit(X)
gmm_labels = gmm.predict(X) # Cluster new data
sns.scatterplot(data=X_df, x="x", y="y", hue=gmm_labels)

GMM does perfect. It correctly creates a non-consecutive cluster. Any ambiguous points are given very reasonable labels.