# Mixtures of Gaussians

Mixtures of Gaussians offer a more robust way to model data over K-Means clustering. We can also use them to perform outlier detection.

## Objectives

1. Model digits data using a Gaussian Mixture Model.
2. Use the model as a guide to help young students learn how to write digits.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from torchvision.datasets import MNIST
from sklearn.model_selection import train_test_split

np.random.seed(0)

In [3]:
# Download the data and split into training and testing
train = MNIST(root='data', train=True, download=True)
test = MNIST(root='data', train=False, download=True)

# Extract the images and labels
X_train, y_train = train.data.numpy(), train.targets.numpy()
X_test, y_test = test.data.numpy(), test.targets.numpy()

In [39]:
# Fit a Gaussian mixture model to the training data
n_components = 10
gmm = GaussianMixture(n_components=n_components, random_state=0)
gmm.fit(X_train.reshape(-1, 28*28))

# Validate the model by predicting the test data
y_pred = gmm.predict(X_test)

# Report the accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy}')

ValueError: Found array with dim 3. GaussianMixture expected <= 2.

The above cell will take quite a while to train (5 minutes on my system). Each component must estimate a $784\times 784$ covariance matrix as well as a 784 dimensional mean vector. This is a total of $784^2 + 784 = 615,344$ parameters per component. With 10 components, this is over 6 million parameters.

## Histograms of Oriented Gradients

We can reduce the number of parameters by using a feature extractor. This could be a pre-trained neural network or a hand-crafted feature extractor. We will use the Histogram of Oriented Gradients (HOG) feature extractor.

In [5]:
from skimage.feature import hog

# Extract HOG features from the training data
X_train_hog = np.array([hog(x, pixels_per_cell=(7, 7), cells_per_block=(2, 2)) for x in X_train])
X_test_hog = np.array([hog(x, pixels_per_cell=(7, 7), cells_per_block=(2, 2)) for x in X_test])

## Initializing the Means

By default, the means are initialized randomly. We cannot guarantee that cluster 1 will coincide with digit 1, and so on. If we want to use this as a classifier, we can intialize the means by taking the average sample per class.

In [18]:
# Compute mean of each component
means = np.zeros((10, X_train_hog.shape[1]))
for i in range(10):
    means[i] = np.mean(X_train_hog[y_train == i], axis=0)

In [19]:
# Fit a Gaussian mixture model to the training data
gmm = GaussianMixture(n_components=10, means_init=means, random_state=0)
gmm.fit(X_train_hog)

# Validate the model by predicting the test data
y_pred = gmm.predict(X_test_hog)

# Report the accuracy
accuracy = np.mean(y_pred == y_test)
print(f'Accuracy: {accuracy}')

Accuracy: 0.1392
