# Gaussian Mixture Models

*k*-means clustering suffers from a major problem, because it is calculating the distance between data and the cluster centres, the clusters are necessarily circular/spherical. 
We can see this in the example below, w

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans

n_samples = 200

X, y = datasets.make_blobs(n_samples=n_samples, random_state=170)
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_aniso = np.dot(X, transformation)
data = pd.DataFrame(X_aniso, columns=['x1', 'x2'])

kmeans = KMeans(n_clusters=3, random_state=0).fit(data)
data['kmeans'] = kmeans.labels_

fig, ax = plt.subplots()
sns.scatterplot(x='x1', y='x2', hue='kmeans', data=data, ax=ax)
plt.show()

It is clear, to us, that there are three clusters and what data points are associated with each. 
However, the Euclidean nature of *k*-means clustering means that the *wrong* clusters are identified.

Gaussian mixture models (GMMs) is another clustering method that also follows a expectation-maximisation algorithm. 
GMMs are able to overcome the linearity issue of *k*-means. 
Let's see it in action.

````{margin}
```{note}
The `'kmeans'` label is dropped from `data` for running the GMM, so that it does not influence the outcome.
```
````

In [None]:
from sklearn.mixture import GaussianMixture

gmm_data = data.drop('kmeans', axis=1)
gmm = GaussianMixture(n_components=3).fit(gmm_data)
data['gmm'] = gmm.predict(gmm_data)

fig, ax = plt.subplots()
sns.scatterplot(x='x1', y='x2', hue='gmm', data=data, ax=ax)
plt.show()

Clearly, the Gaussian mixture models approach can capture this skew, which is not possible in *k*-means. 
We can now have a look at the algorithm to understand *why* this is possible. 