# LSE ST451: Bayesian Machine Learning
## Author: Kostas Kalogeropoulos

## Week 7: Mixture Models

Topics covered 
 - Fitting Gaussian Mixture models using the EM algorithm
 - Obtaining information on soft allocation of individuals
 - Model Choice within the family of Gaussian Mixtures
 - Bayesian approach with overfitted mixtures

Standaer libraries will be used with the addition of two new ones from sklearn for the EM and Variational Bayes approach on Gaussian Mixtures

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import numpy as np

from sklearn import datasets
#The next two lines import the functions for the two things we will look into today
from sklearn.mixture import GaussianMixture
from sklearn.mixture import BayesianGaussianMixture

### Load the Iris dataset

The Iris Dataset. This data sets consists of 3 different types of irises' (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray. The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width. The below plot uses the first two features.

In [None]:
iris = datasets.load_iris()
#next we import it into a pandas frame for convenience (not necessry)
pdiris = pd.DataFrame(iris.data, columns=iris.feature_names)
print(pdiris.shape)
pdiris.head()

### Plots 

Below we will see some 2d plots just to get a feel of the data. There appears to be some clustering but it is hard to infer the number of clusters of the 4d datasets from 2d plots. 

In [None]:
plt.plot(pdiris['sepal length (cm)'], pdiris['petal length (cm)'], 'o')
plt.xlabel('sepal length (cm)')
plt.ylabel('petal length (cm)')

In [None]:
plt.plot(pdiris['sepal width (cm)'], pdiris['petal width (cm)'], 'o')
plt.xlabel('sepal width (cm)')
plt.ylabel('petal width (cm)')

### Fitting GMMs using the EM algorithm

The code for doing so is given below. The '.fit' bit obtains the MLEs of means and covariances that can be viewed using '.means_' and '.covariances_'

We start by inspecting and visualisibg a 2-d dataset with only the 'sepal width (cm)' and 'petal width (cm)' variables. The full dataset is analysed afterwards

In [None]:
vars = ['sepal width (cm)','petal width (cm)']
gmm = GaussianMixture(n_components=2)
gmm.fit(pdiris[vars])
print(gmm.means_)
print('\n')
print(gmm.covariances_)

### Soft allocation of individuals to clusters

GMM method does not necesarily allocates individuals with certainty but with probabilities.

Adding the probabilities can give as an ideas of how many people each cluster has.

In [None]:
probs = gmm.predict_proba(pdiris[vars])
print(np.sum(probs,axis=0))
print(probs[21:50].round(3))

### Model Search

We need to fit models wiht different numbers of cluster and different type of covariance matrices to identify the best one. This is done via the BIC (the smaller the better in this case)

Types of covariance matrices:
 - spherical: each cluster k has covariance $\sigma^2_k I$
 - tied: full covariance matrix but the same across clusters
 - diag: diagonal covariance matrix, different for each cluster
 - full: full covariance matrix, different for each cluster
 

In [None]:
lowest_bic = np.infty

#Consider k=1,...,6 and four types of covariance matrix
n_components_range = range(1, 9)
cv_types = ['spherical', 'tied', 'diag', 'full']
bic = np.zeros((len(n_components_range),len(cv_types))) #matrix to store the BICs
j = -1
for cv_type in cv_types:
    j = j+1
    for n_components in n_components_range:
        # Fit a Gaussian mixture with EM
        gmm = GaussianMixture(n_components=n_components,
                                      covariance_type=cv_type)
        gmm.fit(pdiris)
        bicij = gmm.bic(pdiris)  #get the BIC 
        bic[n_components-1,j] = bicij
        #the code below keeps track of the model with the lowest BIC
        if bicij < lowest_bic:
            lowest_bic = bicij
            best_gmm = gmm
print(lowest_bic)
bic = pd.DataFrame(bic,columns = cv_types,index=n_components_range)
bic

In [None]:
print(best_gmm.means_)
print('\n')
print(best_gmm.covariances_)

### Activity 1

Repeat the analysis using only two of the four variables. Do we get a different conclusion on the number of clusters? 

Put your code below

### Simulate data to test the method

So far we have been looking at a dataset where we are not sure about the 'true' number of clusters and type of covariance matrix.

In what follows we will simulate data from a Gaussian mixture with three components and spherical covariance matrix.

In [None]:
# Number of samples per component
n_samples = 200
np.random.seed(5)

# Generate random sample, three components 1.mean (0,0) cov=1I, 2. mean (-6, 3) cov=.49I and
# 3. mean (3, -4) cov=4I
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.random.randn(n_samples, 2), 
          .4 * np.random.randn(n_samples, 2) + np.array([-6, 3]), 
         3* np.random.randn(n_samples, 2) + np.array([3, -4])] 
print(X.shape)
gmm = GaussianMixture(n_components=3,covariance_type='spherical')
gmm.fit(X)
print(gmm.means_)
print(gmm.covariances_)
labels = gmm.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels)                                                     

### Test the method

Below we repeat the previous model search procedure to the data contained in X. We would like to test whethere the optimal models will indeed be the one with three components and spherical covariance

In [None]:
lowest_bic = np.infty
n_components_range = range(1, 7)
cv_types = ['spherical', 'tied', 'diag', 'full']
bic = np.zeros((len(n_components_range),len(cv_types)))
j = -1
for cv_type in cv_types:
    j = j+1
    for n_components in n_components_range:
        # Fit a Gaussian mixture with EM
        gmm = GaussianMixture(n_components=n_components,
                                      covariance_type=cv_type)
        gmm.fit(X)
        bicij = gmm.bic(X)
        bic[n_components-1,j] = bicij
        if bicij < lowest_bic:
            lowest_bic = bicij
            best_gmm = gmm
print(lowest_bic)
bic = pd.DataFrame(bic,columns = cv_types,index=n_components_range)
bic

### Activity 2

Conduct another simulation experiment generating data from a Gaussian mixture. Choose your own number of components, means and covariances.

Put your code below

### Overfitted Mixtures

Now we will explore what happens when we fit a model with more components than the ones in the data. 

In [None]:
gmm = GaussianMixture(n_components=6,covariance_type='spherical')
gmm.fit(X)
probs = gmm.predict_proba(X)
results = np.sum(probs,axis=0)
results = pd.DataFrame(results.round(0), columns = ['# of individuals'], index=range(1,7))
results

In [None]:
labels = gmm.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels) 

## Bayesian Gaussian Mixture models

We will also apply the fully Bayesian model to the same data with a Dirichlet prior on the cluster probabilities with a low hyperparameter (weight_concentration_prior) of 0.01. This choice is going to penalise redundant clusters by not allocating individuals to them unless it is necessaray 

In [None]:
Bgmm = BayesianGaussianMixture(n_components=6,covariance_type='full',
                               weight_concentration_prior=0.01, max_iter = 200)
Bgmm.fit(X)
probs = Bgmm.predict_proba(X)
results = np.sum(probs,axis=0)
results = pd.DataFrame(results.round(0), columns = ['# of individuals'], index=range(1,7))
results

In [None]:
labels = Bgmm.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels)

### Activity 3

Fit the fully Bayesian approach to the Iris dataset and check the resulting number of clusters.

Put your code below