# Mixture of Gaussians

let π = Pr(y = C1) and 1 − π = Pr(y = C2). Let Pr(x|C1) =
N(x|µ1, Σ) and Pr(x|C2) = N(x|µ2, Σ). Learn the parameters π, µ1, µ2 and Σ by likelihood maximization. Use Bayes theorem to compute the probability of each class given an input x: Pr(Cj |x) =
k Pr(Cj ) Pr(x|Cj ).

## TODO
- Report the accuracy of mixtures of Gaussians on the test set. Measure the accuracy by counting the average number of correctly labeled images. An image
is correctly labeled when the probability of the correct label is greater than 0.5.
- Print the parameters π, µ1, µ2, Σ found for mixtures of Gaussian. Since the covariance Σ is quite big, print
only the diagonal of Σ. 
- Briefly discuss the results:
    - Mixture of Gaussians and logistic regression both find a linear separator, but they use different parameterizations and different objectives. Compare the number of parameters in each model and the
amount of computation needed to find a solution with each model. Compare the results for each
model.
    - Mixture of Gaussians and logistic regression find a linear separator where as k-Nearest Neighbours
(in assignment 1) finds a non-linear separator. Compare the expressivity of the separators. Discuss
under what circumstances each type of separator is expected to perform best. What could explain
the results obtained with KNN in comparison to the results obtained with mixtures of Gaussians and
linear regression?

In [10]:
import numpy as np
import scipy as sp
import pandas as pd
import sklearn as skl

from sklearn.decomposition import PCA

import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Util

In [11]:
def read_csv(file_name_prefix, partition=''):
  # np.genfromtxt('trainData1.csv', dtype=int, delimiter=',')
  return pd.read_csv(f'{file_name_prefix}{partition}.csv', header=None).values

def read_train_data(num_partitions):
  train_data = [read_csv('trainData', i + 1) for i in range(num_partitions)]
  train_labels = [read_csv('trainLabels', i + 1) for i in range(num_partitions)]
  # y is a scalar
  return train_data, train_labels

def read_test_data():
  test_data = read_csv('testData')
  test_labels = read_csv('testLabels')
  # y is a scalar
  return test_data, test_labels

def norm(v):
    norm_v = np.linalg.norm(v, axis=1)[:, np.newaxis]
    norm_v[norm_v == 0] = 1
    return v / norm_v

def prepend_row(matrix, value):
    r, c = matrix.shape
    new_matrix = np.zeros([r + 1, c])
    new_matrix[0, :] = value
    new_matrix[1:, :] = matrix
    return new_matrix

def prepend_col(matrix, value):
    r, c = matrix.shape
    new_matrix = np.zeros([r, c + 1])
    new_matrix[:, 0] = value
    new_matrix[:, 1:] = matrix
    return new_matrix

def split_train_validation(x, y, fold, folds):
    """
    Splits input data into train and validation sets.
    Used in k-fold cross validation.

    For cross validation, when spliting train x validation, consider using stratified sampling:
    https://danilzherebtsov.medium.com/continuous-data-stratification-c121fc91964b
    """
    n = x.shape[0]
    fold_size = n // folds
    fold_start = fold * fold_size
    fold_end = (fold + 1) * fold_size
    xtr = np.concatenate([x[:fold_start], x[fold_end:]])
    ytr = np.concatenate([y[:fold_start], y[fold_end:]])
    xvl, yvl = x[fold_start:fold_end], y[fold_start:fold_end]
    return xtr, ytr, xvl, yvl

## Model Util

In [12]:
def empirical_covariance(X, y, classes, mu):
    x = np.copy(X)
    for class_k, mu_k in zip(classes, mu):
        x[y == class_k] -= mu_k
    return (x.T @ x) / x.shape[0]

def logistic_sigmoid(x):
    return 1. / (1. + np.exp(-x))

def softmax(x):
    p_softmax = np.exp(x)
    p_softmax /= p_softmax.sum(axis=1)[:, np.newaxis]
    return p_softmax

def to_binary_classes(y):
    """
    Used to build labels vector (y) in binary classification models.
    """
    classes = np.unique(y.ravel())
    assert classes.size == 2, f'Should contain only 2 classes but found {classes.size}: {classes}.'
    yb = np.zeros_like(y, dtype=np.int8)
    yb[y == classes[0]] = 1
    return yb, classes

## Plot

In [13]:
def plot_pca(x, y, n_components=None, marker_size=6):
    """
    - If ``n_components == 'mle'`` and ``svd_solver == 'full'``, Minka's MLE is used to guess the dimension.
    - If ``0 < n_components < 1`` and ``svd_solver == 'full'``, select the number of components such that the 
      amount of variance that needs to be explained is greater than the percentage specified by n_components.
    """
    pca = PCA(n_components=n_components, svd_solver='full')
    x_pca = pca.fit_transform(x)
    labels = {}
    print(f'Number of components: {pca.n_components_}')
    print(f'Total variance: {pca.explained_variance_ratio_.sum()}')
    print(f'x_pca.shape: {x_pca.shape}')
    plot_scatter_matrix(x_pca, y, labels=labels, marker_size=marker_size)
    return pca, x_pca


def plot_scatter_matrix(x, y, dim=[], labels={}, height=1700, width=1700, marker_size=6):
    dim = dim or range(x.shape[1])
    fig = px.scatter_matrix(
        x,
        labels=labels,
        dimensions=dim,
        color=y,
    )
    fig.update_traces(diagonal_visible=False, showupperhalf = False, marker=dict(size=marker_size, colorscale='Rainbow'))
    fig.update_layout(height=height, width=width)
    fig.show()


def plot_scatter_x_pairs(x, y, pairs, rows, cols, title=None, height=None, width=None, marker_size=4):
    subplot_titles = [f'X[{i}] x X[{j}]' for i, j in pairs]
    fig = make_subplots(rows=rows, cols=cols, subplot_titles=subplot_titles)

    fig_args = dict(mode='markers', marker=dict(color=y, size=marker_size))
    for k, pair in enumerate(pairs):
        fig.add_trace(go.Scatter(x=x[:, pair[0]], y=x[:, pair[1]], **fig_args), row=(k // cols) + 1, col=(k % cols) + 1)
    fig.update_layout(height=height, width=width, title_text=title)

    fig.show()


def plot_alpha_scores(alphas, scores, title='Alpha Scores'):
    max_score = scores.argmax()
    alpha_scores_fig = go.Figure()
    alpha_scores_fig.add_trace(go.Scatter(x=alphas, y=scores, mode='lines', name='Alpha Scores'))
    alpha_scores_fig.add_trace(go.Scatter(x=[alphas[max_score]], y=[scores[max_score]], mode='markers', name='Best Alpha'))
    alpha_scores_fig.update_layout(title=title, autosize=True, width=500, height=500,)
    alpha_scores_fig.update_xaxes(title_text='Alpha')
    alpha_scores_fig.update_yaxes(title_text='Score')
    alpha_scores_fig.show()

# Data

In [17]:
# read data
xtrp, ytrp = read_train_data(num_partitions=10)
xtr, ytr = np.concatenate(xtrp), np.concatenate(ytrp)
xte, yte = read_test_data()

# normalize
xtr, xte = norm(xtr), norm(xte)

# yi will be used as a scalar
ytr, yte = ytr[:, 0], yte[:, 0]

In [18]:
xtr.shape, ytr.shape, xte.shape, yte.shape

((1000, 64), (1000,), (110, 64), (110,))

In [19]:
ytrdf = pd.DataFrame(data={'ytr': ytr}, dtype=int)
ytedf = pd.DataFrame(data={'yte': yte}, dtype=int)

In [20]:
ytrdf.groupby(['ytr']).size().reset_index(name='counts')

Unnamed: 0,ytr,counts
0,5,500
1,6,500


In [21]:
ytedf.groupby(['yte']).size().reset_index(name='counts')

Unnamed: 0,yte,counts
0,5,51
1,6,59


In [22]:
pca = PCA(n_components=0.999, svd_solver='full')
x_pca = pca.fit_transform(xtr)

In [23]:
component_pairs = [(1, 0), (1, 2), (1, 3), (0, 63), (0, 2), (0, 3)]

plot_scatter_x_pairs(x_pca, ytr, component_pairs, rows=2, cols=3, title='PCA Analysis', height=800, width=1200)

# Mixture of Gaussians

In [24]:
class GaussianMixtureClassifier:
    def __init__(self):
        self.pi_ = None
        self.mu_ = None
        self.covariance_ = None
        self.classes_ = None
        self.w0_ = None
        self.w_ = None
    
    def fit(self, X, y):
        """
        Uses maximum likelihood to learn w
        Assumes same covariance matrix for all classes.

        TODO: add support for different covariance matrices
        """
        classes = np.unique(y.ravel())
        assert classes.shape[0] >= 2, f'At least 2 classes must be provided but found {classes.shape[0]}:\n{classes}'

        n = X.shape[0]
        x = [X[y == label] for label in classes]
        pi = np.array([xk.shape[0] / n for xk in x])
        mu = np.array([xk.mean(axis=0) for xk in x])
        covariance = empirical_covariance(X, y, classes, mu)
        cov_inv = np.linalg.inv(covariance)

        # w and w0 will be used to calculate the posterior P(ck|x)
        if classes.shape[0] == 2:
            # Binary classification
            w = (mu[0] - mu[1]) @ cov_inv
            w0 = -.5 * mu[0] @ cov_inv @ mu[0].T + .5 * mu[1] @ cov_inv @ mu[1].T + np.log(pi[0] / pi[1])
        else:
            # Multi class classification
            w = mu @ cov_inv
            w0 = np.sum(-.5 * w * mu, axis=1) + np.log(pi)

        self.classes_, self.pi_, self.mu_, self.covariance_ = classes, pi, mu, covariance
        # learned parameters
        self.w_, self.w0_ = w, w0

    def predict_proba(self, X):
        w0, w, classes = self.w0_, self.w_, self.classes_
        
        x_wt = X @ w.T + w0

        if self.classes_.shape[0] == 2:            
            # For binary classification, the posterior P(ck|x) is a logistic sigmoid
            prediction_proba = logistic_sigmoid(x_wt)
        else:
            # For more than 2 classes, the posterior P(ck|x) is the softmax function (generalization of logistic sigmoid)
            prediction_proba = softmax(x_wt)

        return prediction_proba

    def predict(self, X):
        prediction_proba = self.predict_proba(X)
        classes = self.classes_

        if classes.shape[0] == 2:            
            # For binary classification, prediction_proba is the probability of classes[0]
            predictions = np.full(X.shape[0], classes[0])
            predictions[prediction_proba < .5] = classes[1]
        else:
            predictions = np.vectorize(lambda argmax: classes[argmax])(prediction_proba.argmax(axis=1))

        return predictions

    def score(self, X, y):
        """
        The score used is the accuracy of the model.
        """
        predictions = self.predict(X)
        tp_tn = (predictions == y).sum()
        n = y.shape[0]
        return tp_tn / n

## Binary Classification

In [25]:
gmc_bin = GaussianMixtureClassifier()

In [26]:
gmc_bin.fit(xtr, ytr)

In [27]:
gmc_bin.score(xte, yte)

0.8545454545454545

In [28]:
gmc_bin.w0_, gmc_bin.w_.shape

(0.013660666163854529, (64,))

In [29]:
gmc_bin.pi_, gmc_bin.mu_.shape, gmc_bin.covariance_.shape

(array([0.5, 0.5]), (2, 64), (64, 64))

In [30]:
%timeit gmc_bin.fit(xtr, ytr)

The slowest run took 8.78 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 1.4 ms per loop


In [31]:
%timeit gmc_bin_y_pred = gmc_bin.predict(xte)

The slowest run took 8.48 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 18 µs per loop


In [32]:
%timeit gmc_bin.score(xte, yte)

The slowest run took 8.71 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 23.6 µs per loop


## Multiple Classes

In [None]:
gmc_multi = GaussianMixtureClassifier()

In [None]:
gmc_multi.fit(xtr, ytr)

In [None]:
gmc_multi.score(xte, yte)

0.8545454545454545

In [None]:
gmc_multi.w0_, gmc_multi.w_.shape

(0.013660666163854529, (64,))

In [None]:
gmc_multi.pi_, gmc_multi.mu_.shape, gmc_multi.covariance_.shape

(array([0.5, 0.5]), (2, 64), (64, 64))

In [None]:
%timeit gmc_multi.fit(xtr, ytr)

1000 loops, best of 5: 1.4 ms per loop


In [None]:
%timeit gmc_multi.predict(xte)

The slowest run took 9.34 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 17.9 µs per loop


In [None]:
%timeit gmc_multi.score(xte, yte)

The slowest run took 7.71 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 24.2 µs per loop


# scikit-learn comparison

- Comaprison is not apples-to-apples since skl uses EM (expectation maximization) to maximize likelihood.
- Closest we can get is to compare to the model that uses the same covariance matrix for all classes (covariance_type = 'tied')
- My implementation achieves higher accuracy than skl's... 




In [33]:
from sklearn.mixture import GaussianMixture

In [51]:
gmm_skl = {
    cov_type: GaussianMixture(n_components=2, covariance_type=cov_type, max_iter=20).fit(xtr, ytr)
    for cov_type in ['tied', 'spherical', 'diag', 'full']
}


Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.



In [54]:
for cov_type, gmm in gmm_skl.items():
    print(cov_type, gmm.score(xte, yte))

tied 82.29268377252518
spherical 78.6967066568826
diag 83.80923928314306
full 87.8961325072206


In [56]:
gmc_bin.score(xte, yte)

0.8545454545454545

In [55]:
%timeit gmm_skl['full'].fit(xtr, ytr)

10 loops, best of 5: 79.9 ms per loop
