# Mixture of Gaussians

let π = Pr(y = C1) and 1 − π = Pr(y = C2). Let Pr(x|C1) =
N(x|µ1, Σ) and Pr(x|C2) = N(x|µ2, Σ). Learn the parameters π, µ1, µ2 and Σ by likelihood maximization. Use Bayes theorem to compute the probability of each class given an input x: Pr(Cj |x) =
k Pr(Cj ) Pr(x|Cj ).

## TODO
- Report the accuracy of mixtures of Gaussians on the test set. Measure the accuracy by counting the average number of correctly labeled images. An image
is correctly labeled when the probability of the correct label is greater than 0.5.
- Print the parameters π, µ1, µ2, Σ found for mixtures of Gaussian. Since the covariance Σ is quite big, print
only the diagonal of Σ. 
- Briefly discuss the results:
    - Mixture of Gaussians and logistic regression both find a linear separator, but they use different parameterizations and different objectives. Compare the number of parameters in each model and the
amount of computation needed to find a solution with each model. Compare the results for each
model.
    - Mixture of Gaussians and logistic regression find a linear separator where as k-Nearest Neighbours
(in assignment 1) finds a non-linear separator. Compare the expressivity of the separators. Discuss
under what circumstances each type of separator is expected to perform best. What could explain
the results obtained with KNN in comparison to the results obtained with mixtures of Gaussians and
linear regression?

In [11]:
! git clone -q https://github.com/abnercorrea/machine-learning.git

[]

In [1]:
import sys
sys.path += ['/content/machine-learning/src']

In [2]:
import numpy as np

import tensorflow as tf

from abnercorrea.numpy.util.data_prep import read_train_data, read_test_data, norm, prepend_col, to_binary_classes
from abnercorrea.numpy.util.data_vis import plot_alpha_scores

from abnercorrea.tensorflow.util.stat import logistic_sigmoid, softmax, empirical_covariance
from abnercorrea.tensorflow.util.data_prep import split_train_validation_tf

# from abnercorrea.tensorflow.linear.logistic_regression import LogisticRegressionClassifierTF


# Data

In [4]:
# read data
xtrp, ytrp = read_train_data(num_partitions=10)
xtr, ytr = np.concatenate(xtrp), np.concatenate(ytrp)
xte, yte = read_test_data()

# predictors are all standardized to have mean zero and unit norm.
xtr, xte = norm(xtr - xtr.mean()), norm(xte - xte.mean())

# yi will be used as a scalar
ytr, yte = ytr[:, 0], yte[:, 0]

In [5]:
xtr.shape, xte.shape, ytr.shape, yte.shape

((1000, 64), (110, 64), (1000,), (110,))

# Mixture of Gaussians

In [None]:
class GaussianMixtureClassifierTF:
    def __init__(self):
        self.pi_ = None
        self.mu_ = None
        self.covariance_ = None
        self.classes_ = None
        self.w0_ = None
        self.w_ = None
    
    def fit(self, X, y):
        # learned parameters
        
        self.w_, self.w0_, self.classes_, self.pi_, self.mu_, self.covariance_ = self.fit_tf(X, y)

    def fit_tf(self, X, y):
        """
        Uses maximum likelihood to learn w
        """
        classes = np.unique(y.ravel())
        assert classes.shape[0] >= 2, f'At least 2 classes must be provided but found {classes.shape[0]}:\n{classes}'

        n = X.shape[0]
        x = [X[y == label] for label in classes]
        pi = np.array([xk.shape[0] / n for xk in x])
        mu = np.array([xk.mean(axis=0) for xk in x])
        covariance = empirical_covariance(X, y, classes, mu)
        cov_inv = np.linalg.inv(covariance)

        # w and w0 will be used to calculate the posterior P(ck|x)
        if classes.shape[0] == 2:
            # Binary classification
            w = (mu[0] - mu[1]) @ cov_inv
            w0 = -.5 * mu[0] @ cov_inv @ mu[0].T + .5 * mu[1] @ cov_inv @ mu[1].T + np.log(pi[0] / pi[1])
        else:
            # Multi class classification
            w = mu @ cov_inv
            w0 = np.sum(-.5 * w * mu, axis=1) + np.log(pi)

        return w, w0, classes, pi, mu, covariance

    def predict_proba(self, X):
        w0, w, classes = self.w0_, self.w_, self.classes_
        
        x_wt = X @ w.T + w0

        if self.classes_.shape[0] == 2:            
            # For binary classification, the posterior P(ck|x) is a logistic sigmoid
            prediction_proba = logistic_sigmoid(x_wt)
        else:
            # For more than 2 classes, the posterior P(ck|x) is the softmax function (generalization of logistic sigmoid)
            prediction_proba = softmax(x_wt)

        return prediction_proba

    def predict(self, X):
        prediction_proba = self.predict_proba(X)
        classes = self.classes_

        if classes.shape[0] == 2:            
            # For binary classification, prediction_proba is the probability of classes[0]
            predictions = np.full(X.shape[0], classes[0])
            predictions[prediction_proba < .5] = classes[1]
        else:
            predictions = np.vectorize(lambda argmax: classes[argmax])(prediction_proba.argmax(axis=1))

        return predictions

    def score(self, X, y):
        """
        The score used is the accuracy of the model.
        """
        predictions = self.predict(X)
        tp_tn = (predictions == y).sum()
        n = y.shape[0]
        return tp_tn / n

## Binary Classification

In [None]:
gmc_bin = GaussianMixtureClassifier()

In [None]:
gmc_bin.fit(xtr, ytr)

In [None]:
gmc_bin.score(xte, yte)

0.8545454545454545

In [None]:
gmc_bin.w0_, gmc_bin.w_.shape

(0.013660666163854529, (64,))

In [None]:
gmc_bin.pi_, gmc_bin.mu_.shape, gmc_bin.covariance_.shape

(array([0.5, 0.5]), (2, 64), (64, 64))

In [None]:
%timeit gmc_bin.fit(xtr, ytr)

The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 1.41 ms per loop


In [None]:
%timeit gmc_bin_y_pred = gmc_bin.predict(xte)

The slowest run took 7.60 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 18.1 µs per loop


In [None]:
%timeit gmc_bin.score(xte, yte)

The slowest run took 8.87 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 24.2 µs per loop


## Multiple Classes

In [None]:
gmc_multi = GaussianMixtureClassifier()

In [None]:
gmc_multi.fit(xtr, ytr)

In [None]:
gmc_multi.score(xte, yte)

0.8545454545454545

In [None]:
gmc_multi.w0_, gmc_multi.w_.shape

(0.013660666163854529, (64,))

In [None]:
gmc_multi.pi_, gmc_multi.mu_.shape, gmc_multi.covariance_.shape

(array([0.5, 0.5]), (2, 64), (64, 64))

In [None]:
%timeit gmc_multi.fit(xtr, ytr)

1000 loops, best of 5: 1.4 ms per loop


In [None]:
%timeit gmc_multi.predict(xte)

The slowest run took 9.34 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 17.9 µs per loop


In [None]:
%timeit gmc_multi.score(xte, yte)

The slowest run took 7.71 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 24.2 µs per loop
