## Gaussian Descriminant Analysis Implementation using numpy

When we have a classification problem in which the input features x are
continuous-valued random variables, we can then use the Gaussian Discriminant Analysis (GDA) model, which models p(x|y) using a multivariate normal distribution.

GDA is a family of generative algorithm that try to model p(x|y) (and p(y)). After modeling p(y) (called the class priors) and p(x|y), our algorithm can then use Bayes rule to derive the posterior distribution on y given x:


![title](figures/bayes-theorem.png)

GDA makes stronger modeling assumptions, and is more data efficient (i.e., requires less training data to learn “well”) when the modeling assumptions are correct or at least approximately correct. Comparing to Logistic regression that makes weaker assumptions, and is significantly more robust to deviations from modeling assumptions

### Our motivation
To gain better understand on how the algorithm works

In [4]:
# %load gda.py
import numpy as np

class GDABinaryClassifier:
    
    def fit(self, X, y):
        self.fi = y.mean()
        self.u = np.array([ X[y==k].mean(axis=0) for k in [0,1]])
        X_u = X.copy()
        for k in [0,1]: X_u[y==k] -= self.u[k]
        self.E = X_u.T.dot(X_u) / len(y)
        self.invE = np.linalg.pinv(self.E)
        return self
    
    def predict(self, X):
        return np.argmax([self.compute_prob(X, i) for i in range(len(self.u))], axis=0)
    
    def compute_prob(self, X, i):
        u, phi = self.u[i], ((self.fi)**i * (1 - self.fi)**(1 - i))
        return np.exp(-1.0 * np.sum((X-u).dot(self.invE)*(X-u), axis=1)) * phi
    
    def score(self, X, y):
        return (self.predict(X) == y).mean()

In [5]:
from sklearn.datasets import load_breast_cancer
X,y = load_breast_cancer(return_X_y=True)
model = GDABinaryClassifier().fit(X,y)
pre = model.predict(X)
model.score(X,y)

0.9666080843585237