# Naïve Bayes Classifier From Scratch 
## (Gaussian Probabilistic Generative model)

- Linear Classifier.
- Probabilistic Generative model.
- Gaussian Assumption for Continuous Inputs.

### Probabilistic Generative Models
Consider the case of two classes. The posterior probability for class $C_1$ can be written as

$$ p(C_1|x) = \frac{p(x|C_1)p(C_1))}{p(x|C_1)p(C_1) + p(x|C_2)p(C_2)} $$

therefore

$$ \therefore p(C_1|x) = \frac{1}{1 + \exp(-a)} = \sigma(a) $$

Where

$$ a = \ln \frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}$$

and $ \sigma(a) $ is the logistic sigmoid function.

### Gaussian Assumption for Continuous Inputs

Assuming that the class-conditional densities are Gaussian.

$$ p(x|C_k) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\Sigma|^{1/2}} \exp \{ \frac{1}{2}(x-\mu_k)^T \Sigma^{-1} (x-\mu_k) \} $$

Consider the case of two classes.

$$ p(C_1|x) = \sigma(\mathbf{w}^T \mathbf{x} + w_0) $$

where we have defined

$$ \mathbf{w} = \Sigma^{-1}(\mu_1 - \mu_2) $$

and

$$ w_0 = -\frac{1}{2} \mu_1^T \Sigma^{-1}\mu_1 + \frac{1}{2} \mu_2^T \Sigma^{-1}\mu_2 + \ln{\frac{p(C_1)}{p(C_2)}} $$

### Naïve Bayes approximation

Sometimes it's difficult to estimate liklihood $p(\mathbf{x}|C_k)$ for high dimensional data

- Naïve Bayes approximation:

    $$ p(\mathbf{x}|C_k) \approx \prod_{j=1}^D p(x_j|C_k) $$
    
- For Gaussian conditional density

    $$ p(\mathbf{x}|C_k) = \mathscr{N}(x|\mu,\Sigma) \approx \prod_{j=1}^D p(x_j|C_k) = \prod_{j=1}^D \mathscr{N}(x_j|\mu_j,\sigma_j^2) $$


### Referrence
*Pattern Recognition and Machine Learning, Christopher M. Bishop, Springer, 2006*

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
class gaussian_naive_bayes:
    def __mean(self, c):
        return np.mean(self.X[self.t == c], axis=0)
        
    def __variance(self, c):
        return np.var(self.X[self.t == c], axis=0)
        
    def predict(self, X):
        if (hasattr(self, 'classes_data') & hasattr(self, 'classes') & hasattr(self, 'N')):
            g = np.zeros((len(X), len(self.classes)))
            i = 0
            for c in self.classes:
                class_data = self.classes_data[c]
                portion = (class_data['n'] / self.N)
                exponent = np.exp(-(X-class_data['mean'])**2 / (2* class_data['var']))
                denominator = (np.sqrt(2 * np.pi) * np.sqrt(class_data['var']))
                g[:, i] = np.prod((1 / denominator) * exponent, axis=1) * portion
                i = i + 1
            a = np.argmax(g, axis=1)
            return a
        else:
            print('Please run fit in order to be able to use predict')
            
    def accuracy(self, y_actual, y_predicted):
        return np.mean(y_actual == y_predicted) * 100
    
    def fit(self, X, t):
        self.X = X
        self.t = t
        
        self.N = len(t)

        self.classes = np.sort(np.unique(t))
        
        self.classes_data = {}
        
        for c in self.classes:
            self.classes_data[c] = {'mean': self.__mean(c),
                                    'var': self.__variance(c),
                                    'n': len(self.X[self.t == c])}
            

# Dataset 1

### Training

In [3]:
Data = np.genfromtxt('synth.tr.csv', delimiter=',', skip_header=True)
X = Data[:, 1:3]
t = Data[:, 3]

gnb = gaussian_naive_bayes()
gnb.fit(X, t)

### Testing

In [4]:
Data_test = np.genfromtxt('synth.te.csv', delimiter=',', skip_header=True)
X_test = Data_test[:, 1:3]
y_actual = Data_test[:, 3]
y_predicted = gnb.predict(X_test)
acc = gnb.accuracy(y_actual, y_predicted)
print('Accuracy', acc, '%')

Accuracy 89.9 %


# Dataset 2

### Training

In [5]:
Data2 = np.genfromtxt('Data1.txt')
# Data2 = Data2[Data2[:,0] < 40] # Without outliers
X2 = Data2[:, 0:2]
t2 = np.array([1 if i >= 0 else 0 for i in Data2[:, 2]])

gnb2 = gaussian_naive_bayes()
gnb2.fit(X2, t2)

### Testing

In [6]:
Data2_test = np.genfromtxt('Test1.txt')
X2_test = Data2_test[:, 0:2]
y2_actual = np.array([1 if i >= 0 else 0 for i in Data2_test[:, 2]])
y2_predicted = gnb2.predict(X2_test)
acc2 = gnb2.accuracy(y2_actual, y2_predicted)
print('Accuracy', acc2, '%')

Accuracy 97.5 %


# Dataset 3

### Training

In [7]:
from sklearn.datasets import make_classification


In [8]:
X3, t3 = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_informative=1,
                             n_clusters_per_class=1, random_state=14, class_sep=1, n_classes=2)

gnb3 = gaussian_naive_bayes()
gnb3.fit(X3, t3)

### Testing

In [9]:
X3_test, y3_actual = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=1,
                             n_clusters_per_class=1, random_state=14, class_sep=1, n_classes=2)

y3_predicted = gnb3.predict(X3_test)
acc3 = gnb3.accuracy(y3_predicted, y3_actual)
print('Accuracy', acc3, '%')

Accuracy 98.66666666666667 %
