# BAYES CLASSIFIERS

For any classifier $f:{X \to Y}$, it's prediction error is:

$P(f(x) \ne Y) = \mathbb{E}[ \mathbb{1}(f(X) \ne Y)] = \mathbb{E}[\mathbb{E}[ \mathbb{1}(f(X) \ne Y)|X]]$

For each $x \in X$,

$$\mathbb{E}[ \mathbb{1}(f(X) \ne Y)|X = x]  = \sum\limits_{y \in Y}  P(Y = y|X = x) \cdot \mathbb{1}(f(x) \ne y)$$

The above quantity is minimized for this particular $x \in X$ when,

$$f(x) = \underset{y \in Y}{argmax} \space P(Y = y|X = x) \space \star$$

A classifier $f$ with property $ \star$ for all $x \in X$ is called the `Bayes Classifier`


Under the assumption $(X,Y) \overset{iid}{\sim} P$, the optimal classifier is:
$$f^{\star}(x) = \underset{y \in Y}{argmax} \space P(Y = y|X = x)$$

And from _Bayes Rule_ we equivalently have:

$$f^{\star}(x) = \underset{y \in Y}{argmax} \space P(Y = y) \space P(X = x|Y = y)$$

Where
- $P(Y =y)$ is called _the class prior_
- $P(X = x|Y= y)$ is called _the class conditional distribution_ of $X$

Assuming $X = \mathbb{R}, Y = \{ 0,1 \}$, and the distribution of $P \space \text{of} \space (X,Y)$ is as follows:
- _Class prior_: $P(Y = y) = \pi_y, y \in \{ 0,1 \}$
- _Class conditional density_ for class $y \in \{ 0,1 \}: p_y (x) = N(x|\mu_y,\sigma^2_y)$

$$f^{\star}(x) = \underset{y \in \{ 0,1 \}}{argmax} \space P(Y = y) \space P(X = x|Y = y) = 
    \begin{cases}
      1 & \text{if} \space \frac{\pi_1}{\sigma_1}\space exp[- \frac{(x - \mu_1)^2}{2 \sigma^2_1}] > \frac{\pi_0}{\sigma_0}\space exp[- \frac{(x - \mu_0)^2}{2 \sigma^2_0}]\\
      0 & \text{otherwise}
    \end{cases}$$

### _Bayes Classifier_
![Bayes Classifier](.\image\BayesClassifier.png)

The `Bayes Classifier` has the smallest prediction error of all classifiers. The problem is that we need to know the distribution of $P$ in order to construct the `Bayes Classifier`

# NAIVE BAYES CLASSIFIER

A simplifying assumtion that the features values are conditionally independent given the label, the probability of observing the conjunction $x_1, x_2, x_3, ..., x_d$ is the product of the probabilities for the individual features:

$$ p(x_1, x_2, x_3, ..., x_d|y) = \prod \limits_j \space p(x_j|y)$$

Then the `Naive Bayes Classifier` is defined as:

$$f^{\star}(x) = \underset{y \in Y}{argmax} \space p(y) \space \prod \limits_j \space p(x_j|y)$$

We can estimate these two terms based on the **frequency counts** in the dataset. If the features are real-valued, Naive Bayes can be extended assuming that features follow a Gaussian distribution. This extension is called `Gaussian Naive Bayes`. Other functions can be used to estimate the distribution but the Gaussian distribution is the easiest to work with due to we only need to estimate the mean and the standard deviation from the dataset.

Ok, let's start with the implementation of `Gaussian Naive Bayes` from scratch.

In [1]:
##IMPORTING ALL NECESSARY SUPPORT LIBRARIES
import math as mt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
def separate_by_label(dataset):
    separate = dict()
    for i in range(len(dataset)):
        row = dataset[i]
        label = row[-1]
        if (label not in separate):
            separate[label] = list()
        separate[label].append(row)
        
    return separate

In [3]:
def mean(list_num):
    return sum(list_num)/len(list_num)

In [4]:
def stdv(list_num):
    mu = mean(list_num)
    var = sum([(x - mu)**2 for x in list_num])/(len(list_num) - 1)
    
    return mt.sqrt(var)

In [5]:
def stats_per_feature(ds):
    '''
    argument:
        > ds: 1-D Array with the all data separated by class
    returns:
        > stats: 1-D Array with statistics summary for each feature
    '''
    stats = [(mean(col), stdv(col), len(col)) for col in zip(*ds)]
    del(stats[-1])
    
    return stats

In [6]:
def summary_by_class(dataset):
    sep_label = separate_by_label(dataset)
    summary = dict()
    for label, rows in sep_label.items():
        summary[label] = stats_per_feature(rows)
    
    return summary

In [7]:
def gaussian_pdf(mean, stdv, x):
    _exp = mt.exp(-1*((x - mean)**2/(2*stdv**2)))
    return (1/(mt.sqrt(2 * mt.pi)*stdv)) * _exp

Now it is time to use the statistics calculated from the data to calculate probabilities for new data.

Probabilities are calculated separately for each class, so we calculate the probability that a new piece of data belongs to the first class, then calculate the probability that it belongs to the second class, and so on for all the classes.

For example, if we have two inputs $x_1 and \space x_2$ the calculation of the probability that those belong to class = _y_ is:

$$P(class = y|x_1,x_2) = P(x_1|class = y) \cdot P(x_2|class = y) \cdot P(class = y)$$

In [8]:
def class_probabilities(summary, row):
    total = sum([summary[label][0][2] for label in summary])
    probabilities = dict()
    
    for class_, class_summary in summary.items():
        probabilities[class_] = summary[class_][0][2]/total
        for i in range(len(class_summary)):
            mean, stdev, count = class_summary[i]
            probabilities[class_] *= gaussian_pdf(row[i], mean, stdev)
            
    return probabilities

In [9]:
def predict(summary, row):
    cls_prob = class_probabilities(summary, row)
    _label, _prob = None, -1.0
    for class_, probability in cls_prob.items():
        if _label is None or probability > _prob:
            _prob = probability
            _label = class_
            
    return _label    

In order to verify proper implementation a **toy dataset** is used to evaluate the algorithm.

In [10]:
dataset = [[3.393533211,2.331273381,0],
[3.110073483,1.781539638,0],
[1.343808831,3.368360954,0],
[3.582294042,4.67917911,0],
[2.280362439,2.866990263,0],
[7.423436942,4.696522875,1],
[5.745051997,3.533989803,1],
[9.172168622,2.511101045,1],
[7.792783481,3.424088941,1],
[7.939820817,0.791637231,1]]

summaries = summary_by_class(dataset)
for row in dataset:
    y_pred = predict(summaries, row)
    y_real = row[-1]
    print("Expected={0}, Predicted={1}".format(y_real, y_pred))

Expected=0, Predicted=0
Expected=0, Predicted=0
Expected=0, Predicted=0
Expected=0, Predicted=0
Expected=0, Predicted=0
Expected=1, Predicted=1
Expected=1, Predicted=1
Expected=1, Predicted=1
Expected=1, Predicted=1
Expected=1, Predicted=1


# _GAUSSIAN NAIVE BAYES APPLICATION_

From the `UCI Machine Learning Repository` which contains Iris dataset, we will train our `Gaussian Naive Bayes` model. The Iris dataset is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day.

The dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are not linearly separable from each other.

The dataset have 150 instances and the following attributes:

   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm
   5. class: 
      -- Iris Setosa
      -- Iris Versicolour
      -- Iris Virginica

To compare the performance of our _Classifier_ on the **Iris** dataset, a Gaussian Naive Bayes model from `sklearn` will be fit on the dataset and classification report for both models is generated.

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

In [12]:
##LOADING 'IRIS' DATASET
columns = ['sepal-len','sepal-wid','petal-len','petal-wid','class']
df = pd.read_csv('./data/Iris.csv', names = columns)
df.head()

Unnamed: 0,sepal-len,sepal-wid,petal-len,petal-wid,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal-len    150 non-null float64
sepal-wid    150 non-null float64
petal-len    150 non-null float64
petal-wid    150 non-null float64
class        150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


Due to the class variable type is `categorical` we need first to encode it as numeric type in order to be feed it into our models.

In [14]:
def encoder(df, class_value_pair):
    for class_name, value in class_value_pair.items():
        df['class'] = df['class'].replace(class_name, value)
        
    return df
class_encoder = {'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2}
df = encoder(df, class_encoder)
df.head()

Unnamed: 0,sepal-len,sepal-wid,petal-len,petal-wid,class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [15]:
df['class'].value_counts().sort_index()

0    50
1    50
2    50
Name: class, dtype: int64

Once the preprocessing is complete the dataset will be split into a `Training` & `Test` dataset.

In [16]:
X_ = df.drop(['class'],axis = 1)
y = df['class']
X_train, X_test, y_train, y_test = train_test_split(X_, y, test_size = 0.30, random_state = 5)

Now, we can `train` our customized model. Noticed that our _Gaussian Naive Bayes_ model expects a complete dataset (attributes and labels) in order to calculate the summaries.

In [17]:
ds_train = pd.concat([X_train, y_train], axis = 1)
GNB_custom = summary_by_class(ds_train.values.tolist())

In [18]:
ds_test = pd.concat([X_test, y_test], axis = 1)
cust_pred = [predict(GNB_custom, row) for row in ds_test.values.tolist()]
cust_pred = np.array(cust_pred, dtype = 'int64')

In [19]:
cust_pred

array([1, 1, 2, 0, 2, 1, 0, 2, 0, 1, 1, 1, 2, 2, 0, 0, 2, 2, 0, 0, 1, 2,
       0, 1, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1, 0, 0, 2, 0, 2, 2, 1, 0, 0,
       1], dtype=int64)

Now an instance of `sklearn` _Gaussian Naive Bayes_ model is created and fit it with the training data and an array of predictions is obtained in order to get out performance comparation

In [20]:
##GET AND INSTANCE OF GAUSSIAN NAIVE BAYES MODEL
GNB_skln = GaussianNB()
GNB_skln.fit(X_train, y_train)

##CREATE SKLEARN PREDICTIONS ARRAY
sk_pred = GNB_skln.predict(X_test)

In [21]:
sk_pred

array([1, 1, 2, 0, 2, 1, 0, 2, 0, 1, 1, 1, 2, 2, 0, 0, 2, 2, 0, 0, 1, 2,
       0, 1, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1, 0, 0, 2, 0, 2, 2, 1, 0, 0,
       1], dtype=int64)

By last, a comparison on both models is performed thru a _Classification Report_

In [22]:
print("Sklearn:")
print(classification_report(y_test, sk_pred))
print("Custom:")
print(classification_report(y_test, cust_pred))

Sklearn:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       0.88      0.94      0.91        16
           2       0.92      0.86      0.89        14

   micro avg       0.93      0.93      0.93        45
   macro avg       0.94      0.93      0.93        45
weighted avg       0.93      0.93      0.93        45

Custom:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       0.88      0.94      0.91        16
           2       0.92      0.86      0.89        14

   micro avg       0.93      0.93      0.93        45
   macro avg       0.94      0.93      0.93        45
weighted avg       0.93      0.93      0.93        45

