# A Naive Bayes classifier

## Introduction

In this notebook, we will create a *probabilistic graphical model* (PGM) that classifies data vector $\mathbb{x}$ into two classes $y \in \{0,1\}$. We will use a *generative* approach and assume that the features are *conditionally independent* given the class label. This allows us to write the class conditional density as a product of one dimensional densities: $p(\mathbb{x}|y=c,\mathbb{\theta}) = \prod_{j=1}^{D} p(x_{j}|y=c,\mathbb{\theta_{j,c}})$, where $D$ is the number of features. This is more commonly referred to as a *Naive Bayes classifier*. The model is "naive", since we assume that the features $\mathbb{x}$ are independent. 

These models typically have a small number of parameters in the order $O(CD)$, where $C$ is the number of classes, which can make them robust against overfitting, and computationally easier to deal with. Features can be discrete-valued, real-valued, or both. 

In our example, we will use real-valued features and use Gaussian distributions, $p(\mathbb{x}|y=c,\mathbb{\theta}) = \prod_{j=1}^{D}\mathcal{N}(x_{j}|\mu_{j,c},\sigma_{j,c}^{2})$, where $\mu_{j,c}$ is the mean of feature $j$ in components of class $c$, and $\sigma_{j,c}^{2}$ is its variance. 

Another way to think about this approach is that it breaks down the task of classification into a number of smaller sub-tasks, where each are dealt with by a separate model. The rationale behind this is that it may be effective to build a single model to interpret a single feature for the task. Each model gives a posterior probability for the classes and can be combined using rules of probability. Conversely, the entire input vector would be used by one model for classification and may share information among features for the task.

Please see resources below for more information:

- Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
- Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.







### Setup

In [None]:
import os
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

### Config

In [None]:
model_dir = Path('../models/model/')
data_dir = Path('../data/')

### Generate data

In [None]:
def sample_component(component, means, covars):
    if component == 0:
        return np.random.multivariate_normal(means[0], covars[0], 1).T   
    if component == 1:
        return np.random.multivariate_normal(means[1], covars[1], 1).T

In [None]:
# specify class distributions
class0_weight = 0.5
class1_weight = 0.5

class0_means = np.array([5, 5])
class1_means = np.array([3, 7])

class0_covar = np.array([[1, 0],
                         [0, 1]])
class1_covar = np.array([[1, 0],
                         [0, 1]])

In [None]:
N = 100

means = [class0_means, class1_means]
covars = [class0_covar, class1_covar]
    
mask = np.random.choice([0, 1], N, p=[class0_weight, class1_weight])
data = [sample_component(i, means, covars) for i in mask]
data = np.array(data).reshape(N, 2)
df_data = pd.DataFrame(data, columns=['x0', 'x1'])
df_data['class'] = mask

# store dataset
df_data.to_csv(data_dir/'data.csv', sep='|', header=False, index=False)

In [None]:
# peak of our data set
# plt.scatter(df_data['x0'],
#             df_data['x1'],
#             c=df_data['comp'])
# plt.title("Data")
# plt.xlabel(r"$x_0$")
# plt.ylabel(r"$x_1$")
# plt.grid()
# plt.show()

### Running model

In [None]:
# run c# Infer.NET code
cmd = f'dotnet run --project {model_dir} {data_dir}/ data.csv'
cmd

In [None]:
!{cmd}

### Results

In [None]:
# load results from file
df_result = pd.read_csv(data_dir/'results.csv', sep='|')

meanClass0 = np.array([df_result.loc[0, 'meanPost0'], df_result.loc[0, 'meanPost1']])
meanClass1 = np.array([df_result.loc[1, 'meanPost0'], df_result.loc[1, 'meanPost1']])

class0Prob = df_result.loc[0, 'classPost']
class1Prob = df_result.loc[1, 'classPost']

print("Class0: ", class0Prob, meanClass0)
print("Class1: ", class1Prob, meanClass1)

In [None]:
# visualise results
fig, ax = plt.subplots()

ax.scatter(df_data[df_data['class'] == 0]['x0'],
           df_data[df_data['class'] == 0]['x1'],
           color='red',
           label='class0',
           marker='.',
           alpha=0.3)

ax.scatter(df_data[df_data['class'] == 1]['x0'],
           df_data[df_data['class'] == 1]['x1'],
           color='blue',
           label='class1',
           marker='.',
           alpha=0.3)

x, y = np.mgrid[-1:10:.2, -1:10:.2]
pos = np.dstack((x, y))

rv = multivariate_normal(meanClass0, [[1, 0], [0, 1]])
ax.contour(x, y, rv.pdf(pos), colors='red', label='class0', alpha=0.7)

rv = multivariate_normal(meanClass1, [[1, 0], [0, 1]])
ax.contour(x, y, rv.pdf(pos), colors='blue', label='class1', alpha=0.7)

ax.set_title("Results")
ax.set_xlabel(r"$x_0$")
ax.set_ylabel(r"$x_1$")
ax.grid()
ax.legend()
plt.show()