<font size=20><b>Bayesian Classification</b></font>

<h1><b>Content</b></h1>

<ul>
  <li>Basic Mathematics</li>
  <li>Variety of implementations in Python</li>
  <li>Preprocessing</li>
  <li>GaussinNB testing</li>
  <ul>
  <li>sample_weight parameter effects</li>
  <li>var_smoothing parameter effects</li>
  </ul>
  <li>Comparison of speed and accuracy with SVM</li>
</ul>

<h1><b>Basic mathematics</b></h1>

Simple Bayesian classifications assume that the effect of a feature value on a class is independent of other features. This assumption is called the conditional independence of the class and it simplifies the calculations.[1]

Here, using Bayes theorem, we mean by P(H|X) the probability that the tuple X satisfies the assumption H. In other words, we want to calculate the probability P(H: tuple X belongs to class $C_i$. | observed values of X), To simplify the notaion, We denote this expression by $P(C_i|X)$. and:<br/><br/>

(1) - $P(C_i|X)=\frac{P(C_i\cap X)}{P(X)}=\frac{P(C_i|X)P(C_i)}{P(X)}$

We can calculate the probability of $P(X), P(C_i)$ and $P(X|C_i)$ using the data set. Suppose  the number of features is $k$ and $X=(x_1,x_2,...,x_k)$, the number of samples is $n$, and the number of classes is $m$, then:

$P(C_i) =\frac{|C_i|}{n}$ , $P(X) =\frac{|X|}{n}$

Because $P(X)$ is the same for all classes, in practice we can skip its calculation, this will reduce the calculation time and accelerate the learning of the model. A large data set causes heavy calculations to calculate $P(X|C_i)$, To reduce these calculations, we use the conditional independence of the class and assume that there is no dependence between the features, and therefore:

$P(X|C_i)=P(x_1|C_i)\times P(x_2|C_i)\times...\times P(x_k|C_i)=\prod_{j=1}^{k}{P(x_j|C_i)}$

If feature j is discrete or categorical,$P(C_i)$ is equal to the number of tuples that are in class $C_i$ and their $j$-th feature is equal to $x_j$ ,is divided by $|C_i|$

If the attribute xj is continuous, it is usually assumed to have a Gaussian distribution with mean $\mu$ and standard deviation $\sigma$, which is introduced as follows: $g(x,\mu,\sigma)=\frac{e^\frac{{(x-\mu)}^2}{2\sigma^2}}{\sqrt{2\pi\sigma}}$

So, $g(x_j,\mu_{C_i},\sigma_{C_i})=P(x_j|C_i)$

<b>Predicting class of X</b>

To specify class $X$, the goal is to find the maximum of $P(Ci|X)$ from (1):<br/>
So that $P(C_i|X)>P(C_j|X) for \le j \le m,j \neq i$

For this, we do not need to calculate $P(X)$ because it is the same for all classes, also this can lead to reduced calculations. So the Bayes classifier considers $X$ to be in class $C_i$ if and only if:

$P(X|C_i)P(C_i)>P(X|C_j)P(C_j) for 1\le j \le m, j\neq i$

<h1><b>Variety of implementations in Python</h1></b>
Regarding the Bayes classifier, there are various types in Python, such as Gaussian Bayes, Bernoulli Naive Bayes, Categorical Naive Bayes, as their names suggest, each of them is used in different fields of data sets.


  * GaussianNB implements the Gaussian Naive Bayes algorithm for classification. The likelihood of the features is assumed to be Gaussian.

  * BernoulliNB implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions; i.e., there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable.

  * CategoricalNB implements the categorical naive Bayes algorithm for categorically distributed data. It assumes that each feature, which is described by the index 
, has its own categorical distribution.[2]


<b>According to the dataset we have, we use Gaussian Naive Bayes in this notebook.</b>

* Preprocessing

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.utils import class_weight
from sklearn.svm import SVC
import time
print("Reading archived data from google colab")

#url ="https://www.kaggle.com/datasets/shashwatwork/dementia-prediction-dataset?select=dementia_dataset.csv"
dementias = pd.read_csv("/content/drive/MyDrive/DATA/dementia_dataset.csv")

print("Size before dropping the records with missed values:",dementias.shape)

# Since machine learning algorithms cannot work with missing data, we have to drop these records.
# Dropping the records with missing value
dementias = dementias.dropna()
print("Size after dropping the records with missed values:",dementias.shape)
# The dataset has tow categorical columns
# Mapping categorical columns to 0 and 1
dementias['M/F'] = dementias['M/F'].map({'M': 0, 'F': 1})
dementias['Hand'] = dementias['Hand'].map({'R': 0, 'L': 1})

# Splitting our data
# By default, Sklearn will reserve 25% of the dataset for training.
# we do not need to  'Subject ID' & 'MRI ID' columns
X = dementias[['Visit','MR Delay','M/F','Hand','Age','EDUC','SES','MMSE','CDR','eTIV','nWBV','ASF']]

# Target
y = dementias['Group']

X_train, X_test, y_train, y_test = train_test_split(X, y)

dementias.head()

Reading archived data from google colab
Size before dropping the records with missed values: (373, 15)
Size after dropping the records with missed values: (354, 15)


Unnamed: 0,Subject ID,MRI ID,Group,Visit,MR Delay,M/F,Hand,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
0,OAS2_0001,OAS2_0001_MR1,Nondemented,1,0,0,0,87,14,2.0,27.0,0.0,1987,0.696,0.883
1,OAS2_0001,OAS2_0001_MR2,Nondemented,2,457,0,0,88,14,2.0,30.0,0.0,2004,0.681,0.876
5,OAS2_0004,OAS2_0004_MR1,Nondemented,1,0,1,0,88,18,3.0,28.0,0.0,1215,0.71,1.444
6,OAS2_0004,OAS2_0004_MR2,Nondemented,2,538,1,0,90,18,3.0,27.0,0.0,1200,0.718,1.462
7,OAS2_0005,OAS2_0005_MR1,Nondemented,1,0,0,0,80,12,4.0,28.0,0.0,1689,0.712,1.039


* GaussianNB testing<br/>
As mentioned above, the likelihood of the features is assumed to be Gaussian:  $P(x_j|C_i)=\frac{e^\frac{{(x_j-\mu_{C_i})}^2}{2\sigma_{C_i}^2}}{\sqrt{2\pi\sigma_{C_i}}}$

In [None]:
gnb = GaussianNB()

#Unweighted fit
#gnb.partial_fit(X_train, y_train,np.unique(y_train))
gnb.fit(X_train, y_train)
#gnb.fit(X_train, y_train,sample_weight=5)
y_pred = gnb.predict(X_test)
acc = accuracy_score(y_test,y_pred)
print("Accuracy" , acc)


Accuracy 0.9213483146067416


One of the 'fit' parameters is 'sample_weight', which we ignored in the above code, in the following we want to assign different values to it and see the result.

In [None]:

gnb = GaussianNB()
# Higher weights force the classifier to put more emphasis on the points
weight = class_weight.compute_sample_weight(None, y_train)

y_pred = gnb.fit(X_train, y_train,sample_weight=weight).predict(X_test)

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)


Accuracy 0.9101123595505618


In [None]:
gnb = GaussianNB()
# Higher weights force the classifier to put more emphasis on the points
y_pred = gnb.fit(X_train, y_train,sample_weight=5).predict(X_test)

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)


Accuracy 0.8651685393258427


In [None]:
gnb = GaussianNB()
# Higher weights force the classifier to put more emphasis on the points
y_pred = gnb.fit(X_train, y_train,sample_weight=10).predict(X_test)

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)


Accuracy 0.9325842696629213


In [None]:
gnb = GaussianNB()
# Higher weights force the classifier to put more emphasis on the points
y_pred = gnb.fit(X_train, y_train,sample_weight=15).predict(X_test)

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)


Accuracy 0.9213483146067416


In [None]:
gnb = GaussianNB()
# Higher weights force the classifier to put more emphasis on the points
y_pred = gnb.fit(X_train, y_train,sample_weight=25).predict(X_test)

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)


Accuracy 0.9213483146067416


In [None]:
gnb = GaussianNB()

y_pred = gnb.partial_fit(X_train, y_train,np.unique(y_train)).predict(X_test)

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)

Accuracy 0.898876404494382


In [None]:
gnb = GaussianNB()
exec_start = time.time()

y_pred = gnb.partial_fit(X_train, y_train,np.unique(y_train),sample_weight=5).predict(X_test)

exec_end = time.time()

exec_time = exec_end - exec_start

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)
print("Time:", exec_time)

Accuracy 0.8876404494382022
Time: 0.03380084037780762


In [None]:
gnb = GaussianNB()

y_pred = gnb.partial_fit(X_train, y_train,np.unique(y_train),sample_weight=10).predict(X_test)
exec_end = time.time()

exec_time = exec_end - exec_start

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)
print("Time:", exec_time)

Accuracy 0.9101123595505618


In [None]:
gnb = GaussianNB()

y_pred = gnb.partial_fit(X_train, y_train,np.unique(y_train),sample_weight=15).predict(X_test)

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)

Accuracy 0.9213483146067416


In [None]:
gnb = GaussianNB()

y_pred = gnb.partial_fit(X_train, y_train,np.unique(y_train),sample_weight=25).predict(X_test)

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)

Accuracy 0.8426966292134831


var_smoothing parameter effects

In [None]:
for i in range(12):
  s = 1/(10**(1+i))
  gnb = GaussianNB(priors=None, var_smoothing = s)
  y_pred = gnb.fit(X_train, y_train).predict(X_test)
  acc = accuracy_score(y_test,y_pred)
  print("Accuracy" , acc)

Accuracy 0.4943820224719101
Accuracy 0.48314606741573035
Accuracy 0.5393258426966292
Accuracy 0.6741573033707865
Accuracy 0.7078651685393258
Accuracy 0.7528089887640449
Accuracy 0.8764044943820225
Accuracy 0.9438202247191011
Accuracy 0.9325842696629213
Accuracy 0.9325842696629213
Accuracy 0.9325842696629213
Accuracy 0.9325842696629213


Comparison of speed and accuracy with SVM

In [None]:

exec_start = time.time()
# Building and training our model
classifier = SVC(kernel='linear',C=0.6)
classifier.fit(X_train, y_train)  
predictions = classifier.predict(X_test)
        
score = accuracy_score(y_test, predictions) 

exec_end = time.time()

exec_time = exec_end - exec_start

print("Time:",exec_time)
print("Accuracy:",score)
print('kernel: Linear')

Time: 38.619956731796265
Accuracy: 0.8876404494382022
kernel: Linear


In [None]:
exec_start = time.time()
# Building and training our model
classifier = SVC(kernel='rbf',C=1.0)
classifier.fit(X_train, y_train)  
predictions = classifier.predict(X_test)
        
score = accuracy_score(y_test, predictions) 

exec_end = time.time()

exec_time = exec_end - exec_start

print("Time:",exec_time)
print("Accuracy:",score)
print('kernel: rbf')

Time: 0.026592254638671875
Accuracy: 0.5168539325842697
kernel: rbf


In [None]:
exec_start = time.time()
# Building and training our model
classifier = SVC(kernel='poly',C=0.6,degree=2)
classifier.fit(X_train, y_train)  
predictions = classifier.predict(X_test)
        
score = accuracy_score(y_test, predictions) 

exec_end = time.time()

exec_time = exec_end - exec_start

print("Time:",exec_time)
print("Accuracy:",score)
print('kernel: poly')

Time: 0.02133774757385254
Accuracy: 0.43820224719101125
kernel: poly


In [None]:
gnb = GaussianNB()
exec_start = time.time()

y_pred = gnb.partial_fit(X_train, y_train,np.unique(y_train),sample_weight=5).predict(X_test)

exec_end = time.time()

exec_time = exec_end - exec_start

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)
print("Time:", exec_time)

Accuracy 0.8876404494382022
Time: 0.013085126876831055


In [None]:
gnb = GaussianNB()
exec_start = time.time()

y_pred = gnb.fit(X_train, y_train,sample_weight=10).predict(X_test)

exec_end = time.time()

exec_time = exec_end - exec_start

acc = accuracy_score(y_test,y_pred)

print("Accuracy" , acc)
print("Time:", exec_time)

Accuracy 0.8876404494382022
Time: 0.0076389312744140625


In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

class MyBayesClassifier():
    '''
    Bayes Theorem form
    P(y|X) = P(X|y) * P(y) / P(X)
    '''
    def calc_prior(self, features, target):
        '''
        prior probability P(y)
        calculate prior probabilities
        '''
        self.prior = (features.groupby(target).apply(lambda x: len(x)) / self.rows).to_numpy()

        return self.prior
    
    def calc_statistics(self, features, target):
        '''
        calculate mean, variance for each column and convert to numpy array
        ''' 
        self.mean = features.groupby(target).apply(np.mean).to_numpy()
        self.var = features.groupby(target).apply(np.var).to_numpy()
              
        return self.mean, self.var
    
    def gaussian_density(self, class_idx, x):     
        '''
        calculate probability from gaussian density function (normally distributed)
        we will assume that probability of specific target value given specific class is normally distributed 
        
        probability density function derived from wikipedia:
        (1/√2pi*σ) * exp((-1/2)*((x-μ)^2)/(2*σ²)), where μ is mean, σ² is variance, σ is quare root of variance (standard deviation)
        '''
        mean = self.mean[class_idx]
        var = self.var[class_idx]
        numerator = np.exp((-1/2)*((x-mean)**2) / (2 * var))
        # numerator = np.exp(-((x-mean)**2 / (2 * var)))
        denominator = np.sqrt(2 * np.pi * var)
        prob = numerator / denominator
        return prob
    
    def calc_posterior(self, x):
        posteriors = []

        # calculate posterior probability for each class
        for i in range(self.count):
            prior = np.log(self.prior[i]) ## use the log to make it more numerically stable
            conditional = np.sum(np.log(self.gaussian_density(i, x))) # use the log to make it more numerically stable
            posterior = prior + conditional
            posteriors.append(posterior)
        # return class with highest posterior probability
        return self.classes[np.argmax(posteriors)]
     

    def fit(self, features, target):
        self.classes = np.unique(target)
        self.count = len(self.classes)
        self.feature_nums = features.shape[1]
        self.rows = features.shape[0]
        
        self.calc_statistics(features, target)
        self.calc_prior(features, target)
        
    def predict(self, features):
        preds = [self.calc_posterior(f) for f in features.to_numpy()]
        return preds

    def accuracy(self, y_test, y_pred):
        accuracy = np.sum(y_test == y_pred) / len(y_test)
        return accuracy

    def visualize(self, y_true, y_pred, target):
        
        tr = pd.DataFrame(data=y_true, columns=[target])
        pr = pd.DataFrame(data=y_pred, columns=[target])
        
        
        fig, ax = plt.subplots(1, 2, sharex='col', sharey='row', figsize=(15,6))
        
        sns.countplot(x=target, data=tr, ax=ax[0], palette='viridis', alpha=0.7, hue=target, dodge=False)
        sns.countplot(x=target, data=pr, ax=ax[1], palette='viridis', alpha=0.7, hue=target, dodge=False)
        

        fig.suptitle('True vs Predicted Comparison', fontsize=20)

        ax[0].tick_params(labelsize=12)
        ax[1].tick_params(labelsize=12)
        ax[0].set_title("True values", fontsize=18)
        ax[1].set_title("Predicted values", fontsize=18)
        plt.show()

In [None]:
mygclf = MyBayesClassifier()

mygclf.fit(X_train,y_train)
predictions = mygclf.predict(X_test)
acc = mygclf.accuracy(y_test, predictions)
print("Accuracy" , acc)


Accuracy 0.0898876404494382


  return mean(axis=axis, dtype=dtype, out=out, **kwargs)
  numerator = np.exp((-1/2)*((x-mean)**2) / (2 * var))


<b>Refrences:</b><br/>
1 - Data mining:concepts and techniques, 3rd Ed, Han.Jiawei<br/>
2 - https://scikit-learn.org/stable/modules/naive_bayes.html

