# Naive Bayes classifier NBC

### <span style='color:yellow'>The term naive of a NBC is attributed to the independence (naive) between the feature vector components of data. </span>

### <span style='color:yellow'>The NBC is one of the simplest probabilistic classifier based on Bayes' therom.</span>

### <span style='color:yellow'>In Bayes theorem, if we have two events A and A, then the probability of event A to be occured given B is given as follows:</span>

$$
    \LARGE{P(A|B)} = {\frac{P(B|A).P(A)}{P(B)}} 
$$

### <span style='color:yellow'>P(A|B) is called posterior probability. </span>

### <span style='color:yellow'>P(B|A) is termed as a class conditional probability: It can be optained from gaussian distribution. </span>


### <span style='color:yellow'>P(A) is callled the prior of event A; practically we cont how many times event A is occured and use it as a prior. </span>

### <span style='color:yellow'>P(A) is called the prior of event B; practically we cont how many time event A occured and use it as a prior.

# Naive Bayes classifier NBC for machine learning

### <span style='color:yellow'> To integerate the NBC within other machine learning models, we should unify the mathmatical notations and  conventions, thus we will use teh following math,matical model: </span>  

$$
    \LARGE{P(y|X)} = {\frac{P(X|y).P(y)}{P(X)}} 
$$

### <span style='color:yellow'> The above equation could be read as teh probability of class label y given the feature vector X ={x1,x2,x3,,,$x_n$}  (Remmember: the components of the feature vectors should be independent.) </span>  

### <span style='color:yellow'> Remember: the feature vector components should be mutually independent.</span>  


### <span style='color:yellow'>An example of feature commponents independence: the probability of detecting a hotel class (3* or 5*), based on the location and room sizes (location and room size are independent from each other).</span>  



# How the independence property affects NBC?


### <span style='color:yellow'> At  real life application, the indepence among feature vector components might be challenging. </span>  


### <span style='color:yellow'> We will consider that we have a dataset of N samples, wehere each sample comprises different and nutually independent feature vector components.</span>  


### <span style='color:yellow'>With the independance assumption, we will be abel to factorize/ split $P(X|y)$ which is the main portion of the NBC, where we use the chain rule for mathmatical implementations: </span>  

$$
{P(X|y)=P(x_1|y)*P(x_2|y)*...*P(x_n|y)}
$$

### <span style='color:yellow'>Thus, the NBC fomula becoms:</span>

$$
P(y|X)= \frac{P(x_1|y)*P(x_2|y)*...*P(x_n|y)P(y)}{P(X)}
$$

### <span style='color:yellow'> Beacuse the model is used for classification, we only care about the labels and the prior of labels, accordingly we neglect any part that does not contation y, i.e., we remove P(X) from denominator and keep only the nominator:</span>
    
$$
P(y|X)= P(x_1|y)*P(x_2|y)*...*P(x_n|y)P(y)
$$
    
### <span style='color:yellow'>Remember: P(y|X) is called the posterior probability, P(x|y) is called class conditional, P(y) is called the prior of labels (occurance or simply counts the labels of each class and express the number of each class as a prior).</span>

# Classification/ class selection

###  <span style='color:yellow'>To perform the classification, we want to predict the class label based on probabilistic value. Practically, the model predicts a vector of probabilities and the length of that vector is a function of how many unique class we have.</span>

###  <span style='color:yellow'>Considering a feature vector of probabilities, we use argmax to obtain the index of the highest probability and retrieve the class label based on that index:</span>

$$
y=\mathrm{argmax}_y=P(x_1|y)*P(x_2|y)*...*P(x_n|y)P(y)
$$


# Log-trick 

### <span style='color:yellow'> Because we already factorized the class conditional probability P(X|y) into P(x_1|y)*P(x_2|y)*...*P(x_n|y) and the multiplication is utilized, we might be face overflow of multiplication.</span>


### <span style='color:yellow'>Generally, the multiplication of several small numbers leads to a very small number, and that is simply the overflow problem in the multiplication.</span>


### <span style='color:yellow'>To solve the overflow problem, we want to turn the multiplication into addition and that is done by using the logarithmic operators among the NBC model:</span>


$$
y=\mathrm{argmax}_y=\mathrm{log}(P(x_1|y))+\mathrm{log}(P(x_2|y))+\mathrm{log}...+\mathrm{log}(P(x_n|y))\mathrm{log}(P(y))
$$


# For practical implementations:

### <span style='color:yellow'>The occurrence or frequency of each class label is used as a prior for prediction.</span>


### <span style='color:yellow'>$P(x_i|y)$ is the class conditional probability and is estimated from the Gaussian distribution:</span>

$$
     P(x_i|y)=\frac{1}{\sqrt{2 \pi \sigma^2_y}}.\mathrm{exp}(-\frac{(x_i-\mu_y)^2}{2 \sigma^2_y})
$$

### <span style='color:yellow'> The Gaussian distribution is illustrated at the following figure:</span>

<img src='gdis.png' width=350>



In [3]:
import numpy as np
class NaiveBayes:
    # We do not need the constructor __init__
    def fit(self,X,y):
        # We need the prior P(y), and we need the mean and variance for teh class conditional and
        n_samples,n_features=X.shape
        self._classes=np.unique(y)
        # The number of classes is important to obtain P(y)
        n_classes=len(self._classes) 
        self.prior_y=np.zeros(n_classes)

    
    


In [4]:
'Initiate the mean and variance for each class'
class NaiveBayes:
    # We do not need the constructor __init__
    def fit(self,X,y):
        # We need the prior P(y), and we need the mean and variance for teh class conditional and
        n_samples,n_features=X.shape
        self._classes=np.unique(y)
        # The number of classes is important to obtain P(y)
        n_classes=len(self._classes) 
        
        self.prior_y=np.zeros(n_classes,dtype=np.float64)
        self._mean=np.zeros(n_classes,n_features)
        self._var=np.zeros(n_classes,n_features)


In [7]:
'Compute the mean and variance for each class'
class NaiveBayes:
    # We do not need the constructor __init__
    def fit(self,X,y):
        # We need the prior P(y), and we need the mean and variance for teh class conditional and
        n_samples,n_features=X.shape
        self._classes=np.unique(y)
        # The number of classes is important to obtain P(y)
        n_classes=len(self._classes) 
        
        self.prior_y=np.zeros(n_classes,dtype=np.float64)
        self._mean=np.zeros(n_classes,n_features)
        self._var=np.zeros(n_classes,n_features)
        
        for class_ in self._classes:
            # Retriving class of data based on masking the index of the class and the labels y
            X_class=X[class_==y]
            # Separate mean for each class
            self._mean[c,:]=X_class.mean(axis=0)
            # Separate variance for each class
            self._var[c,:]=X_class.var(zxis=0)
            # Separate prior fo reach class
            self.prior_y[c]=X_class.shape[0]/float(n_samples)
        # To predict method:
            
    

In [1]:
'Building predict method'
class NaiveBayes:
    # We do not need the constructor __init__
    def fit(self,X,y):
        # We need the prior P(y), and we need the mean and variance for teh class conditional and
        n_samples,n_features=X.shape
        self._classes=np.unique(y)
        # The number of classes is important to obtain P(y)
        n_classes=len(self._classes) 
        
        self.prior_y=np.zeros(n_classes,dtype=np.float64)
        self._mean=np.zeros((n_classes,n_features),dtype=np.float64)
        self._var=np.zeros((n_classes,n_features),dtype=np.float64)
        
        for class_ in self._classes:
            # Retriving class of data based on masking the index of the class and the labels y
            X_class=X[class_==y]
            # Separate mean for each class
            self._mean[class_,:]=X_class.mean(axis=0)
            # Separate variance for each class
            self._var[class_,:]=X_class.var(axis=0)
            # Separate prior fo reach class
            self.prior_y[class_]=X_class.shape[0]/float(n_samples)
        # To predict method:
            
    def predict(self,X):
        y_predicted=[self._predict(x) for x in X]
        return y_predicted
    
    def _predict(self,x):
        posteriors_prob=[] #  Here we store the prediction
        # Iterate over classes to  computes its log prior anc class conditional probability
        for idx,class_ in enumerate(self._classes):
            # Computing the log of the class
            prior=np.log(self.prior_y[idx])
            # Computing the Gaussian distribution
            class_conditonal_prob= np.sum(np.log(self._Gaussian_dist(idx,x))) 
            posterior=prior+class_conditonal_prob
            posteriors_prob.append(posterior) #5
        return(self._classes[np.argmax(posteriors_prob)])
        
    def _Gaussian_dist(self,idx,x):  #3
        # Class_idx will be inserted when we call this private method
        # Retrieving the mean for each class based on its index
        mean=self._mean[idx]
        # Retrieving the variance for each class based on its index
        var=self._var[idx]
        # The numerator for the Gaussian distribution
        numerator=np.exp(-(x-mean)**2 / (2*var))
        # The denominator for the Gaussian distribution
        denominator=np.sqrt(2*np.pi*var)
        pdf=numerator/denominator
        return pdf
            
    

In [2]:
'Let us test the NaiveBase classifier on a cllassification dataset'
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split


In [3]:
X,y=datasets.make_classification(n_samples=1000, n_features=10,n_classes=2,random_state=1234)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,shuffle=True,random_state=1234)

In [4]:
%matplotlib qt5
fig,ax=plt.subplots(1,1)
ax.scatter(X[:,5],X[:,6],c=y,cmap='viridis',marker='o',s=20)
ax.axis('square')
ax.axes.get_xaxis().set_ticks([])
ax.axes.get_yaxis().set_ticks([])
ax.set_xlabel('Feature 1',size=14,weight='bold')
ax.set_ylabel('Feature2',size=14,weight='bold')
plt.show(block=False)
plt.pause(5)
plt.close()

In [5]:
clf=NaiveBayes()

In [6]:
clf.fit(X_train,y_train)
prediction=clf.predict(X_test)

In [7]:
print(list(prediction))

[1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1]


In [8]:
'Let us define the accurancy metrics'
def prediction_accuracy(y_true,y_predicted):
    return (np.sum(y_true==y_predicted))/len(y_true)

In [9]:
accuracy= prediction_accuracy(y_test,prediction)
accuracy

0.93