# Bayes Classifier on Iris Dataset
<br>张栋玮 19373703 <a href="http://github.com/ZhangDw529/PythonProjects">Open in GITHUB</br>

# Abstract
<br>Naive Bayes is a generative classification algorithm which is based on bayesian theorem and conditional independence assumption. By using bayesian theorem, we can compute posterior probability with prior probabilities and observations. Bayesian theorem is shown as follows.</br>
<img src='./pic/bayes.jpg' width=400 height=100 >
<br>Let the series of decision actions as ${a_1,a_2,..,a_c}$, the conditional risk of decision action $a_i$ can be computed by</br>
<img src='./pic/risk.jpg' width=400 height=100>
<br>Thus the minimum risk Bayesian decision can be found as</br>
<img src='./pic/arg.jpg' width=400 height=100>

# 1. Algorithm
## 1.1 Data Preprocess
Here I load the Iris dataset from sklearn.datasets. Then split it into train set and test set with $test\_size=0.2,random\_state=3$.

In [3]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

# Load data
def load_data():
    iris = datasets.load_iris()
    x = iris.data
    y = iris.target
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2, random_state=3)
    return x_train,x_test,y_train,y_test

## 1.2 Bayes Class

<br>This is the python class named bayes. It contains four parts with details in comments.</br>
- Initialization and training
- Predictions
- Computation of accuracy
- Data print

In [4]:
class Bayes():

    # Initialize and train
    def __init__(self, x_train, y_train):
        self.categories = len(np.unique(y_train))  # 3 classes in the dataset
        self.total_col = x_train.shape[1]   # Number of Attributes used
        self.partial = []   # Split proportion
        self.mean = np.zeros([self.categories,self.total_col]) # Initialization
        self.var = np.zeros([self.categories,self.total_col]) 

        # Compute the proportion, variance and mean of each class       
        for i in range(self.categories):
            temp = x_train[np.nonzero(i==y_train)] # Select i_th class
            self.partial.append(len(temp)/len(x_train))
            self.mean[i,:] = np.mean(temp,axis=0,keepdims=True)
            self.var[i,:] = np.var(temp,axis=0,keepdims=True)

    
    # Make predictions
    def predict(self, x_test,y_test):
        result = []
        eps = 1e-10  
        for i in x_test:
            x = np.tile(i,(3,1))
            
            # Compute the Gaussian pdf
            num = -(x-self.mean+eps)**2
            den = 2*self.var+eps
            _exp = np.exp(num/den)
            # _exp = np.exp(-(x-self.mean)**2/(2*self.var+eps))
            
            # Compute the posterior possibilities
            p = _exp/(np.sqrt(2*np.pi)*self.var+eps)
            # Change the possibilities into log() mode
            log_p = np.sum(np.log(p),axis=1) 
            prob = np.log(self.partial)+log_p
            result.append(np.argmax(prob))
        return result
    
    # Compute the accuracy
    def acc(self, y_test, y_pred):
        acc = np.count_nonzero(y_test==y_pred)
        return acc/len(y_pred)
    
    # Print parameters
    def printPara(self):
        print(f"The dataset has {self.categories} categories and {self.total_col} attributes.")
        print("Proportion of each class:")
        print(self.partial)
        print("Mean:")
        print(self.mean)
        print("Var:")
        print(self.var)


# 2. Test
<br>In this part, I first </br>

In [5]:
#
x_train,x_test,y_train,y_test=load_data()
TotalAttr = Bayes(x_train,y_train)
TotalAttr.printPara()
y_pred = TotalAttr.predict(x_test,y_test)
acc = TotalAttr.acc(y_test,y_pred)
print(f'Accuracy: {acc:.2f}')


The dataset has 3 categories and 4 attributes.
Proportion of each class:
[0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
Mean:
[[5.03   3.4325 1.465  0.2375]
 [5.93   2.7875 4.2675 1.335 ]
 [6.5525 2.9875 5.5325 2.01  ]]
Var:
[[0.1011     0.13219375 0.032775   0.01134375]
 [0.2351     0.10009375 0.21819375 0.040275  ]
 [0.38749375 0.10309375 0.28519375 0.0649    ]]
Accuracy: 0.97


In [2]:
x_train,x_test,y_train,y_test=load_data()
# Take attribute 3 and 4 into consideration
x_train = x_train[:,2:]
x_test = x_test[:,2:]
Attr34 = Bayes(x_train,y_train)
Attr34.printPara()
y_pred = Attr34.predict(x_test,y_test)
acc = Attr34.acc(y_test,y_pred)
print(f'Accuracy: {acc:.2f}')

The dataset has 3 categories and 2 attributes.
Proportion of each class:
[0.3333333333333333, 0.3333333333333333, 0.3333333333333333]
Mean:
[[1.465  0.2375]
 [4.2675 1.335 ]
 [5.5325 2.01  ]]
Var:
[[0.032775   0.01134375]
 [0.21819375 0.040275  ]
 [0.28519375 0.0649    ]]
Accuracy: 1.00


# Reference
<br>[1] Statistical Pattern Recognition Lab, SASEE</br>
<br>[2] <a href="https://blog.csdn.net/Happy_change/article/details/117226036">朴素贝叶斯算法</a></br>
<br>[3] <a href="https://blog.csdn.net/weixin_46302044/article/details/117399359">朴素贝叶斯处理鸢尾花数据集分类</a></br>