# Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem with the "naive" assumption of conditional independence between every pair of features given the value of the class variable. Bayes'theorem states the following relationship, given class variable $y$ and dependent feature vector $x_1$ through $x_n$,:

$$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots x_n \mid y)}
                                 {P(x_1, \dots, x_n)}$$

Using the naive conditional independence assumption, we have

$$\begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\\\Downarrow\\\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align}$$

Then, we can use Maximum A Posteriori (MAP) estimation to estimate $P(y)$ and $P(x_i \mid y)$; the former is then the relative frequency of class $y$ in the training set.

*References*:
H. Zhang (2004). The optimality of Naive Bayes. Proc. FLAIRS.

# 1 Gaussian Naive Bayes

GaussianNB implements the Gaussian Naive Bayes algorithm for classification.   
The likelihood of the features is assumed to be Gaussian:

$$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$$

The parameters $\sigma_y$ and $\mu_y$  are estimated using maximum likelihood.

**Example** - The training data is generated as follows:

In [1]:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

**Q1**: Training a GaussianNB model:

In [3]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X,Y)

GaussianNB(priors=None)

**Q2**: Predict the label of a data [-0.8,-1]:

In [5]:
print(model.predict([[-0.8,-1]]))

[1]


In [6]:
print(model.predict([[5,5]]))

[2]


In [7]:
print(model.predict([[-1,5]]))

[2]


In [8]:
print(model.predict([[-5,1]]))

[1]


In [9]:
print(model.predict([[0,0]]))

[1]


# 2 MultinomialNB

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). 

*References*   
C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265. http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

**Example** - The training data is generated as follows:

In [17]:
import numpy as np
x = np.random.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])

**Q3**: Training a MultinomialNB model:

In [18]:
from sklearn.naive_bayes import MultinomialNB
m = MultinomialNB()
m.fit(x,y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

**Q4**: Predict the label of a data X[2:3]:

In [19]:
m.predict(x[2:3])

array([3])

# 3 Process on 'Iris' Data

In Week 9, we have studied how to use KNN algorithm to do classification task on 'iris' data. Here,we are going to employ the GaussianNB to conduct the same task. 

In [21]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
iris_dataset = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], random_state=142)

**Q5**：Report the acuracy result on test data:

In [23]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

GaussianNB(priors=None)

In [36]:
y_train_pred = gnb.predict(X_train)
print('Accuracy score on training set: ', accuracy_score(y_train_pred, y_train)) 

Accuracy score on training set:  0.9821428571428571


In [37]:
y_test_pred = gnb.predict(X_test)
print('Accuracy score on testing set: ', accuracy_score(y_test, y_test_pred))

Accuracy score on testing set:  0.8947368421052632


# 4 Predict Human Activity Recognition (HAR)

The objective of this practice exercise is to predict current human activity based on phisiological activity measurements from 53 different features based in the [HAR dataset](http://groupware.les.inf.puc-rio.br/har#sbia_paper_section). The training (`har_train.csv`) and test (`har_validate.csv`) datasets are provided.

**Q6**: Build a Naive Bayes model, predict on the test dataset and compute the [confusion matrix](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62). Note: Please refer to the [`sklearn.metrics.confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

In [50]:
import pandas as pd
from sklearn.metrics import confusion_matrix
train = pd.read_csv("data/har_train.csv")
test = pd.read_csv("data/har_validate.csv")

In [51]:
train.shape

(13737, 53)

In [52]:
test.shape

(5885, 53)

In [48]:
train.head()

Unnamed: 0,classe,roll_belt,pitch_belt,yaw_belt,total_accel_belt,gyros_belt_x,gyros_belt_y,gyros_belt_z,accel_belt_x,accel_belt_y,...,total_accel_forearm,gyros_forearm_x,gyros_forearm_y,gyros_forearm_z,accel_forearm_x,accel_forearm_y,accel_forearm_z,magnet_forearm_x,magnet_forearm_y,magnet_forearm_z
0,A,1.41,8.07,-94.4,3,0.0,0.0,-0.02,-21,4,...,36,0.03,0.0,-0.02,192,203,-215,-17,654,476
1,A,1.41,8.07,-94.4,3,0.02,0.0,-0.02,-22,4,...,36,0.02,0.0,-0.02,192,203,-216,-18,661,473
2,A,1.42,8.07,-94.4,3,0.0,0.0,-0.02,-20,5,...,36,0.03,-0.02,0.0,196,204,-213,-18,658,469
3,A,1.48,8.05,-94.4,3,0.02,0.0,-0.03,-22,3,...,36,0.02,-0.02,0.0,189,206,-214,-16,658,469
4,A,1.45,8.06,-94.4,3,0.02,0.0,-0.02,-21,4,...,36,0.02,-0.02,-0.03,193,203,-215,-9,660,478


In [53]:
test.head()

Unnamed: 0,classe,roll_belt,pitch_belt,yaw_belt,total_accel_belt,gyros_belt_x,gyros_belt_y,gyros_belt_z,accel_belt_x,accel_belt_y,...,total_accel_forearm,gyros_forearm_x,gyros_forearm_y,gyros_forearm_z,accel_forearm_x,accel_forearm_y,accel_forearm_z,magnet_forearm_x,magnet_forearm_y,magnet_forearm_z
0,A,1.48,8.07,-94.4,3,0.02,0.02,-0.02,-21,2,...,36,0.02,0.0,-0.02,189,206,-214,-17,655.0,473.0
1,A,1.45,8.17,-94.4,3,0.03,0.0,0.0,-21,4,...,36,0.02,0.0,-0.02,190,205,-215,-22,656.0,473.0
2,A,1.42,8.21,-94.4,3,0.02,0.0,-0.02,-22,4,...,36,0.0,-0.02,-0.03,193,202,-214,-14,659.0,478.0
3,A,1.48,8.15,-94.4,3,0.0,0.0,0.0,-21,4,...,36,0.02,0.0,0.0,194,204,-215,-13,656.0,471.0
4,A,1.51,8.12,-94.4,3,0.0,0.0,-0.02,-21,4,...,36,0.02,-0.02,0.0,192,204,-213,-13,653.0,481.0


In [67]:
x_train = train.drop(['classe' ], axis=1)
y_train = train['classe']


x_test = test.drop(['classe' ], axis=1)
y_test = test['classe']


print('x train shape:', x_train.shape)
print('y train shape:', y_train.shape)
print('x test shape:', x_test.shape)
print('y test shape:', y_test.shape)

x train shape: (13737, 52)
y train shape: (13737,)
x test shape: (5885, 52)
y test shape: (5885,)


In [57]:
model = GaussianNB()
model.fit(x_train, y_train)

GaussianNB(priors=None)

In [74]:
y_train_p = model.predict(x_train)
print('Accuracy score on train set: ', accuracy_score(y_train_p, y_train)) 
print('Confusion Matrix on train set:')
print(confusion_matrix(y_train, y_train_p))

Accuracy score on train set:  0.5580548882579893
Confusion Matrix on train set:
[[2491  164  667  502   82]
 [ 306 1608  358  191  195]
 [ 519  252 1197  309  119]
 [ 240   70  603 1075  264]
 [ 143  518  252  317 1295]]


In [73]:
y_test_p = model.predict(x_test)
print('Accuracy score on test set: ', accuracy_score(y_test_p, y_test)) 
print('Confusion Matrix on test set:')
print(confusion_matrix(y_test, y_test_p))

Accuracy score on test set:  0.5542905692438402
Confusion Matrix on test set:
[[1070   95  262  212   35]
 [ 127  685  145   76  106]
 [ 223  106  512  136   49]
 [ 102   35  271  441  115]
 [  51  239   95  143  554]]


Gaussian NB assumes all features are independent of eachother.

### KNeighborsClassifier

In [76]:
from sklearn.neighbors import KNeighborsClassifier

In [77]:
knn = KNeighborsClassifier()

knn.fit(x_train, y_train)
y_train_p = knn.predict(x_train)

print('Accuracy score on train set: ', accuracy_score(y_train_p, y_train)) 
print('Confusion Matrix on train set:')
print(confusion_matrix(y_train, y_train_p))

Accuracy score on train set:  0.9552303996505788
Confusion Matrix on train set:
[[3850   15   11   26    4]
 [  81 2475   52   24   26]
 [  15   44 2289   28   20]
 [  18    7   86 2131   10]
 [  19   45   30   54 2377]]


In [78]:
y_test_p = knn.predict(x_test)
print('Accuracy score on test set: ', accuracy_score(y_test_p, y_test)) 
print('Confusion Matrix on test set:')
print(confusion_matrix(y_test, y_test_p))

Accuracy score on test set:  0.9097706032285472
Confusion Matrix on test set:
[[1612   12   14   27    9]
 [  77  982   39   28   13]
 [  12   34  940   23   17]
 [  18    3   61  878    4]
 [  28   34   40   38  942]]


### Decision Tree Classifier

In [79]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train, y_train)
y_train_p = dt.predict(x_train)

print('Accuracy score on train set: ', accuracy_score(y_train_p, y_train)) 
print('Confusion Matrix on train set:')
print(confusion_matrix(y_train, y_train_p))

Accuracy score on train set:  1.0
Confusion Matrix on train set:
[[3906    0    0    0    0]
 [   0 2658    0    0    0]
 [   0    0 2396    0    0]
 [   0    0    0 2252    0]
 [   0    0    0    0 2525]]


In [80]:
y_test_p = dt.predict(x_test)
print('Accuracy score on test set: ', accuracy_score(y_test_p, y_test)) 
print('Confusion Matrix on test set:')
print(confusion_matrix(y_test, y_test_p))

Accuracy score on test set:  0.956159728122345
Confusion Matrix on test set:
[[1630   21    8    9    6]
 [  21 1082   20    9    7]
 [   7   25  963   19   12]
 [   7    9   20  916   12]
 [   2   11   19   14 1036]]


### Neural Networks MLP Classifier

In [95]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(random_state=0)

mlp.fit(x_train, y_train)
y_train_p = mlp.predict(x_train)

print('Accuracy score on train set: ', accuracy_score(y_train_p, y_train)) 
print('Confusion Matrix on train set:')
print(confusion_matrix(y_train, y_train_p))

Accuracy score on train set:  0.901361287035015
Confusion Matrix on train set:
[[3783   99   12   12    0]
 [ 142 2375   66   18   57]
 [  55  143 2059   92   47]
 [  40   57  219 1847   89]
 [  17   84   54   52 2318]]


In [96]:
y_test_p = mlp.predict(x_test)
print('Accuracy score on test set: ', accuracy_score(y_test_p, y_test)) 
print('Confusion Matrix on test set:')
print(confusion_matrix(y_test, y_test_p))

Accuracy score on test set:  0.8756159728122345
Confusion Matrix on test set:
[[1617   43    4    8    2]
 [  87  977   37    4   34]
 [  21   63  848   67   27]
 [  17   33  103  762   49]
 [  15   47   34   37  949]]


### Supported Vector Classifier

In [97]:
from sklearn.svm import SVC
svc = SVC(random_state=0)

svc.fit(x_train, y_train)
y_train_p = svc.predict(x_train)

print('Accuracy score on train set: ', accuracy_score(y_train_p, y_train)) 
print('Confusion Matrix on train set:')
print(confusion_matrix(y_train, y_train_p))

Accuracy score on train set:  1.0
Confusion Matrix on train set:
[[3906    0    0    0    0]
 [   0 2658    0    0    0]
 [   0    0 2396    0    0]
 [   0    0    0 2252    0]
 [   0    0    0    0 2525]]


In [98]:
y_test_p = svc.predict(x_test)
print('Accuracy score on test set: ', accuracy_score(y_test_p, y_test)) 
print('Confusion Matrix on test set:')
print(confusion_matrix(y_test, y_test_p))

Accuracy score on test set:  0.2844519966015293
Confusion Matrix on test set:
[[1674    0    0    0    0]
 [1139    0    0    0    0]
 [1026    0    0    0    0]
 [ 964    0    0    0    0]
 [1082    0    0    0    0]]


SVC is highly overfitting. 