# Lecture 11a. Naive Bayes Classification

[Naive Bayes User Guide](https://scikit-learn.org/stable/modules/naive_bayes.html)

[Gaussian Naive Bayes implementation](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB)

## Dataset: iris flowers,
Load from seaborn, or from scikit-learn   

**Summary about the algo:**    
Naive Bayes is an "supervised learning" model.   
We know the labels. We have a "y" target variable. 

Goal:  
Classify observations in known categories.

Important note:   
Assumption of "conditional independence" between every pair of features.  
This means that the value of any feature is not correlated with any other feature.   
Can you think of examples that this applies?

# A. Import the necessary modules.   
Imports should be at the first code cell.

In [1]:
# data management libraries
import pandas as pd
import numpy as np
np.set_printoptions(suppress=True, precision=3)

# visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# interactive visualizarion libaries
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# scikit learn algo library (google it)
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB, MultinomialNB
# data preprocessing 
from sklearn.model_selection import train_test_split
# algo metric
from sklearn.metrics import accuracy_score

# B. Load the data and understand its features.   
See which are the variables, what is their type, what are the values that the variables take.  
At this step, you should think about possible relations that you ought to examine.   
Which do you think might be more important?

In [2]:
# load iris from scikit-learn
iris = datasets.load_iris(as_frame=True)
df = pd.DataFrame(iris.frame)

In [3]:
df.columns  # I don't like the headers for the features, but I like the "target" name for y.

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'target'],
      dtype='object')

In [4]:
# load iris from seaborn, which loads training datasets and returns a pandas dataframe
df = sns.load_dataset('iris')

In [5]:
df.columns  # i don't like the "species" header. I prefer the "target"

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [6]:
# use .loc to get all columns names,  the "~" operator is a negation of 
X = df.loc[:, ~df.columns.isin(['species'])]

In [7]:
X.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width'], dtype='object')

In [8]:
y = df["species"]
y.head()

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: species, dtype: object

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

## Apply Naive Bayes on Iris dataset

In [10]:
clf = GaussianNB()

fit_clf = clf.fit(X_train, y_train)

y_pred = fit_clf.predict(X_test)

print(f"{(y_test != y_pred).sum()} mislabeled points out of a total {len(X_test)}  observations.")

4 mislabeled points out of a total 75  observations.


In [11]:
# accuracy_score??

In [12]:
fit_clf.score?

[0;31mSignature:[0m [0mfit_clf[0m[0;34m.[0m[0mscore[0m[0;34m([0m[0mX[0m[0;34m,[0m [0my[0m[0;34m,[0m [0msample_weight[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return the mean accuracy on the given test data and labels.

In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.

Parameters
----------
X : array-like of shape (n_samples, n_features)
    Test samples.

y : array-like of shape (n_samples,) or (n_samples, n_outputs)
    True labels for `X`.

sample_weight : array-like of shape (n_samples,), default=None
    Sample weights.

Returns
-------
score : float
    Mean accuracy of ``self.predict(X)`` w.r.t. `y`.
[0;31mFile:[0m      ~/venv_projects/uoa_py_course/course_venv/lib/python3.11/site-packages/sklearn/base.py
[0;31mType:[0m      method

In [13]:
fit_clf.score(X_train, y_train)

0.9733333333333334

In [14]:
fit_clf.score(X_test, y_test)

0.9466666666666667

In [15]:
fit_clf.predict_proba(X_test)

array([[0.   , 0.   , 1.   ],
       [0.   , 1.   , 0.   ],
       [1.   , 0.   , 0.   ],
       [0.   , 0.   , 1.   ],
       [1.   , 0.   , 0.   ],
       [0.   , 0.   , 1.   ],
       [1.   , 0.   , 0.   ],
       [0.   , 0.94 , 0.06 ],
       [0.   , 0.992, 0.008],
       [0.   , 1.   , 0.   ],
       [0.   , 0.956, 0.044],
       [0.   , 0.985, 0.015],
       [0.   , 1.   , 0.   ],
       [0.   , 0.994, 0.006],
       [0.   , 0.999, 0.001],
       [1.   , 0.   , 0.   ],
       [0.   , 0.999, 0.001],
       [0.   , 1.   , 0.   ],
       [1.   , 0.   , 0.   ],
       [1.   , 0.   , 0.   ],
       [0.   , 0.048, 0.952],
       [0.   , 0.999, 0.001],
       [1.   , 0.   , 0.   ],
       [1.   , 0.   , 0.   ],
       [0.   , 0.44 , 0.56 ],
       [1.   , 0.   , 0.   ],
       [1.   , 0.   , 0.   ],
       [0.   , 1.   , 0.   ],
       [0.   , 1.   , 0.   ],
       [1.   , 0.   , 0.   ],
       [0.   , 0.002, 0.998],
       [0.   , 0.999, 0.001],
       [1.   , 0.   , 0.   ],
       [0.

In [16]:
accuracy_score(y_test, y_pred)

0.9466666666666667

In [17]:
print(fit_clf.predict([[5.1, 3.5, 1.4, 0.2]]))  # mind the two [[]] 

['setosa']




In [18]:
y_pred = MultinomialNB().fit(X_train, y_train).predict(X_test)

In [19]:
accuracy_score(y_test, y_pred)

0.6

## Digits dataset

In [20]:
X, y = datasets.load_digits(as_frame=True, return_X_y=True)

In [21]:
X.head(2)

Unnamed: 0,pixel_0_0,pixel_0_1,pixel_0_2,pixel_0_3,pixel_0_4,pixel_0_5,pixel_0_6,pixel_0_7,pixel_1_0,pixel_1_1,...,pixel_6_6,pixel_6_7,pixel_7_0,pixel_7_1,pixel_7_2,pixel_7_3,pixel_7_4,pixel_7_5,pixel_7_6,pixel_7_7
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0


In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [23]:
y_pred = GaussianNB().fit(X_train, y_train).predict(X_test)

In [24]:
accuracy_score(y_test, y_pred)

0.8333333333333334

In [25]:
y_pred = MultinomialNB().fit(X_train, y_train).predict(X_test)

In [26]:
accuracy_score(y_test, y_pred)

0.9088888888888889