# Bayesian Machine Learning : Naive Bayes Classifier

In [1]:
from jyquickhelper import add_notebook_menu
add_notebook_menu()

## The Iris dataset

We will work with the Ronald Fischer dataset, which dates from 1936 and contains data on irises.

We will see that the first 4 columns correspond respectively to the length of the sepal, the width of the sepal
the length of the petal and the width of the petal (all these measurements are in centimetres). These 4 variables will constitute the input. The fifth column corresponds to the species of the iris (which can be: Iris-Setosa, Iris-Versicolor or Iris-Virginica) and constitutes the output. It is therefore a classification problem: we want to build a predictor predicting the species according to the other characteristics.

### Question 1: Prepare the dataset

- Import the [Iris dataset](https://curiousml.github.io/teaching/epita-python/Iris.csv)
- Rename the column `Species` by `y`
- Replace the Species 
    - `Iris-setosa` by $0$, 
    - `Iris-versicolor` by $1$ and 
    - `Iris-virginica` by $2$
- Shuffle the dataset

First, we want to simplify the dataset by reducing the dimension of the input space from 4 to 2. We then need to determine which two variables seem to be the most promising for predicting the species. To do this, we will use the sns.pairplot function from the Seaborn library.

```
import seaborn as sns
sns.pairplot(iris, hue="y")
```

### Question 2: simplify the dataset

Create an array `X` containing the columns corresponding to the two selected variables, as well as an array `y` containing the column corresponding to the species (`iris.values` can be used to return all the data from the iris DataFrame iris in the form of an NumPy array).

### Question 3: split the dataset

Split the dataset into two samples: a training sample of size 90 (called `X_train` and `y_train`), and a test sample of size 60 (called `X_test` and `y_test`).

## Naive Bayes Classifier (from scratch)

Let us fit a [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier), more particularily a **Gaussian Naive Bayes Classifier** where $P(X_i|Y)$ is a Gaussian probability distribution.

![image.png](attachment:image.png)

We are interested in the conditional probability of each input variable. This means we need:
1. one distribution for each of the input variables, and 
2. one set of distributions for each of the class labels.

As a first step let us model the input variables using a Gaussian probability distribution.

### Question 4: fit a gaussian probability distribution

Let us fit a Gaussian probability distribution to each feature. Create a function `fit_gaussian(feature)` which takes one input feature and fits a Gaussian probability distribution.

**Rmk:**
- you can use the `norm` function of `scipy.stats` for constructing a distribution;
- you can use the `mean` and the `std` function of `numpy` for estimating the parameters of the distribution

### Question 5: split the training set

Split the training set into groups of samples for each of the class labels $0, 1, 2$. Name them respectively `X_train0`, `X_train1` and `X_train2`. These groups are used to calculate the prior probabilities $P(Y)$ for a data sample belonging to each group.

### Question 6: generate the priors $P(Y=y)$

Use `X_train0`, `X_train1` and `X_train2` to calculate the prior probabilities $P(Y=y)$ for $y\in\{0, 1, 2\}$. Name them respectively `prior0`, `prior1` and `prior2`

Finally, we can call the `fit_gaussian` function that we defined to prepare a probability distribution for each variable, for each class label.

### Question 7: generate the PDFs $P(X_i|Y=y)$ 

Generate the PDFs $P(X_i|Y=y)$ for $y\in\{0, 1, 2\}$.

### Question 8: generate the joint probabilities $P(Y=y, X_1, X_2)$


Create a function `joint_proba(X, prior, distX1, distX2)` which computes $P(Y, X_1, X_2)$ for a given class $Y=y$ (example $y=0$).
- `prior` corresponds to the prior probability $P(Y=y)$
- and `distXi`, $i=1, 2$, corresponds to $P(X_i|Y=y)$.

### Question 9: performance on the test set

- Compute the probabilities $P(Y=y, X_1, X_2)$ for $y\in\{0, 1, 2\}$.
- Evaluate the `Accuracy` of the model on the test set `X_test`

## Naive Bayes Classifier (with external package)

### Question 10: GaussianNB of sklearn

- Create and train a Gaussian Classifier using `GaussianNB` of the external submodule `sklearn.naive_bayes`.
```
from sklearn.naive_bayes import GaussianNB
```


- Evaluate the `Accuracy` of the model on the test set `X_test`.

**Rm:** You should have (approximately) the same accuracy as in **Question 9**.

## Some advantages

- Fast and accurate method for prediction.


- Naive Bayes has very low computation cost.


- It can efficiently work on a large dataset.


- It can be used in multi-class tasks.


- When the assumption of independence holds, a Naive Bayes classifier performs better compared to other models like logistic regression.

## Disadvantages

- The assumption of independent features. 