# 0. Review
## 0.A Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users to access machine learning algorithms via **object-oriented programming**.

## 0.B Data Set

I will be using a dataset of antibiotic resistance in bacteria strains. 

- Each bacteria is labeled with its resistance to the antibiotic, azithromycin.
- Additionally, each bacteria sample is labelled if its genome contains certain strands of DNA.

We would like to learn antibotic resistance from the bacterial genome.

- Our predictors are whether strands of DNA are present.
- Our response are resistance classes.

## 0.C Data Preprocessing

We did a bit of data preprocessing: 

- encoded the resistance feature as 0 - "resistant," 1 - "susceptible".
- encoded all features of the DNA strands as, 0 - "if its genome does not contain the strand of DNA", 1 - "if its genome contains the strand of DNA."
- did a 70:30 training-test split

## 0.D Load Data
Now, we load our training and test set. Run the code below to load 

- the dataset, ```labels_training_set```, containing antibotic resistance phentype for each bacteria in the training split
- and dataset, ```DNA_training_set```, containing the genome of each bacteria in the training split.

In [None]:
import pandas as pd
labels_training_set = pd.read_csv('datasets/labels_training_set',index_col=0)
DNA_training_set = pd.read_csv('datasets/DNA_training_set',index_col=0)

**In this section, we will be learning a classifier from our training data. Using the bacteria's genome, this classifier will determine whether a bacteria is resistant or susceptible to a particular antibiotic.**

# 2. Train model: Naive Bayes

2. Then, train a **machine learning model** using **labeled data**

    - "Labeled data" has been labeled with the outcome
    - "Machine learning model" learns the relationship between the attributes of the data and its outcome

Naive Bayes (NB) computes the probability that an observation falls into any class based on the observation's features. 

In our case, Naive Bayes computes the probability that a bacterium is susceptible or resistant to azithromycin given segments of the bacterium's genome.

Naive Bayes assumes that a feature is independent of any other feature. In our case, segments of the genome do not interact and affect a bacterium's resistance.

    
## 2.A Check Training Set

Let's check if we loaded the correct dataset.

In [None]:
#print head of labels_training_set
labels_training_set.head()

In [None]:
#print head of  DNA_training_set

DNA_training_set.head()

## 2.B Build Model: Gaussian Naive Bayes

### 2.B.1 What is Gaussian Naive Bayes?

First, we will be considering a variant of Naive Bayes called Gaussian Naive Bayes. It is variant meaning that Gaussian Naive Bayes computes the probability that a bacteria is susceptible or resistant based on its genome and assumes that presence of a feature does not affect presences of another feature. 

Gaussian Naive Bayes makes an additional assumption about how the features determine whether a bacteria is susceptible or resistant. Gaussian Naive Bayes assumes the probability that a feature is 1 or 0 is Gaussian. **The probability that a feature is 1 or 0 is used to calculate the bacteria is susceptible or resistant.**

Gaussian distribution is identified by its mean and variance. Therefore, Gaussian Naive Bayes calculates the mean and variance of a feature for each class.

### 2.B.2 Diagram of Gaussian Naive Bayes

**First, Gaussian Naive Bayes divides the training data into its categories.**

<img src="images/02_gNB_01.png" alt="Drawing" style="width: 600px;"/>

**From each feature in the susceptible category, the classifier learns a mean and variance.**
<img src="images/02_gNB_02.png" alt="Drawing" style="width: 600px;"/>

**From each feature in the susceptible category, the classifier learns a mean and variance.**
<img src="images/02_gNB_05.png" alt="Drawing" style="width: 600px;"/>

**Again, the probability that a feature is 1 or 0 is used to calculate the probability that the bacteria is susceptible or resistant.**

### 2.B.3 Implementation of Gaussian Naive Bayes

In [None]:
#import naive bayes

from sklearn.naive_bayes import GaussianNB

#instantiate a Naive Bayes classifier 
gNB = GaussianNB()

#fit classifier

gNB.fit(DNA_training_set, labels_training_set.values.ravel())

We can grab the mean, ```theta_```, of each feature per class.

In [None]:
#mean of each feature per class
gNB.theta_

#first row is the mean feature value if bacteria is susceptible
#second row is the mean feature value if the bacteria is resistant

Let's visualize using a Pandas DataFrame. 

In [None]:
pd.DataFrame(gNB.theta_, 
             columns=DNA_training_set.columns,
             index=['Resistant', 'Susceptible'])

### Exercise 2.B.3: Retrieving the learned variances

The variance of each feature is stored in the ```sigma_``` attribute in the ```GaussianNB()``` object. To get the variance of each feature per class, use the command: ```gNB.sigma_```. 

Now editing the code,```pd.DataFrame(gNB.theta_, columns=DNA_training_set.columns,index=['Resistant', 'Susceptible'])```, print the variance of each feature per class.

In [None]:
# enter solution here
pd.DataFrame(gNB.sigma_, columns=DNA_training_set.columns,index=['Resistant', 'Susceptible'])