# 0. Review
## 0.A Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users access machine learning algorithms via **object-oriented programming**.

## 0.B Data Set

I will be using a dataset of antibotic resistance in bacteria strains. 

- Each bacteria is labelled for their antibotic resistance to the antibotic, azithromycin.

- Additionally, each bacteria sample is labelled if its genome contains certain strands of DNA.

We would like to learn antibotic resistance from the bacterial genome.

## 0.C Data Processing

We did a bit of data preprocessing: 
-  encoded the resistance feature as 1 - "susceptible", 0 - "resistant."

- encoded all features of DNA strands as 1 - "if it's genome contains the strand of DNA", 0 - "if it's genome does not contain the strand of DNA"

- did a 70:30 training-test split

## 0.D Load Data
Now, we load our training and test set. Run the code below to load 

- the dataset, ```labels_training_set```, containing antibotic resistance phentype for each bacteria in the training split
- and dataset, ```DNA_training_set```, containing the genome of each bacteria in the training split.

In [1]:
import pandas as pd
labels_training_set = pd.read_csv('datasets/labels_training_set',index_col=0)
DNA_training_set = pd.read_csv('datasets/DNA_training_set',index_col=0)

**In this section, we will be learning a classifier from our training data. Using the bacteria's genome, this classifier will determine whether a bacteria resistant or susceptible to a particular antibotic.**

# 2. Train model: Naive Bayes

2. Then, train a **machine learning model** using **labeled data**

    - "Labeled data" has been labeled with the outcome
    - "Machine learning model" learns the relationship between the attributes of the data and its outcome

Naive Bayes (NB) computes the probability that an observation falls into any class based on the observation's features. 

In our case, Naive Bayes computes the probability that bacterium is susceptible or resistant to azithromycin given segments of the bacterium genome.

Naive Bayes assumes each that the value of any feature is independent of any other feature. In our case, the the presence of a segment of the genome does not affect the presence of another segment.

    
## 2.A Check Training Set

Let's check if we load the correct dataset.

In [2]:
labels_training_set.head()

Unnamed: 0,resistance class
Bacteria 0,1
Bacteria 1,0
Bacteria 2,0
Bacteria 3,0
Bacteria 4,1


In [3]:
DNA_training_set.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Bacteria 0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
Bacteria 1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 2.B Build Model: Gaussian Naive Bayes

### 2.B.1 What is Gaussian Naive Bayes?

First, we will be considering a variant of Naive Bayes called Gaussian Naive Bayes. It is variant meaning that Gaussian Naive Bayes computes the probability that a bacteria is susceptible or resistant based on its genome and assumes that presence of a feature does not affect presences of another feature. 

Gaussian Naive Bayes makes an additional assumption about how the features determine whether a bacteria is susceptible or resistant. Gaussian Naive Bayes assumes the probability that a feature is 1 or 0 is Gaussian. **The probability that a feature is 1 or 0 is used to calculate the bacteria is susceptible or resistant.**

Gaussian distribution is identified by its mean and variance. Therefore, Gaussian Naive Bayes calculates the mean and variance of a feature for each class.

### 2.B.2 Diagram of Gaussian Naive Bayes

**First, Gaussian Naive Bayes divides the training data into its categories.**

<img src="images/02_gNB_01.png" alt="Drawing" style="width: 600px;"/>

**From each feature in the susceptible category, the classifier learns a mean and variance.**
<img src="images/02_gNB_02.png" alt="Drawing" style="width: 600px;"/>

**From each feature in the susceptible category, the classifier learns a mean and variance.**
<img src="images/02_gNB_05.png" alt="Drawing" style="width: 600px;"/>

**Again, the probability that a feature is 1 or 0 is used to calculate the bacteria is susceptible or resistant.**

### 2.B.3 Implementation of Gaussian Naive Bayes

In [4]:
#import naive bayes

from sklearn.naive_bayes import GaussianNB

#instantiate a Naive Bayes classifier 
gNB = GaussianNB()

#fit classifier

gNB.fit(DNA_training_set,labels_training_set.values.ravel())

GaussianNB(priors=None)

We can grab the mean of each feature per class.

In [5]:
#mean of each feature per class
gNB.theta_
#first row is the mean feature value if bacteria is susceptible
#second row is the mean feature value if the bacteria is resistant

array([[0.        , 0.        , 0.        , ..., 0.00666667, 0.00666667,
        0.00666667],
       [0.02419355, 0.02419355, 0.02419355, ..., 0.        , 0.        ,
        0.        ]])

Let's visualize using a Pandas DataFrame. 

In [6]:
pd.DataFrame(gNB.theta_, 
             columns=DNA_training_set.columns,
             index=['Resistant', 'Susceptible'])

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Resistant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.006667,0.006667,0.006667,0.006667,0.006667,0.006667,0.006667,0.006667,0.006667,0.006667
Susceptible,0.024194,0.024194,0.024194,0.024194,0.024194,0.024194,0.024194,0.024194,0.024194,0.024194,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Exercise 2.B.3: Retrieving the learned variances

The variance of each feature is stored in the ```sigma_``` attribute in the ```GaussianNB()``` object. To get variance of each feature per class, use the command: ```gNB.sigma_```. 

Now editing the code above,```pd.DataFrame(gNB.theta_, columns=DNA_training_set.columns,index=['Resistant', 'Susceptible'])```, print the variance of each feature per class.

In [7]:
# enter solution here

#variance of each feature per class
pd.DataFrame(gNB.sigma_, columns=DNA_training_set.columns,
             index=['Resistant', 'Susceptible'])
#first row is the variance of the feature value if bacteria is susceptible
#second row is the variance of the the mean feature value if the bacteria is resistant

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Resistant,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,...,0.006622222,0.006622222,0.006622222,0.006622222,0.006622222,0.006622222,0.006622222,0.006622222,0.006622222,0.006622222
Susceptible,0.02360822,0.02360822,0.02360822,0.02360822,0.02360822,0.02360822,0.02360822,0.02360822,0.02360822,0.02360822,...,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10,2.5e-10
