# 0. Review 

## 0.A Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users access machine learning algorithms via **object-oriented programming**.

## 0.B Data Set

I will be using a dataset of antibotic resistance in bacteria strains. 

- Each bacteria is labelled for their antibotic resistance to the antibotic, azithromycin.

- Additionally, each bacteria sample is labelled if its genome contains certain strands of DNA.

We would like to learn antibotic resistance from the bacterial genome.


## 0.C Data Processing

We did a bit of data preprocessing: 
-  encoded the resistance feature as 1 - "suspectible", 0 - "resistant."

- encoded all features of DNA strands as 1 - "if it's genome contains the strand of DNA", 0 - "if it's genome does not contain the strand of DNA"

## 0.D Trained Model: Gaussian Naive Bayes

Before, we use a Gaussian Naive Bayes algorithm to learn classifier of antibotic resistance in a bacteria. I run the code to create the model again.

**In this section, we will be evaluating the accuracy of the trained model on the test data.**

In [1]:
import pandas as pd

#load training data
Y_training_set = pd.read_csv('datasets/Y_training_set')
training_set = pd.read_csv('datasets/training_set')

#import naive bayes

from sklearn.naive_bayes import GaussianNB

#instantiate a Naive Bayes classifier 
gNB = GaussianNB()

#learn classifier from data 
gNB.fit(training_set,Y_training_set.values.ravel())

GaussianNB(priors=None)

# 3. Model Evaluation: Gaussian Naive Bayes

*3. Finally, make **predictions** on **new (unseen) data** for which the label is unknown*

The **unseen data** is the test data. The model trained with training set does not know anything about the testing data. This means that has learnt nothing from the test data.


## 3.A Load Testing Set
1a. We must first load the training data. Run the code below to load 

- the dataset, ```Y_testing_set```, containing antibotic resistance phentype for each bacteria in the testing split
- and dataset, ```testing_set```, containing the genome of each bacteria in the testing split.

In [2]:
import pandas as pd
Y_test_set = pd.read_csv('datasets/Y_test_set')
test_set = pd.read_csv('datasets/test_set')

## 3.B Evaluate the test set

The ```GaussianNB()``` class has a method called ```predict```. ```predict``` determines the antimicrobial resistance of any observation with the same features as observations in the training set.

Let's consider the first observation in our test set.

In [3]:
test_set.iloc[[0]]

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now, let's use our learned model to evaluate the first observation of our test set. 

In [78]:
gNB.predict(test_set.iloc[[0]])

array([0])

Our model determined that the first observation in our testing set is $0$. This means that the model thinks that this observation is resistant.

How does the predict resistance compare to the actual antimicrobial resistance of the first observation in test set?

In [79]:
Y_test_set.iloc[[0]]

Unnamed: 0,resistance phenotype
0,0


Now using ```predict``` and our learned model, let's evaluate the entire model on the test set.

In [48]:
gNB.predict(test_set)

array([0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0])

How do we interpret these results?

## 3.C Confusion Matrix

A confusion matrix is a means of tabulating correct and incorrect predicted values.

In [80]:
actual = pd.Series(gNB.predict(test_set), name='Actual')
predicted = pd.Series(Y_test_set.values.ravel(), name='Predicted')
df_confusion = pd.crosstab(actual, predicted,margins=True)
df_confusion

Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,56,6,62
1,8,48,56
All,64,54,118


In [82]:
df_confusion/df_confusion.iloc[2,2]

Predicted,0,1,All
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.474576,0.050847,0.525424
1,0.067797,0.40678,0.474576
All,0.542373,0.457627,1.0


## 3.D Quantifying Quality of Model

In [84]:
gNB.score(test_set,Y_test_set)

0.8813559322033898