# 0. Review 

## 0.A Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users access machine learning algorithms via **object-oriented programming**.

## 0.B Data Set

I will be using a dataset of antibotic resistance in bacteria strains. 

- Each bacteria is labelled for their antibotic resistance to the antibotic, azithromycin.

- Additionally, each bacteria sample is labelled if its genome contains certain strands of DNA.

We would like to learn antibotic resistance from the bacterial genome.


## 0.C Data Processing

We did a bit of data preprocessing: 
-  encoded the resistance feature as 1 - "susceptible", 0 - "resistant."

- encoded all features of DNA strands as 1 - "if it's genome contains the strand of DNA", 0 - "if it's genome does not contain the strand of DNA"

- did a 70:30 training-test split

## 0.D Trained Model: Gaussian Naive Bayes

Before, we use a Gaussian Naive Bayes algorithm to learn classifier of antibotic resistance in a bacteria. I run the code to create the model again.

**In this section, we will be evaluating the accuracy of the trained model on the test data.**

In [1]:
import pandas as pd

#load training data
Y_training_set = pd.read_csv('datasets/Y_training_set')
training_set = pd.read_csv('datasets/training_set')

In [2]:
#import naive bayes

from sklearn.naive_bayes import GaussianNB

#instantiate a Naive Bayes classifier 
gNB = GaussianNB()

#learn classifier from data 
gNB.fit(training_set,Y_training_set.values.ravel())

GaussianNB(priors=None)

# 3. Model Evaluation: Gaussian Naive Bayes

*3. Finally, make **predictions** on **new (unseen) data** for which the label is unknown*

The **unseen data** is the test data. The model trained with training set does not know anything about the testing data. This means that has learnt nothing from the test data.


## 3.A Load Testing Set
1a. We must first load the training data. Run the code below to load 

- the dataset, ```Y_testing_set```, containing antibotic resistance phentype for each bacteria in the testing split
- and dataset, ```testing_set```, containing the genome of each bacteria in the testing split.

In [3]:
import pandas as pd
Y_test_set = pd.read_csv('datasets/Y_test_set')
test_set = pd.read_csv('datasets/test_set')

## 3.B Evaluate the test set

The ```GaussianNB()``` class has a method called ```predict```. ```predict``` determines the antimicrobial resistance of any observation with the same features as observations in the training set.

### 3.B.1 Predicting a single observation

Let's consider the first observation in our test set.

In [4]:
first_observation = test_set.iloc[[0]]
first_observation

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now, let's use our learned model to evaluate the first observation of our test set. 

In [5]:
gNB.predict(first_observation)

array([0])

Our model determined that the first observation in our testing set is $0$. This means that the model thinks that this observation is resistant.

How does the predict resistance compare to the actual antimicrobial resistance of the first observation in test set?

In [6]:
Y_test_set.iloc[[0]]

Unnamed: 0,resistance phenotype
0,0


### Exercise 3.B.1: Predicting the resistance of the tenth observation

Using the trained Gaussian Naive Bayes model, ```gNB```, predict the antibotic resistance of the tenth observation in our test set. Like above, you can retrieve the tenth observation using the command ```test_set.iloc[[9]]```. How does this compare the actual antimicrobial resistance of the tenth observation?



In [7]:
tenth_observation=test_set.iloc[[9]]

#actual class that the tenth observation falls in
Y_test_set.iloc[[9]]

Unnamed: 0,resistance phenotype
1,1


In [8]:
# enter solution here

gNB.predict(tenth_observation)

array([1])

The model correctly predicts that the tenth observation will be susceptible. 

### 3.B.2 Predicting the entire dataset
Now using ```predict``` and our learned model, let's evaluate the entire model on the test set.

In [9]:
gNB.predict(test_set)

array([0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0])

How does compare the actual data set of antimicrobial resistance?

In [10]:
pd.DataFrame([gNB.predict(test_set),Y_test_set.values.ravel()], 
             columns=range(0,len(Y_test_set)),index=['Predicted','Actual'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,108,109,110,111,112,113,114,115,116,117
Predicted,0,1,1,0,1,0,0,0,1,0,...,0,0,1,0,0,1,0,1,1,0
Actual,0,1,1,1,1,0,0,0,1,1,...,0,0,1,0,0,0,0,1,1,0


## 3.C Predict Probabilities 

How does the classifier decide which category an observation belongs to? 


The classifer considers the possibility that an observation can belongs to either a susceptible or resistance class. It then computes the probability of being in either class from the features of the observations. 


<img src="images/03_predict_prob.png" alt="Drawing" style="width: 600px;"/>


If the probability of being in a class is above some threshold, the classifier assigns that class to the observation in question.

We can retrieve this probability using the ```predict_proba``` method.

### 3.C.1 Probability values of a single observation


In [11]:
gNB.predict_proba(first_observation)

array([[1., 0.]])

The model predicts a $100\%$ chance that the observation will be resistant and a $0\%$ chance that the observation will be susceptible.

### Exercise 3.C.1: Probability of the tenth observation

Using the trained Gaussian Naive Bayes model, ```gNB```, compute and interept the probability values of the tenth observation in our test set. 

In [12]:
tenth_observation

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
# enter solution here
gNB.predict_proba(tenth_observation)

array([[0., 1.]])

The model predicts a $0\%$ chance that the observation will be resistant and a $100\%$ chance that the observation will be susceptible.

### 3.C.2 Predicting the entire dataset

Now using ```predict_proba``` and our learned model, let's compute the probabilities of all observations in the test set.

In [14]:
probabilities_test_set = gNB.predict_proba(test_set)
probabilities_test_set

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.

In [15]:
pd.DataFrame(probabilities_test_set,
             index=test_set.index,
             columns=['Probability of being resistant','Probability of being susceptible'])

Unnamed: 0,Probability of being resistant,Probability of being susceptible
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,1.0,0.0
4,0.0,1.0
5,1.0,0.0
6,1.0,0.0
7,1.0,0.0
8,0.0,1.0
9,1.0,0.0


## 3.D Confusion Matrix

Let's go back to the predict classes of the test observations.

In [16]:
pd.DataFrame([gNB.predict(test_set),Y_test_set.values.ravel()], 
             columns=range(0,len(Y_test_set)),index=['Predicted','Actual'])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,108,109,110,111,112,113,114,115,116,117
Predicted,0,1,1,0,1,0,0,0,1,0,...,0,0,1,0,0,1,0,1,1,0
Actual,0,1,1,1,1,0,0,0,1,1,...,0,0,1,0,0,0,0,1,1,0


It's quite difficult to compare list of predicted and actual reisistance categories. 

Is there a more quantitative way of evaluating the quality of the classifier? Yes, there is a more quantitative way! One popular way of evaluating the quality of the classifier is the **Confusion Matrix**.

**Confusion Matrix** tabulates the results that are correctly and incorrectly classified.

<img src="images/03_confusion_matrix.png" alt="Drawing" style="width: 500px;"/>


The ```confusion_matrix``` method in sklearn.metrics module generates the confusion matrix.

In [17]:
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(gNB.predict(test_set),Y_test_set)
confusionMatrix

array([[56,  6],
       [ 8, 48]])

In [18]:
confusionMatrix_df = pd.DataFrame(confusionMatrix,
                                  columns=['Predicted: 0' , 'Predicted: 1'],
                                  index=['Actual: 0' , 'Actual: 1'])
confusionMatrix_df

Unnamed: 0,Predicted: 0,Predicted: 1
Actual: 0,56,6
Actual: 1,8,48


To determine fractions of observations correctly and incorrectly classified, we can divide the values in table by the total observation.

In [19]:
confusionMatrix_df/confusionMatrix_df.sum().sum()

Unnamed: 0,Predicted: 0,Predicted: 1
Actual: 0,0.474576,0.050847
Actual: 1,0.067797,0.40678


## 3.E Quality of Model

The quality of a classifier is measured by the fraction of all observations correctly classified. The ```score``` method of the ```GaussianNB()``` computes the fraction of observations correctly classified.

In [20]:
gNB.score(test_set,Y_test_set)

0.8813559322033898

Note that this score is the sum of the diagonal of the confusion matrix.