# 0. Review 

## 0.A Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users to access machine learning algorithms via **object-oriented programming**.

## 0.B Data Set

I will be using a dataset of antibiotic resistance in bacteria strains. 

- Each bacteria is labeled with its resistance to the antibiotic, azithromycin.
- Additionally, each bacteria sample is labelled if its genome contains certain strands of DNA.

We would like to learn antibiotic resistance from the bacterial genome.

- Our predictors are whether strands of DNA are present.
- Our response are resistance classes.


## 0.C Data Preprocessing

We did a bit of data preprocessing: 

- encoded the resistance feature as 0 - "resistant," 1 - "susceptible".
- encoded all features of the DNA strands as , 0 - "if its genome does not contain the strand of DNA", 1 - "if its genome contains the strand of DNA."
- did a 70:30 training-test split


## 0.D Trained Model: Gaussian Naive Bayes

Before, we use a Gaussian Naive Bayes algorithm to learn a classifier of antibiotic resistance in the bacteria. I run the code to create the model again.

In [1]:
import pandas as pd

#load training data
labels_training_set = pd.read_csv('datasets/labels_training_set',index_col=0)
DNA_training_set = pd.read_csv('datasets/DNA_training_set',index_col=0)

labels_test_set = pd.read_csv('datasets/labels_test_set',index_col=0)
DNA_test_set = pd.read_csv('datasets/DNA_test_set',index_col=0)

In [2]:
#import naive bayes

from sklearn.naive_bayes import GaussianNB

#instantiate a Naive Bayes classifier 
gNB = GaussianNB()

#learn classifier from data 
gNB.fit(DNA_training_set,labels_training_set.values.ravel())

GaussianNB(priors=None)

**In this section, we will be evaluating the accuracy of the trained model on the test data.**

# 3. Model Evaluation: Gaussian Naive Bayes

*3. Finally, make **predictions** on **new (unseen) data** for which the label is unknown*

The **unseen data** is the test data. The model trained with the training set does not know anything about the testing data. This means that the model has learned nothing from the test data.


## 3.A Check Testing Set

First, let's check if our data has loaded correctly.

In [3]:
labels_test_set.head()

Unnamed: 0,resistance class
Bacteria 274,0
Bacteria 275,1
Bacteria 276,1
Bacteria 277,1
Bacteria 278,1


In [4]:
DNA_test_set.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Bacteria 274,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 275,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 276,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 277,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 278,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Everything looks good!

## 3.B Evaluate the test set

The ```GaussianNB()``` class has a method called ```predict```. ```predict``` determines the antimicrobial resistance of any observation with the same features as observations in the training set.

### 3.B.1 Predicting a single observation

Let's consider the first observation in our test set.

In [5]:
first_observation = DNA_test_set.iloc[[0]]
first_observation

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Bacteria 274,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now, let's use our learned model to evaluate the first observation of our test set. 

In [6]:
gNB.predict(first_observation)

array([0])

Our model determined that the first observation in our test set is $0$. This means that the model thinks that this observation is resistant.

How does the predicted resistance compare to the actual antimicrobial resistance of the first observation in the test set?

In [7]:
labels_test_set.iloc[[0]]

Unnamed: 0,resistance class
Bacteria 274,0


The model predicts the correct resistance class!

### Exercise 3.B.1: Predicting the resistance of the tenth observation

Using the trained Gaussian Naive Bayes model, ```gNB```, predict the antibiotic resistance of the tenth observation in our test set. Like above, you can retrieve the tenth observation using the command ```DNA_test_set.iloc[[9]]```. How does predicted resistance compare to actual antimicrobial resistance of the tenth observation?


In [8]:
tenth_observation=DNA_test_set.iloc[[9]]

#actual class that the tenth observation falls in
labels_test_set.iloc[[9]]

Unnamed: 0,resistance class
Bacteria 283,1


In [9]:
# enter solution here

gNB.predict(tenth_observation)

array([0])

The model incorrectly predicts that the tenth observation will be resistant. 

### 3.B.2 Predicting the entire dataset
Now using ```predict``` and our learned model, let's evaluate the entire model on the test set.

In [10]:
gNB.predict(DNA_test_set)

array([0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0])

How does the predicted data set compare to the actual data set of antimicrobial resistance?

In [11]:
pd.DataFrame([gNB.predict(DNA_test_set),labels_test_set.values.ravel()], 
             columns=DNA_test_set.index,index=['Predicted','Actual'])

Unnamed: 0,Bacteria 274,Bacteria 275,Bacteria 276,Bacteria 277,Bacteria 278,Bacteria 279,Bacteria 280,Bacteria 281,Bacteria 282,Bacteria 283,...,Bacteria 382,Bacteria 383,Bacteria 384,Bacteria 385,Bacteria 386,Bacteria 387,Bacteria 388,Bacteria 389,Bacteria 390,Bacteria 391
Predicted,0,1,1,0,1,0,0,0,1,0,...,0,0,1,0,0,1,0,1,1,0
Actual,0,1,1,1,1,0,0,0,1,1,...,0,0,1,0,0,0,0,1,1,0


## 3.C Predict Probabilities 

How does the classifier decide which category an observation belongs to? 


The classifier considers the possibility that an observation can fall into either the susceptible or resistant class. It then computes the probability of being in either class from the features of the observation. 

<img src="images/03_predict_prob.png" alt="Drawing" style="width: 600px;"/>


If the probability of being in a class is above some threshold, the classifier assigns that class to the observation in question.

We can retrieve this probability using the ```predict_proba``` method.

### 3.C.1 Probability values of a single observation


In [12]:
gNB.predict_proba(first_observation)

array([[1., 0.]])

The model predicts a $100\%$ chance that the observation will be resistant and a $0\%$ chance that the observation will be susceptible.

### Exercise 3.C.1: Probability of the tenth observation

Using the trained Gaussian Naive Bayes model, ```gNB```, and the ```predict_proba``` method, compute and interept the probability values of the tenth observation in our test set. 

In [13]:
tenth_observation

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Bacteria 283,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
# enter solution here
gNB.predict_proba(tenth_observation)

array([[1., 0.]])

The model predicts a $100\%$ chance that the observation will be resistant and a $0\%$ chance that the observation will be susceptible.

### 3.C.2 Predicting the entire dataset

Now using ```predict_proba``` and our learned model, let's compute the probabilities of all observations in the test set.

In [15]:
probabilities_test_set = gNB.predict_proba(DNA_test_set)
probabilities_test_set

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.

In [16]:
pd.DataFrame(probabilities_test_set,
             index=DNA_test_set.index,
             columns=['Probability of being resistant','Probability of being susceptible'])

Unnamed: 0,Probability of being resistant,Probability of being susceptible
Bacteria 274,1.0,0.0
Bacteria 275,0.0,1.0
Bacteria 276,0.0,1.0
Bacteria 277,1.0,0.0
Bacteria 278,0.0,1.0
Bacteria 279,1.0,0.0
Bacteria 280,1.0,0.0
Bacteria 281,1.0,0.0
Bacteria 282,0.0,1.0
Bacteria 283,1.0,0.0


## 3.D Confusion Matrix

Let's go back to the predicted classes of the test observations.

In [17]:
pd.DataFrame([gNB.predict(DNA_test_set),labels_test_set.values.ravel()], 
             columns=DNA_test_set.index,index=['Predicted','Actual'])

Unnamed: 0,Bacteria 274,Bacteria 275,Bacteria 276,Bacteria 277,Bacteria 278,Bacteria 279,Bacteria 280,Bacteria 281,Bacteria 282,Bacteria 283,...,Bacteria 382,Bacteria 383,Bacteria 384,Bacteria 385,Bacteria 386,Bacteria 387,Bacteria 388,Bacteria 389,Bacteria 390,Bacteria 391
Predicted,0,1,1,0,1,0,0,0,1,0,...,0,0,1,0,0,1,0,1,1,0
Actual,0,1,1,1,1,0,0,0,1,1,...,0,0,1,0,0,0,0,1,1,0


It's quite difficult to compare list of predicted and actual resistance categories. 

Is there a more quantitative way of evaluating the quality of the classifier? Yes, there is a more quantitative way! One popular way of evaluating the quality of a classifier is **Confusion Matrix**.

**Confusion Matrix** tabulates the results that are correctly and incorrectly classified.

<img src="images/03_confusion_matrix.png" alt="Drawing" style="width: 500px;"/>


The ```confusion_matrix``` method in sklearn.metrics module generates the confusion matrix.

In [18]:
from sklearn.metrics import confusion_matrix
confusionMatrix = confusion_matrix(gNB.predict(DNA_test_set),labels_test_set)
confusionMatrix

array([[56,  6],
       [ 8, 48]])

In [19]:
confusionMatrix_df = pd.DataFrame(confusionMatrix,
                                  columns=['Predicted: 0' , 'Predicted: 1'],
                                  index=['Actual: 0' , 'Actual: 1'])
confusionMatrix_df

Unnamed: 0,Predicted: 0,Predicted: 1
Actual: 0,56,6
Actual: 1,8,48


## 3.D.1 Exercise

The cells of the confusion matrix have special names:
    

| . | Predicted: 0 | Predicted: 1 |
| --- | --- | --- |
| **Actual: 0** | True Negative (TN) | False Positive  (FP) |
| **Actual: 1** | False Negative (FN) | True Positive  (TP) |
    
Given these names and the confusion matrix we generated above, what are the values for
- True Negative
- False Positive
- False Negative
- True Positive?

With your neighbor, discuss the definition of these words.

enter solution here

True Negative - 56
False Positive - 6
False Negative - 8
True Positive - 48

## 3.D.2 Exercise

We have two important formulas:
$$ \text{true positive rates} = \frac{\text{TP}}{\text{TP} + \text{FN}}= \text{fraction of susceptible bacteria correctly classified} $$

$$ \text{false positive rates} = \frac{\text{FP}}{\text{FP} + \text{TN}} = \text{fraction of resistant bacteria incorrectly classified}.$$

Calculate these values for the confusion matrix above.

In [20]:
# enter solution here true positive rates
confusionMatrix_df.iloc[1,1]/(confusionMatrix_df.iloc[1,1]+confusionMatrix_df.iloc[0,1])

0.8888888888888888

In [21]:
# enter solution here false positive rates

confusionMatrix_df.iloc[1,0]/(confusionMatrix_df.iloc[1,0]+confusionMatrix_df.iloc[0,0])

0.125

## 3.E Quality of Model

The quality of a classifier is measured by the fraction of all observations correctly classified. The ```score``` method of the ```GaussianNB()``` computes the fraction of observations correctly classified.

In [22]:
gNB.score(DNA_test_set,labels_test_set)

0.8813559322033898