# Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users access machine learning algorithms via **object-oriented programming**.

# Data Set

I will be using a dataset of antibotic resistance in bacteria strains. 

- Each bacteria is labelled for their antibotic resistance to the antibotic, azithromycin.

- Additionally, each bacteria sample is labelled if its genome contains certain strands of DNA.

We would like to learn antibotic resistance from the bacterial genome.

# 2. Train model: Gaussian Naive Bayes

2. Then, train a **machine learning model** using **labeled data**

    - "Labeled data" has been labeled with the outcome
    - "Machine learning model" learns the relationship between the attributes of the data and its outcome
    
    
## 2a. Load Training Set
1a. We must first load the training datta. Run the code below to load 

- the dataset, ```Y_training_set```, containing antibotic resistance phentype for each bacteria in the training split
- and dataset, ```training_set```, containing the genome of each bacteria in the training split.

In [5]:
import pandas as pd
Y_training_set = pd.read_csv('datasets/Y_training_set')
training_set = pd.read_csv('datasets/training_set')

Let's check if we loaded the correct dataset.

In [6]:
Y_training_set.head()

Unnamed: 0,resistance phenotype
0,1
1,0
2,0
3,0
4,1


In [7]:
training_set.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 2b. Build Model: Naive Bayes

- Naive Bayes (NB) computes the probability that an observation falls into any class based on the observation's features.

- Gaussian NB assumes each that feature is independent of each other and, given that the class of an observation,  the feature of an observation comes a normal distribution with mean $\mu$ and variance $\sigma$

**TODO: Draw a diagram that can aid in understanding Gaussian NB**

- Given each class and each feature, Gaussian NB learns $\mu$ and $\sigma$ from the data

In [10]:
#import naive bayes

from sklearn.naive_bayes import GaussianNB

#instantiate a Naive Bayes classifier 
gNB = GaussianNB()

gNB.fit(training_set,Y_training_set.values.ravel())

GaussianNB(priors=None)

In [11]:
gNB.class_count_

array([150., 124.])

In [15]:
gNB.theta_

array([[0.        , 0.        , 0.        , ..., 0.00666667, 0.00666667,
        0.00666667],
       [0.02419355, 0.02419355, 0.02419355, ..., 0.        , 0.        ,
        0.        ]])

In [16]:
gNB.sigma_

array([[2.50000000e-10, 2.50000000e-10, 2.50000000e-10, ...,
        6.62222247e-03, 6.62222247e-03, 6.62222247e-03],
       [2.36082209e-02, 2.36082209e-02, 2.36082209e-02, ...,
        2.50000000e-10, 2.50000000e-10, 2.50000000e-10]])

# 3. Model Evaluation: Gaussian Naive Bayes

## 2a. Load Training
1a. We must first load the training datta. Run the code below to load 

- the dataset, ```Y_testing_set```, containing antibotic resistance phentype for each bacteria in the testing split
- and dataset, ```testing_set```, containing the genome of each bacteria in the testing split.

In [18]:
import pandas as pd
Y_test_set = pd.read_csv('datasets/Y_test_set')
test_set = pd.read_csv('datasets/test_set')

In [19]:
gNB.predict(test_set)

array([0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0])

In [20]:
gNB.score(test_set,Y_test_set)

0.8813559322033898

# 4. Train and Evaluate Model: Bernoulli Naive Bayes

In [21]:
from sklearn.naive_bayes import BernoulliNB

#instantiate a Naive Bayes classifier 
bNB = BernoulliNB()

bNB.fit(training_set,Y_training_set.values.ravel())

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [22]:
bNB.class_log_prior_

array([-0.60249281, -0.79284654])

In [33]:
bNB.feature_log_prob_

array([[-5.02388052, -5.02388052, -5.02388052, ..., -4.33073334,
        -4.33073334, -4.33073334],
       [-3.44998755, -3.44998755, -3.44998755, ..., -4.83628191,
        -4.83628191, -4.83628191]])

In [24]:
bNB.feature_count_

array([[0., 0., 0., ..., 1., 1., 1.],
       [3., 3., 3., ..., 0., 0., 0.]])

In [26]:
bNB.score(test_set,Y_test_set)

0.9152542372881356

# Possible Improvements

- Allow for interactions: 

Logistic regression, linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). However, LDA and QDA assume that probability that a feature takes on a given value is Gaussian. Logistic regression does not make assumption of a probability distribution. However, you have to tell the logistic regression what interactions to consider. Doing logistic regression with all interactions terms and some penalty, like LASSO, might be suitable.

- Parametric methods might be unsuitable:

Tree based methods or KNN might be suitable for handling such a complex problem.