# 0. Review
## 0.A Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users access machine learning algorithms via **object-oriented programming**.

## 0.B Data Set

I will be using a dataset of antibotic resistance in bacteria strains. 

- Each bacteria is labelled for their antibotic resistance to the antibotic, azithromycin.

- Additionally, each bacteria sample is labelled if its genome contains certain strands of DNA.

We would like to learn antibotic resistance from the bacterial genome. First, we have to clean our data up. **This section will focus on data preprocessing.**

# 1. Preprocessing

1. **First, preprocess the data - make sure that data is ready to enter the ML pipeline**

## 1.A Load Data
1a. We must first load the data. Run the code below to load 
- the dataset, *antibotic_resistance_labels*, containing antibotic resistance phentype for each bacteria
- and dataset, *kmer_csv*, containing the genome of each bacteria.

In [1]:
import pandas as pd
kmer_df = pd.read_csv('datasets/kmer_csv')

In [2]:
kmer_df.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
0,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,...,N,N,N,N,N,N,N,N,N,N
1,N,N,N,N,N,N,N,N,N,N,...,N,N,N,N,N,N,N,N,N,N
2,N,N,N,N,N,N,N,N,N,N,...,N,N,N,N,N,N,N,N,N,N
3,N,N,N,N,N,N,N,N,N,N,...,N,N,N,N,N,N,N,N,N,N
4,N,N,N,N,N,N,N,N,N,N,...,N,N,N,N,N,N,N,N,N,N


In [3]:
antibiotic_resistance_labels = pd.read_csv('datasets/antibiotic_resistance_labels')
antibiotic_resistance_labels.head()

Unnamed: 0,resistance phenotype
0,susceptible
1,resistant
2,resistant
3,resistant
4,susceptible


## 1.B Label Encoding

1b. Sklearn classifiers can only process numeric values. We need to then encode each feature with discrete integer values $(0, 1, 2, \cdots)$.

- For example if we were to ask participants their prefered level of spiciness in a study and provide them with option, "very spicy", "spicy", "mild", "no spice", we could encode these categories as "3-very spicy", "2-spicy", "1-mild", "0-no spice".

- We encode the resistance feature as 1 - "suspectible", 0 - "resistant."

- We encode all features of DNA strands as 1 - "if it's genome contains the strand of DNA", 0 - "if it's genome does not contain the strand of DNA"

### Encoding Resistance Labels
First, we encode *antibiotic_resistance_labels* array into 0's and 1's.

In [5]:
#get a list of unique values from antibiotic_resistance_labels
antibiotic_resistance_labels_values = antibiotic_resistance_labels.values

import numpy as np 

phenotypes = np.unique(antibiotic_resistance_labels_values)
print(phenotypes)

['resistant' 'susceptible']


In [6]:
#load label encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

Let's have a look at classes encoding we learnt from the *phenotypes* array.

In [7]:
le.fit(phenotypes)
print(le.classes_)

['resistant' 'susceptible']


In [8]:
phenotypes_encoded = le.transform(antibiotic_resistance_labels)
print(phenotypes_encoded)

[1 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 1 0 0 1 0 0 0
 0 0 0 1 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 0 0
 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0
 1 1 1 1 1 0 1 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 1 1 1 0 0 1 0 0 0 1 1
 0 0 1 0 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 0 1 0 1 1
 0 0 1 1 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 1 1 1 0 1 0 0 0 0 1 0
 0 0 0 1 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1
 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 0 1
 1 0 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 1 1
 0 0 0 1 1 0 1 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0
 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0]


  y = column_or_1d(y, warn=True)


I really like keeping everything consistent so I'll convert *phenotypes_encoded* into *Pandas DataFrame*.

In [9]:
phenotypes_encoded_df = pd.DataFrame(phenotypes_encoded,columns=antibiotic_resistance_labels.columns) 
phenotypes_encoded_df.head()

Unnamed: 0,resistance phenotype
0,1
1,0
2,0
3,0
4,1


### Exercise 1.1: Encoding First Column ```kmer_df```

Now, it's your turn! Following the steps of above encode the first column of *kmer_df* using the list *genome_values*. 

**Note: there is no need to load LabelEncoder from the preprocessing module and re-instantiate LabelEncoder again. That is, you don't have to run**

```#load label encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()```.

**Bonus: Why don't you need to run the code above again?**

In [10]:
first_column = kmer_df.iloc[:,0]
first_column.head()

0    Y
1    N
2    N
3    N
4    N
Name: tggagcgccgggcggatcggttccgtactat, dtype: object

In [11]:
genome_values = ["Y", "N"]

In [12]:
le.fit(genome_values)
first_column_encoded = le.transform(first_column)
first_column_encoded_df = pd.DataFrame(first_column_encoded,columns=[first_column.name]) 
first_column_encoded_df.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat
0,1
1,0
2,0
3,0
4,0


The ```le.fit_transform(data)``` method of a LabelEncoder instance runs the ```fit``` method to learns the encoding from the ```data``` argument and then immediately applies the learnt encoding to ```data```.

In [13]:
#how the fit transform function works

# don't need to grab a unique list of elements to learn encoding
# fit transform learns the encoding and encodings the column

first_column_encoded = le.fit_transform(first_column)
first_column_encoded_df = pd.DataFrame(first_column_encoded,columns=[first_column.name]) 
first_column_encoded_df.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat
0,1
1,0
2,0
3,0
4,0


Rather than running a ```for``` loop to encode each column of the ```kmer_df```, we can use the ```apply``` method of a pandas data array to run the ```fit_transform``` function on the each column.

In [14]:
genome_values = ["Y", "N"]
kmer_df_encoded = kmer_df.apply(le.fit_transform)
kmer_df_encoded.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 1.C Training-Test Split

The next preprocessing step is the generate training set and test set. When building a machine learning model, it is necessary to train and evaluate the accuracy of the model. The training set is used to train the model. The test set is used to evaluate the model accuracy.

In [27]:
from sklearn.model_selection import train_test_split

training_set, test_set, Y_training_set, Y_test_set = train_test_split(kmer_df_encoded, phenotypes_encoded_df,\
                                                                      train_size=0.8, test_size=0.2,\
                                                                      random_state=None,shuffle=False)

Let's have a look at our dataset!

In [30]:
training_set.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
test_set.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
313,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
314,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
315,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
316,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
317,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
Y_training_set.head()

Unnamed: 0,resistance phenotype
0,1
1,0
2,0
3,0
4,1


In [33]:
Y_test_set.head()

Unnamed: 0,resistance phenotype
313,0
314,1
315,1
316,1
317,0


### Exercise 1.2: 70%:30% Training-Test Set Split

Following the process above, return a $70\%$ and $30\%$ training-test split with ```kmer_df_encoded, phenotypes_encoded_df```. Like above, store the output in ```training_set, test_set, Y_training_set, Y_test_set```.

In [34]:
training_set, test_set, Y_training_set, Y_test_set = train_test_split(kmer_df_encoded, phenotypes_encoded_df,\
                                                                      train_size=0.7, test_size=0.3,\
                                                                      random_state=None,shuffle=False)

Next, we will be storing the 70%:30% split so that this output will accessible across all the note books we will be using.

In [40]:
training_set.to_csv("datasets/training_set",columns=training_set.columns,index=False)
test_set.to_csv("datasets/test_set",columns=training_set.columns,index=False)
Y_training_set.to_csv("datasets/Y_training_set",columns=['resistance phenotype'],index=False)
Y_test_set.to_csv("datasets/Y_test_set",columns=['resistance phenotype'],index=False)