# 0. Review
## 0.A Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users to access machine learning algorithms via **object-oriented programming**.

## 0.B Data Set

I will be using a dataset of antibiotic resistance in bacteria strains. 

- Each bacteria is labeled with its resistance to the antibiotic, azithromycin.
- Additionally, each bacteria sample is labelled if its genome contains certain strands of DNA.

We would like to learn antibiotic resistance from the bacterial genome. 

- Our predictors are whether strands of DNA are present.
- Our response are resistance classes.

First, we have to clean our data up. **This section will focus on data preprocessing.**


## 0.C Load Data

We must load the data:

-  the dataset, *antibotic_resistance_labels*, containing antibiotic resistance classes for each bacteria. This contains our response variable.
- and dataset, *DNA_slices_csv*, containing the genome of each bacteria. This contains our predictors.

In [1]:
import pandas as pd
antibiotic_resistance_labels = pd.read_csv('datasets/antibiotic_resistance_labels',index_col=0)
DNA_slices_df = pd.read_csv('datasets/DNA_slices_csv',index_col=0)

# 1. Preprocessing

1. **First, preprocess the data - make sure that data is ready to enter the ML pipeline**

## 1.A Analyzing Data

First, we must get a glimpse of our data. 

In [2]:
#print shape of antibiotic_resistance_labels
antibiotic_resistance_labels.shape
#print head of antibiotic_resistance_labels
antibiotic_resistance_labels.head()


Unnamed: 0,resistance class
Bacteria 0,susceptible
Bacteria 1,resistant
Bacteria 2,resistant
Bacteria 3,resistant
Bacteria 4,susceptible


In [3]:
#print shape of DNA_slicies_df
print(DNA_slices_df.shape)
#print head of DNA_slicies_df
DNA_slices_df.head()

(392, 73016)


Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Bacteria 0,Y,Y,Y,Y,Y,Y,Y,Y,Y,Y,...,N,N,N,N,N,N,N,N,N,N
Bacteria 1,N,N,N,N,N,N,N,N,N,N,...,N,N,N,N,N,N,N,N,N,N
Bacteria 2,N,N,N,N,N,N,N,N,N,N,...,N,N,N,N,N,N,N,N,N,N
Bacteria 3,N,N,N,N,N,N,N,N,N,N,...,N,N,N,N,N,N,N,N,N,N
Bacteria 4,N,N,N,N,N,N,N,N,N,N,...,N,N,N,N,N,N,N,N,N,N


## 1.B Label Encoding

1b. Sklearn classifiers can only process numeric values. We need to then encode each feature with discrete integer values $(0, 1, 2, \cdots)$.

- For example if we were to ask participants their prefered level of spiciness in a study and provide them with the options, "very spicy", "spicy", "mild", "no spice", we could encode these categories as "3-very spicy", "2-spicy", "1-mild", "0-no spice".

- We encode the resistance feature as 0 - "resistant" and 1 - "susceptible."

- We encode all features of DNA strands as 0 - "if its genome does not contain the strand of DNA" and 1 - "if its genome contains the strand of DNA."

### 1.B.1 Encoding Resistance Labels
First, we encode *antibiotic_resistance_labels* array into 0's and 1's.

#### I. Get unique labels

In [4]:
antibiotic_resistance_labels.head()

Unnamed: 0,resistance class
Bacteria 0,susceptible
Bacteria 1,resistant
Bacteria 2,resistant
Bacteria 3,resistant
Bacteria 4,susceptible


In [5]:
resistance_classes = ['resistant', 'susceptible']

#### II. Initialize LabelEncoder()

In [6]:
#initialize label encoder as le
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#### III. Train Label Encoder

Using the ```fit``` method, we train our ```LabelEncoder()``` instance on how to encode our resistance data. 

In [7]:
# train label encoder
le.fit(resistance_classes)

LabelEncoder()

Let's have a look at class encoding we learned from the *resistance_classes* array.

In [8]:
# print classes in le
le.classes_

array(['resistant', 'susceptible'], dtype='<U11')

In [9]:
resistance_classes

['resistant', 'susceptible']

#### IV. Transform Data

We now use the ```transform``` method in the LabelEncoder class to encode lists of string into integer values.

In [10]:
# transform the list ['resistant']

le.transform(['resistant'])

array([0])

In [11]:
# transform the list ['susceptible']
le.transform(['susceptible'])

array([1])

In [12]:
#transform the list ['susceptible','resistant','susceptible']

le.transform(['susceptible','resistant','susceptible'])

array([1, 0, 1])

### Exercise 1.B.1: Encoding ```['susceptible','resistant','susceptible','susceptible']```

Encode the list, ```['susceptible','resistant','susceptible','susceptible']```, using the ```transform``` method.

In [13]:
# enter solution here

test_list = ['susceptible','resistant','susceptible','susceptible']
le.transform(test_list)

array([1, 0, 1, 1])

### Exercise 1.B.2: Encoding ```antibiotic_resistance_labels.values.ravel()```

Encode ```antibiotic_resistance_labels.values.ravel()``` using the ```transform``` method. Print and store the output of the ```transform``` method as ```labels_encoded```.

In [14]:
# enter solution here
ar_labels = antibiotic_resistance_labels.values.ravel()
labels_encoded = le.transform(ar_labels)
labels_encoded

array([1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1,

I really like keeping everything consistent so I'll convert *labels_encoded* into a *Pandas DataFrame*.

In [15]:
labels_encoded_df = pd.DataFrame(labels_encoded,
                                 columns=antibiotic_resistance_labels.columns,
                                 index=antibiotic_resistance_labels.index) 
labels_encoded_df.head()

Unnamed: 0,resistance class
Bacteria 0,1
Bacteria 1,0
Bacteria 2,0
Bacteria 3,0
Bacteria 4,1


### Exercise 1.B.3: Encoding First Column ```DNA_slices_df```

This is a challenge exercise. Following the steps of above encode the first column of *DNA_slices_df* using the list *genome_values*. I grab the first column of *DNA_slices_df* and stored it in *first_column*.

**Note: there is no need to load LabelEncoder from the preprocessing module and re-instantiate LabelEncoder again. That is, you don't have to run**

```#load label encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()```.

**Bonus: Why don't you need to run the code above again?**

In [16]:
# get the first column
first_column = DNA_slices_df.iloc[:,0]
# get head of the data
print(first_column.head())

# unique values in our dataframe
genome_values = ["N","Y"]

Bacteria 0    Y
Bacteria 1    N
Bacteria 2    N
Bacteria 3    N
Bacteria 4    N
Name: tggagcgccgggcggatcggttccgtactat, dtype: object


In [17]:
# enter solution here
le.fit(genome_values)
le.transform(first_column)

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,

The ```le.fit_transform(data)``` method of a ```LabelEncoder``` instance runs the ```fit``` method to learn the encoding from the ```data``` argument and then immediately applies the learned encoding to ```data```.

In [18]:
#how the fit transform function works

# don't need to grab a unique list of elements to learn encoding
# fit transform learns the encoding and encodings the column

first_column_encoded = le.fit_transform(first_column)
first_column_encoded
first_column_encoded_df = pd.DataFrame(first_column_encoded,columns=[first_column.name]) 
first_column_encoded_df.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat
0,1
1,0
2,0
3,0
4,0


Rather than running a ```for``` loop to encode each column of the ```DNA_slices_df```, we can use the ```apply``` method of a pandas data array to run the ```transform``` function on each column.

In [19]:
genome_values = ["Y", "N"]
le.fit(genome_values)
DNA_slices_df_encoded = DNA_slices_df.apply(le.transform)
DNA_slices_df_encoded.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Bacteria 0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
Bacteria 1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 1.C Training-Test Split

The next preprocessing step is the generate a training set and test set. When building a machine learning model, it is necessary to train and evaluate the accuracy of the model. The training set is used to train the model. The test set is used to evaluate model accuracy.

In [20]:
from sklearn.model_selection import train_test_split

# do 80:20 split
# write training set split here for DNA_training_set, 
# DNA_test_set, labels_training_set, labels_test_set

DNA_training_set, DNA_test_set, labels_training_set, labels_test_set = train_test_split(DNA_slices_df_encoded,\
                                                                                        labels_encoded, train_size=0.8,test_size=0.2,\
                                                                                       shuffle=False)

In [21]:
DNA_slices_df_encoded.shape

(392, 73016)

Let's have a look at our dataset!

In [22]:
# shape and head of the predictors training set
print(labels_training_set.shape)
labels_training_set

(313,)


array([1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1,

In [23]:
# shape and head of the response variables training set


In [24]:
# shape and head of the predictors test set
DNA_training_set.shape
DNA_training_set.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Bacteria 0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
Bacteria 1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
# shape and head of the response variables test set
DNA_test_set.shape
DNA_test_set.head()

Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Bacteria 313,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 314,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 315,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 316,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 317,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Exercise 1.C.1: 70%:30% Training-Test Set Split

Following the process above, return a $70\%$ and $30\%$ training-test split of ```DNA_slices_df_encoded, labels_encoded_df```. Like above, store the output in ```DNA_training_set, DNA_test_set, labels_training_set, labels_test_set```.

Also run the code below to print the head and shape of each data. Did the split occurred? Did the split make everything the correct dimensions?

In [26]:
# enter solution here

DNA_training_set, DNA_test_set, labels_training_set, labels_test_set = train_test_split(DNA_slices_df_encoded,\
                                                                                        labels_encoded, train_size=0.7,test_size=0.3,\
                                                                                       shuffle=False)

In [27]:
print(DNA_training_set.shape)
DNA_training_set.head()

(274, 73016)


Unnamed: 0,tggagcgccgggcggatcggttccgtactat,ggagcgccgggcggatcggttccgtactatc,gagcgccgggcggatcggttccgtactatcc,agcgccgggcggatcggttccgtactatccg,gcgccgggcggatcggttccgtactatccgt,cgccgggcggatcggttccgtactatccgta,gccgggcggatcggttccgtactatccgtac,ccgggcggatcggttccgtactatccgtact,cgggcggatcggttccgtactatccgtactg,gggcggatcggttccgtactatccgtactgc,...,cttttggtctttcctgttaggtggaacgtta,ttttggtctttcctgttaggtggaacgttac,tttggtctttcctgttaggtggaacgttacc,ttggtctttcctgttaggtggaacgttacct,tggtctttcctgttaggtggaacgttaccta,ggtctttcctgttaggtggaacgttacctac,gtctttcctgttaggtggaacgttacctact,tctttcctgttaggtggaacgttacctactt,ctttcctgttaggtggaacgttacctacttc,tttcctgttaggtggaacgttacctacttct
Bacteria 0,1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
Bacteria 1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bacteria 4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
print(labels_training_set.shape)
labels_training_set.head()

(274,)


AttributeError: 'numpy.ndarray' object has no attribute 'head'

In [None]:
print(DNA_test_set.shape)
DNA_test_set.head()

In [None]:
print(labels_test_set.shape)
labels_test_set.head()

The dimensions look great! It is a 70-30 split.