# 0. Review
## 0.A Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users to access machine learning algorithms via **object-oriented programming**.

## 0.B Data Set

I will be using a dataset of antibiotic resistance in bacteria strains. 

- Each bacteria is labeled with its resistance to the antibiotic, azithromycin.
- Additionally, each bacteria sample is labelled if its genome contains certain strands of DNA.

We would like to learn antibiotic resistance from the bacterial genome. 

- Our predictors are whether strands of DNA are present.
- Our response are resistance classes.

First, we have to clean our data up. **This section will focus on data preprocessing.**


## 0.C Data Preprocessing

We did a bit of data preprocessing: 

- encoded the resistance feature as 0 - "resistant," 1 - "susceptible".
- encoded all features of the DNA strands as, 0 - "if its genome does not contain the strand of DNA", 1 - "if its genome contains the strand of DNA."
- standardized dataset of presence of DNA strands

## 0.D Load Data
Now, we load our dataset. Run the code below to load 

- the dataset, ```antibiotic_resistance_all_labels```, containing antibotic resistance phentype for each bacteria
- the dataset, ```standardized_DNA_data_df```, containing standardized antibotic resistance phentype for each bacteria
- and dataset, ```DNA_slices_all_df```, containing the genome of each bacteria 

In [None]:
import pandas as pd

antibiotic_resistance_all_labels = pd.read_csv('datasets/antibiotic_resistance_encoded_labels',index_col=0)
DNA_slices_all_df = pd.read_csv('datasets/DNA_slices_encoded_csv',index_col=0)

In [None]:
#create standardized data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_DNA_data = scaler.fit_transform(DNA_slices_all_df)
standardized_DNA_data_df = pd.DataFrame(standardized_DNA_data,
                                        columns=DNA_slices_all_df.columns,
                                        index=DNA_slices_all_df.index)

**In this section, we will be covering unsupervised learning.**

Recall that **unsupervised learning** is extracting structure from data (self-organized learning - find previously unknown patterns in data set without pre-existing labels).

# 7. Dimensionality Reduction: PCA

It is difficult to visualize data with many features. The human mind is mostly limited to three dimensions. 

One popular technique of reducing dimensions and grouping features together is PCA.

PCA attempts to project high dimensional data onto directions of highest variance of the data.

## 7.A PCA for data with two features
### 7.B.1 Reduction to one dimension

<img src="images/07_PCA_01.png" alt="Drawing" style="width: 1000px;"/>


## 7.B PCA for data with three features

### 7.B.1 Reduction to one dimension

<img src="images/07_PCA_03.png" alt="Drawing" style="width: 1000px;"/>

### 7.B.2 Reduction to two dimensions

<img src="images/07_PCA_04.png" alt="Drawing" style="width: 1000px;"/>


## 7.C PCA on the DNA slices dataframe

```DNA_slices_all_df``` is very high dimensional. To get some idea of what the data looks like, we will use PCA to project the data into two dimensions.

### I. Initialize ```PCA``` object

In [None]:
from sklearn.decomposition import PCA
# initialize PCA with n_components = 2
# initialize PCA as pca


### II. Fit PCA object

In [None]:
# fit pca to standardized_DNA_data_df


### III. Transform Data

In [None]:
# transform standardized_DNA_data_df using pca.transform
# store as transformed_data 


###  IV. Plot reduced data

In [None]:
# plot of two dimensional data

import matplotlib.pyplot as plt


#plot suspectible strains
plt.figure(figsize=(10,6))
%matplotlib inline
presence_0 = [element == 0 for element in antibiotic_resistance_all_labels.values.ravel()]

plt.scatter(transformed_data[presence_0, 0],
            transformed_data[presence_0, 1],
            label='label = 0 (Resistant)',
            c='r')

#plot resistant strains

presence_1 = [element == 1 for element in antibiotic_resistance_all_labels.values.ravel()]

plt.scatter(transformed_data[presence_1, 0],
            transformed_data[presence_1, 1],
            label='label = 1 (Susceptible)',
            c='b')


plt.xlabel('First Component')
plt.ylabel('Second Component')
plt.title('PCA plot of k-mer test data')
plt.legend()
plt.show()

In [None]:
# zoomed in plot of two dimensional data

import matplotlib.pyplot as plt

#plot suspectible strains
plt.figure(figsize=(10,6))
%matplotlib inline
presence_0 = [element == 0 for element in antibiotic_resistance_all_labels.values.ravel()]

plt.scatter(transformed_data[presence_0, 0],
            transformed_data[presence_0, 1],
            label='label = 0 (Resistant)',
            c='r')

#plot resistant strains

presence_1 = [element == 1 for element in antibiotic_resistance_all_labels.values.ravel()]

plt.scatter(transformed_data[presence_1, 0],
            transformed_data[presence_1, 1],
            label='label = 1 (Susceptible)',
            c='b')

plt.xlim([-25,25])
plt.ylim([-25,25])
plt.xlabel('First Component')
plt.ylabel('Second Component')
plt.title('PCA plot of k-mer test data')
plt.legend()
plt.show()

## 7.C.2 Exercise: Reduction to 3-D

We can also reduce the data to three dimensions using PCA. We need three components to reduce three dimensions. 

Following the steps above, redo the PCA to reduce the data, ```standardized_DNA_data_df```, to three dimensions. Use ```transformed_3d``` to store your final transformed data.

In [None]:
# enter solution here
from sklearn.decomposition import PCA



The code below plots the three dimensional reduced data.

In [None]:
import matplotlib.pyplot as plt 
from mpl_toolkits.mplot3d import Axes3D

%matplotlib notebook
%matplotlib notebook

fig =  plt.figure()
ax = Axes3D(fig)


#plot suspectible strains

presence_0 = [element == 0 for element in antibiotic_resistance_all_labels.values.ravel()]

ax.scatter(transformed_3d[presence_0, 0],
           transformed_3d[presence_0, 1],
           transformed_3d[presence_0, 2],
           label='label = 0 (Resistant)',
           c='r')

#plot resistant strains

presence_1 = [element == 1 for element in antibiotic_resistance_all_labels.values.ravel()]

ax.scatter(transformed_3d[presence_1, 0],
           transformed_3d[presence_1, 1],
           transformed_3d[presence_1, 2],
           label='label = 1 (Susceptible)',
           c='b')

ax.set_xlabel('First Component')
ax.set_ylabel('Second Component')
ax.set_zlabel('Third Component')
plt.title('PCA plot of k-mer data')
plt.legend()
plt.show()

In [None]:
#### zoomed in plot ####

import matplotlib.pyplot as plt 
from mpl_toolkits.mplot3d import Axes3D

%matplotlib notebook
%matplotlib notebook

fig =  plt.figure()
ax = Axes3D(fig)


#plot suspectible strains

presence_0 = [element == 0 for element in antibiotic_resistance_all_labels.values.ravel()]

ax.scatter(transformed_3d[presence_0, 0],
           transformed_3d[presence_0, 1],
           transformed_3d[presence_0, 2],
           label='label = 0 (Resistant)',
           c='r')

#plot resistant strains

presence_1 = [element == 1 for element in antibiotic_resistance_all_labels.values.ravel()]

ax.scatter(transformed_3d[presence_1, 0],
           transformed_3d[presence_1, 1],
           transformed_3d[presence_1, 2],
           label='label = 1 (Susceptible)',
           c='b')

ax.set_xlim([-10,10])
ax.set_ylim([-5,20])
ax.set_zlim([-10,10])
ax.set_xlabel('First Component')
ax.set_ylabel('Second Component')
ax.set_zlabel('Third Component')
plt.title('PCA plot of k-mer data')
plt.legend()
plt.show()

## 7.C.3 PCA: Explained Variance

```PCA``` also calculates the variance in each direction. This is a measure of the information of a direction.

To compare variance in each direction, it's common to analyze the explained variance ratio rather than the explained variance. Explained variance ratio is the explained variance in the each direction divided by the total variance in the data. The explained variance ratio is a measure of the "information" captured in each direction. 

```PCA``` stores explained variance as ```explained_variance_```. It also stores the explained variance ratio as ```explained_variance_ratio_```.

In [None]:
# learn the three dimensional PCA
from sklearn.decomposition import PCA

# initialize PCA with n_components=3

# fit_transform standardized_DNA_data_df and store in transformed_3d


I calculate the explained variance below.

In [None]:
#get explained variance, explained_variance_, store in explained_variance

# print explained_variance


In [None]:
# plot of explained variance

n = len(explained_variance)

x = range(0,n)

%matplotlib inline
fig, ax = plt.subplots()
ax.scatter(x,explained_variance)
ax.set_ylim(min(explained_variance)- 5,max(explained_variance)+100)
for i in x:
    ax.annotate("%.3f" %explained_variance[i],
                (i, explained_variance[i]), 
                (i-0.1, explained_variance[i]+5))
ax.set_xlabel('component index')
ax.set_ylabel('Explained Variance')
ax.set_title('Scatter plot of Explained Variance')
ax.set_xticks(range(0, n, 1))

plt.show()

I calculate and plot the explained variance ratio below.

In [None]:
#get explained variance ratio, explained_variance_ratio_
# store as explained_variance_ratio

# print explained_variance_ratio


In [None]:
## show plot ##

n = len(explained_variance_ratio)

x = range(0,n)

fig, ax = plt.subplots()
ax.scatter(x,explained_variance_ratio)
ax.set_ylim([-0.05,1.10])
for i in x:
    ax.annotate("%.3f" %explained_variance_ratio[i],
                (i, explained_variance_ratio[i]), 
                (i-0.1, explained_variance_ratio[i]+0.05))
ax.set_xlabel('component index')
ax.set_ylabel('Explained Variance Ratio')
ax.set_title('Scatter plot of Explained Variance Ratio')
ax.set_xticks(range(0, n, 1))

plt.show()

## 7.C.2 Exercise: The Drop-off 

When reducing the data, there is a loss in variance (and thus information). Typically if there significant drop-off in the explained variance, as in the plot below,

<img src="images/07_PCA_drop_off_.png" alt="Drawing" style="width: 500px;"/>

many researchers have argued that it is possible to cut off lower variance components without much information.

Play around with the number of components and determine if it is possible to find a drop-off in the explained variance. Please store your instance of ```PCA``` as ```pca```.

In [None]:
#enter solution here


In [None]:
# PCA with all components, pca = PCA()
# fit_transform(standardized_DNA_data_df)


# get pca.explained_variance_, store as explained_variance


In [None]:
# get pca.explained_variance_ratio_, store as explained_variance_ratio



In [None]:
# compute the sum of explained_variance_ratio from 0 to 245th entry


# first 245 columns contain 90% of the variance. Alot less than 73016



In [None]:

### plot of explained variance,  explained variance ratio, log(explained variance ratio) ###

n = len(explained_variance_ratio)

x = range(0,n)

fig, ax = plt.subplots(1,3)

fig.set_size_inches(20, 5)

ax[0].scatter(x,explained_variance)

ax[0].set_xlabel('component index')
ax[0].set_ylabel('Explained Variance')
ax[0].set_title('Scatter plot of Explained Variance')
ax[0].set_xticks(range(0, n, 100))

ax[1].scatter(x,explained_variance_ratio)
ax[1].set_xlabel('component index')
ax[1].set_ylabel('Explained Variance')
ax[1].set_title('Scatter plot of Explained Variance Ratio')
ax[1].set_xticks(range(0, n, 100))

import numpy as np
ax[2].scatter(x,np.log(explained_variance_ratio))
ax[2].set_xlabel('component index')
ax[2].set_ylabel('log(Explained Variance)')
ax[2].set_title('Scatter plot of Explained Variance Ratio')
ax[2].set_xticks(range(0, n, 100))

plt.show()