In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
from Bio import SeqIO
from Bio.Align import MultipleSeqAlignment
from sklearn.cross_validation import ShuffleSplit, train_test_split, cross_val_score
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error
from util.isoelectric_point import isoelectric_points

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We will train a model to predict drug resistance values from sequence.

This is the other general variant of supervised learning - where instead of predicting a "label" for a class (classification), we are predicting a "number" or a "value". This is called "regression", analogous to, say, linear regression or logistic regression.

In [None]:
# Load the sequence data as a Pandas dataframe.

seqids = [s.id for s in SeqIO.parse('data/hiv-protease-sequences-expanded.fasta', 'fasta')]

sequences = [s for s in SeqIO.parse('data/hiv-protease-sequences-expanded.fasta', 'fasta')]
sequences = MultipleSeqAlignment(sequences)
sequences = pd.DataFrame(np.array(sequences))
sequences.index = seqids
# Ensure that all of the letters are upper-case, otherwise the replace function in the next cell won't work.
for col in sequences.columns:
    sequences[col] = sequences[col].apply(lambda x: x.upper())
    sequences[col] = sequences[col].replace('*', np.nan)
sequences.head()

In [None]:
seqdf = sequences.replace(isoelectric_points.keys(), isoelectric_points.values())
seqdf.head()

In [None]:
# Load the drug resistance values
dr_vals = pd.read_csv('data/hiv-protease-data-expanded.csv', index_col=0)
dr_vals.set_index('seqid', inplace=True)
dr_vals.head()

In [None]:
# Join the sequence data together with that of one drug of interest.
drug_name = 'FPV'

data_matrix = seqdf.join(dr_vals[drug_name]).dropna()  # we have to drop NaN values because scikit-learn algorithms are not designed to accept them.
data_matrix.head()

## Exercise

Practice what you've learned! Split the data into features, response variable, and then do a train/test split.

In [None]:
# Your Answer
# Hint: to select a set of columns from a dataframe, use: dataframe[[columns]]
# Hint: the columns 0 to 98 can be expressed as a list comprehension: [i for i in range(99)]


## Exercise

Now, let's train the Random Forest Regressor on the data.

In [None]:
# Answer


Make a plot of what the predictions (y-axis) against the actual values (x-axis)

## Evaluating the Model

Just as with classification tasks, we also need metrics to help evaluate how good a trained model is, given the input features.

## Exercise

Look through the [`sklearn.metrics`][1] module. What might be a suitable metric to use?

Justify the use of two of them, and write the code that computes the evaluation metric.

[1]: http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics

In [None]:
# Metric 1: correlation coefficient.



In [None]:
# Metric 2: mean squared error



### Discussion

What does the distribution of values look like? Where is its skew? How could we tell?

What would be a better way of transforming the data prior to doing ML?

## Live-Coding

Let's try log10-transforming the values to be predicted. 

Note: the MSE goes down because of the `log10` transform. However, that's exactly what we would have expected by definition.

## Challenge Exercise

Can you compare the following algorithms to see which one performs best?

- `RandomForestRegressor`
- `GradientBoostingRegressor`
- `AdaBoostRegressor`
- `ExtraTreesRegressor`

## Live Coding: Statistical Practices

Feel free to type along!

Statistical good practices:

1. Train/test split
2. Shuffle your data, to break up any inadvertant structure in the dataset.
3. Compare different models in a systematic way.