# Python Machine Learning in Biology
# Preprocessing

"Garbage in, garbage out" applies to machine learning models as well. The quality of the data we have and the amount of useful data it contains determines how much the machine learning algorithm can tell us about patterns in the data. So, before we feed it into our model, we need to examine and preprocess the dataset.

We'll cover:
* Dealing with missing data (removing and imputing missing values)
* Converting categorical data to a format a machine learning model can understand
* Scaling features

## Dealing with missing data

Why might our dataset be missing data?  

Most of our computational tools won't be able to handle missing data, so we'll need to deal with it.

Missing data is usualy represented in the dataset as a blank space or as a NaN (not a number) placeholder string.

#### Let's create a fake dataset so we can learn how to deal with missing values
`StringIO` let's us read in a string as a dataframe like it is a regular csv we imported. 

In [2]:
missing_data = '''1.0, 2.0, 3.0, 4.0
5.0, 6.0,,8.0
10.0, 11.0, 12.0,'''

Even though we can see our missing values here, for larger datasets, searching manually through would take a long time. 

#### Let's figure out how many missing values each column has
We can use the `.isnull` method to get a DataFrame with a Boolean indicating whether there is a missing value or not. Then we can use the `.sum()` method to figure out how many missing values are in each column.

### Removing missing samples

An easy way to handle missing data is to just remove it. We can remove the column (feature) containing the missing value, or we can remove the row (sample) from the dataset.

#### Drop rows with any missing values using `.dropna()`

#### Drop columns with any missing values using `.dropna()`
"axis = 0" means row and "axis = 1" means column. For this method, row is the default. I usually remember that columns are vertical, and so is the number "1".

Notice we didn't actually affect the original dataframe. (We would need to save it as a new variable or add an "inplace=True" argument)

#### `dropna` can drop rows where all columns are NaN

#### drop rows that have not at least 4 non-NaN values (threshold)

#### only drop rows where NaN appears in specific columns

Dropping missing data isn't always the best idea. Why? 

### Imputing missing values

A commonly-used alternative to dropping missing data is imputing the missing values (interpolating). This means using the other values in that same column to try to estimate that value.   

A common type of interpolation is **mean imputation** where we use the mean of the other values in that column (same feature) to fill in the blank.  

There are other types of imputation (like using clustering methods), but we won't go into the pros and cons of these. Know that they exist and know that they each have their pros and cons. 

#### Use scikit-learn's Imputer class to do mean imputation

The basic steps in using the `Imputer` class (which is a transformer class--we'll see some other ones that we'll use for data transformation)
1. instantiate the class
2. fit the data (learn the parameters from the training data--only use on training data)
3. transform the data (use those parameters to transform the data)

*for some reason axis = 0 for this class means columns. CONFUSING*  

<img src="assets/transform.png"/>

*Side note: scikit-learn can handle dataframes usually, but it's build in `NumPy` (a linear algebra library). `dataframe.values` gives us the numpy matrix representation of our dataframe*

## Handling categorical data

What are some examples of categorical data?  
What is the difference between ordinal and nominal data?

#### Let's create some more dataframes to learn to deal with categorical data

In [30]:
category = pd.DataFrame([
    ['blue', 'S', 13.2, 'class1'],
    ['red', 'XL', 3.4, 'class2'],
    ['green', 'M', 8.7, 'class1']
])

In [31]:
category.columns = ['color', 'size', 'price', 'class_label']

Which of our features are nominal? numerical? ordinal?

### Mapping ordinal features

Our learning algorithms will need ordinal features to be numerical to interpret them correctly, so we'll need to convert these into integers. There's not a convenient built-in feature to do this for us (like there is for nominal features), so we'll have to do it manually. 

In [33]:
size_mapping = {
    'XL': 3,
    'L': 2,
    'M': 1,
    'S': 0
}

### Encoding class labels

Many algorithms want class labels to be encoded as integer labels. Many of them will do the work for you, but it's always a good idea to handle it yourself before feeding them in.   

We can use the same mapping technique that we did for the ordinal variables, but scikit-learn has a helpful `LabelEncoder` class that can do it for us.

#### Use `LabelEncoder` to convert class labels to integers

Remember fit and transform? This method combines them. (When might you want to separate them?)

#### Inverse transform

### Perform one-hot encoding on nominal features

We can use our mapping technique or our `LabelEncoder` technique on the color feature, but what problems do we run into? (hint: how is nominal different than ordinal?)

#### Use the pandas `.get_dummies` method to do One Hot Encoding 
We are going to create dummy binary features for each unique value.

### Partition dataset into training and test sets

Why do we split our data into training and test sets?

#### Read in the wine dataset

#### Get features and store as a NumPy array (X)

#### Get response (target) and store as a NumPy array (y)

#### Randomly split X and y into training and test datasets
We'll have 30% as the test set, 70% for the training set.

Why is our test set smaller than our training? (Common ones are 60:40, 70:30 or 80:20. With bigger datasets you can get away with 10:90 or 1:99)

### Scaling features

Decision trees and Random Forests are some of the few algorithms where you don't need to worry about feature scaling. But most algorithms will perform better when our features on are on the same scale.  

For example, if we have one feature that goes from 1 to 10 and another that goes from 1 to 100,000, the gradient descent algorithm will spend most of its time working on the larger errors of the feature with the larger scale. 

There are two common approaches to scaling: 
* **normalization**: scaling features from [0,1] (special case of min-max scaling) 
  <img src="assets/normalization.png" alt="normalization" style="width: 200px;"/>
* **standardization**: center features at 0 with standard deviation of 1  
  <img src="assets/standardization.png" alt="standarization" style="width: 200px;"/>



#### Normalizing features
Useful for algorithms that need a bounded interval.

Why did we only transform on the test dataset?

#### Standardizing features
Useful for many linear models (like SVM and logistic regression) because it makes them easier for them to learn the weights. Also helps the algorithm be less sensitive to outliers than normalization.

# Python Machine Learning In Biology:
# Support Vector Machines

We'll build an SVM using the `cancer.csv` dataset. 

#### Import modules

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder

#### Read in dataset 

#### Store features as "X"

#### Store response as "y" and encode them as numbers

#### Split dataset into training set and test set

#### Create a SVM Classifier with a linear kernel

#### Train the model using the training sets

#### Predict the response for test dataset

#### Evaluate the Model

Model Accuracy: how often is the classifier correct?

Model Precision: (AKA positive predictive value)  
Model Recall: (AKA sensitivity) 

*We'll talk more in depth about these when we talk about evaluation metrics*

<img src = "assets/precisionrecall.png"/>

# Python Machine Learning for Biology
# Hyperparameter Tuning

What is a hyperparameter?    

We'll go over some best practices for building machine learning models by fine-tuning hyperparameters and evaluating model performance.  

We'll cover:  
* Cross-Validation: Getting unbiased estimates of model performance
* Learning and Validation Curves: Diagnosing common problems
* GridSearch: Fine-tuning machine learning algorithms
* Evaluating models using different performance metrics

### Peform a logistic regression on the cancer dataset
1. import the cancer dataset
2. create X and y variables
3. encode categorical variables
4. split data into testing and training datasets (80:20)
5. standardize the data
6. perform a logistic regression
7. report the accuracy score

In [210]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline

(Side note: we can figure out what it labeled each class of tumor)

## Cross Validation

*Let's review*
* Why don't we evaluate using our training data?
* What is overfitting? 
* What is underfitting?  
* What are the drawbacks of train/test/split?

Two techniques to try to figure out our model's generalization error are **holdout validation** and **k-fold cross validation.** 

### Holdout validation (AKA Train/Test/Split)

We've been doing holdout validation, where we separate the dataset into training and testing datasets. But if we do lots of **model selection**, that is tune our hyper-parameters to see which give us the best model, we start reusing that same test dataset over and over again. Then the model is likely to overfit.  

A better way of using the holdout method is to divide the dataset into three parts: a training set, a test set, and a validation set. Use the training set to fit the model, use the validation set to compare model performance among different models, and use the test set to test model generalizability. This is a way less biased way to do it because the model has never seen the test data before.  

<img src="assets/traintestsplit.png"/>

A disadvantage of this method is that it is sensitive to how we divide up the data. 

*But what if we created a bunch of train/test/splits, calculated the test accuracy for each, and averaged these?* That is the essence of **k-fold cross validation.**

### K-fold Cross Validation

1. Split the data into *k* sets (folds) without replacement. 
2. Use *k-1* sets on model training and use 1 for model testing. 
3. Repeat *k* times, using a different set for the testing set each time. We'll have *k* models and *k* performance estimates.  

Then we can calculate the average performance of the model based on the *k* folds so we have a performance estimate that is less biased to how we sliced and diced the data. 

The standard value of *k* that people use is 10 (has been shown in experients to give a good out-of-sample accuracy). It's a good idea to use a larger *k* if you are working with a smaller dataset (lower generalization bias the higher your *k*). Larger values of *k* will have a slower runtime.  

<img src="assets/kfolds.png"/>

**Stratified k-fold cross validation** has even better bias and variance estimates, especially if you have really unequal class proportions. This method preserves the class proportions in each fold. `cross_val_score` does this by default.

***Train/test/split may still be the better option if you need speed***

#### Perform a stratified k-fold cross validation on the cancer dataset

## Grid Search: fine-tuning models

*Review:* 
Which parameters does the machine learning model "learn"? Which are parameters we have to tune?  

Validation curves help us figure out an optimal value of one hyperparameter. Grid search helps us find optimal combinations of hyperparameters.

This will be more efficient than the for loops we were using when trying to find the best K for KNN.

#### Perform a grid search on an SVM of our cancer data

In [147]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC 
from sklearn.pipeline import Pipeline

In [150]:
param_grid = [{'clf__C': param_range,
              'clf__kernel': ['linear']},
             {'clf__C': param_range,
             'clf__gamma': param_range,
             'clf__kernel': ['rbf']}]

### Nested cross-validation

Earlier we combined k-fold cross validation and grid search to fine-tune our hyperparameters. A better way to do this is with **nested cross-validation.**  

**Nested cross-validation** is when we have an outer k-fold cross-validation loop to split the data into training and testing folds and an inner loop used to select a model using k-fold cross-validation on the training fold. After model selection, we evaluate model performance on our test fold. 

<img src="assets/nestedcv.png"/>

#### Nested cross-validation on our cancer dataset with an SVM (This is a 5x2 cross-validation)

#### Use nested cross-validation to compare SVM to another algorithm

In [171]:
gs = GridSearchCV(estimator=DecisionTreeClassifier(random_state=0),
                 param_grid = [{'max_depth': [1,2,3,4,5,6,7, None]}],
                 scoring='accuracy',
                 cv=5)

# Independent Exercises

### Preprocessing Data
Find one of the datasets in the dirty folder and run through cleaning it up. OR find some of your own data to clean up.
* Deal with missing data (removing and imputing missing values)
* Convert categorical data to a format a machine learning model can understand
* Scaling features

### Iris Dataset SVM Practice

Compare SVMs with different kernels on the iris data.
* Gaussian
* Linear

### Hyperparameter tuning for Iris Dataset
Select the best hyperparameters (K) for a KNN of the iris dataset using stratified cross validation scores.

**Bonus** Compare the best K of KNN to a Logistic Regression for the iris dataset to see which model performs better (with stratified cross validation).

### Grid Search Independent Exercise
Perform a gridsearch to find the best K for a KNN of the iris dataset