# Python Machine Learning in Biology
# Preprocessing

"Garbage in, garbage out" applies to machine learning models as well. The quality of the data we have and the amount of useful data it contains determines how much the machine learning algorithm can tell us about patterns in the data. So, before we feed it into our model, we need to examine and preprocess the dataset.

We'll cover:
* Dealing with missing data (removing and imputing missing values)
* Converting categorical data to a format a machine learning model can understand
* Scaling features

## Dealing with missing data

Why might our dataset be missing data?  

Most of our computational tools won't be able to handle missing data, so we'll need to deal with it.

Missing data is usualy represented in the dataset as a blank space or as a NaN (not a number) placeholder string.

#### Let's create a fake dataset so we can learn how to deal with missing values
`StringIO` let's us read in a string as a dataframe like it is a regular csv we imported. 

In [1]:
import pandas as pd
from io import StringIO

In [2]:
missing_data = '''1.0, 2.0, 3.0, 4.0
5.0, 6.0,,8.0
10.0, 11.0, 12.0,'''

In [3]:
missing = pd.read_csv(StringIO(missing_data), header = None)

In [4]:
missing

Unnamed: 0,0,1,2,3
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [5]:
missing.columns = ['A', 'B', 'C', 'D']

Even though we can see our missing values here, for larger datasets, searching manually through would take a long time. 

#### Let's figure out how many missing values each column has
We can use the `.isnull` method to get a DataFrame with a Boolean indicating whether there is a missing value or not. Then we can use the `.sum()` method to figure out how many missing values are in each column.

In [8]:
missing.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

### Removing missing samples

An easy way to handle missing data is to just remove it. We can remove the column (feature) containing the missing value, or we can remove the row (sample) from the dataset.

#### Drop rows with any missing values using `.dropna()`

In [9]:
missing.dropna()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


#### Drop columns with any missing values using `.dropna()`
"axis = 0" means row and "axis = 1" means column. For this method, row is the default. I usually remember that columns are vertical, and so is the number "1".

In [12]:
missing.dropna(axis = 0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [13]:
missing

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


Notice we didn't actually affect the original dataframe. (We would need to save it as a new variable or add an "inplace=True" argument)

In [17]:
missing

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


#### `dropna` can drop rows where all columns are NaN

In [14]:
missing.dropna(how = 'all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


#### drop rows that have not at least 4 non-NaN values (threshold)

In [19]:
missing.dropna(thresh = 3)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


#### only drop rows where NaN appears in specific columns

In [16]:
missing.dropna(subset = ['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


Dropping missing data isn't always the best idea. Why? 

### Imputing missing values

A commonly-used alternative to dropping missing data is imputing the missing values (interpolating). This means using the other values in that same column to try to estimate that value.   

A common type of interpolation is **mean imputation** where we use the mean of the other values in that column (same feature) to fill in the blank.  

There are other types of imputation (like using clustering methods), but we won't go into the pros and cons of these. Know that they exist and know that they each have their pros and cons. 

#### Use scikit-learn's Imputer class to do mean imputation

The basic steps in using the `Imputer` class (which is a transformer class--we'll see some other ones that we'll use for data transformation)
1. instantiate the class
2. fit the data (learn the parameters from the training data--only use on training data)
3. transform the data (use those parameters to transform the data)

*for some reason axis = 0 for this class means columns. CONFUSING*  

Check out the slides for how fit and transform work.

In [20]:
from sklearn.preprocessing import Imputer

In [21]:
imr = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)

In [22]:
imr = imr.fit(missing)

In [23]:
imputed_data = imr.transform(missing)

In [24]:
imputed_data

array([[  1. ,   2. ,   3. ,   4. ],
       [  5. ,   6. ,   7.5,   8. ],
       [ 10. ,  11. ,  12. ,   6. ]])

*Side note: scikit-learn can handle dataframes usually, but it's build in `NumPy` (a linear algebra library). `dataframe.values` gives us the numpy matrix representation of our dataframe*

## Handling categorical data

What are some examples of categorical data?  
What is the difference between ordinal and nominal data?

#### Let's create some more dataframes to learn to deal with categorical data

In [25]:
category = pd.DataFrame([
    ['blue', 'S', 13.2, 'class1'],
    ['red', 'XL', 3.4, 'class2'],
    ['green', 'M', 8.7, 'class1']
])

In [26]:
category.columns = ['color', 'size', 'price', 'class_label']

In [27]:
category

Unnamed: 0,color,size,price,class_label
0,blue,S,13.2,class1
1,red,XL,3.4,class2
2,green,M,8.7,class1


Which of our features are nominal? numerical? ordinal?

### Mapping ordinal features

Our learning algorithms will need ordinal features to be numerical to interpret them correctly, so we'll need to convert these into integers. There's not a convenient built-in feature to do this for us (like there is for nominal features), so we'll have to do it manually. 

In [28]:
size_mapping = {
    'XL': 3,
    'L' : 2,
    'M' : 1,
    'S' : 0
}

In [29]:
category['size'] = category['size'].map(size_mapping)

In [30]:
category

Unnamed: 0,color,size,price,class_label
0,blue,0,13.2,class1
1,red,3,3.4,class2
2,green,1,8.7,class1


### Encoding class labels

Many algorithms want class labels to be encoded as integer labels. Many of them will do the work for you, but it's always a good idea to handle it yourself before feeding them in.   

We can use the same mapping technique that we did for the ordinal variables, but scikit-learn has a helpful `LabelEncoder` class that can do it for us.

#### Use `LabelEncoder` to convert class labels to integers

In [31]:
from sklearn.preprocessing import LabelEncoder

In [32]:
class_le = LabelEncoder()

In [33]:
y = class_le.fit_transform(category['class_label'])

In [34]:
y

array([0, 1, 0])

Remember fit and transform? This method combines them. (When might you want to separate them?)

#### Inverse transform

In [35]:
class_le.inverse_transform(y)

array(['class1', 'class2', 'class1'], dtype=object)

### Perform one-hot encoding on nominal features

We can use our mapping technique or our `LabelEncoder` technique on the color feature, but what problems do we run into? (hint: how is nominal different than ordinal?)

#### Use the pandas `.get_dummies` method to do One Hot Encoding 
We are going to create dummy binary features for each unique value.

In [37]:
pd.get_dummies(category[['color', 'size', 'price']])

Unnamed: 0,size,price,color_blue,color_green,color_red
0,0,13.2,1,0,0
1,3,3.4,0,0,1
2,1,8.7,0,1,0


### Partition dataset into training and test sets

Why do we split our data into training and test sets?

#### Read in the wine dataset

In [38]:
wine = pd.read_csv("wine.csv")

In [39]:
wine.head()

Unnamed: 0,cultivar,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


#### Get features and store as a NumPy array (X)

In [46]:
X = wine.iloc[:, 1:].values
X

array([[  1.42300000e+01,   1.71000000e+00,   2.43000000e+00, ...,
          1.04000000e+00,   3.92000000e+00,   1.06500000e+03],
       [  1.32000000e+01,   1.78000000e+00,   2.14000000e+00, ...,
          1.05000000e+00,   3.40000000e+00,   1.05000000e+03],
       [  1.31600000e+01,   2.36000000e+00,   2.67000000e+00, ...,
          1.03000000e+00,   3.17000000e+00,   1.18500000e+03],
       ..., 
       [  1.32700000e+01,   4.28000000e+00,   2.26000000e+00, ...,
          5.90000000e-01,   1.56000000e+00,   8.35000000e+02],
       [  1.31700000e+01,   2.59000000e+00,   2.37000000e+00, ...,
          6.00000000e-01,   1.62000000e+00,   8.40000000e+02],
       [  1.41300000e+01,   4.10000000e+00,   2.74000000e+00, ...,
          6.10000000e-01,   1.60000000e+00,   5.60000000e+02]])

#### Get response (target) and store as a NumPy array (y)

In [43]:
y = wine.iloc[:, 0].values
y

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

#### Randomly split X and y into training and test datasets
We'll have 30% as the test set, 70% for the training set.

In [47]:
from sklearn.model_selection import train_test_split

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

Why is our test set smaller than our training? (Common ones are 60:40, 70:30 or 80:20. With bigger datasets you can get away with 10:90 or 1:99)

### Scaling features

Decision trees and Random Forests are some of the few algorithms where you don't need to worry about feature scaling. But most algorithms will perform better when our features on are on the same scale.  

For example, if we have one feature that goes from 1 to 10 and another that goes from 1 to 100,000, the gradient descent algorithm will spend most of its time working on the larger errors of the feature with the larger scale. 

There are two common approaches to scaling: 
* **normalization**: scaling features from [0,1] (special case of min-max scaling) 
* **standardization**: center features at 0 with standard deviation of 1  

Check out the formulas for these on the slides.

#### Normalizing features
Useful for algorithms that need a bounded interval.

In [49]:
from sklearn.preprocessing import MinMaxScaler

In [50]:
mms = MinMaxScaler()

In [51]:
X_train_norm = mms.fit_transform(X_train)

In [52]:
X_test_norm = mms.transform(X_test)

In [53]:
X_train_norm

array([[ 0.72043011,  0.20378151,  0.53763441, ...,  0.48717949,
         1.        ,  0.5854251 ],
       [ 0.31989247,  0.08403361,  0.31182796, ...,  0.27350427,
         0.64102564,  0.        ],
       [ 0.60215054,  0.71218487,  0.48387097, ...,  0.04273504,
         0.10622711,  0.42348178],
       ..., 
       [ 0.37365591,  0.1512605 ,  0.44623656, ...,  0.44444444,
         0.61904762,  0.02672065],
       [ 0.77150538,  0.16596639,  0.40860215, ...,  0.31623932,
         0.75457875,  0.54493927],
       [ 0.84139785,  0.34033613,  0.60215054, ...,  0.06837607,
         0.16117216,  0.28178138]])

Why did we only transform on the test dataset?

#### Standardizing features
Useful for many linear models (like SVM and logistic regression) because it makes them easier for them to learn the weights. Also helps the algorithm be less sensitive to outliers than normalization.

In [54]:
from sklearn.preprocessing import StandardScaler

In [55]:
stdsc = StandardScaler()

In [56]:
X_train_std = stdsc.fit_transform(X_train)

In [57]:
X_test_std = stdsc.transform(X_test)

In [58]:
X_train_std

array([[ 0.91083058, -0.46259897, -0.01142613, ...,  0.65706596,
         1.94354495,  0.93700997],
       [-0.95609928, -0.96608672, -1.53725357, ..., -0.40859506,
         0.58118003, -1.41336684],
       [ 0.35952243,  1.67501572, -0.37471838, ..., -1.55950896,
        -1.44846566,  0.28683658],
       ..., 
       [-0.70550467, -0.68342693, -0.62902295, ...,  0.44393375,
         0.49776993, -1.30608823],
       [ 1.14889546, -0.6215951 , -0.88332752, ..., -0.19546286,
         1.0121322 ,  0.77446662],
       [ 1.47466845,  0.11155374,  0.42452457, ..., -1.43162964,
        -1.23994042, -0.28206514]])