# Python Machine Learning in Biology
# Preprocessing

"Garbage in, garbage out" applies to machine learning models as well. The quality of the data we have and the amount of useful data it contains determines how much the machine learning algorithm can tell us about patterns in the data. So, before we feed it into our model, we need to examine and preprocess the dataset.

We'll cover:
* Dealing with missing data (removing and imputing missing values)
* Converting categorical data to a format a machine learning model can understand
* Standardizing data
* Feature selection for model construction

## Dealing with missing data

Why might our dataset be missing data?  

Most of our computational tools won't be able to handle missing data, so we'll need to deal with it.

Missing data is usualy represented in the dataset as a blank space or as a NaN (not a number) placeholder string.

#### Let's create a fake dataset so we can learn how to deal with missing values
`StringIO` let's us read in a string as a dataframe like it is a regular csv we imported. 

In [2]:
import pandas as pd
from io import StringIO

In [3]:
missing_data = '''A, B, C, D
1.0, 2.0, 3.0, 4.0
5.0, 6.0,,8.0
10.0, 11.0, 12.0,'''

In [4]:
missing = pd.read_csv(StringIO(missing_data))

In [5]:
missing

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


Even though we can see our missing values here, for larger datasets, searching manually through would take a long time. 

#### Let's figure out how many missing values each column has
We can use the `.isnull` method to get a DataFrame with a Boolean indicating whether there is a missing value or not. Then we can use the `.sum()` method to figure out how many missing values are in each column.

In [9]:
missing.isnull().sum()

A     0
 B    0
 C    1
 D    1
dtype: int64

### Removing missing samples

An easy way to handle missing data is to just remove it. We can remove the column (feature) containing the missing value, or we can remove the row (sample) from the dataset.

#### Drop rows with any missing values using `.dropna()`

In [10]:
missing.dropna()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


#### Drop columns with any missing values using `.dropna()`
"axis = 0" means row and "axis = 1" means column. For this method, row is the default. I usually remember that columns are vertical, and so is the number "1".

In [11]:
missing.dropna(axis = 1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


Notice we didn't actually affect the original dataframe. (We would need to save it as a new variable or add an "inplace=True" argument)

In [12]:
missing

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


#### `dropna` can drop rows where all columns are NaN

In [13]:
missing.dropna(how='all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


#### drop rows that have not at least 4 non-NaN values (threshold)

In [14]:
missing.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


#### only drop rows where Nan appears in specific columns

In [1]:
missing.dropna(subset=['C'], axis = 1)

NameError: name 'missing' is not defined

Dropping missing data isn't always the best idea. Why? (might lose too much valuable data)

### Imputing missing values

A commonly-used alternative to dropping missing data is imputing the missing values (interpolating). This means using the other values in that same column to try to estimate that value.   

A common type of interpolation is **mean imputation** where we use the mean of the other values in that column (same feature) to fill in the blank.  

There are other types of imputation (like using clustering methods), but we won't go into the pros and cons of these. Know that they exist and know that they each have their pros and cons. 

#### Use scikit-learn's Imputer class to do mean imputation

In [23]:
from sklearn.preprocessing import Imputer

The basic steps in using the `Imputer` class (which is a transformer class--we'll see some other ones that we'll use for data transformation)
1. instantiate the class
2. fit the data (learn the parameters from the training data--only use on training data)
3. transform the data (use those parameters to transform the data)

*for some reason axis = 0 for this class means columns. CONFUSING*

In [26]:
imr = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)

In [27]:
imr = imr.fit(missing)

In [28]:
imputed_data = imr.transform(missing.values)

In [29]:
imputed_data

array([[  1. ,   2. ,   3. ,   4. ],
       [  5. ,   6. ,   7.5,   8. ],
       [ 10. ,  11. ,  12. ,   6. ]])

*Side note: scikit-learn can handle dataframes usually, but it's build in `NumPy` (a linear algebra library). `dataframe.values` gives us the numpy matrix representation of our dataframe*

## Handling categorical data