# Main topics
* Removing and imputating missing values from the dataset
* Getting categorical data into shape for machine learning algorithms
* Selecting relevant features for the model construction

# Dealing with missing data
* Error in data collection process
* Empty fields in a survey
* Represented by NaN or NULL
* We have to take care of these missing data

In [9]:
import pandas as pd
from io import StringIO
csv_data = '''
A,B,C,D
1.0, 2.0, 3.0, 4.0
5.0, 6.0,, 8.0
10.0, 11.0, 12.0,
'''
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [11]:
# return a DataFrame with Boolean values that indicate whether a cell contains a numeric value (False) 
# or if data is missing (True)
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

## Eliminating samples or featuers with missing values
The easiest approach is to simply remove the corresponding features (columns) or samples (rows) from the dataset entirely.

In [13]:
# drop rows with missing values
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [14]:
# drop  columns with at least one NaN in any row in it
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


In [15]:
# drop rows where all columns are NaN
df.dropna(how='all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [16]:
# drop rows that have less than 4 real values
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [17]:
# drop rows where NaN appear in specific columns
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


Removing missing data seems to be a convenient approach, but is comes with disadvantages. For example we may end up removing too many samples. Or if we remove too many feature columns, we will run the risk of losing valuable information that our classifier needs to dicriminate between classes.
## Imputing missing values
Mean imputation: we simply replace the missing value with the mean value of the entire feature column

In [19]:
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])

We can change axis to axis=1 to get the mean of the rows. 
Strategy: `median`, `most_frequent`
most frequent is useful for categorical feature values