# [Handling Missing Values](https://www.kaggle.com/dansbecker/handling-missing-values)

There are many ways data can end up with missing values.    
* A 2 bedroom house doesn't have value for a third bedroom.
* Someone being surveyed may choose not to share their income.

Python libraries represent missing numbers as `nan` which is short for "not a number".  
You can detect which cells have missing values, and then count how many there are in each column with the command:

Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.  
Let's figure out how to deal with them.

### 1. You can drop columns with missing values:

If you want to drop the same columns from the DataFrames in both your training dataset and test dataset:

This method discards all information in the entire column, so it can be useful when most values in a column are missing.

### 2. You can impute missing values:

Imputation replaces the missing value with some number (the mean, for example), which usually gives more accurate models than dropping the column entirely.

Imputation can also be included in a scikit-learn Pipeline, which simplify model building, validation, and deployment.

### 3. You can extend imputation to consider which values were originally missing:

Imputation is the standard approach, and it usually works well.  
However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset).  
Or rows with missing values may be unique in some other way.  
In that case, your model would make better predictions by considering which values were originally missing.  
Here's how it might look:

This approach may or may not improve the results compared to simply imputing values.

# An example comparing the solutions using the Melbourne Housing data.

We will see an example predicting housing prices from the Melbourne Housing data.  

In [7]:
import pandas as pd

mb_data = pd.read_csv('input/melbourne_data.csv')

In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import train_test_split

mb_target = mb_data.Price
mb_predictors = mb_data.drop(['Price'], axis=1)

# In order to simplify this example, only numeric predictors are used.
mb_numeric_predictors = mb_predictors.select_dtypes(exclude=['object'])

### Create a function to measure how well each approach performs:

We divide our data into training and test.  
We've loaded a function `score_dataset(X_train, X_test, y_train, y_test)` to compare the quality of different approaches to missing values.  
This function reports the out-of-sample MAE score from a RandomForest.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(mb_numeric_predictors, mb_target, train_size=0.7,
                                                   test_size=0.3, random_state=0)

def score_dataset(X_train, X_test, y_train, y_test):
    model = RandomForestRegressor()
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    return mae(y_test, preds)

### Dropping columns with missing values:

In [10]:
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_test = X_test.drop(cols_with_missing, axis=1)
print("Mean Absolute Error after dropping columns with missing values:")
print(score_dataset(reduced_X_train, reduced_X_test, y_train, y_test))

Mean Absolute Error after dropping columns with missing values:
347575.785631455


### Get model score from imputation:

In [11]:
from sklearn.preprocessing import Imputer

my_imputer = Imputer()
imputed_X_train = my_imputer.fit_transform(X_train)
imputed_X_test = my_imputer.transform(X_test)
print("Mean Absolute Error after imputing misssing values:")
print(score_dataset(imputed_X_train, imputed_X_test, y_train, y_test))

Mean Absolute Error after imputing misssing values:
204966.73475267258


### Get score after imputation and display imputed values:

In [12]:
imputed_X_train_plus = X_train.copy()
imputed_X_test_plus = X_test.copy()

# https://www.python.org/dev/peps/pep-0289/
cols_with_missing = (col for col in X_train.columns if X_train[col].isnull().any())

for col in cols_with_missing:
    imputed_X_train_plus[col + ' was missing'] = imputed_X_train_plus[col].isnull()
    imputed_X_test_plus[col + ' was missing'] = imputed_X_test_plus[col].isnull()
    
# Imputation
my_imputer = Imputer()
imputed_X_train_plus = my_imputer.fit_transform(imputed_X_train_plus)
imputed_X_test_plus = my_imputer.transform(imputed_X_test_plus)

print("Mean Absolute Error while tracking imputed values:")
print(score_dataset(imputed_X_train_plus, imputed_X_test_plus, y_train, y_test))

Mean Absolute Error while tracking imputed values:
201584.8448994383


The difference between imputation and imputation with extension is trivial compared to dropping entire columns.

# [Categorical Data and One-Hot Encoding](https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding)