# [Handling Missing Values](https://www.kaggle.com/dansbecker/handling-missing-values)

There are many ways data can end up with missing values.    
* A 2 bedroom house doesn't have value for a third bedroom.
* Someone being surveyed may choose not to share their income.

Python libraries represent missing numbers as `nan` which is short for "not a number".  
You can detect which cells have missing values, and then count how many there are in each column with the command:

Most libraries (including scikit-learn) will give you an error if you try to build a model using data with missing values.  
Let's figure out how to deal with them.

### 1. You can drop columns with missing values:

If you want to drop the same columns from the DataFrames in both your training dataset and test dataset:

This method discards all information in the entire column, so it can be useful when most values in a column are missing.

### 2. You can impute missing values:

Imputation replaces the missing value with some number (the mean, for example), which usually gives more accurate models than dropping the column entirely.

Imputation can also be included in a scikit-learn Pipeline, which simplify model building, validation, and deployment.

### 3. You can extend imputation to consider which values were originally missing:

Imputation is the standard approach, and it usually works well.  
However, imputed values may by systematically above or below their actual values (which weren't collected in the dataset).  
Or rows with missing values may be unique in some other way.  
In that case, your model would make better predictions by considering which values were originally missing.  
Here's how it might look:

This approach may or may not improve the results compared to simply imputing values.

# An example comparing the solutions using the Melbourne Housing data.