# Handling Missing Values

## Summary.

#### Treating values which do not present in train data
1. The choice of method to fill `NaN` depends on the situation
2. Usual way to deal with missing values is to replace them with `-999`, `mean` or `median`
3. Missing values already can be replaced with something by organizers
4. Binary feature `isnull` indicating what rows have missing values can be beneficial
5. In general, avoid replacing `NaN`s with real values before feature generation
6. XGBost can handle `NaN`

## Missing data, numeric
example data from `Springfield competition`

#### `Hidden NaN` - Replaced `NaN` with some actual num, string, etc.
* How to find ? --- **Drawing histogram**

![hidden-nan](img/hidden-nan.png)

## `Fillna` approaches

1. -999, -1, etc.
2. mean, median
  * can be useful for linear model with numeric features
3. reconstruct value

## `Isnull` feature

feature|isnull
---|---
.1|False
.95|False
NaN|True
-3|False
NaN|True

* indicating which rows have missing values.
* **This can solve problems with trees and neural networks while computing mean or median.**

## Missing values reconstruction

![mvts1](img/missing-value-time-series1.png)

* one example : `having missing values in time series`
  * we can approximate the missing values from nearby values, but usually features in data are independent - so this is somewhat rare case.
  
  
#### dealing with feature generation
* `isnull` feature indicates which rows contain non-numbers.

## Feature generation with missing values

![mvts2](img/missing-value-time-series2.png)

* IF we generate new features from original features including missing values - the model can have massive, unwanted impacts.
* We already know that we can approximate missing values sometimes from nearby points - but it's very rare. 

**Another example!**
* What happens if we fill non-numbers in the `numeric_feature` with some value outside of the feature range like `-999`?
* All values are diven closer to `-999`.
* The more the row's corresponding to particular category will have missing values, the closer mean value will be to `-999`.
* **The same is true when filling missing values with mean or median**

categorical_feature|numeric_feature|numeric_feature_filled|categorical_encoded
---|---|---|---
A|1|1|1.5
A|4|4|1.5
A|2|2|1.5
A|-1|-1|1.5
B|9|9|-495
B|NaN|-999|-495

### This kind of missing value imputation definitely can screw the feature we are constructing.
<br>
### The way to handle this particular case is to simply ignore missing values while calculating means for each category.
<br>
### You should be very careful with early data not imputed if you want to generate new features.

### XGBoost can handle missing values!
* Sometimes using this approach with XGBoost can change score drastically.

### Treating outliers as missing values
* When the value of outlier is something impossible to be included in the feature

## Treating values which do not present in train data

![do-not-present-in-train](img/do-not-present-in-train.png)

If we have categorical features, sometimes **it can beneficial to change the missing values or categories which present in the test data but do not present in the train data.**

![do-not-present-in-train2](img/do-not-present-in-train2.png)
* New features' (`_encoded` by occurrence) values related to D and C are equal to each other
* If there is some dependency in between target and number of occurrences for each category, our model will be able to successfully utilize that.