https://github.com/raytroop/FeatureEngineering/blob/master/Course-Notebooks/03.1_Missing_values.ipynb

## Missing values
Missing data, or Missing values, occur when no data / no value is stored for a certain observation within a variable.

Missing data are a common occurrence both in data science competitions and in data in business settings, and can have a significant effect on the conclusions that can be drawn from the data. Incomplete data is an unavoidable problem in dealing with most data sources.

### Why is data missing?
The source of missing data can be very different and here are just a few examples:

- A value is missing because it was forgotten or lost or not stored properly
- For a certain observation, the value of the variable does not exist
- The value can't be known or identified
Imagine for example that the data comes from a survey, and the data are entered manually into an online form. The data entry could easily forget to complete a field in the form, and therefore, that value for that form would be missing.

The person being asked may not want to disclose the answer to one of the questions, for example, their income. That would be then a missing value for that person.

Sometimes, a certain feature can't be calculated for a specific individual. For example, in the variable 'total debt as percentage of total income' if the person has no income, then the total percentage of 0 does not exist. Therefore it will be a missing value.

Together with understanding the source of missing data, it is important to understand the mechanisms by which missing fields are introduced in a dataset. Depending on the mechanism, we may choose to process the missing values differently. In addition, by knowing the source of missing data, we may choose to take action to control that source, and decrease the number of missing data looking forward during data collection.

### Missing Data Mechanisms
There are 3 mechanisms that lead to missing data, 2 of them involve missing data randomly or almost-randomly, and the third one involves a systematic loss of data.

#### Missing Completely at Random, MCAR:
A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other.

If values for observations are missing completely at random, then disregarding those cases would not bias the inferences made.

#### Missing at Random, MAR:
MAR occurs when there is a systematic relationship between the propensity of missing values and the observed data. In other words, the probability an observation being missing depends only on available information (other variables in the dataset). For example, if men are more likely to disclose their weight than women, weight is MAR. The weight information will be missing at random for those men and women that decided not to disclose their weight, but as men are more prone to disclose it, there will be more missing values for women than for men.

In a situation like the above, if we decide to proceed with the variable with missing values (in this case weight), we might benefit from including gender to control the bias in weight for the missing observations.

#### Missing Not at Random, MNAR:
Missing of values is not at random (MNAR) if their being missing depends on information not recorded in the dataset. In other words, there is a mechanism or a reason why missing values are introduced in the dataset.

Examples:

MNAR would occur if people failed to fill in a depression survey because of their level of depression. Here, the missing of data is related to the outcome, depression.

When a financial company asks for bank and identity documents from customers in order to prevent identity fraud, typically, fraudsters impersonating someone else will not upload documents, because they don't have them, precisely because they are fraudsters. Therefore, there is a systematic relationship between the missing documents and the target we want to predict: fraud.

Understanding the mechanism by which data can be missing is important to decide which methods to use to handle the missing values. I will cover how to handle missing values in detail sections 5 and 6.

Key Take Away Code:
    
date.groupby(['Survived'])['cabin_null'].mean()    

In [15]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

# to display the total number columns present in the dataset
pd.set_option('display.max_columns', None)

In [2]:
# let's load the titanic dataset

data = pd.read_csv('../datasets/titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [4]:
# alternatively you can get percentage
data.isnull().mean()

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

# Missing data Not At Random (MNAR): Systematic missing values
In this dataset, both the missing values of the variables Cabin and Age, were introduced systematically. For many of the people who did not survive, the age they had or the cabin they were staying in, could not be established. The people who survived could be asked for that information.

Can we infer this by looking at the data?

In a situation like this, we could expect a greater number of missing values for people who did not survive.

Let's have a look.

In [10]:
data['cabin_null'] = np.where(data.Cabin.isnull(), 1, 0)
data["cabin_null"].mean()

0.7710437710437711

In [11]:
# and then we evaluate the mean of the missing values in
# cabin for the people who survived vs the non-survivors.

# group data by Survived vs Non-Survived
# and find nulls for cabin

data.groupby(['Survived'])["cabin_null"].mean()



Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64

Non survived people has missing value at 87%
Survived people has missing value at 60%

This finding is aligned with our hypothesis that the data is missing because after the people died, the information could not be retrieved.

In [12]:
data['age_null'] = np.where(data.Age.isnull(), 1, 0)

In [13]:
data.groupby(['Survived'])['age_null'].mean()

Survived
0    0.227687
1    0.152047
Name: age_null, dtype: float64

There is a systematic loss of data: people who did not survive tend to have more information missing. Presumably, the method chosen to gather the information, contributes to the generation of these missing data.

# Missing data Completely At Random (MCAR)

In [17]:
# slice the dataframe to show only those observations
# with missing value for Embarked

data[data.Embarked.isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,cabin_null,age_null
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,0,0
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,0,0


## Missing data At Random (MAR)
For this example, I will use the Lending Club loan book. I will look specifically at the variables employer name (emp_title) and years in employment (emp_length), declared by the borrowers at the time of applying for a loan. The former refers to the name of the company for which the borrower works, the second one to how many years the borrower has worked for named company.

Here I will show an example, in which a data point missing in one variable (emp_title) depends on the value entered on the other variable (emp_lenght).