# Deal with missing data

Calculate the following three KPIs
- % and absolute number of missing data per feature (variable)
- % and absolute number of missing data per observation
- % and absolute number of missing data overall

Overall objective is to achieve 0% missing data, as algorithms/statistics cannot deal with missing values.

## Less than 10% missing data for each feature and each observation
NUMERIC data
- analyze if a deletion of the respective features and/or observations would significantly reduce the overall missing data. Check if a collinear feature could take over for the one with missing values. Verify if sample size remains big enough. 
- If not deleted, use a respective imputation method to impute missing values. This should be possible without analyzing possible patterns in the missing data, as with 10% or less missing data, the imputation should not be biased.

NON-NUMERIC data
- add a dummy variable for missing values

## 10% up to 20% missing data for each feature and each observation
NUMERIC data
- analyze if a deletion of the respective features and/or observations would significantly reduce the overall missing data. Check if a collinear feature could take over for the one with missing values. Verify if sample size remains big enough. 
- If not deleted, analyze if there are patterns in the missing data or is the data missed randomly? Based on this outcome use respective MAR methods (patterns found) or respective MCAR (randomly missed data) to impute missing values. T-Test etc. can be used to find out if the data is missed randomly or not. 

NON-NUMERIC data
- add a dummy variable for missing values

## More than 20% missing data for each feature and each observation
- candidates for deletion. Check if a collinear feature could take over for the one with missing values. Verify if sample size remains big enough. If imputation is really needed, go with regression methods for MCAR and model based techniques for MAR.

### Relevant imports and data load

In [1]:
import pandas as pd

%matplotlib inline

In [2]:
# load data which is stored in the /data folder of the project
train_data = pd.read_csv('../data/train.csv', sep=',', header=0)
test_data = pd.read_csv('../data/test.csv', sep=',', header=0)

### first glimps at overall situation of missing data

In [3]:
# overall missing data
def overall_missing_data(df):
    overall_missing = df.isnull().sum().sum()
    overall_values = df.shape[0]*df.shape[1]
    missing_perc = overall_missing * 100 / overall_values
    print("Missing values overall: ", overall_missing)
    print("From total values overall: ", overall_values)
    print("Resulting in: {0:.2f}% missing data overall".format(missing_perc))

In [4]:
# missing data per feature
def missing_data_per_feature(df):
    total_features = df.isnull().sum().sort_values(ascending=False)
    percent_features = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending=False)
    missing_data_features = pd.concat([total_features, percent_features], axis=1, keys=['TotalMissing', 'Percent'])
    print(missing_data_features.head(30))

In [5]:
# missing data per observation
def missing_data_per_observation(df):
    
    observations_with_missing_data = df.isnull().replace(to_replace=[False, True], value=['','M'])
    
    total_observations = df.isnull().sum(axis=1).sort_values(ascending=False)
    percent_observations = (df.isnull().sum(axis=1)/df.isnull().count(axis=1)*100).sort_values(ascending=False)
    missing_data_observations = pd.concat([total_observations, percent_observations], axis=1, keys=['TotalMissing', 'Percent'])
    
    return missing_data_observations, observations_with_missing_data

In [6]:
overall_missing_data(train_data)

Missing values overall:  6965
From total values overall:  118260
Resulting in: 5.89% missing data overall


In [7]:
missing_data_per_feature(train_data)

              TotalMissing    Percent
PoolQC                1453  99.520548
MiscFeature           1406  96.301370
Alley                 1369  93.767123
Fence                 1179  80.753425
FireplaceQu            690  47.260274
LotFrontage            259  17.739726
GarageCond              81   5.547945
GarageType              81   5.547945
GarageYrBlt             81   5.547945
GarageFinish            81   5.547945
GarageQual              81   5.547945
BsmtExposure            38   2.602740
BsmtFinType2            38   2.602740
BsmtFinType1            37   2.534247
BsmtCond                37   2.534247
BsmtQual                37   2.534247
MasVnrArea               8   0.547945
MasVnrType               8   0.547945
Electrical               1   0.068493
Utilities                0   0.000000
YearRemodAdd             0   0.000000
MSSubClass               0   0.000000
Foundation               0   0.000000
ExterCond                0   0.000000
ExterQual                0   0.000000
Exterior2nd 

In [8]:
missing_absolutely, missing_patterns = missing_data_per_observation(train_data)

In [9]:
missing_patterns.to_csv("../data/missing_values.csv", index=False)

go and check if there are 'visual' patterns in the missing data sheet.

### eliminate features with more than 20% missing data

In [10]:
train_data = train_data.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'], axis=1)

In [11]:
overall_missing_data(train_data)

Missing values overall:  868
From total values overall:  110960
Resulting in: 0.78% missing data overall


### analyze features with less than 20% missing data
MasVnrArea
- option 1) delete 8 observations
- option 2) find imputing values
- option 3) delete feature 

GarageYrBlt
- option 1) possibly correlating with YearBlt, so that GarageYrBlt can be deleted
- option 2) find imputing value. will be a random guess
- option 2) least preferred: delete 81 observations

LotFrontage
- option 1) find a correlating feature, so that LotFrontage can be deleted
- option 2) impute missing values

In [12]:
train_data.LotFrontage.describe()

count    1201.000000
mean       70.049958
std        24.284752
min        21.000000
25%        59.000000
50%        69.000000
75%        80.000000
max       313.000000
Name: LotFrontage, dtype: float64

On a first glimps the missing data seems to be missing randomly. Therefore, it would make sense to impute mean value into missing fields

In [13]:
train_data['LotFrontage'] = train_data['LotFrontage'].fillna(train_data['LotFrontage'].mean())

Assumption that GarageYrBlt is highly correlating to YearBlt. Manual data inspection confirms that picture. But let's compare these two variables. -- comparison of two categorial, ordinal variables. 

In [14]:
train_data = train_data.drop(['GarageYrBlt'], axis=1)

In [15]:
train_data.MasVnrArea.describe()

count    1452.000000
mean      103.685262
std       181.066207
min         0.000000
25%         0.000000
50%         0.000000
75%       166.000000
max      1600.000000
Name: MasVnrArea, dtype: float64

More than 50% of values seem to be on 0. Which is similar to missing in this case?
Checked against the MasVnrType. Same picture here. It seems that also MasVnrType is in more than 50% on None. 
Decision to delete both of these variables

In [16]:
train_data = train_data.drop(['MasVnrType', 'MasVnrArea'], axis=1) 

In [17]:
overall_missing_data(train_data)

Missing values overall:  512
From total values overall:  106580
Resulting in: 0.48% missing data overall


In [18]:
missing_data_per_feature(train_data)

              TotalMissing   Percent
GarageCond              81  5.547945
GarageType              81  5.547945
GarageFinish            81  5.547945
GarageQual              81  5.547945
BsmtFinType2            38  2.602740
BsmtExposure            38  2.602740
BsmtFinType1            37  2.534247
BsmtCond                37  2.534247
BsmtQual                37  2.534247
Electrical               1  0.068493
YearRemodAdd             0  0.000000
RoofStyle                0  0.000000
RoofMatl                 0  0.000000
ExterQual                0  0.000000
Exterior1st              0  0.000000
Exterior2nd              0  0.000000
OverallCond              0  0.000000
ExterCond                0  0.000000
Foundation               0  0.000000
BsmtFinSF1               0  0.000000
YearBuilt                0  0.000000
SalePrice                0  0.000000
OverallQual              0  0.000000
LandContour              0  0.000000
MSSubClass               0  0.000000
MSZoning                 0  0.000000
L

In [20]:
train_data = train_data.drop(train_data.loc[train_data['Electrical'].isnull()].index)

### convert non-numeric features into dummy variables

In [21]:
test_interim = test_data.drop(['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu','MasVnrType', 'MasVnrArea', 'GarageYrBlt'], axis=1)

In [31]:
# drop SalePrice and add it after get_dummies
target_variable = train_data['SalePrice']
train_interim = train_data.drop('SalePrice', axis=1)

# concat test and train data. List all train records first, attach the test data second
all_data = pd.concat((train_interim, test_interim), axis=0)

# convert categorical variables into dummy/indicator variable. 
# For missing values an additional column will be created - dummy_na
# The original feature will be dropped - drop_first 
all_dummies = pd.get_dummies(all_data, dummy_na=True, drop_first=True)

# split test and train sets again
train_dummies = all_dummies.iloc[:train_interim.shape[0],:]
test_dummies = all_dummies.iloc[train_interim.shape[0]:,:]

get overall statistics of missing again. Only numerical values should be missing, if any is missing

In [32]:
overall_missing_data(train_dummies)

Missing values overall:  0
From total values overall:  386635
Resulting in: 0.00% missing data overall


In [33]:
# add SalePrice again
train_dummies = train_dummies.assign(SalePrice=target_variable)

In [34]:
train_dummies.head(3)

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,...,SaleType_Oth,SaleType_WD,SaleType_nan,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,SaleCondition_nan,SalePrice
0,1,60,65.0,8450,7,5,2003,2003,706.0,0.0,...,0,1,0,0,0,0,1,0,0,208500
1,2,20,80.0,9600,6,8,1976,1976,978.0,0.0,...,0,1,0,0,0,0,1,0,0,181500
2,3,60,68.0,11250,7,5,2001,2002,486.0,0.0,...,0,1,0,0,0,0,1,0,0,223500


In [35]:
train_dummies.to_csv("../data/train_filled_up.csv", index=False)

In [None]:
test_dummies.to_csv("../data/test_filled_up.csv", index=False)