# Deal with missing data

Calculate the following three KPIs
- % and absolute number of missing data per feature (variable)
- % and absolute number of missing data per observation
- % and absolute number of missing data overall

Overall objective is to achieve 0% missing data, as algorithms/statistics cannot deal with missing values.

## < than 10% missing data for each feature and each observation
NUMERIC data
- analyze if a deletion of the respective features and/or observations would significantly reduce the overall missing data. Check if a collinear feature could take over for the one with missing values. Verify if sample size remains big enough. 
- If not deleted, use a respective imputation method to impute missing values. This should be possible without analyzing possible patterns in the missing data, as with 10% or less missing data, the imputation should not be biased.

NON-NUMERIC data
- add a dummy variable for missing values

## 10% up to 20% missing data for each feature and each observation
NUMERIC data
- analyze if a deletion of the respective features and/or observations would significantly reduce the overall missing data. Check if a collinear feature could take over for the one with missing values. Verify if sample size remains big enough. 
- If not deleted, analyze if there are patterns in the missing data or is the data missed randomly? Based on this outcome use respective MAR methods (patterns found) or respective MCAR (randomly missed data) to impute missing values. T-Test etc. can be used to find out if the data is missed randomly or not. 

NON-NUMERIC data
- add a dummy variable for missing values

## > 20% missing data for each feature and each observation
- candidates for deletion. Check if a collinear feature could take over for the one with missing values. Verify if sample size remains big enough. If imputation is really needed, go with regression methods for MCAR and model based techniques for MAR.

In [39]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.svm import NuSVR
from sklearn.svm import LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error
from math import sqrt

In [40]:
# load data which is stored in the /data folder of the project
train_data = pd.read_csv('../data/train.csv', sep=',', header=0)
test_data = pd.read_csv('../data/test.csv', sep=',', header=0)

In [41]:
target_variable = train_data["SalePrice"]
train_features = train_data.drop(["SalePrice"], axis=1)

## first glimps at overall situation

In [42]:
# overall missing data
def overall_missing_data(train_features):
    overall_missing = train_features.isnull().sum().sum()
    overall_values = train_features.shape[0]*train_features.shape[1]
    missing_perc = overall_missing * 100 / overall_values
    print("Missing values overall: ", overall_missing)
    print("From total values overall: ", overall_values)
    print("Resulting in: {0:.2f}% missing data overall".format(missing_perc))

In [43]:
# missing data per feature
def missing_data_per_feature(train_features):
    total_features = train_features.isnull().sum().sort_values(ascending=False)
    percent_features = (train_features.isnull().sum()/train_features.isnull().count()*100).sort_values(ascending=False)
    missing_data_features = pd.concat([total_features, percent_features], axis=1, keys=['TotalMissing', 'Percent'])
    print(missing_data_features.head(30))

In [44]:
# missing data per observation
def missing_data_per_observation(train_features):
    
    observations_with_missing_data = train_features.isnull().replace(to_replace=[False, True], value=['','M'])
    
    total_observations = train_features.isnull().sum(axis=1).sort_values(ascending=False)
    percent_observations = (train_features.isnull().sum(axis=1)/train_features.isnull().count(axis=1)*100).sort_values(ascending=False)
    missing_data_observations = pd.concat([total_observations, percent_observations], axis=1, keys=['TotalMissing', 'Percent'])
    
    return missing_data_observations, observations_with_missing_data

In [45]:
overall_missing_data(train_features)

Missing values overall:  6965
From total values overall:  116800
Resulting in: 5.96% missing data overall


In [46]:
missing_data_per_feature(train_features)

              TotalMissing    Percent
PoolQC                1453  99.520548
MiscFeature           1406  96.301370
Alley                 1369  93.767123
Fence                 1179  80.753425
FireplaceQu            690  47.260274
LotFrontage            259  17.739726
GarageCond              81   5.547945
GarageType              81   5.547945
GarageYrBlt             81   5.547945
GarageFinish            81   5.547945
GarageQual              81   5.547945
BsmtExposure            38   2.602740
BsmtFinType2            38   2.602740
BsmtCond                37   2.534247
BsmtQual                37   2.534247
BsmtFinType1            37   2.534247
MasVnrArea               8   0.547945
MasVnrType               8   0.547945
Electrical               1   0.068493
Utilities                0   0.000000
YearRemodAdd             0   0.000000
MSSubClass               0   0.000000
Foundation               0   0.000000
ExterCond                0   0.000000
ExterQual                0   0.000000
Exterior2nd 

In [47]:
numbers, sheet = missing_data_per_observation(train_features)

In [48]:
sheet.to_csv("../data/missing_values.csv", index=False)

go and check if there are 'visual' patterns in the missing data sheet.

### eliminate features with over 20% missing data

In [49]:
# drop all features with more than 20% of missing data
def delete_features(df, features):
    return df.drop(features, axis=1)

In [50]:
train_features = delete_features(train_features, ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'])

In [51]:
overall_missing_data(train_features)

Missing values overall:  868
From total values overall:  109500
Resulting in: 0.79% missing data overall


### impute, delete observation, find correlating alternative?
MasVnrArea
- option 1) delete 8 observations
- option 2) find imputing values

GarageYrBlt
- option 1) possibly correlating with YearBlt, so that GarageYrBlt can be deleted
- option 2) find imputing value. will be a random guess
- option 2) least preferred: delete 81 observations

LotFrontage
- option 1) find a correlating feature, so that LotFrontage can be deleted
- option 2) impute missing values

In [52]:
train_features.LotFrontage.describe()

count    1201.000000
mean       70.049958
std        24.284752
min        21.000000
25%        59.000000
50%        69.000000
75%        80.000000
max       313.000000
Name: LotFrontage, dtype: float64

On a first glimps the missing data seems to be missing randomly. Therefore, it would make sense to impute mean value into missing fields

In [None]:
train_features = delete_features(train_features, ['GarageYrBlt'])

In [53]:
train_features.MasVnrArea.describe()

count    1452.000000
mean      103.685262
std       181.066207
min         0.000000
25%         0.000000
50%         0.000000
75%       166.000000
max      1600.000000
Name: MasVnrArea, dtype: float64

More than 50% of values seem to be on 0. Which is similar to missing in this case?
Checked against the MasVnrType. Same picture here. It seems that also MasVnrType is in more than 50% on None. 
Decision to delete both of these variables

In [55]:
train_features = delete_features(train_features, ['MasVnrType', 'MasVnrArea']) 

### convert non-numeric features into dummy variables

In [15]:
test_interim = delete_features(test_data)

In [16]:
# concat test and train data. List all train records first, attach the test data second
all_data = pd.concat((train_features, test_interim), axis=0)

# convert categorical variables into dummy/indicator variable. 
# For missing values an additional column will be created - dummy_na
# The original feature will be dropped - drop_first 
all_dummies = pd.get_dummies(all_data, dummy_na=True, drop_first=True)

# split test and train sets again
train_dummies = all_dummies.iloc[:train_features.shape[0],:]

get overall statistics of missing again. Only numerical values should be missing, if any is missing

In [17]:
overall_missing_data(train_dummies)

Missing values overall:  348
From total values overall:  395660
Resulting in: 0.09% missing data overall


In [18]:
missing_data_per_feature(train_dummies)

                      TotalMissing    Percent
LotFrontage                    259  17.739726
GarageYrBlt                     81   5.547945
MasVnrArea                       8   0.547945
Condition1_RRNe                  0   0.000000
Condition1_Feedr                 0   0.000000
Condition1_Norm                  0   0.000000
Condition1_PosA                  0   0.000000
Condition1_PosN                  0   0.000000
Condition1_RRAe                  0   0.000000
Condition1_RRAn                  0   0.000000
SaleCondition_nan                0   0.000000
Neighborhood_Veenker             0   0.000000
Condition1_RRNn                  0   0.000000
Condition1_nan                   0   0.000000
Condition2_Feedr                 0   0.000000
Condition2_Norm                  0   0.000000
Condition2_PosA                  0   0.000000
Condition2_PosN                  0   0.000000
Neighborhood_nan                 0   0.000000
Neighborhood_Timber              0   0.000000
Condition2_RRAn                  0

In [19]:
numbers, sheet = missing_data_per_observation(train_dummies)

In [20]:
print(numbers)

      TotalMissing   Percent
1407             2  0.738007
287              2  0.738007
1030             2  0.738007
393              2  0.738007
234              2  0.738007
1143             2  0.738007
529              2  0.738007
375              2  0.738007
307              2  0.738007
826              1  0.369004
237              1  0.369004
490              1  0.369004
1148             1  0.369004
269              1  0.369004
1153             1  0.369004
1154             1  0.369004
495              1  0.369004
496              1  0.369004
1161             1  0.369004
465              1  0.369004
1097             1  0.369004
1164             1  0.369004
221              1  0.369004
794              1  0.369004
218              1  0.369004
791              1  0.369004
1173             1  0.369004
214              1  0.369004
789              1  0.369004
1177             1  0.369004
...            ...       ...
916              0  0.000000
919              0  0.000000
885           

In [21]:
sheet.to_csv("../data/missing_values_afterdeletion.csv", index=False)

In [7]:
dummies_train = dummies_train.drop(['Id'], axis=1)

In [8]:
print(dummies_train.shape)

(1460, 288)


In [7]:
target_variable = np.log(target_variable)

In [8]:
dummies_train['GrLivArea'] = np.log(dummies_train['GrLivArea'])

In [9]:
dummies_train['HasBsmt'] = pd.Series(len(dummies_train['TotalBsmtSF']), index=dummies_train.index)
dummies_train['HasBsmt'] = 0 
dummies_train.loc[dummies_train['TotalBsmtSF']>0,'HasBsmt'] = 1

In [10]:
dummies_train.loc[dummies_train['HasBsmt']==1,'TotalBsmtSF'] = np.log(dummies_train['TotalBsmtSF'])

  if __name__ == '__main__':


In [9]:
X_train, X_valid, y_train, y_valid = train_test_split(dummies_train, target_variable, test_size=0.2, random_state=0)

In [10]:
reg_models = [RandomForestRegressor(n_estimators=300),
              DecisionTreeRegressor(),
              LinearRegression(),
              Lasso(),
              ElasticNet(),
              Ridge(alpha=2.5),
              SVR(),
              NuSVR(),
              LinearSVR()]

log_cols = ["RegressionModel", "RMSE", "Score"]
log = pd.DataFrame(columns=log_cols)

for reg in reg_models:
    reg.fit(X_train, y_train)

    name = reg.__class__.__name__

    print("=" * 30)
    print(name)

    train_predictions = reg.predict(X_valid)
    rmse = sqrt(mean_squared_error(y_valid, train_predictions))
    print("Root mean squared error: {}".format(rmse))
    
    score = reg.score(X_valid, y_valid)
    print("Score: {}".format(score))

    log_entry = pd.DataFrame([[name, rmse, score]], columns=log_cols)
    log = log.append(log_entry)

print("="*30)

RandomForestRegressor
Root mean squared error: 33444.6661426284
Score: 0.8380292506755513
DecisionTreeRegressor
Root mean squared error: 51957.36056264777
Score: 0.6090897817490446
LinearRegression
Root mean squared error: 58152.52972878381
Score: 0.510311296728986




Lasso
Root mean squared error: 56954.42649401919
Score: 0.5302813246492708
ElasticNet
Root mean squared error: 50378.19181864712
Score: 0.6324909715465268
Ridge
Root mean squared error: 46849.71131664417
Score: 0.6821686741105454
SVR
Root mean squared error: 85107.83090233327
Score: -0.04887058344096018
NuSVR
Root mean squared error: 83573.53347654763
Score: -0.011394040438498898
LinearSVR
Root mean squared error: 61672.08727027875
Score: 0.4492428349735119


In [13]:
# train best model
rf_reg = RandomForestRegressor(n_estimators=200)
rf_reg.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=200, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

In [14]:
test_ids = dummies_test['Id']
dummies_test = dummies_test.drop(['Id'], axis=1)

In [15]:
dummies_test['GrLivArea'] = np.log(dummies_test['GrLivArea'])

In [16]:
dummies_test['HasBsmt'] = pd.Series(len(dummies_test['TotalBsmtSF']), index=dummies_test.index)
dummies_test['HasBsmt'] = 0 
dummies_test.loc[dummies_test['TotalBsmtSF']>0,'HasBsmt'] = 1

In [17]:
dummies_test.loc[dummies_test['HasBsmt']==1,'TotalBsmtSF'] = np.log(dummies_test['TotalBsmtSF'])

  if __name__ == '__main__':


In [18]:
predictions = rf_reg.predict(dummies_test)

# prepare submission as outlined in the submission_sample from Kaggle
submission = pd.DataFrame({"Id": test_ids,"SalePrice": predictions})

In [19]:
submission.to_csv("../data/submission.csv", index=False)