# 2. Preprocessing and Feature Engineering

* [2.1 Feature Selection](#2.1-Feature-Selection)
    * [2.1.1 Skewed Distribution](#2.1.1-Skewed-Distribution)
    * [2.1.2 Correlation](#2.1.2-Correlation)
    * [2.1.3 P-Value](#2.1.3-P-Value)    
    * [2.1.4 Saving Final Dataset](#2.1.4-Saving-Final-Dataset)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats.stats import pearsonr
import itertools
import pingouin as pg
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import re
import statsmodels.api as sm
from scipy import stats

import warnings
warnings.filterwarnings('ignore')

In [2]:
train = pd.read_csv('../datasets/train_clean.csv') 
test = pd.read_csv('../datasets/test_clean.csv')

In [3]:
pd.options.display.max_rows = train.shape[1]
pd.options.display.max_columns = train.shape[1]

## 2.1 Feature Selection
Due to alot of categorical values and encoding, ended up with 222 columns. We will like to reduce redundant column that have skewed distribution and common correlation.

### 2.1.1 Skewed Distribution
After filtering the column that have less than 0.005 variation, we can compare the features in the train and test set. We can decide which feature to drop if it won't affect the other set. We will proceed to drop those features that appear in both set. For features that are from original features we can consider combining them if it will help the variation to be more than 0.005.

In [4]:
train_skew_var = train.var()
test_skew_var = test.var()

In [5]:
train_skew_var[train_skew_var.values < 0.01]

GarageType_2Types       0.009205
GarageType_CarPort      0.005350
MasVnrType_BrkCmn       0.005834
MasVnrType_CBlock       0.000000
MSZoning_A_agr          0.000977
MSZoning_C_all          0.009205
MSZoning_I_all          0.000489
MSZoning_RH             0.006799
Street_Grvl             0.003411
Street_Pave             0.003411
LotShape_IR3            0.003897
LotConfig_FR3           0.004382
LandSlope_Sev           0.003897
Utilities_AllPub        0.000977
Utilities_NoSeWa        0.000489
Utilities_NoSewr        0.000489
Heating_Floor           0.000000
Heating_GasW            0.009684
Heating_Grav            0.002439
Heating_OthW            0.000977
Heating_Wall            0.002925
HouseStyle_2.5Fin       0.002925
RoofStyle_Flat          0.005834
RoofStyle_Gambrel       0.005834
RoofStyle_Mansard       0.003411
RoofStyle_Shed          0.001465
RoofMatl_Membran        0.000489
RoofMatl_Metal          0.000000
RoofMatl_Roll           0.000000
RoofMatl_Tar&Grv        0.006799
RoofMatl_W

In [6]:
test_skew_var[test_skew_var.values < 0.01]

MiscFeature_Othr        0.000000
MiscFeature_TenC        0.000000
GarageType_2Types       0.004540
GarageType_CarPort      0.004540
MasVnrType_CBlock       0.001139
MSZoning_A_agr          0.000000
MSZoning_C_all          0.006795
MSZoning_I_all          0.001139
Street_Grvl             0.005669
Street_Pave             0.005669
LotShape_IR3            0.007918
LotConfig_FR3           0.005669
LandSlope_Sev           0.006795
Utilities_AllPub        0.001139
Utilities_NoSeWa        0.000000
Utilities_NoSewr        0.001139
Heating_Floor           0.001139
Heating_GasW            0.007918
Heating_Grav            0.004540
Heating_OthW            0.000000
Heating_Wall            0.000000
HouseStyle_2.5Fin       0.002275
RoofStyle_Flat          0.007918
RoofStyle_Mansard       0.004540
RoofStyle_Shed          0.002275
RoofMatl_Membran        0.000000
RoofMatl_Metal          0.001139
RoofMatl_Roll           0.001139
RoofMatl_Tar&Grv        0.009039
RoofMatl_WdShake        0.005669
RoofMatl_W

#### MasVnrType
Train:
- MasVnrType_BrkCmn       0.005834
- MasVnrType_CBlock       0.000000

Test:
- MasVnrType_CBlock       0.001139


In [7]:
train.drop(['MasVnrType_CBlock', 'MasVnrType_BrkCmn'], axis=1, inplace=True)
test.drop(['MasVnrType_CBlock', 'MasVnrType_BrkCmn'], axis=1, inplace=True)

#### MSZoning
Train:
- MSZoning_A_agr          0.000977
- MSZoning_C_all          0.009205
- MSZoning_I_all          0.000489

Test:
- MSZoning_A_agr          0.000000
- MSZoning_C_all          0.006795
- MSZoning_I_all          0.001139

In [8]:
train['MSZoning_ACI'] = train['MSZoning_A_agr'] + train['MSZoning_C_all'] + train['MSZoning_I_all']
test['MSZoning_ACI'] = test['MSZoning_A_agr'] + test['MSZoning_C_all'] + test['MSZoning_I_all']

In [9]:
train.drop(['MSZoning_A_agr', 'MSZoning_C_all', 'MSZoning_I_all'], axis=1, inplace=True)
test.drop(['MSZoning_A_agr', 'MSZoning_C_all', 'MSZoning_I_all'], axis=1, inplace=True)

#### Utilities
Notice that all utilities from the original features are less than 0.005. We can proceed to drop all.


In [10]:
train.drop(['Utilities_AllPub', 'Utilities_NoSeWa', 'Utilities_NoSewr'], axis=1, inplace=True)
test.drop(['Utilities_AllPub', 'Utilities_NoSeWa', 'Utilities_NoSewr'], axis=1, inplace=True)

#### Heating
We can see that of we combine the low variance heating features, it will be more than 0.005.

Train:
- Heating_Floor           0.000000
- Heating_GasW            0.009684
- Heating_Grav            0.002439
- Heating_OthW            0.000977
- Heating_Wall            0.002925

Test:
- Heating_Floor           0.001139
- Heating_GasW            0.007918
- Heating_Grav            0.004540
- Heating_OthW            0.000000
- Heating_Wall            0.000000

In [11]:
train['Heating_OthW'] = train['Heating_Floor'] + train['Heating_GasW'] + train['Heating_Grav'] + train['Heating_OthW'] \
                        + train['Heating_Wall']
test['Heating_OthW'] = test['Heating_Floor'] + test['Heating_GasW'] + test['Heating_Grav'] + test['Heating_OthW'] + \
                        test['Heating_Wall']

In [12]:
train.drop(['Heating_Floor', 'Heating_GasW', 'Heating_Grav', 'Heating_Wall'], axis=1, inplace=True)
test.drop(['Heating_Floor', 'Heating_GasW', 'Heating_Grav', 'Heating_Wall'], axis=1, inplace=True)

#### RoofStyle
Train:
- RoofStyle_Flat          0.005834
- RoofStyle_Gambrel       0.005834
- RoofStyle_Mansard       0.003411
- RoofStyle_Shed          0.001465

Test:
- RoofStyle_Flat          0.007918
- RoofStyle_Mansard       0.004540
- RoofStyle_Shed          0.002275

In [13]:
train['RoofStyle_Othr'] = train['RoofStyle_Flat'] + train['RoofStyle_Gambrel'] + train['RoofStyle_Mansard'] + \
                            train['RoofStyle_Shed'] 
test['RoofStyle_Othr'] = test['RoofStyle_Flat'] + test['RoofStyle_Gambrel'] + test['RoofStyle_Mansard'] + \
                            test['RoofStyle_Shed'] 

In [14]:
train.drop(['RoofStyle_Flat', 'RoofStyle_Gambrel', 'RoofStyle_Mansard', 'RoofStyle_Shed'], axis=1, inplace=True)
test.drop(['RoofStyle_Flat', 'RoofStyle_Gambrel', 'RoofStyle_Mansard', 'RoofStyle_Shed'], axis=1, inplace=True)

#### RoofMatl
Train:
- RoofMatl_Membran        0.000489
- RoofMatl_Metal          0.000000
- RoofMatl_Roll           0.000000
- RoofMatl_Tar&Grv        0.006799
- RoofMatl_WdShake        0.001952
- RoofMatl_WdShngl        0.002439

Test:
- RoofMatl_Membran        0.000000
- RoofMatl_Metal          0.001139
- RoofMatl_Roll           0.001139
- RoofMatl_Tar&Grv        0.009039
- RoofMatl_WdShake        0.005669
- RoofMatl_WdShngl        0.002275

In [15]:
train['RoofMatl_Othr'] = train['RoofMatl_Membran'] + train['RoofMatl_Metal'] + train['RoofMatl_Roll'] + \
                        train['RoofMatl_Tar&Grv'] + train['RoofMatl_WdShake'] + train['RoofMatl_WdShngl']
test['RoofMatl_Othr'] = test['RoofMatl_Membran'] + test['RoofMatl_Metal'] + test['RoofMatl_Roll'] + \
                        test['RoofMatl_Tar&Grv'] + test['RoofMatl_WdShake'] + test['RoofMatl_WdShngl']

In [16]:
train.drop(['RoofMatl_Membran', 'RoofMatl_Metal', 'RoofMatl_Roll', 'RoofMatl_Tar&Grv', 'RoofMatl_WdShake', 
            'RoofMatl_WdShngl'], axis=1, inplace=True)
test.drop(['RoofMatl_Membran', 'RoofMatl_Metal', 'RoofMatl_Roll', 'RoofMatl_Tar&Grv', 'RoofMatl_WdShake', 
           'RoofMatl_WdShngl'], axis=1, inplace=True)

#### Exterior1st
Train:
- Exterior1st_AsphShn     0.001139
- Exterior1st_BrkComm     0.003409
- Exterior1st_CBlock      0.000000
- Exterior1st_ImStucc     0.000000
- Exterior1st_PreCast     0.001139
- Exterior1st_Stone       0.000000

Test:
- Exterior1st_AsphShn     0.001139
- Exterior1st_BrkComm     0.003409
- Exterior1st_CBlock      0.000000
- Exterior1st_ImStucc     0.000000
- Exterior1st_PreCast     0.001139
- Exterior1st_Stone       0.000000

In [17]:
train['Exterior1st_Other'] = train['Exterior1st_AsphShn'] + train['Exterior1st_BrkComm'] + train['Exterior1st_CBlock']\
                            + train['Exterior1st_ImStucc'] + train['Exterior1st_PreCast'] + train['Exterior1st_Stone']
test['Exterior1st_Other'] = test['Exterior1st_AsphShn'] + test['Exterior1st_BrkComm'] + test['Exterior1st_CBlock'] + \
                            test['Exterior1st_ImStucc'] + test['Exterior1st_PreCast'] + test['Exterior1st_Stone']

In [18]:
train.drop(['Exterior1st_AsphShn', 'Exterior1st_BrkComm', 'Exterior1st_CBlock', 'Exterior1st_ImStucc', 
            'Exterior1st_PreCast', 'Exterior1st_Stone'], axis=1, inplace=True)
test.drop(['Exterior1st_AsphShn', 'Exterior1st_BrkComm', 'Exterior1st_CBlock', 'Exterior1st_ImStucc', 
           'Exterior1st_PreCast', 'Exterior1st_Stone'], axis=1, inplace=True)

#### Exterior2nd
Train:
- Exterior2nd_AsphShn     0.001465
- Exterior2nd_BrkCmn      0.008244
- Exterior2nd_CBlock      0.000977
- Exterior2nd_ImStucc     0.005350
- Exterior2nd_Other       0.000000
- Exterior2nd_PreCast     0.000000
- Exterior2nd_Stone       0.002925

Test:
- Exterior2nd_AsphShn     0.001139
- Exterior2nd_BrkCmn      0.005669
- Exterior2nd_CBlock      0.001139
- Exterior2nd_ImStucc     0.004540
- Exterior2nd_Other       0.001139
- Exterior2nd_PreCast     0.001139
- Exterior2nd_Stone       0.000000

In [19]:
train['Exterior2nd_Other'] = train['Exterior2nd_AsphShn'] + train['Exterior2nd_BrkCmn'] + train['Exterior2nd_CBlock'] \
                            + train['Exterior2nd_ImStucc'] + train['Exterior2nd_Other'] + train['Exterior2nd_PreCast'] \
                            + train['Exterior2nd_Stone']
test['Exterior2nd_Other'] = test['Exterior2nd_AsphShn'] + test['Exterior2nd_BrkCmn'] + test['Exterior2nd_CBlock'] + \
                            test['Exterior2nd_ImStucc'] + test['Exterior2nd_Other'] + test['Exterior2nd_PreCast'] + \
                            test['Exterior2nd_Stone']

In [20]:
train.drop(['Exterior2nd_AsphShn', 'Exterior2nd_BrkCmn', 'Exterior2nd_CBlock', 'Exterior2nd_ImStucc', 
            'Exterior2nd_PreCast', 'Exterior2nd_Stone'], axis=1, inplace=True)
test.drop(['Exterior2nd_AsphShn', 'Exterior2nd_BrkCmn', 'Exterior2nd_CBlock', 'Exterior2nd_ImStucc', 
           'Exterior2nd_PreCast', 'Exterior2nd_Stone'], axis=1, inplace=True)

#### Foundation
Train:
- Foundation_Slab      0.016350
- Foundation_Stone     0.002439
- Foundation_Wood      0.000977

Test:
- Foundation_Slab      0.016812
- Foundation_Stone     0.006795
- Foundation_Wood      0.003409

In [21]:
train['Foundation_Other'] = train['Foundation_Slab'] + train['Foundation_Stone'] + train['Foundation_Wood']
test['Foundation_Other'] = test['Foundation_Slab'] + test['Foundation_Stone'] + test['Foundation_Wood']

In [22]:
train.drop(['Foundation_Slab', 'Foundation_Stone', 'Foundation_Wood'], axis=1, inplace=True)
test.drop(['Foundation_Slab', 'Foundation_Stone', 'Foundation_Wood'], axis=1, inplace=True)

#### SaleType
Train:
- SaleType_CWD            0.004866
- SaleType_Con            0.001952
- SaleType_ConLD          0.008244
- SaleType_ConLI          0.003411
- SaleType_ConLw          0.002439
- SaleType_Oth            0.001952
- SaleType_VWD            0.000000

Test:
- SaleType_CWD            0.002275
- SaleType_Con            0.001139
- SaleType_ConLI          0.002275
- SaleType_ConLw          0.003409
- SaleType_Oth            0.003409
- SaleType_VWD            0.001139

In [23]:
train['SaleType_Oth'] = train['SaleType_CWD'] + train['SaleType_Con'] + train['SaleType_ConLD'] + \
                        train['SaleType_ConLI'] + train['SaleType_ConLw'] + train['SaleType_Oth'] + \
                        train['SaleType_VWD']
test['SaleType_Oth'] = test['SaleType_CWD'] + test['SaleType_Con'] + test['SaleType_ConLD'] + test['SaleType_ConLI'] \
                        + test['SaleType_ConLw'] + test['SaleType_Oth'] + test['SaleType_VWD']

In [24]:
train.drop(['SaleType_CWD', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_ConLI', 'SaleType_ConLw', 'SaleType_VWD'], 
           axis=1, inplace=True)
test.drop(['SaleType_CWD', 'SaleType_Con', 'SaleType_ConLD', 'SaleType_ConLI', 'SaleType_ConLw', 'SaleType_VWD'], 
          axis=1, inplace=True)

### 2.1.2 Correlation
Using Pingouin to check pairwise correlation (r=Correlation coefficients). We will check for strong positive and negative correlation. Higher than 0.8 or lower than -0.8.

In [25]:
columns = train.columns
pair_corr = pg.pairwise_corr(train, columns)

In [26]:
pair_corr.loc[pair_corr['r'] >= 0.8, ['X', 'Y', 'r']].sort_values(by='r', ascending = False).head(20)

Unnamed: 0,X,Y,r
12997,BldgType_Duplex,MSSubClass_90,1.0
14070,Exterior1st_CemntBd,Exterior2nd_CmentBd,0.988253
12914,BldgType_2fmCon,MSSubClass_190,0.977761
14431,Exterior1st_VinylSd,Exterior2nd_VinylSd,0.977539
14217,Exterior1st_MetalSd,Exterior2nd_MetalSd,0.97645
13663,HouseStyle_SLvl,MSSubClass_80,0.954547
13249,HouseStyle_1.5Story,MSSubClass_50,0.919442
5810,GarageCars,GarageArea,0.897599
14144,Exterior1st_HdBoard,Exterior2nd_HdBoard,0.887791
6105,GarageQual,GarageCond,0.884649


In [27]:
pair_corr.loc[pair_corr['r'] <= -0.8, ['X', 'Y', 'r']].sort_values(by='r', ascending = False).head(20)

Unnamed: 0,X,Y,r
10395,MSZoning_RL,MSZoning_RM,-0.801455
9815,MasVnrType_BrkFace,MasVnrType_None,-0.82649
6273,GarageCond,GarageType_None,-0.867613
6128,GarageQual,GarageType_None,-0.874681
10843,LotShape_IR1,LotShape_Reg,-0.937744
13676,RoofStyle_Gable,RoofStyle_Hip,-0.949623
12180,LandSlope_Gtl,LandSlope_Mod,-0.955013
48,Id,YrSold,-0.975766
13907,RoofMatl_CompShg,RoofMatl_Othr,-1.0
10620,Street_Grvl,Street_Pave,-1.0


#### Exterior
Notice that alot of Exterior1st and Exterior2nd features have strong correlation. We will proceed to use their mean value rather than choosing which to drop.

In [28]:
pair_corr.loc[pair_corr['X'].str.contains(r'Exterior(?!$)') & pair_corr['Y'].str.contains(r'Exterior(?!$)'), 
              ['X', 'Y', 'r']].sort_values('r', ascending=False).head(15)

Unnamed: 0,X,Y,r
14070,Exterior1st_CemntBd,Exterior2nd_CmentBd,0.988253
14431,Exterior1st_VinylSd,Exterior2nd_VinylSd,0.977539
14217,Exterior1st_MetalSd,Exterior2nd_MetalSd,0.97645
14144,Exterior1st_HdBoard,Exterior2nd_HdBoard,0.887791
14500,Exterior1st_WdSdng,Exterior2nd_WdSdng,0.860579
13919,Exterior1st_AsbShng,Exterior2nd_AsbShng,0.819804
14290,Exterior1st_Plywood,Exterior2nd_Plywood,0.71501
14361,Exterior1st_Stucco,Exterior2nd_Stucco,0.687763
13995,Exterior1st_BrkFace,Exterior2nd_BrkFace,0.679487
14568,Exterior1st_WdShing,Exterior2nd_WdShng,0.629149


In [29]:
train.columns[train.columns.str.contains('Exterior1st')]

Index(['Exterior1st_AsbShng', 'Exterior1st_BrkFace', 'Exterior1st_CemntBd',
       'Exterior1st_HdBoard', 'Exterior1st_MetalSd', 'Exterior1st_Plywood',
       'Exterior1st_Stucco', 'Exterior1st_VinylSd', 'Exterior1st_WdSdng',
       'Exterior1st_WdShing', 'Exterior1st_Other'],
      dtype='object')

In [30]:
train.columns[train.columns.str.contains('Exterior2nd')]

Index(['Exterior2nd_AsbShng', 'Exterior2nd_BrkFace', 'Exterior2nd_CmentBd',
       'Exterior2nd_HdBoard', 'Exterior2nd_MetalSd', 'Exterior2nd_Other',
       'Exterior2nd_Plywood', 'Exterior2nd_Stucco', 'Exterior2nd_VinylSd',
       'Exterior2nd_WdSdng', 'Exterior2nd_WdShng'],
      dtype='object')

In [31]:
train['Exterior_AsbShng'] = (train['Exterior1st_AsbShng'] + train['Exterior2nd_AsbShng']) / 2 
train['Exterior_BrkFace'] = (train['Exterior1st_BrkFace'] + train['Exterior2nd_BrkFace']) / 2 
train['Exterior_CmentBd'] = (train['Exterior1st_CemntBd'] + train['Exterior2nd_CmentBd']) / 2 
train['Exterior_HdBoard'] = (train['Exterior1st_HdBoard'] + train['Exterior2nd_HdBoard']) / 2 
train['Exterior_MetalSd'] = (train['Exterior1st_MetalSd'] + train['Exterior2nd_MetalSd']) / 2 
train['Exterior_Other'] = (train['Exterior1st_Other'] + train['Exterior2nd_Other']) / 2 
train['Exterior_Plywood'] = (train['Exterior1st_Plywood'] + train['Exterior2nd_Plywood']) / 2 
train['Exterior_Stucco'] = (train['Exterior1st_Stucco'] + train['Exterior2nd_Stucco']) / 2 
train['Exterior_VinylSd'] = (train['Exterior1st_VinylSd'] + train['Exterior2nd_VinylSd']) / 2 
train['Exterior_WdSdng'] = (train['Exterior1st_WdSdng'] + train['Exterior2nd_WdSdng']) / 2 
train['Exterior_WdShing'] = (train['Exterior1st_WdShing'] + train['Exterior2nd_WdShng']) / 2 

In [32]:
test['Exterior_AsbShng'] = (test['Exterior1st_AsbShng'] + test['Exterior2nd_AsbShng']) / 2 
test['Exterior_BrkFace'] = (test['Exterior1st_BrkFace'] + test['Exterior2nd_BrkFace']) / 2 
test['Exterior_CmentBd'] = (test['Exterior1st_CemntBd'] + test['Exterior2nd_CmentBd']) / 2 
test['Exterior_HdBoard'] = (test['Exterior1st_HdBoard'] + test['Exterior2nd_HdBoard']) / 2 
test['Exterior_MetalSd'] = (test['Exterior1st_MetalSd'] + test['Exterior2nd_MetalSd']) / 2 
test['Exterior_Other'] = (test['Exterior1st_Other'] + test['Exterior2nd_Other']) / 2 
test['Exterior_Plywood'] = (test['Exterior1st_Plywood'] + test['Exterior2nd_Plywood']) / 2 
test['Exterior_Stucco'] = (test['Exterior1st_Stucco'] + test['Exterior2nd_Stucco']) / 2 
test['Exterior_VinylSd'] = (test['Exterior1st_VinylSd'] + test['Exterior2nd_VinylSd']) / 2 
test['Exterior_WdSdng'] = (test['Exterior1st_WdSdng'] + test['Exterior2nd_WdSdng']) / 2 
test['Exterior_WdShing'] = (test['Exterior1st_WdShing'] + test['Exterior2nd_WdShng']) / 2 

In [33]:
train.drop(['Exterior1st_AsbShng', 'Exterior1st_BrkFace', 'Exterior1st_CemntBd', 'Exterior1st_HdBoard', 
            'Exterior1st_MetalSd', 'Exterior1st_Plywood', 'Exterior1st_Stucco', 'Exterior1st_VinylSd', 
            'Exterior1st_WdSdng', 'Exterior1st_WdShing', 'Exterior1st_Other', 'Exterior2nd_AsbShng', 
            'Exterior2nd_BrkFace', 'Exterior2nd_CmentBd', 'Exterior2nd_HdBoard', 'Exterior2nd_MetalSd', 
            'Exterior2nd_Other', 'Exterior2nd_Plywood', 'Exterior2nd_Stucco', 'Exterior2nd_VinylSd', 
            'Exterior2nd_WdSdng', 'Exterior2nd_WdShng'], axis=1, inplace=True)
test.drop(['Exterior1st_AsbShng', 'Exterior1st_BrkFace', 'Exterior1st_CemntBd', 'Exterior1st_HdBoard', 
            'Exterior1st_MetalSd', 'Exterior1st_Plywood', 'Exterior1st_Stucco', 'Exterior1st_VinylSd', 
            'Exterior1st_WdSdng', 'Exterior1st_WdShing', 'Exterior1st_Other', 'Exterior2nd_AsbShng', 
            'Exterior2nd_BrkFace', 'Exterior2nd_CmentBd', 'Exterior2nd_HdBoard', 'Exterior2nd_MetalSd', 
            'Exterior2nd_Other', 'Exterior2nd_Plywood', 'Exterior2nd_Stucco', 'Exterior2nd_VinylSd', 
            'Exterior2nd_WdSdng', 'Exterior2nd_WdShng'], axis=1, inplace=True)

#### BldgType, MSSubClass
Dropping MSSubClass over BldgType as there are info derive from it.

|X|Y|r|
|----|----|----|
|BldgType_Duplex|MSSubClass_90|1.000000|
|BldgType_2fmCon|MSSubClass_190|0.977761|

In [34]:
train.drop(['MSSubClass_90', 'MSSubClass_190'], axis=1, inplace=True)
test.drop(['MSSubClass_90', 'MSSubClass_190'], axis=1, inplace=True)

#### HouseStyle, MSSubClass
Dropping MSSubClass over HouseStyle as there are info derive from it.

|X|Y|r|
|----|----|----|
|HouseStyle_SLvl|MSSubClass_80|0.954547|
|HouseStyle_1.5Story|MSSubClass_50|0.919442|
|2ndFlrSF|HouseStyle_2Story|0.824436|

In [35]:
train.drop(['MSSubClass_80', 'MSSubClass_50', 'HouseStyle_2Story'], axis=1, inplace=True)
test.drop(['MSSubClass_80', 'MSSubClass_50', 'HouseStyle_2Story'], axis=1, inplace=True)

#### Garage

|X|Y|r|
|----|----|----|
|GarageCars|GarageArea|0.897599
|GarageQual|GarageCond|0.884649
|GarageCond|GarageType_None|-0.867613
|GarageQual|GarageType_None|-0.874681

In [36]:
train.drop(['GarageArea', 'GarageQual', 'GarageType_None'], axis=1, inplace=True)
test.drop(['GarageArea', 'GarageQual', 'GarageType_None'], axis=1, inplace=True)

#### MSZoning, Neighborhood

|X|Y|r|
|----|----|----|
|MSZoning_FV|Neighborhood_Somerst|0.874837|
|MSZoning_RL|MSZoning_RM|-0.801455|

In [37]:
train.drop(['Neighborhood_Somerst', 'MSZoning_RM'], axis=1, inplace=True)
test.drop(['Neighborhood_Somerst', 'MSZoning_RM'], axis=1, inplace=True)

#### The Rest

|X|Y|r|
|----|----|----|
|PoolArea|PoolQC|0.857544|
|GrLivArea|TotRmsAbvGrd|0.812701|
|Fireplaces|FireplaceQu|0.805170|
|MasVnrType_BrkFace|MasVnrType_None|-0.826490|
|LotShape_IR1|LotShape_Reg|-0.937744|
|RoofStyle_Gable|RoofStyle_Hip|-0.949623|
|LandSlope_Gtl|LandSlope_Mod|-0.955013|
|RoofMatl_CompShg|RoofMatl_Othr|-1.000000|
|Street_Grvl|Street_Pave|-1.000000|
|Heating_GasA|Heating_OthW|-1.000000|

In [38]:
train.drop(['PoolQC', 'TotRmsAbvGrd', 'FireplaceQu', 'MasVnrType_None', 'LotShape_IR1', 'RoofStyle_Gable', 
            'LandSlope_Gtl', 'Street_Grvl', 'RoofMatl_Othr', 'Heating_OthW'], axis=1, inplace=True)
test.drop(['PoolQC', 'TotRmsAbvGrd', 'FireplaceQu', 'MasVnrType_None', 'LotShape_IR1', 'RoofStyle_Gable', 
           'LandSlope_Gtl', 'Street_Grvl', 'RoofMatl_Othr', 'Heating_OthW'], axis=1, inplace=True)

### 2.1.3 P-Value
We will check for p-value higher than 0.05.

In [39]:
features = [col for col in train._get_numeric_data().columns[2:] if col != 'SalePrice']
X = train[features]
y = train['SalePrice']

In [40]:
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.927
Model:                            OLS   Adj. R-squared:                  0.922
Method:                 Least Squares   F-statistic:                     169.8
Date:                Sat, 29 Oct 2022   Prob (F-statistic):               0.00
Time:                        08:13:34   Log-Likelihood:                -23300.
No. Observations:                2046   AIC:                         4.689e+04
Df Residuals:                    1902   BIC:                         4.770e+04
Df Model:                         143                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                -1.497e+04 

We can drop feature one by one and see how it affects the R-squared and Adj. R-squared.

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|SaleType_Oth|1066.4849|9.1e+04|0.012|0.991|-1.77e+05|1.8e+05|


In [41]:
X.drop(['SaleType_Oth'], axis=1, inplace=True)
model = sm.OLS(y,X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.927
Model:                            OLS   Adj. R-squared:                  0.922
Method:                 Least Squares   F-statistic:                     169.8
Date:                Sat, 29 Oct 2022   Prob (F-statistic):               0.00
Time:                        08:13:34   Log-Likelihood:                -23300.
No. Observations:                2046   AIC:                         4.689e+04
Df Residuals:                    1902   BIC:                         4.770e+04
Df Model:                         143                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                -1.444e+04 

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Exterior_AsbShng|-240.3111|3.74e+04|-0.006|0.995|-7.36e+04|7.31e+04|


In [42]:
X.drop(['Exterior_AsbShng'], axis=1, inplace=True)
model = sm.OLS(y,X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.927
Model:                            OLS   Adj. R-squared:                  0.922
Method:                 Least Squares   F-statistic:                     169.8
Date:                Sat, 29 Oct 2022   Prob (F-statistic):               0.00
Time:                        08:13:34   Log-Likelihood:                -23300.
No. Observations:                2046   AIC:                         4.689e+04
Df Residuals:                    1902   BIC:                         4.770e+04
Df Model:                         143                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                -1.456e+04 

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|BldgType_Twnhs|728.8205|8.61e+04|0.008|0.993|-1.68e+05|1.7e+05|


In [43]:
X.drop(['BldgType_Twnhs'], axis=1, inplace=True)
model = sm.OLS(y,X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.927
Model:                            OLS   Adj. R-squared:                  0.922
Method:                 Least Squares   F-statistic:                     169.8
Date:                Sat, 29 Oct 2022   Prob (F-statistic):               0.00
Time:                        08:13:34   Log-Likelihood:                -23300.
No. Observations:                2046   AIC:                         4.689e+04
Df Residuals:                    1902   BIC:                         4.770e+04
Df Model:                         143                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                -1.414e+04 

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LotConfig_Inside|-440.2248|9.58e+04|-0.005|0.996|-1.88e+05|1.87e+05|


In [44]:
X.drop(['LotConfig_Inside'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Foundation_PConc|-4838.3100|1.36e+05|-0.036|0.972|-2.71e+05|2.61e+05|


In [45]:
X.drop(['Foundation_PConc'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LandContour_Lvl|-6973.7136|1.63e+05|-0.043|0.966|-3.26e+05|3.12e+05|


In [46]:
X.drop(['LandContour_Lvl'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_IDOTRR|-433.8724|7118.952|-0.061|0.951|-1.44e+04|1.35e+04|

In [47]:
X.drop(['Neighborhood_IDOTRR'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MSSubClass_85|-547.7170|8579.544|-0.064|0.949|-1.74e+04|1.63e+04|


In [48]:
X.drop(['MSSubClass_85'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|HouseStyle_SFoyer|-1134.2850|7322.134|-0.155|0.877|-1.55e+04|1.32e+04|


In [49]:
X.drop(['HouseStyle_SFoyer'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|HouseStyle_SLvl|-380.4831|4109.134|-0.093|0.926|-8439.358|7678.391|


In [50]:
X.drop(['HouseStyle_SLvl'], axis=1, inplace=True)

Adj. R-squared matches R-squared now. 

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|BldgType_1Fam|-2098.6388|1.23e+04|-0.171|0.864|-2.62e+04|2.2e+04|


In [51]:
X.drop(['BldgType_1Fam'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|BsmtHalfBath|479.2982|2224.436|0.215|0.829|-3883.284|4841.881|


In [52]:
X.drop(['BsmtHalfBath'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|EnclosedPorch|1.9756|9.612|0.206|0.837|-16.876|20.827|


In [53]:
X.drop(['EnclosedPorch'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_Landmrk|6403.2259|2.41e+04|0.266|0.790|-4.08e+04|5.36e+04|


In [54]:
X.drop(['Neighborhood_Landmrk'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Exterior_MetalSd|976.5510|4722.461|0.207|0.836|-8285.171|1.02e+04|


In [55]:
X.drop(['Exterior_MetalSd'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MiscFeature_Gar2|0.3457|1.435|0.241|0.810|-2.470|3.161|


In [56]:
X.drop(['MiscFeature_Gar2'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|HouseStyle_2.5Fin|-4086.2191|1.14e+04|-0.359|0.720|-2.64e+04|1.82e+04|


In [57]:
X.drop(['HouseStyle_2.5Fin'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MSSubClass_20|-1893.0300|4687.685|-0.404|0.686|-1.11e+04|7300.481|


In [58]:
X.drop(['MSSubClass_20'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MSSubClass_40|-4089.9792|1.16e+04|-0.351|0.725|-2.69e+04|1.87e+04|


In [59]:
X.drop(['MSSubClass_40'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_ClearCr|-2264.0848|5979.781|-0.379|0.705|-1.4e+04|9463.482|


In [60]:
X.drop(['Neighborhood_ClearCr'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LandSlope_Mod|1373.8991|3150.076|0.436|0.663|-4804.040|7551.838|


In [61]:
X.drop(['LandSlope_Mod'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MiscFeature_Shed|1.0991|2.755|0.399|0.690|-4.304|6.502|


In [62]:
X.drop(['MiscFeature_Shed'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LandContour_Low|-1718.9346|4019.496|-0.428|0.669|-9601.977|6164.108|


In [63]:
X.drop(['LandContour_Low'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|GarageYrBltDiff|21.0025|43.362|0.484|0.628|-64.038|106.043|


In [64]:
X.drop(['GarageYrBltDiff'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LandContour_Bnk|-1376.2155|2827.660|-0.487|0.627|-6921.823|4169.392|
|MSZoning_ACI|-2920.4879|6007.840|-0.486|0.627|-1.47e+04|8862.089|


In [65]:
X.drop(['LandContour_Bnk','MSZoning_ACI'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LowQualFinSF|3.7914|7.822|0.485|0.628|-11.548|19.131|
|3SsnPorch|9.6961|19.933|0.486|0.627|-29.397|48.789|


In [66]:
X.drop(['LowQualFinSF','3SsnPorch'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|PoolArea|7.6253|15.043|0.507|0.612|-21.878|37.128|


In [67]:
X.drop(['PoolArea'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_Greens|8274.8493|1.39e+04|0.596|0.551|-1.9e+04|3.55e+04|


In [68]:
X.drop(['Neighborhood_Greens'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LotConfig_FR2|-1941.0527|3109.942|-0.624|0.533|-8040.262|4158.156|


In [69]:
X.drop(['LotConfig_FR2'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Fence|-429.7675|624.257|-0.688|0.491|-1654.058|794.523|


In [70]:
X.drop(['Fence'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_OldTown|-2026.2887|3132.699|-0.647|0.518|-8170.124|4117.547|


In [71]:
X.drop(['Neighborhood_OldTown'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_Sawyer|-1984.1224|3241.737|-0.612|0.541|-8341.801|4373.556|


In [72]:
X.drop(['Neighborhood_Sawyer'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|RoofMatl_CompShg|-4386.5327|5776.467|-0.759|0.448|-1.57e+04|6942.244|
|Exterior_Other|-5064.2642|6654.130|-0.761|0.447|-1.81e+04|7985.779|


In [73]:
X.drop(['RoofMatl_CompShg','Exterior_Other'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_Mitchel|-2368.0905|3065.567|-0.772|0.440|-8380.261|3644.079|


In [74]:
X.drop(['Neighborhood_Mitchel'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Alley|1095.7645|1381.803|0.793|0.428|-1614.217|3805.746|


In [75]:
X.drop(['Alley'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Exterior_Stucco|-4337.8285|5091.643|-0.852|0.394|-1.43e+04|5647.862|


In [76]:
X.drop(['Exterior_Stucco'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|PavedDrive|1106.8519|1219.913|0.907|0.364|-1285.630|3499.334|
|Foundation_BrkTil|-2352.8664|2592.059|-0.908|0.364|-7436.389|2730.656|


In [77]:
X.drop(['PavedDrive','Foundation_BrkTil'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|GarageFinish|834.8107|918.678|0.909|0.364|-966.892|2636.514|


In [78]:
X.drop(['GarageFinish'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MSSubClass_60|-3020.5241|3143.322|-0.961|0.337|-9185.174|3144.126|


In [79]:
X.drop(['MSSubClass_60'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|FullBath|1538.2079|1583.777|0.971|0.332|-1567.878|4644.294|

In [80]:
X.drop(['FullBath'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|HalfBath|1042.7321|1434.977|0.727|0.468|-1771.529|3856.993|


In [81]:
X.drop(['HalfBath'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MSSubClass_45|6681.3195|7050.741|0.948|0.343|-7146.506|2.05e+04|


In [82]:
X.drop(['MSSubClass_45'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|YrSold|-389.7046|391.989|-0.994|0.320|-1158.469|379.060|


In [83]:
X.drop(['YrSold'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MoSold|-167.0432|184.259|-0.907|0.365|-528.409|194.322|


In [84]:
X.drop(['MoSold'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LotShape_IR2|3660.9113|3407.691|1.074|0.283|-3022.203|1.03e+04|


In [85]:
X.drop(['LotShape_IR2'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_SWISU|-4705.4113|4576.132|-1.028|0.304|-1.37e+04|4269.231|


In [86]:
X.drop(['Neighborhood_SWISU'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LandSlope_Sev|-9295.0124|8243.550|-1.128|0.260|-2.55e+04|6872.109|


In [87]:
X.drop(['LandSlope_Sev'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Exterior_CmentBd|3866.0083|3379.624|1.144|0.253|-2762.055|1.05e+04|


In [88]:
X.drop(['Exterior_CmentBd'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Street_Pave|9859.9091|8637.371|1.142|0.254|-7079.558|2.68e+04|


In [89]:
X.drop(['Street_Pave'], axis=1, inplace=True)

- R-squared:                       0.927
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|ExterCond|-1847.6881|1510.013|-1.224|0.221|-4809.100|1113.724|
|OpenPorchSF|10.5074|8.589|1.223|0.221|-6.337|27.352|


In [90]:
X.drop(['ExterCond','OpenPorchSF'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|BsmtFin|-457.5821|341.752|-1.339|0.181|-1127.821|212.656|


In [91]:
X.drop(['BsmtFin'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|GarageCond|2276.9721|2036.851|1.118|0.264|-1717.661|6271.605|


In [92]:
X.drop(['GarageCond'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|1stFlrSF|14.2073|10.723|1.325|0.185|-6.823|35.237|


In [93]:
X.drop(['1stFlrSF'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|2ndFlrSF|1.9130|3.529|0.542|0.588|-5.009|8.835|


In [94]:
X.drop(['2ndFlrSF'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|GarageType_CarPort|-9900.9959|7431.156|-1.332|0.183|-2.45e+04|4672.830|


In [95]:
X.drop(['GarageType_CarPort'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|YearRemod/Add|57.8054|41.308|1.399|0.162|-23.208|138.818|


In [96]:
X.drop(['YearRemod/Add'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Exterior_WdShing|-4850.6820|3612.038|-1.343|0.179|-1.19e+04|2233.166|


In [97]:
X.drop(['Exterior_WdShing'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Electrical|-2645.7197|1970.905|-1.342|0.180|-6511.013|1219.574|


In [98]:
X.drop(['Electrical'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|GarageType_BuiltIn|-5140.6343|3710.635|-1.385|0.166|-1.24e+04|2136.575|


In [99]:
X.drop(['GarageType_BuiltIn'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|GarageType_Detchd|-2604.0739|2182.358|-1.193|0.233|-6884.061|1675.913|


In [100]:
X.drop(['GarageType_Detchd'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|BldgType_Duplex|-5727.4221|4636.846|-1.235|0.217|-1.48e+04|3366.245|


In [101]:
X.drop(['BldgType_Duplex'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|BldgType_2fmCon|-3491.6668|3725.791|-0.937|0.349|-1.08e+04|3815.259|


In [102]:
X.drop(['BldgType_2fmCon'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MSSubClass_70|5372.5563|3631.957|1.479|0.139|-1750.342|1.25e+04|
|GarageType_Basment|-7040.8667|4746.254|-1.483|0.138|-1.63e+04|2267.363|


In [103]:
X.drop(['MSSubClass_70','GarageType_Basment'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MSSubClass_75|7822.7066|6061.351|1.291|0.197|-4064.649|1.97e+04|


In [104]:
X.drop(['MSSubClass_75'], axis=1, inplace=True)

- R-squared:                       0.926
- Adj. R-squared:                  0.923

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|HouseStyle_1.5Story|2594.3248|2146.041|1.209|0.227|-1614.432|6803.081|


In [105]:
X.drop(['HouseStyle_1.5Story'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|BldgType_TwnhsE|5211.6996|3825.207|1.362|0.173|-2290.186|1.27e+04|


In [106]:
X.drop(['BldgType_TwnhsE'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_Veenker|-8578.1574|5901.276|-1.454|0.146|-2.02e+04|2995.252|


In [107]:
X.drop(['Neighborhood_Veenker'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_Timber|-5511.9817|3710.534|-1.485|0.138|-1.28e+04|1765.006|


In [108]:
X.drop(['Neighborhood_Timber'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_Edwards|-3408.0887|2283.369|-1.493|0.136|-7886.162|1069.984|
|LotShape_IR3|-1.2e+04|8005.895|-1.499|0.134|-2.77e+04|3701.020|


In [109]:
X.drop(['Neighborhood_Edwards','LotShape_IR3'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_NAmes|-1936.0019|1792.428|-1.080|0.280|-5451.255|1579.252|


In [110]:
X.drop(['Neighborhood_NAmes'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_CollgCr|-3392.9617|2266.609|-1.497|0.135|-7838.163|1052.240|
|Exterior_VinylSd|-2460.1315|1640.310|-1.500|0.134|-5677.054|756.791|


In [111]:
X.drop(['Neighborhood_CollgCr','Exterior_VinylSd'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_Gilbert|-2371.2304|2542.668|-0.933|0.351|-7357.827|2615.366|


In [112]:
X.drop(['Neighborhood_Gilbert'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MSSubClass_30|4442.4347|2832.899|1.568|0.117|-1113.350|9998.220|


In [113]:
X.drop(['MSSubClass_30'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_SawyerW|-4110.7582|2604.694|-1.578|0.115|-9218.993|997.477|


In [114]:
X.drop(['Neighborhood_SawyerW'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MSZoning_RH|9615.3664|6223.383|1.545|0.122|-2589.713|2.18e+04|


In [115]:
X.drop(['MSZoning_RH'], axis=1, inplace=True)

- R-squared:                       0.925
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MSZoning_RL|2358.1486|1821.760|1.294|0.196|-1214.622|5930.919|


In [116]:
X.drop(['MSZoning_RL'], axis=1, inplace=True)

- R-squared:                       0.924
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|HeatingQC|1142.1106|703.316|1.624|0.105|-237.207|2521.428|


In [117]:
X.drop(['HeatingQC'], axis=1, inplace=True)

- R-squared:                       0.924
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MasVnrType_Stone|3725.7971|2260.772|1.648|0.100|-707.946|8159.540|


In [118]:
X.drop(['MasVnrType_Stone'], axis=1, inplace=True)

- R-squared:                       0.924
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|Neighborhood_Blueste|1.564e+04|9675.665|1.616|0.106|-3335.208|3.46e+04|


In [119]:
X.drop(['Neighborhood_Blueste'], axis=1, inplace=True)

- R-squared:                       0.924
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LotShape_Reg|2144.6124|1215.768|1.764|0.078|-239.705|4528.930|


In [120]:
X.drop(['LotShape_Reg'], axis=1, inplace=True)

- R-squared:                       0.924
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MiscFeature_Othr|4.9934|2.813|1.775|0.076|-0.523|10.509|


In [121]:
X.drop(['MiscFeature_Othr'], axis=1, inplace=True)

- R-squared:                       0.924
- Adj. R-squared:                  0.922

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LotConfig_FR3|-1.371e+04|7602.440|-1.803|0.071|-2.86e+04|1199.616|


In [122]:
X.drop(['LotConfig_FR3'], axis=1, inplace=True)

- R-squared:                       0.924
- Adj. R-squared:                  0.921

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|MiscFeature_TenC|-21.8941|11.392|-1.922|0.055|-44.235|0.447|
|Foundation_CBlock|-2725.8636|1420.992|-1.918|0.055|-5512.656|60.929|


In [123]:
X.drop(['MiscFeature_TenC','Foundation_CBlock'], axis=1, inplace=True)

- R-squared:                       0.924
- Adj. R-squared:                  0.921

|Column Names|coef|std err|t|p-value|[0.025|0.975]|
|---|---|---|---|---|---|---|
|LotConfig_Corner|-2561.6516|1439.148|-1.780|0.075|-5384.048|260.745|


In [124]:
X.drop(['LotConfig_Corner'], axis=1, inplace=True)

In [125]:
model = sm.OLS(y,X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              SalePrice   R-squared:                       0.923
Model:                            OLS   Adj. R-squared:                  0.921
Method:                 Least Squares   F-statistic:                     420.4
Date:                Sat, 29 Oct 2022   Prob (F-statistic):               0.00
Time:                        08:13:35   Log-Likelihood:                -23355.
No. Observations:                2046   AIC:                         4.683e+04
Df Residuals:                    1988   BIC:                         4.715e+04
Df Model:                          57                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                -6.937e+05 

In [131]:
X.columns

Index(['const', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'MasVnrArea', 'ExterQual', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'TotalBsmtSF', 'CentralAir', 'GrLivArea',
       'BsmtFullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'Functional', 'Fireplaces', 'GarageCars', 'WoodDeckSF', 'ScreenPorch',
       'GarageType_2Types', 'GarageType_Attchd', 'BsmtLivArea',
       'MasVnrType_BrkFace', 'MSZoning_FV', 'LotConfig_CulDSac',
       'LandContour_HLS', 'Heating_GasA', 'Condition', 'HouseStyle_1Story',
       'RoofStyle_Hip', 'SaleType_COD', 'SaleType_New', 'SaleType_WD',
       'Neighborhood_Blmngtn', 'Neighborhood_BrDale', 'Neighborhood_BrkSide',
       'Neighborhood_Crawfor', 'Neighborhood_GrnHill', 'Neighborhood_MeadowV',
       'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge',
       'Neighborhood_NridgHt', 'Neighborhood_StoneBr', 'MSSubClass_120',
       'MSSubClass_150', 'MSSubClass_160', 'MSSubClass_180',

These are the features we are left with to predict our sale price.

- LotFrontage
- LotArea
- OverallQual
- OverallCond  
- YearBuilt 
- MasVnrArea      
- ExterQual 
- BsmtQual         
- BsmtCond
- BsmtExposure
- TotalBsmtSF
- CentralAir
- GrLivArea 
- BsmtFullBath 
- BedroomAbvGr
- KitchenAbvGr
- KitchenQual
- Functional
- Fireplaces   
- GarageCars        
- WoodDeckSF
- ScreenPorch 
- GarageType_2Types
- GarageType_Attchd
- BsmtLivArea 
- MasVnrType_BrkFace 
- MSZoning_FV  
- LotConfig_CulDSac 
- LandContour_HLS 
- Heating_GasA 
- Condition    
- HouseStyle_1Story  
- RoofStyle_Hip 
- SaleType_COD        
- SaleType_New
- SaleType_WD 
- Neighborhood_Blmngtn 
- Neighborhood_BrDale
- Neighborhood_BrkSide 
- Neighborhood_Crawfor
- Neighborhood_GrnHill 
- Neighborhood_MeadowV
- Neighborhood_NPkVill
- Neighborhood_NWAmes
- Neighborhood_NoRidge
- Neighborhood_NridgHt
- Neighborhood_StoneBr
- MSSubClass_120
- MSSubClass_150
- MSSubClass_160
- MSSubClass_180
- RoofStyle_Othr
- Foundation_Other
- Exterior_BrkFace
- Exterior_HdBoard
- Exterior_Plywood
- Exterior_WdSdng

### 2.1.4 Saving Final Dataset

In [127]:
columns = ['Id','PID','LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
            'YearBuilt', 'MasVnrArea', 'ExterQual', 'BsmtQual', 'BsmtCond',
            'BsmtExposure', 'TotalBsmtSF', 'CentralAir', 'GrLivArea',
            'BsmtFullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
            'Functional', 'Fireplaces', 'GarageCars', 'WoodDeckSF', 'ScreenPorch',
            'GarageType_2Types', 'GarageType_Attchd', 'BsmtLivArea',
            'MasVnrType_BrkFace', 'MSZoning_FV', 'LotConfig_CulDSac',
            'LandContour_HLS', 'Heating_GasA', 'Condition', 'HouseStyle_1Story',
            'RoofStyle_Hip', 'SaleType_COD', 'SaleType_New', 'SaleType_WD',
            'Neighborhood_Blmngtn', 'Neighborhood_BrDale', 'Neighborhood_BrkSide',
            'Neighborhood_Crawfor', 'Neighborhood_GrnHill', 'Neighborhood_MeadowV',
            'Neighborhood_NPkVill', 'Neighborhood_NWAmes', 'Neighborhood_NoRidge',
            'Neighborhood_NridgHt', 'Neighborhood_StoneBr', 'MSSubClass_120',
            'MSSubClass_150', 'MSSubClass_160', 'MSSubClass_180', 'RoofStyle_Othr',
            'Foundation_Other', 'Exterior_BrkFace', 'Exterior_HdBoard',
            'Exterior_Plywood', 'Exterior_WdSdng', 'SalePrice']

In [128]:
train = train[columns]

In [132]:
columns_t = [col for col in columns if col != 'SalePrice']
test = test[columns_t]

In [135]:
train.to_csv('../datasets/train_clean2.csv', index=False)

In [136]:
test.to_csv('../datasets/test_clean2.csv', index=False)

In [133]:
train.shape

(2046, 60)

In [134]:
test.shape

(878, 59)