<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2(a): Ames Housing Data and Kaggle Challenge
# (Train Dataset Cleaning)

## Problem Statement:

### Using a data science approach, this project aims to identify areas contributing to high transacted prices and

### where the highest transacted volume occurs, in order to help realtors of Skywalker Property Advisors to

### gain a competitive advantage in the Ames Housing Market.

### Contents:
- [Background](#Background)
- [Data Import & Cleaning - Train Dataset](#Data-Import-&-Cleaning---Train-Dataset)
- [Data Dummifying - Train Dataset](#Data-Dummifying---Train-Dataset)

## Background

The city of Ames is in Iowa State, US.

In Ames, the situation of property market housing is stable, with the number of houses sold per year from year 2006-2010 remaining relatively consistent at ~400+, despite the US experiencing Subprime financial crisis in that period.
There are several real estate agencies in Ames, and Skywalker Property Advisors is one of them.

As a data scientist engaged to advise the realtors of Skywalker Property Advisors in the year 2010, data of houses sold in Ames in from Jan 2006 to July 2010 was extensively analysed which resulted in some key observations. With these observations, recommendations were made to the Realtors of Skywalker Property Advisors to help them to improve their sales, and to gain a competitive advantage in the Ames Housing Market.

## Data Import & Cleaning - Train Dataset

In [1]:
# import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import scipy.stats as stats
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Ridge

%matplotlib inline

### Data Import and Cleaning for 1st Dataset - train.csv

In [2]:
train = pd.read_csv('../data/train.csv')

In [3]:
# print first 5 rows of train.csv dataframe
train.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,138500


In [4]:
# to allow all missing values from all columns to be displayed
pd.set_option('display.max_rows', 100)

In [5]:
# check for missing values in all columns of train
train.isnull().sum()

Id                    0
PID                   0
MS SubClass           0
MS Zoning             0
Lot Frontage        330
Lot Area              0
Street                0
Alley              1911
Lot Shape             0
Land Contour          0
Utilities             0
Lot Config            0
Land Slope            0
Neighborhood          0
Condition 1           0
Condition 2           0
Bldg Type             0
House Style           0
Overall Qual          0
Overall Cond          0
Year Built            0
Year Remod/Add        0
Roof Style            0
Roof Matl             0
Exterior 1st          0
Exterior 2nd          0
Mas Vnr Type         22
Mas Vnr Area         22
Exter Qual            0
Exter Cond            0
Foundation            0
Bsmt Qual            55
Bsmt Cond            55
Bsmt Exposure        58
BsmtFin Type 1       55
BsmtFin SF 1          1
BsmtFin Type 2       56
BsmtFin SF 2          1
Bsmt Unf SF           1
Total Bsmt SF         1
Heating               0
Heating QC      

### Change all NA and NaN values to 0 or 'No'

In [6]:
# replace all spaces in column headings with '_'
train.columns = train.columns.str.lower().str.replace(' ', '_')
pd.set_option('max_columns', 100)
train.head()

Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,gr_liv_area,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_yr_blt,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,BrkFace,289.0,Gd,TA,CBlock,TA,TA,No,GLQ,533.0,Unf,0.0,192.0,725.0,GasA,Ex,Y,SBrkr,725,754,0,1479,0.0,0.0,2,1,3,1,Gd,6,Typ,0,,Attchd,1976.0,RFn,2.0,475.0,TA,TA,Y,0,44,0,0,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,BrkFace,132.0,Gd,TA,PConc,Gd,TA,No,GLQ,637.0,Unf,0.0,276.0,913.0,GasA,Ex,Y,SBrkr,913,1209,0,2122,1.0,0.0,2,1,4,1,Gd,8,Typ,1,TA,Attchd,1997.0,RFn,2.0,559.0,TA,TA,Y,0,74,0,0,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,Gd,CBlock,TA,TA,No,GLQ,731.0,Unf,0.0,326.0,1057.0,GasA,TA,Y,SBrkr,1057,0,0,1057,1.0,0.0,1,0,3,1,Gd,5,Typ,0,,Detchd,1953.0,Unf,1.0,246.0,TA,TA,Y,0,52,0,0,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,,0.0,TA,TA,PConc,Gd,TA,No,Unf,0.0,Unf,0.0,384.0,384.0,GasA,Gd,Y,SBrkr,744,700,0,1444,0.0,0.0,2,1,3,1,TA,7,Typ,0,,BuiltIn,2007.0,Fin,2.0,400.0,TA,TA,Y,100,0,0,0,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,AllPub,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd Sdng,Plywood,,0.0,TA,TA,PConc,Fa,Gd,No,Unf,0.0,Unf,0.0,676.0,676.0,GasA,TA,Y,SBrkr,831,614,0,1445,0.0,0.0,2,0,3,1,TA,6,Typ,0,,Detchd,1957.0,Unf,2.0,484.0,TA,TA,N,0,59,0,0,0,0,,,,0,3,2010,WD,138500


In [7]:
# replace 'None' cells in 'Mas_Vnr_Type' column with 'No'
# this is to standardise all non-numerical null values to 'No'
train['mas_vnr_type'] = train['mas_vnr_type'].replace('None','No')

In [8]:
# replace bsmt_exposure 'No' (No Exposure) with 'NE'
# this is to differentiate between 'No' (No Exposure) and 'No' (non-numerical null values) by changing No Exposure to 'NE'
train[['bsmt_exposure']] = train[['bsmt_exposure']].replace('No', 'NE')

In [9]:
# replace empty/NA cells with 'No'
train[['alley', 'mas_vnr_type', 'bsmt_qual', 'bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1', 'bsmtfin_type_2', 'fireplace_qu', 'garage_type',  'garage_finish', 'garage_qual', 'garage_cond', 'pool_qc', 'fence', 'misc_feature']] = train[['alley', 'mas_vnr_type', 'bsmt_qual', 'bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1', 'bsmtfin_type_2', 'fireplace_qu', 'garage_type',  'garage_finish', 'garage_qual', 'garage_cond', 'pool_qc', 'fence', 'misc_feature']].fillna('No')

In [10]:
# replace empty/NA cells with 0
train[['lot_frontage', 'mas_vnr_area', 'bsmtfin_sf_1', 'bsmtfin_sf_2', 'bsmt_unf_sf', 'total_bsmt_sf', 'bsmt_full_bath', 'bsmt_half_bath', 'garage_cars', 'garage_area']] = train[['lot_frontage', 'mas_vnr_area', 'bsmtfin_sf_1', 'bsmtfin_sf_2', 'bsmt_unf_sf', 'total_bsmt_sf', 'bsmt_full_bath', 'bsmt_half_bath', 'garage_cars', 'garage_area']].fillna(0)

In [11]:
train.drop('pid', axis=1, inplace=True)
train.drop('gr_liv_area', axis=1, inplace=True)
train.drop('garage_yr_blt', axis=1, inplace=True)

### Cross reference each column against Data Dictionary to check for abnormalities, and resolve them.

In [12]:
# checking each column to check for abnormalities
np.unique(train['mas_vnr_type'])

array(['BrkCmn', 'BrkFace', 'No', 'Stone'], dtype=object)

In [13]:
# checking each column to check for abnormalities
np.unique(train['bsmt_exposure'])

array(['Av', 'Gd', 'Mn', 'NE', 'No'], dtype=object)

In [14]:
# checking each column to check for abnormalities
np.unique(train['ms_zoning'])

array(['A (agr)', 'C (all)', 'FV', 'I (all)', 'RH', 'RL', 'RM'],
      dtype=object)

In [15]:
train['ms_zoning'].replace({'A (agr)': 'A', 'C (all)': 'C', 'I (all)': 'I'}, inplace = True)

In [16]:
# checking each column to check for abnormalities
np.unique(train['ms_zoning'])

array(['A', 'C', 'FV', 'I', 'RH', 'RL', 'RM'], dtype=object)

In [17]:
# checking each column to check for abnormalities
np.unique(train['exterior_1st'])

array(['AsbShng', 'AsphShn', 'BrkComm', 'BrkFace', 'CBlock', 'CemntBd',
       'HdBoard', 'ImStucc', 'MetalSd', 'Plywood', 'Stone', 'Stucco',
       'VinylSd', 'Wd Sdng', 'WdShing'], dtype=object)

In [18]:
# in 'exterior_1st', replace all values that have space with '_'
train['exterior_1st'] = train['exterior_1st'].str.replace(' ', '_')

In [19]:
# checking each column to check for abnormalities
np.unique(train['exterior_2nd'])

array(['AsbShng', 'AsphShn', 'Brk Cmn', 'BrkFace', 'CBlock', 'CmentBd',
       'HdBoard', 'ImStucc', 'MetalSd', 'Plywood', 'Stone', 'Stucco',
       'VinylSd', 'Wd Sdng', 'Wd Shng'], dtype=object)

In [20]:
train['exterior_2nd'].replace("Brk Cmn","BrkComm", inplace = True)

In [21]:
train['exterior_2nd'].replace("Wd Shng","WdShing", inplace = True)

In [22]:
# in 'exterior_2nd', replace all values that have space with '_'
train['exterior_2nd'] = train['exterior_2nd'].str.replace(' ', '_')

In [23]:
# checking each column to check for abnormalities
np.unique(train['central_air'])

array(['N', 'Y'], dtype=object)

In [24]:
train['central_air'].replace({'Y': 1, 'N': 0}, inplace=True)

### Change string based gradings to integer based gradings

In [25]:
train["exter_qual"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1}, inplace = True)
train["exter_cond"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1}, inplace = True)
train["bsmt_qual"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1, "No":0}, inplace = True)
train["bsmt_cond"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1, "No":0}, inplace = True)
train["heating_qc"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1}, inplace = True)
train["kitchen_qual"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1}, inplace = True)
train["fireplace_qu"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1, "No":0}, inplace = True)
train["garage_qual"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1, "No":0}, inplace = True)
train["garage_cond"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1, "No":0}, inplace = True)
train["pool_qc"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "No":0}, inplace = True)

In [26]:
train['lot_shape'] = train['lot_shape'].replace({
    'Reg': 4,
    'IR1': 3,
    'IR2': 2,
    'IR3': 1
})
train['bsmt_exposure'] = train['bsmt_exposure'].replace({
    'Av': 3,
    'Gd': 2,
    'Mn': 1,
    'NE': 0,
    'No': 0
})
train['utilities'] = train['utilities'].replace({
    'AllPub': 4,
    'NoSewr': 3,
    'NoSeWa': 2,
    'ELO': 1
})
train['bsmtfin_type_1'] = train['bsmtfin_type_1'].replace({
    'GLQ': 6,
    'ALQ': 5,
    'BLQ': 4,
    'Rec': 3,
    'LwQ': 2,
    'Unf': 1,
    'No': 0
})
train['bsmtfin_type_2'] = train['bsmtfin_type_2'].replace({
    'GLQ': 6,
    'ALQ': 5,
    'BLQ': 4,
    'Rec': 3,
    'LwQ': 2,
    'Unf': 1,
    'No': 0
})
train['functional'] = train['functional'].replace({
    'Typ': 8,
    'Min1': 7,
    'Min2': 6,
    'Mod': 5,
    'Maj1': 4,
    'Maj2': 3,
    'Sev': 2,
    'Sal': 1
})
train['garage_finish'] = train['garage_finish'].replace({
    'Fin': 3,
    'RFn': 2,
    'Unf': 1,
    'No': 0
})
train['paved_drive'] = train['paved_drive'].replace({
    'Y': 2,
    'P': 1,
    'N': 0
})

In [27]:
pd.set_option('max_columns', 100)
train.head()

Unnamed: 0,id,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,land_slope,neighborhood,condition_1,condition_2,bldg_type,house_style,overall_qual,overall_cond,year_built,year_remod/add,roof_style,roof_matl,exterior_1st,exterior_2nd,mas_vnr_type,mas_vnr_area,exter_qual,exter_cond,foundation,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating,heating_qc,central_air,electrical,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_type,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,109,60,RL,0.0,13517,Pave,No,3,Lvl,4,CulDSac,Gtl,Sawyer,RRAe,Norm,1Fam,2Story,6,8,1976,2005,Gable,CompShg,HdBoard,Plywood,BrkFace,289.0,4,3,CBlock,3,3,0,6,533.0,1,0.0,192.0,725.0,GasA,5,1,SBrkr,725,754,0,0.0,0.0,2,1,3,1,4,6,8,0,0,Attchd,2,2.0,475.0,3,3,2,0,44,0,0,0,0,0,No,No,0,3,2010,WD,130500
1,544,60,RL,43.0,11492,Pave,No,3,Lvl,4,CulDSac,Gtl,SawyerW,Norm,Norm,1Fam,2Story,7,5,1996,1997,Gable,CompShg,VinylSd,VinylSd,BrkFace,132.0,4,3,PConc,4,3,0,6,637.0,1,0.0,276.0,913.0,GasA,5,1,SBrkr,913,1209,0,1.0,0.0,2,1,4,1,4,8,8,1,3,Attchd,2,2.0,559.0,3,3,2,0,74,0,0,0,0,0,No,No,0,4,2009,WD,220000
2,153,20,RL,68.0,7922,Pave,No,4,Lvl,4,Inside,Gtl,NAmes,Norm,Norm,1Fam,1Story,5,7,1953,2007,Gable,CompShg,VinylSd,VinylSd,No,0.0,3,4,CBlock,3,3,0,6,731.0,1,0.0,326.0,1057.0,GasA,3,1,SBrkr,1057,0,0,1.0,0.0,1,0,3,1,4,5,8,0,0,Detchd,1,1.0,246.0,3,3,2,0,52,0,0,0,0,0,No,No,0,1,2010,WD,109000
3,318,60,RL,73.0,9802,Pave,No,4,Lvl,4,Inside,Gtl,Timber,Norm,Norm,1Fam,2Story,5,5,2006,2007,Gable,CompShg,VinylSd,VinylSd,No,0.0,3,3,PConc,4,3,0,1,0.0,1,0.0,384.0,384.0,GasA,4,1,SBrkr,744,700,0,0.0,0.0,2,1,3,1,3,7,8,0,0,BuiltIn,3,2.0,400.0,3,3,2,100,0,0,0,0,0,0,No,No,0,4,2010,WD,174000
4,255,50,RL,82.0,14235,Pave,No,3,Lvl,4,Inside,Gtl,SawyerW,Norm,Norm,1Fam,1.5Fin,6,8,1900,1993,Gable,CompShg,Wd_Sdng,Plywood,No,0.0,3,3,PConc,2,4,0,1,0.0,1,0.0,676.0,676.0,GasA,3,1,SBrkr,831,614,0,0.0,0.0,2,0,3,1,3,6,8,0,0,Detchd,1,2.0,484.0,3,3,0,0,59,0,0,0,0,0,No,No,0,3,2010,WD,138500


In [28]:
# export cleaned data into new csv file
train.to_csv('../data/train_cleaned.csv', index=False)

## Data Dummifying - Train Dataset

In [29]:
train_cleaned = pd.read_csv('../data/train_cleaned.csv')

In [30]:
train_cleaned = pd.get_dummies(columns=['ms_subclass','ms_zoning','street','alley','land_contour','lot_config','land_slope',
                                'neighborhood','condition_1','condition_2','bldg_type','house_style','roof_style', 'roof_matl',
                                'exterior_1st','exterior_2nd','mas_vnr_type','foundation','heating','electrical','garage_type',
                                'fence','misc_feature','sale_type'], drop_first=True, data = train)

In [31]:
train_cleaned.shape

(2051, 217)

In [32]:
pd.set_option('max_columns', 250)
train_cleaned.head()

Unnamed: 0,id,lot_frontage,lot_area,lot_shape,utilities,overall_qual,overall_cond,year_built,year_remod/add,mas_vnr_area,exter_qual,exter_cond,bsmt_qual,bsmt_cond,bsmt_exposure,bsmtfin_type_1,bsmtfin_sf_1,bsmtfin_type_2,bsmtfin_sf_2,bsmt_unf_sf,total_bsmt_sf,heating_qc,central_air,1st_flr_sf,2nd_flr_sf,low_qual_fin_sf,bsmt_full_bath,bsmt_half_bath,full_bath,half_bath,bedroom_abvgr,kitchen_abvgr,kitchen_qual,totrms_abvgrd,functional,fireplaces,fireplace_qu,garage_finish,garage_cars,garage_area,garage_qual,garage_cond,paved_drive,wood_deck_sf,open_porch_sf,enclosed_porch,3ssn_porch,screen_porch,pool_area,pool_qc,misc_val,mo_sold,yr_sold,saleprice,ms_subclass_30,ms_subclass_40,ms_subclass_45,ms_subclass_50,ms_subclass_60,ms_subclass_70,ms_subclass_75,ms_subclass_80,ms_subclass_85,ms_subclass_90,ms_subclass_120,ms_subclass_150,ms_subclass_160,ms_subclass_180,ms_subclass_190,ms_zoning_C,ms_zoning_FV,ms_zoning_I,ms_zoning_RH,ms_zoning_RL,ms_zoning_RM,street_Pave,alley_No,alley_Pave,land_contour_HLS,land_contour_Low,land_contour_Lvl,lot_config_CulDSac,lot_config_FR2,lot_config_FR3,lot_config_Inside,land_slope_Mod,land_slope_Sev,neighborhood_Blueste,neighborhood_BrDale,neighborhood_BrkSide,neighborhood_ClearCr,neighborhood_CollgCr,neighborhood_Crawfor,neighborhood_Edwards,neighborhood_Gilbert,neighborhood_Greens,neighborhood_GrnHill,neighborhood_IDOTRR,neighborhood_Landmrk,neighborhood_MeadowV,neighborhood_Mitchel,neighborhood_NAmes,neighborhood_NPkVill,neighborhood_NWAmes,neighborhood_NoRidge,neighborhood_NridgHt,neighborhood_OldTown,neighborhood_SWISU,neighborhood_Sawyer,neighborhood_SawyerW,neighborhood_Somerst,neighborhood_StoneBr,neighborhood_Timber,neighborhood_Veenker,condition_1_Feedr,condition_1_Norm,condition_1_PosA,condition_1_PosN,condition_1_RRAe,condition_1_RRAn,condition_1_RRNe,condition_1_RRNn,condition_2_Feedr,condition_2_Norm,condition_2_PosA,condition_2_PosN,condition_2_RRAe,condition_2_RRAn,condition_2_RRNn,bldg_type_2fmCon,bldg_type_Duplex,bldg_type_Twnhs,bldg_type_TwnhsE,house_style_1.5Unf,house_style_1Story,house_style_2.5Fin,house_style_2.5Unf,house_style_2Story,house_style_SFoyer,house_style_SLvl,roof_style_Gable,roof_style_Gambrel,roof_style_Hip,roof_style_Mansard,roof_style_Shed,roof_matl_CompShg,roof_matl_Membran,roof_matl_Tar&Grv,roof_matl_WdShake,roof_matl_WdShngl,exterior_1st_AsphShn,exterior_1st_BrkComm,exterior_1st_BrkFace,exterior_1st_CBlock,exterior_1st_CemntBd,exterior_1st_HdBoard,exterior_1st_ImStucc,exterior_1st_MetalSd,exterior_1st_Plywood,exterior_1st_Stone,exterior_1st_Stucco,exterior_1st_VinylSd,exterior_1st_WdShing,exterior_1st_Wd_Sdng,exterior_2nd_AsphShn,exterior_2nd_BrkComm,exterior_2nd_BrkFace,exterior_2nd_CBlock,exterior_2nd_CmentBd,exterior_2nd_HdBoard,exterior_2nd_ImStucc,exterior_2nd_MetalSd,exterior_2nd_Plywood,exterior_2nd_Stone,exterior_2nd_Stucco,exterior_2nd_VinylSd,exterior_2nd_WdShing,exterior_2nd_Wd_Sdng,mas_vnr_type_BrkFace,mas_vnr_type_No,mas_vnr_type_Stone,foundation_CBlock,foundation_PConc,foundation_Slab,foundation_Stone,foundation_Wood,heating_GasW,heating_Grav,heating_OthW,heating_Wall,electrical_FuseF,electrical_FuseP,electrical_Mix,electrical_SBrkr,garage_type_Attchd,garage_type_Basment,garage_type_BuiltIn,garage_type_CarPort,garage_type_Detchd,garage_type_No,fence_GdWo,fence_MnPrv,fence_MnWw,fence_No,misc_feature_Gar2,misc_feature_No,misc_feature_Othr,misc_feature_Shed,misc_feature_TenC,sale_type_CWD,sale_type_Con,sale_type_ConLD,sale_type_ConLI,sale_type_ConLw,sale_type_New,sale_type_Oth,sale_type_WD
0,109,0.0,13517,3,4,6,8,1976,2005,289.0,4,3,3,3,0,6,533.0,1,0.0,192.0,725.0,5,1,725,754,0,0.0,0.0,2,1,3,1,4,6,8,0,0,2,2.0,475.0,3,3,2,0,44,0,0,0,0,0,0,3,2010,130500,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1
1,544,43.0,11492,3,4,7,5,1996,1997,132.0,4,3,4,3,0,6,637.0,1,0.0,276.0,913.0,5,1,913,1209,0,1.0,0.0,2,1,4,1,4,8,8,1,3,2,2.0,559.0,3,3,2,0,74,0,0,0,0,0,0,4,2009,220000,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1
2,153,68.0,7922,4,4,5,7,1953,2007,0.0,3,4,3,3,0,6,731.0,1,0.0,326.0,1057.0,3,1,1057,0,0,1.0,0.0,1,0,3,1,4,5,8,0,0,1,1.0,246.0,3,3,2,0,52,0,0,0,0,0,0,1,2010,109000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1
3,318,73.0,9802,4,4,5,5,2006,2007,0.0,3,3,4,3,0,1,0.0,1,0.0,384.0,384.0,4,1,744,700,0,0.0,0.0,2,1,3,1,3,7,8,0,0,3,2.0,400.0,3,3,2,100,0,0,0,0,0,0,0,4,2010,174000,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1
4,255,82.0,14235,3,4,6,8,1900,1993,0.0,3,3,2,4,0,1,0.0,1,0.0,676.0,676.0,3,1,831,614,0,0.0,0.0,2,0,3,1,3,6,8,0,0,1,2.0,484.0,3,3,0,0,59,0,0,0,0,0,0,3,2010,138500,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1


In [33]:
list(train_cleaned)

['id',
 'lot_frontage',
 'lot_area',
 'lot_shape',
 'utilities',
 'overall_qual',
 'overall_cond',
 'year_built',
 'year_remod/add',
 'mas_vnr_area',
 'exter_qual',
 'exter_cond',
 'bsmt_qual',
 'bsmt_cond',
 'bsmt_exposure',
 'bsmtfin_type_1',
 'bsmtfin_sf_1',
 'bsmtfin_type_2',
 'bsmtfin_sf_2',
 'bsmt_unf_sf',
 'total_bsmt_sf',
 'heating_qc',
 'central_air',
 '1st_flr_sf',
 '2nd_flr_sf',
 'low_qual_fin_sf',
 'bsmt_full_bath',
 'bsmt_half_bath',
 'full_bath',
 'half_bath',
 'bedroom_abvgr',
 'kitchen_abvgr',
 'kitchen_qual',
 'totrms_abvgrd',
 'functional',
 'fireplaces',
 'fireplace_qu',
 'garage_finish',
 'garage_cars',
 'garage_area',
 'garage_qual',
 'garage_cond',
 'paved_drive',
 'wood_deck_sf',
 'open_porch_sf',
 'enclosed_porch',
 '3ssn_porch',
 'screen_porch',
 'pool_area',
 'pool_qc',
 'misc_val',
 'mo_sold',
 'yr_sold',
 'saleprice',
 'ms_subclass_30',
 'ms_subclass_40',
 'ms_subclass_45',
 'ms_subclass_50',
 'ms_subclass_60',
 'ms_subclass_70',
 'ms_subclass_75',
 'ms_subc

In [34]:
# export dummified data into new csv file
train_cleaned.to_csv('../data/train_dummified.csv', index=False)