## Project Description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

#### Data source

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

#### Acknowledgement

The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Load the data

In [4]:
train = pd.read_csv('train.csv')
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

### Cleaning and Wrangling of The Data

In [86]:
train.loc[train['LotFrontage'].isnull(), ['LotFrontage','LotArea', 'Street', 'LotShape', 'Alley']]

Unnamed: 0,LotFrontage,LotArea,Street,LotShape,Alley
7,,10382,Pave,IR1,No_Alley_Access
12,,12968,Pave,IR2,No_Alley_Access
14,,10920,Pave,IR1,No_Alley_Access
16,,11241,Pave,IR1,No_Alley_Access
24,,8246,Pave,IR1,No_Alley_Access
...,...,...,...,...,...
1429,,12546,Pave,IR1,No_Alley_Access
1431,,4928,Pave,IR1,No_Alley_Access
1441,,4426,Pave,Reg,No_Alley_Access
1443,,8854,Pave,Reg,No_Alley_Access


In [90]:
train[['LotFrontage', 'LotArea']].describe()

Unnamed: 0,LotFrontage,LotArea
count,1201.0,1460.0
mean,70.049958,10516.828082
std,24.284752,9981.264932
min,21.0,1300.0
25%,59.0,7553.5
50%,69.0,9478.5
75%,80.0,11601.5
max,313.0,215245.0


In [91]:
train.loc[train['LotFrontage'].isnull(), ['LotFrontage','LotArea', 'Street', 'LotShape', 'Alley']]['LotArea'].describe()

count       259.000000
mean      13137.370656
std       16215.264451
min        1974.000000
25%        8065.500000
50%       10624.000000
75%       13018.500000
max      164660.000000
Name: LotArea, dtype: float64

In [87]:
train.loc[train['LotFrontage'].isnull(), ['LotFrontage','LotArea', 'Street', 'LotShape', 'Alley']]['Alley'].value_counts()

No_Alley_Access    254
Grvl                 3
Pave                 2
Name: Alley, dtype: int64

In [88]:
train.loc[train['LotFrontage'].isnull(), ['LotFrontage','LotArea', 'Street', 'LotShape', 'Alley']]['LotShape'].value_counts()

IR1    167
Reg     74
IR2     15
IR3      3
Name: LotShape, dtype: int64

__NOTE__: Deal with the NaNs in other categorical variables first and then revisit LotFrontage to form a strategy for dealing with its NaNs

In [12]:
# The NaNs in "Alley" refer to No Alley Access as per the data description hence labeling it as such
train.loc[train['Alley'].isnull(),'Alley'] = 'No_Alley_Access'

In [14]:
train['Alley'].value_counts()

No_Alley_Access    1369
Grvl                 50
Pave                 41
Name: Alley, dtype: int64

In [62]:
# NaNs in MasVnrType and MasVnrArea
train.loc[train['MasVnrType'].isnull(), 'MasVnrType'] = 'None'
train.loc[train['MasVnrArea'].isnull(), 'MasVnrArea'] = 0.0

In [63]:
train['MasVnrType'].value_counts()

None       872
BrkFace    445
Stone      128
BrkCmn      15
Name: MasVnrType, dtype: int64

__NOTE__: Revisit this after dealing with NaNs in other major columns

In [25]:
# PoolQC (Pool quality), as per the data description the NaNs in this column refer to "No Pool"
train['PoolQC'].value_counts()

Gd    3
Ex    2
Fa    2
Name: PoolQC, dtype: int64

In [27]:
train.loc[train['PoolQC'].isnull(), 'PoolQC']='No_Pool'

In [28]:
train['PoolQC'].value_counts()

No_Pool    1453
Gd            3
Ex            2
Fa            2
Name: PoolQC, dtype: int64

In [31]:
# NA is column "Fence" refers to "No Fence" according to the data description
train.loc[train['Fence'].isnull(), 'Fence'] = 'No_Fence'

In [34]:
# MiscFeature: Miscellaneous feature not covered in other categories - NA refers to "None"
train.loc[train['MiscFeature'].isnull(), 'MiscFeature'] = 'None'

In [35]:
train['MiscFeature'].value_counts()

None    1406
Shed      49
Gar2       2
Othr       2
TenC       1
Name: MiscFeature, dtype: int64

In [38]:
# FireplaceQu: Fireplace quality - NAs refer to "No Fireplace"
train.loc[train['FireplaceQu'].isnull(), 'FireplaceQu'] = 'No_Fireplace'

In [39]:
train['FireplaceQu'].value_counts()

No_Fireplace    690
Gd              380
TA              313
Fa               33
Ex               24
Po               20
Name: FireplaceQu, dtype: int64

In [41]:
# NAs in GarageType refers to "No Garage"
train.loc[train['GarageType'].isnull(), 'GarageType'] = 'No_Garage'

In [43]:
# No garage hence 0.0
train.loc[train['GarageYrBlt'].isnull(), 'GarageYrBlt'] = 0.0 

In [44]:
# No garage as per data description
train.loc[train['GarageFinish'].isnull(), 'GarageFinish'] = 'No_Garage' 

In [45]:
# No garage as per data description
train.loc[train['GarageQual'].isnull(), 'GarageQual'] = 'No_Garage'

In [46]:
# No garage as per data description
train.loc[train['GarageCond'].isnull(), 'GarageCond'] = 'No_Garage'

In [52]:
# No basement as per data description
train.loc[train['BsmtQual'].isnull(), ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']] = 'No_Basement'

In [69]:
train.loc[train['BsmtExposure'].isnull(), 'BsmtExposure'] = 'No'

In [77]:
# All other basement related columns have valid data hence filling this one with 'GLQ' to be consistent with "BsmtFinType1"
train.loc[train['BsmtFinType2'].isnull(), 'BsmtFinType2'] = 'GLQ'

In [82]:
# Filling it with the mostly observed value as there is only one missing value
train.loc[train['Electrical'].isnull(), 'Electrical'] = 'SBrkr'

### Exploratory Data Analysis

### Feature Engineering

### Pre-processing and Modeling

### Model Evaluation