# Advanced House Pricing Prediction : EDA & Modelling

![](https://storage.googleapis.com/kaggle-competitions/kaggle/5407/logos/front_page.png)


I have recently started working on AMES Housing Prices Regression Dataset. This notebook showcases some of the exploratory analysis, data visualization, data processing, missing value treatment, tree based models and model blending etc. I have applied on the datasets. Please provide your feedback in the comments.

# Table of Contents - 
* [Extrapolatory Analysis & Data Transformation](#DataTransformations)
* [Feature Engineering](#feature)
* [Model Development ](#model)



# Extrapolatory Analysis


## Importing the Raw Datasets<a name="DataTransformations"></a>

The Competition primarily provides two datasets, training data and testing dataset. The data is split 50-50 between training and testing datasets, each dataset containing 1460 records. Each record corresponds to a single transaction for a house purchase, the dependent variable is **SalePrice**, the price at which house was sold. There are about 79 independent variables capturing different pieces of information about a house being sold - area, location, number of rooms etc. 

Let's start off by importing the datasets and having a quick glance at the data. 



In [4]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#Viz libraries used in the notebook
import plotly.express as px
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt                    
import matplotlib


#Importing sklearn libraries
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor #Random forest libraries
from sklearn.model_selection import cross_validate #cross validation
from sklearn.impute import SimpleImputer           #Treatment of missing values
from sklearn.preprocessing import OrdinalEncoder   #Ordinal Encoder package
from sklearn.preprocessing import LabelEncoder     #For Label Encoding
from sklearn.metrics import mean_squared_log_error #Mean Squared Log Error metric from sklearn
import xgboost as xgb                              #XGboost package
from sklearn.model_selection import GridSearchCV   #Grid search for finding out the best package


#AV = AutoViz_Class()
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

train=pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test=pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

train

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [5]:
train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


Some things immediately stand out from the outputs above - 

* The dataset is very rich in terms of the number of properties available for a house, everything from construction material, housetype, garage, basement, neighborhood properties etc. are covered in the dataset
* The ID column is the primary key in the dataset,it is a unique differentiator for every house in the dataset. There are 1460 records in the training dataset, each corresponding to a single house
* **LotArea** variable corresponds to the area of the property. There is a huge difference in the minimum and maximum values for the Lot Area, ranging from 1300 sq ft. to 215245 sq. ft
* There are several Quality variables indicating the quality of overall house, Exteriors, Kitchens, Garage etc. These variables have values on a 1-10 scale
* **YrBuilt** has values from 1872 to 2010, showing that there is a lot of variation in age of the house at the time of sale
* **YrSold** variables refers to the year in which house was sold, it shows that sales transactions are in between **2006-2010**, so a fairly short time period. With such a short time range  we can discount the impact of inflation or any other external factors on house prices
* There are several categorical variables as well like Neighborhood, Lot Shape, Street, Condition etc. We will analyze these variables separately
* **SalePrice** shows the price of house sold in Dollars. The value of SalePrice ranges from 34900 to 755,000. This is the dependent variable in this compeition. We will soon be having a more detailed look at this variable. 



## Dependent Variable - SalePrice

In [6]:
train['logSalePrice']=np.log(train.SalePrice) 

fig = px.histogram(train, x="SalePrice",title='Distribution of SalePrice',height=400)
#fig.show()
fig.write_html("myfile.html")


fig1 = px.violin(train, x="SalePrice",title='Violin Plot for SalePrice',height=300)
fig1.update_traces(box_visible=True, meanline_visible=True)
fig1.show()


fig2 = px.violin(train, x="logSalePrice",title='Violin Plot for Log(SalePrice)',height=300)
fig2.update_traces(box_visible=True, meanline_visible=True)
fig2.show()


As the plots above show, histogram of SalePrice variable shows that values are not equally distributed around the mean, there is a **right skew** in the distribution. Plotting the Violin plot for SalePrice shows us that the mean is around 173k and third quartile value is at 213k, but there are quite a few values in the long tail with SalePrice values in excess of 300k. These values are also pulling the mean away from Median.

We will be training linear regression & tree based models for predicting the SalePrice values later in the notebook, therefore these outlier values can cause lot of issues in model training and generate lot of variance for model predictions. One of the common ways to deal with this issue is to log transform the SalePrice values to reduce the skew of the distribution. We did this and plotted the distribution for log(SalePrice), it seems to be centered around mean and normally distributed.

If your variable is skewed, high values will affect the variances and push your split points towards higher values - forcing your decision tree to make less balanced splits and trying to "isolate" the tail from the rest of the points.Link below provides more details on impact of outlier values on Tree based models - 

https://stats.stackexchange.com/questions/447863/log-transforming-target-var-for-training-a-random-forest-regressor
 

## Univariate Analysis - Relationship between Independent variables on SalePrice

After having understood basic structure of the training dataset and the target variable SalePrice, let's now look at relationship between individual variables in the training dataset. We will break this analysis into two parts - first we will be looking at relationship between continous variables and target variable, and then we would perform the same analysis for categorical variables and target variable. First let's start off by looking at the count of missing values for all the columns in the training dataset and then we will look at distributions.
