### Advanced Regression Techniques - House Prices

![](https://storage.googleapis.com/kaggle-competitions/kaggle/5407/media/housesbanner.png)

**Author:** Bráulio Vieira  
**Last Updated:** 17/02/2022



**The notebook:**<br>
In this notebook the sale price of houses will be predicted using regression techiniques. The predictions will be submitted as part of a kaggle competition.

**The data:**<br>
The data consist of a train and test table with 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa.

First, we import the libraries:
* NumPy - for data manipulation.
* Pandas - for data manipulation.
* MatPlotLib - for data visualization.
* Seaborn - for data visualization.
* Sklearn - for machine learning

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()

Next, we read the train and test data using pandas read_csv function.  
Then we merge both datasets and drop de Id columns that will not be used for the predictions.

In [None]:
train= pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
test= pd.read_csv("../input/house-prices-advanced-regression-techniques/test.csv")

train['train_test']=0
test['train_test']=1
test['SalePrice'] = np.NaN
df = pd.concat([train,test])
df=df.drop(columns=['Id'])

Next, we inspect the columns and create a list of the numeric and categorical variables.

In [None]:
df.dtypes.value_counts()

In [None]:
numericvar=df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numericvar.remove('train_test')
numericvar.remove('SalePrice')
categvar=df.select_dtypes(include=['object']).columns.tolist()

## Numeric Variables

We will start analyse and transform the numeric variables first.  
Let's look at the value distribution of each variable using histogram plots.

In [None]:
fig, ax = plt.subplots(ncols=3, nrows=12,figsize=(17,30))
j=0
k=0
for i in range(len(numericvar)):
    if i==12:
        j=0
        k=1
    elif i==24:
        j=0
        k=2
    ax[j,k].hist(train[numericvar[i]])
    ax[j,k].set_title(numericvar[i])
    j+=1
fig.tight_layout()

### Missing numeric values

Next, let's look what numeric variables have missing values and how much are missing.
Then missing values will be filled.

In [None]:
print('Missing numeric values:')
for i in numericvar:
    nval=df[i].isnull().sum()
    if nval>0:
        print(str(i)+': '+str(nval))

In [None]:
#LotFrontage
df['LotFrontage']=df['LotFrontage'].fillna(df['LotFrontage'].mean())
#MasVnrArea
df['MasVnrArea']=df['MasVnrArea'].fillna(0)
#BsmtFinSF1
df['BsmtFinSF1']=df['BsmtFinSF1'].fillna(0)
#BsmtFinSF2
df['BsmtFinSF2']=df['BsmtFinSF2'].fillna(0)
#BsmtUnfSF
df['BsmtUnfSF']=df['BsmtUnfSF'].fillna(df['BsmtUnfSF'].median())
#TotalBsmtSF
df['TotalBsmtSF']=df['TotalBsmtSF'].fillna(df['TotalBsmtSF'].mean())
#BsmtFullBath
df['BsmtFullBath']=df['BsmtFullBath'].fillna(0)
#BsmtHalfBath
df['BsmtHalfBath']=df['BsmtHalfBath'].fillna(0)
#GarageYrBlt
df['GarageYrBlt']=df['GarageYrBlt'].fillna(df['GarageYrBlt'].mean())
#GarageCars
df['GarageCars']=df['GarageCars'].fillna(df['GarageCars'].mode()[0])
#GarageArea
df['GarageArea']=df['GarageArea'].fillna(df['GarageArea'].mean())

* MSSubClass will be treated as categorical variable
* LowQualFinSF will be droped
* A new variable, DateSold, will be created combining YrSold and MoSold

In [None]:
#MSSubClass
numericvar.remove('MSSubClass')
categvar.append('MSSubClass')
#LowQualFinSF
numericvar.remove('LowQualFinSF')
df=df.drop(columns='LowQualFinSF')
#DateSold, YrSold and MoSold
df['DateSold']=df['YrSold']-df['YrSold'].min()+((df['MoSold']-1)/12)
df=df.drop(columns=['YrSold', 'MoSold'])
numericvar.remove('YrSold')
numericvar.remove('MoSold')
numericvar.append('DateSold')

The following variables will be transformed into indicator variables, only having 0 and 1 values.

In [None]:
#MasVnrArea
df['MasVnrArea']=df.MasVnrArea.apply(lambda x: 1 if x>0 else 0)
#BsmtFinSF1
df['BsmtFinSF1']=df['BsmtFinSF1'].apply(lambda x: 1 if x>0 else 0)
#BsmtFinSF2
df['BsmtFinSF2']=df['BsmtFinSF2'].apply(lambda x: 1 if x>0 else 0)
#BsmtFullBath
df['BsmtFullBath']=df['BsmtFullBath'].apply(lambda x: 1 if x>0 else 0)
#BsmtHalfBath
df['BsmtHalfBath']=df['BsmtHalfBath'].apply(lambda x: 1 if x>0 else 0)
#HalfBath
df['HalfBath']=df['HalfBath'].apply(lambda x: 1 if x>0 else 0)
#3SsnPorch
df['3SsnPorch']=df['3SsnPorch'].apply(lambda x: 0 if x==0 else 1)
#KitchenAbvGr
df['KitchenAbvGr']=df['KitchenAbvGr'].apply(lambda x: 0 if x<2 else 1)
#ScreenPorch
df['ScreenPorch']=df['ScreenPorch'].apply(lambda x: 0 if x==0 else 1)
#PoolArea
df['PoolArea']=df['PoolArea'].apply(lambda x: 0 if x==0 else 1)
#MiscVal
df['MiscVal']=df['MiscVal'].apply(lambda x: 0 if x==0 else 1)

### Normalization

Next, some variables will be normalized using log scalling and z-score.

##### Log Scalling

In [None]:
df['LotArea']=np.log(df['LotArea']+1)
df['BsmtUnfSF']=np.log(df['BsmtUnfSF']+1)

##### Z-score normalization

In [None]:
df['LotFrontage']=scaler.fit_transform(df[['LotFrontage']])
df['LotArea']=scaler.fit_transform(df[['LotArea']])
df['OverallQual']=scaler.fit_transform(df[['OverallQual']])
df['OverallCond']=scaler.fit_transform(df[['OverallCond']])
df['YearBuilt']=scaler.fit_transform(df[['YearBuilt']])
df['YearRemodAdd']=scaler.fit_transform(df[['YearRemodAdd']])
df['BsmtUnfSF']=scaler.fit_transform(df[['BsmtUnfSF']])
df['TotalBsmtSF']=scaler.fit_transform(df[['TotalBsmtSF']])
df['1stFlrSF']=scaler.fit_transform(df[['1stFlrSF']])
df['2ndFlrSF']=scaler.fit_transform(df[['2ndFlrSF']])
df['GrLivArea']=scaler.fit_transform(df[['GrLivArea']])
df['FullBath']=scaler.fit_transform(df[['FullBath']])
df['BedroomAbvGr']=scaler.fit_transform(df[['BedroomAbvGr']])
df['TotRmsAbvGrd']=scaler.fit_transform(df[['TotRmsAbvGrd']])
df['Fireplaces']=scaler.fit_transform(df[['Fireplaces']])
df['GarageYrBlt']=scaler.fit_transform(df[['GarageYrBlt']])
df['GarageCars']=scaler.fit_transform(df[['GarageCars']])
df['GarageArea']=scaler.fit_transform(df[['GarageArea']])
df['WoodDeckSF']=scaler.fit_transform(df[['WoodDeckSF']])
df['OpenPorchSF']=scaler.fit_transform(df[['OpenPorchSF']])
df['EnclosedPorch']=scaler.fit_transform(df[['EnclosedPorch']])
df['DateSold']=scaler.fit_transform(df[['DateSold']])

## Categorical Variables

After finishing with the numeric variables let's look at the categorical variables.

In [None]:
fig, ax = plt.subplots(ncols=3, nrows=15,figsize=(17,35))
j=0
k=0
for i in range(len(categvar)):
    if i==15:
        j=0
        k=1
    elif i==30:
        j=0
        k=2
    ax[j,k].bar(x=df[categvar[i]].value_counts().index, height=df[categvar[i]].value_counts())
    ax[j,k].set_title(categvar[i])
    j+=1
ax[14,2].set_axis_off()
fig.tight_layout()

### Missing categorical values

Next, let's look what categorical variables have missing values and how much are missing.
Then missing values will be filled.

In [None]:
print('Missing categorical values:')
for i in categvar:
    nval=df[i].isnull().sum()
    if nval>0:
        print(str(i)+': '+str(nval))

In [None]:
#MSZoning
df['MSZoning']=df['MSZoning'].fillna(df['MSZoning'].mode()[0])
#Exterior1st
df['Exterior1st']=df['Exterior1st'].fillna(df['Exterior1st'].mode()[0])
#Exterior2nd
df['Exterior2nd']=df['Exterior2nd'].fillna(df['Exterior2nd'].mode()[0])
#MasVnrType
df['MasVnrType']=df['MasVnrType'].fillna(df['MasVnrType'].mode()[0])
#BsmtQual
df['BsmtQual']=df['BsmtQual'].fillna(df['BsmtQual'].mode()[0])
#BsmtCond
df['BsmtCond']=df['BsmtCond'].fillna(df['BsmtCond'].mode()[0])
#BsmtExposure
df['BsmtExposure']=df['BsmtExposure'].fillna(df['BsmtExposure'].mode()[0])
#BsmtFinType1
df['BsmtFinType1']=df['BsmtFinType1'].fillna(df['BsmtFinType1'].mode()[0])
#BsmtFinType2
df['BsmtFinType2']=df['BsmtFinType2'].fillna(df['BsmtFinType2'].mode()[0])
#Electrical
df['Electrical']=df['Electrical'].fillna(df['Electrical'].mode()[0])
#KitchenQual
df['KitchenQual']=df['KitchenQual'].fillna(df['KitchenQual'].mode()[0])
#Functional
df['Functional']=df['Functional'].fillna(df['Functional'].mode()[0])
#FireplaceQu
df['FireplaceQu']=df['FireplaceQu'].fillna('None')
#GarageType
df['GarageType']=df['GarageType'].fillna(df['GarageType'].mode()[0])
#GarageFinish
df['GarageFinish']=df['GarageFinish'].fillna(df['GarageFinish'].mode()[0])
#GarageQual
df['GarageQual']=df['GarageQual'].fillna(df['GarageQual'].mode()[0])
#GarageCond
df['GarageCond']=df['GarageCond'].fillna(df['GarageCond'].mode()[0])
#Fence
df['Fence']=df['Fence'].fillna('None')
#SaleType
df['SaleType']=df['SaleType'].fillna(df['SaleType'].mode()[0])

The following variables were dropped. Those have too much missing values.

In [None]:
df=df.drop(columns=['Alley'])
df=df.drop(columns=['Utilities'])
df=df.drop(columns=['PoolQC'])
df=df.drop(columns=['MiscFeature'])

The following variables will be transformed into indicator variables, only having 0 and 1 values.

In [None]:
df['Street']=df['Street'].apply(lambda x: 0 if x=='Pave' else 1)
df['CentralAir']=df['CentralAir'].apply(lambda x: 1 if x=='Y' else 0)

To finish let's convert categorical variables into dummy/indicator variables

In [None]:
df=pd.get_dummies(df, columns=['MSSubClass'])
df=pd.get_dummies(df, columns=['MSZoning'])
df=pd.get_dummies(df, columns=['LotShape'])
df=pd.get_dummies(df, columns=['LandContour'])
df=pd.get_dummies(df, columns=['LotConfig'])
df=pd.get_dummies(df, columns=['LandSlope'])
df=pd.get_dummies(df, columns=['Neighborhood'])
df=pd.get_dummies(df, columns=['Condition1'])
df=pd.get_dummies(df, columns=['Condition2'])
df=pd.get_dummies(df, columns=['BldgType'])
df=pd.get_dummies(df, columns=['HouseStyle'])
df=pd.get_dummies(df, columns=['RoofStyle'])
df=pd.get_dummies(df, columns=['RoofMatl'])
df=pd.get_dummies(df, columns=['Exterior1st'])
df=pd.get_dummies(df, columns=['Exterior2nd'])
df=pd.get_dummies(df, columns=['MasVnrType'])
df=pd.get_dummies(df, columns=['ExterQual'])
df=pd.get_dummies(df, columns=['ExterCond'])
df=pd.get_dummies(df, columns=['Foundation'])
df=pd.get_dummies(df, columns=['BsmtQual'])
df=pd.get_dummies(df, columns=['BsmtCond'])
df=pd.get_dummies(df, columns=['BsmtExposure'])
df=pd.get_dummies(df, columns=['BsmtFinType1'])
df=pd.get_dummies(df, columns=['BsmtFinType2'])
df=pd.get_dummies(df, columns=['Heating'])
df=pd.get_dummies(df, columns=['HeatingQC'])
df=pd.get_dummies(df, columns=['Electrical'])
df=pd.get_dummies(df, columns=['KitchenQual'])
df=pd.get_dummies(df, columns=['Functional'])
df=pd.get_dummies(df, columns=['FireplaceQu'])
df=pd.get_dummies(df, columns=['GarageType'])
df=pd.get_dummies(df, columns=['GarageFinish'])
df=pd.get_dummies(df, columns=['GarageQual'])
df=pd.get_dummies(df, columns=['GarageCond'])
df=pd.get_dummies(df, columns=['PavedDrive'])
df=pd.get_dummies(df, columns=['Fence'])
df=pd.get_dummies(df, columns=['SaleType'])
df=pd.get_dummies(df, columns=['SaleCondition'])

## Correlation between variables

To improve the regression model we could remove some variables with high correlation. To do this let's find the pairs of variables with the highest correlation factor.

In [None]:
cols=df.columns.tolist()
cols.remove('SalePrice')
corr=df[cols].corr()
corlist=[]
variaveis=corr.columns
for j in variaveis:
    linha=0
    for k in variaveis:
        corlist.append([(str(j)+' '+str(k)),corr[j][linha]])
        linha+=1
corlist=pd.DataFrame(corlist)
corlist=corlist[corlist[1]<1]
corlist[1]=np.absolute(corlist[1])
corlist=corlist.sort_values(by=[1], ascending=False)
corlist=corlist.drop_duplicates(subset=[1])
corlist=corlist.rename(columns={0: "Variables", 1: "Corr"})
corlist[corlist['Corr']>0.8]

Some variables with high correlation will be removed

In [None]:
above09=['BsmtFinType1_Unf', 'BsmtFinType2_Unf','MasVnrType_None','SaleCondition_Partial','Exterior2nd_CmentBd',
                    'Exterior2nd_VinylSd','BldgType_2fmCon','Exterior2nd_MetalSd','HouseStyle_SLvl','HouseStyle_1.5Fin']
above08=['FireplaceQu_None','GarageCars','Exterior2nd_HdBoard','Neighborhood_Somerst','HouseStyle_1.5Unf',
                    'Exterior2nd_Wd Sdng','Exterior2nd_AsbShng','HouseStyle_2Story','TotRmsAbvGrd','MasVnrType_BrkFace','TotalBsmtSF']
df=df.drop(columns=above09)

In [None]:
Xtrain=df[df['train_test']==0]
Xtest=df[df['train_test']==1]
Ytrain=Xtrain['SalePrice']
Xtrain=Xtrain.drop(columns=['SalePrice'])
Xtest=Xtest.drop(columns=['SalePrice'])

## Predictions

#### Submission scores:  
**XGboost**
* without droping columns due to high correlation: 0.14155
* droping all above 0.8: 0.14353   
* droping all above 0.9: 0.14133

**RandomForestRegressor**
* droping all above 0.9: 0.14725

**LinearRegression**
* droping all above 0.9: 22.37148

#### XGBoost

In [None]:
import xgboost
classifier=xgboost.XGBRegressor()
classifier.fit(Xtrain, Ytrain)
ypred=classifier.predict(Xtest)

#### Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
randfor = RandomForestRegressor()
randfor.fit(Xtrain, Ytrain)
ypred=randfor.predict(Xtest)

#### Linear Regression

In [None]:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(Xtrain, Ytrain)
ypred=regr.predict(Xtest)

In [None]:
submission=pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
submission['SalePrice']=ypred
submission.to_csv('submission.csv', index =False)

### Parameter tunning