# Project 2:  Ames Housing Data and Kaggle Challenge

# Problem Statement

Given a dataset of houses and their features, we must find the best model to predict the sale price of these houses and understand which features of the houses influence the sale price the most. The goal is helping property owners in order to maximize their house values.

# Libraries and import

In [50]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor
from sklearn.preprocessing import PolynomialFeatures
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso, LassoCV
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge

In [3]:
house_train_set = pd.read_csv('datasets/house_train_set.csv')

In [4]:
house_train_set.head(2)

Unnamed: 0.1,Unnamed: 0,id,pid,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,...,screen_porch,pool_area,pool_qc,fence,misc_feature,misc_val,mo_sold,yr_sold,sale_type,saleprice
0,0,109,533352170,60,RL,69.0552,13517,Pave,No alley,IR1,...,0,0,No Pool,No fence,No extra features,0,3,2010,WD,130500
1,1,544,531379050,60,RL,43.0,11492,Pave,No alley,IR1,...,0,0,No Pool,No fence,No extra features,0,4,2009,WD,220000


In [5]:
house_train_set.drop('Unnamed: 0', axis=1, inplace=True)

# Initial linear regression model 

For the first model, I only use the numerical features. Let's establish our feature matrix X and our target vector y

In [12]:
numerical_features = house_train_set[['pid', 'ms_subclass', 'lot_frontage', 'lot_area','overall_qual', 'overall_cond', 'year_built',\
                        'year_remod/add', 'mas_vnr_area', 'bsmtfin_sf_1', 'bsmtfin_sf_2', 'bsmt_unf_sf', 'total_bsmt_sf',\
                          '1st_flr_sf', '2nd_flr_sf', 'low_qual_fin_sf', 'gr_liv_area', 'bsmt_full_bath', 'full_bath',\
                          'half_bath', 'bedroom_abvgr', 'kitchen_abvgr', 'totrms_abvgrd', 'fireplaces', 'garage_yr_blt',\
                          'garage_cars', 'garage_area', 'wood_deck_sf', 'open_porch_sf', 'enclosed_porch', '3ssn_porch',\
                          'screen_porch', 'pool_area', 'misc_val', 'mo_sold', 'yr_sold']]

In [13]:
X_num = numerical_features
y = house_train_set['saleprice']

I then separate my data into two dataset to be able to compare how the model perform with trained and untrained(not seen) data:

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X_num, y, test_size = 0.2, 
                                                   random_state = 2020)

We can now start fitting our model

In [15]:
lr = LinearRegression()

In [16]:
lr.fit(X_train, y_train)

LinearRegression()

In [17]:
lr.score(X_train, y_train)

0.8403795594441867

This above is the R2 score of my model on trained data. It explains 84% of the variance is explained by this model.
This is not bad but we must get the score on unseen data to truly evaluate how our model perform:

In [19]:
lr.score(X_test, y_test)

0.8341132856390501

The R2 score on the untrained data is almost the same which is very good. It tells us that the model is not overfitting on trained data(just slighlty slighly ^^). 83% of the variance is explained

How does this model compare to the baseline model ? I know the RMSE (Root Mean Squared Error) of the baseline model is
80389.4 based on my first submission. Let's compute the RMSE of our intial model to compare the two:

In [20]:
Initial_preds = lr.predict(X_test)

In [23]:
np.sqrt(mean_squared_error(y_test, Initial_preds))

33568.627650793016

The RMSE of this model is 33568.6 which is way better than the baseline model. This represent the average distance from the predicted values which means that our predictions are off by $ 33,568.6

I use this model for my kaggle submission 

# linear regression model with categorical features

during the EDA and feature selection process I concluded that 4 categorical values were the most important for our model :

1. foundation 
2. bsmtfin_type_1 
3. neighborhood 
4. exter_qual 


Let's create dummies and add it to the numerical_features:

In [26]:
numerical_features = numerical_features.join(pd.get_dummies(house_train_set['foundation'], drop_first=True))

In [28]:
numerical_features = numerical_features.join(pd.get_dummies(house_train_set['bsmtfin_type_1'], drop_first=True))

In [29]:
numerical_features = numerical_features.join(pd.get_dummies(house_train_set['neighborhood'], drop_first=True))

In [30]:
numerical_features = numerical_features.join(pd.get_dummies(house_train_set['exter_qual'], drop_first=True))

Now that I added the categorigal features I can fit another regression model. I willl also scale the data this time:

In [35]:
X_cat = numerical_features
y = house_train_set['saleprice']

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X_cat, y, test_size = 0.2, 
                                                   random_state = 2020)

In [38]:
scaler = StandardScaler()

In [39]:
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

In [40]:
lr.fit(X_train_sc, y_train)

LinearRegression()

In [41]:
lr.score(X_train_sc, y_train)

0.8800202597615105

We can see that our R2 score is already better compared to the first model (.88 > .84). But this is only on the trained data. Let's see what happen with the unseen data:

In [42]:
lr.score(X_test_sc, y_test)

0.8873168908790605

The model perform slightly better on unseen data with 88.7% of the variance explained. This model is performing better than the initial model. Let's look at the RMSE and compare it with both the initial model and the baseline model:

In [43]:
categorical_preds = lr.predict(X_test_sc)

In [44]:
np.sqrt(mean_squared_error(y_test, categorical_preds))

27666.68942161585

The RMSE is lower than both the base model and the initial regression model. Our prediction are off $ 27,666.6

# Lasso model 

Set up a list of Lasso alphas

In [48]:
l_alphas = np.logspace(0, 5, 100)

cross_validation

In [51]:
lasso_cv = LassoCV(alphas=l_alphas, cv=5, max_iter=5000)

Now let's fit the model using the X_train_sc

In [52]:
lasso_cv.fit(X_train_sc, y_train)

LassoCV(alphas=array([1.00000000e+00, 1.12332403e+00, 1.26185688e+00, 1.41747416e+00,
       1.59228279e+00, 1.78864953e+00, 2.00923300e+00, 2.25701972e+00,
       2.53536449e+00, 2.84803587e+00, 3.19926714e+00, 3.59381366e+00,
       4.03701726e+00, 4.53487851e+00, 5.09413801e+00, 5.72236766e+00,
       6.42807312e+00, 7.22080902e+00, 8.11130831e+00, 9.11162756e+00,
       1.02353102e+01, 1.14975700e+0...
       6.89261210e+03, 7.74263683e+03, 8.69749003e+03, 9.77009957e+03,
       1.09749877e+04, 1.23284674e+04, 1.38488637e+04, 1.55567614e+04,
       1.74752840e+04, 1.96304065e+04, 2.20513074e+04, 2.47707636e+04,
       2.78255940e+04, 3.12571585e+04, 3.51119173e+04, 3.94420606e+04,
       4.43062146e+04, 4.97702356e+04, 5.59081018e+04, 6.28029144e+04,
       7.05480231e+04, 7.92482898e+04, 8.90215085e+04, 1.00000000e+05]),
        cv=5, max_iter=5000)

So how does this model perform compared to the previous ones ?

In [53]:
lasso_cv.score(X_train_sc, y_train)

0.8727159171831226

The R2 score is lower than the linear regression with categorical features. Let's see the score on untrained data

In [54]:
lasso_cv.score(X_test_sc, y_test)

0.8754369698902998

The R2 score on untrained data is slightly lower than on trained data. The model is a little bit overfitting and perform worse than the previous model(linear regression with categorical features). Let's check the RMSE

In [56]:
lasso_pred = lasso_cv.predict(X_test_sc)

In [57]:
np.sqrt(mean_squared_error(y_test, lasso_pred))

29088.56963339312

Even tho this model is performing better than the baseline and the initial regression model, it is not performing better than the second linear regression model. The RMSE of the lasso model is $ 29, 088 off.

Let's try a ridge model

# Ridge Model

First we set up a list of ridge alphas

In [60]:
r_alphas = np.logspace(0, 5, 100)

cross-validation

In [62]:
ridge_cv = RidgeCV(alphas=r_alphas, scoring='r2', cv=5)

Now we can fit the model 

In [64]:
ridge_cv = ridge_model.fit(X_train_sc, y_train)

We can now observe how our model perform:

In [65]:
ridge_cv.score(X_train_sc, y_train)

0.8798991149401728

The ridge model perform better than the lasso model on trained data, not performing better than the second linear regression model. Let's observe how it is going to perform on unseen data:

In [66]:
ridge_cv.score(X_test_sc, y_test)

0.8867398584406757

The model is underfitting on trained data. Overall the ridge model perform better than the lasso and is very close to the second linear regression model.

# Conclusion

Based on my analysis and the data provided, it is possible to approximate the sale price of a house given it's features. The best model I came up with is a linear regression model explaining 88.7 of the variance in precicted values.

Some features are more influential than others like the overall quality of a house the square footage of the living area, the square footage of the basement and the garage. The foundation is also a factor as well as the neighborhood and the exterior material quality. 

While those features increase the house price, they can also negatively affect it if they are out of the norm. Really big square footage don't result in higher prices. The same thing can apply to additional features where they don't add a lot to the value of a house.

My recommandation to property owners looking to maximize the value of their house apart from what they can't control is to remodel the houses if possible. The best factor on sale price is the overall quality of the house so renovations are the key.