## Modeling and Conclusion

Having processed our data for machine learning modeling, we will now train and evaluate several models which would predict the housing sales price in Ames Iowa. 

### Import the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math

from sklearn.linear_model import (LinearRegression, 
                                  LassoCV, 
                                  RidgeCV, 
                                  ElasticNetCV, 
                                  Lasso)

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

In [2]:
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

### Import the cleaned and processed train and test dataset

In [3]:
df_train = pd.read_csv('../data/train_processed.csv')
df_test = pd.read_csv('../data/test_processed.csv')

### Split the train data into train and validation sets

In [4]:
# Drop 'saleprice' from both the test and train data sets
X = df_train.drop(columns=['id', 'pid', 'saleprice'])
y = df_train['saleprice']

In [5]:
#Number of features

X.shape[1]

72

In [6]:
X_train, X_test, y_train, y_test = train_test_split(
                                    X,
                                    y,
                                    test_size=0.3,
                                    random_state= 89
)

### Establish the baseline score

To establish the baseline RMSE, we assume that the predicted target value `y_base` is the mean sale price

In [7]:
# Create an array with all predictions being the mean sale price
sales_mean = np.repeat(np.mean(df_train['saleprice']), df_train.shape[0])
y_base = pd.Series(sales_mean)

In [8]:
# Calculate the baseline RMSE
round((mean_squared_error(y, y_base))**0.5, 2)

78632.44

### Standardize X_train and X_test

We will standardize the continuous variables on X_train and X_test

In [9]:
# List of continuous variables for scaling

con_var = ['1st_flr_sf', 'bsmt_exposure', 'bsmt_qual', 'bsmtfin_sf_1', 
           'bsmtfin_type_1', 'exter_qual', 'fence', 'fireplace_qu', 'full_bath',
           'garage_area', 'garage_cars', 'garage_finish', 'gr_liv_area', 'heating_qc',
           'kitchen_qual', 'lot_area', 'lot_frontage_reg_imputed', 'lot_shape', 'mas_vnr_area',
           'neigh_group', 'open_porch_sf', 'overall_qual', 'total_bsmt_sf', 'totrms_abvgrd', 
           'wood_deck_sf', 'year_built', 'year_remod_add']

In [10]:
# Isolate continuous variables for scaling
X_train_cont = X_train[[var for var in con_var if var in df_train.columns]]
X_test_cont = X_test[[var for var in con_var if var in df_test.columns]]

In [11]:
# Instantiate the StandardScaler
ss = StandardScaler()

X_train_ss_cont = ss.fit_transform(X_train_cont)
X_test_ss_cont = ss.transform(X_test_cont)

In [12]:
# Drop the orginal continuous variables from X_train and join back the scaled ones
X_train_ss = X_train.drop(
                 columns=[var for var in con_var if var in df_train.columns]
             ).reset_index(drop=True
                          ).join(pd.DataFrame(X_train_ss_cont, 
                                              columns=[var for var in con_var if var in df_train.columns]))

In [13]:
# Drop the orginal continuous variables from X_test and join back the scaled ones
X_test_ss = X_test.drop(
                 columns=[var for var in con_var if var in df_train.columns]
             ).reset_index(drop=True
                          ).join(pd.DataFrame(X_test_ss_cont, columns=[var for var in con_var if var in df_train.columns]))

### Perform Linear Regression

The first model we will train is Linear Regression

In [14]:
# Instantiate and fit the Linear Regression
lr = LinearRegression()
lr.fit(X_train_ss, y_train)

LinearRegression()

We will evaluate the cross-validated RMSE score against that of the holdout set

In [15]:
#Cross Validated RMSE of training set
round(((-cross_val_score(
    lr,
    X_train_ss,
    y_train,
    cv=10,
    scoring='neg_mean_squared_error'
).mean())**0.5), 2)

25941.91

In [16]:
round((mean_squared_error(y_test, lr.predict(X_test_ss)))**0.5, 2)

28752.18

The RMSE on the training set is lower than that of the holdout set, which is within expectation

Two thresholds were used to tune and evaluate the performance of each models in the previous notebook:

1. Cardinal Threshold: used to control the number categories in nominal variables. Categories are sorted by value counts and normalized. Any categories that are beyond the threshold are deemed to be too small and will be grouped together as a single category called 'Others'. This helps to reduce the number of features for modeling

2. Correlation Threshold: Highly correlated pair of variables beyond the stated threshold are filtered out and made to drop one of them. This reduces the multicollinearity problem and also the number of features

The following is the evaluated perfomance of each model. The best model is model 4 based on Holdout RMSE score of 28717.78

| Models | Thresholds                     | Features | CV RMSE | Holdout RMSE |
|--------|--------------------------------|----------|---------|--------------|
| 1      | cardinal:0.85 correlation:0.8   | 71       | 28739.24   | 31313.84        |
| 2      | cardinal:0.9 correlation:0.8  | 77       | 28695.65   | 31128.54        |
| 3      | cardinal:0.8 correlation:0.8   | 68       | 28966.67   | 31264.25        |
| 4      | cardinal:0.8 correlation:0.9 | 72       | 25928.04   | 28717.78        |

In [17]:
lr_coef = pd.DataFrame(lr.coef_, index = X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False)

lr_coef

Unnamed: 0,Coefficient
bsmt_qual,1.237077e+17
1st_flr_sf,1.237077e+17
bsmtfin_type_1,1.237077e+17
exter_qual,1.237077e+17
bsmt_exposure,1.237077e+17
exterior_2nd_VinylSd,5.835209e+16
exterior_2nd_PreCast,5.835209e+16
lot_config_Others,1.570802e+16
exterior_1st_MetalSd,1.278228e+16
exterior_2nd_HdBoard,1.278228e+16


In [18]:
#Number of coefficient remaining after fitting
lr_coef[lr_coef!=0].notna().sum()

Coefficient    72
dtype: int64

However, we are seeing wildly overblown coefficients on the Linear Regression. There are also  minor reduction in the number of features before and after fitting the training set(from 75 to 71). We will need to use Regression with regularization to penalize these coefficients.

### Perform Ridge Regression

We will now train a Ridge Regression based on the best Cardinality Threshold and Correlation Threshold obtained for Linear Regression 

In [19]:
ridge = RidgeCV(alphas=np.logspace(0, 5, 100))
ridge.fit(X_train_ss, y_train)

round(ridge.alpha_, 2)

29.15

In [20]:
#RMSE of training set
round((-cross_val_score(
    ridge,
    X_test_ss,
    y_test,
    cv=10,
    scoring='neg_mean_squared_error'
).mean())**0.5, 2)

29336.05

In [21]:
round((mean_squared_error(y_test, ridge.predict(X_test_ss)))**0.5, 2)

28669.68

In [22]:
ridge_coef = pd.DataFrame(ridge.coef_, 
                          index = X.columns, 
                          columns=['Coefficient']
                         ).sort_values(by='Coefficient', 
                                       ascending=False)
ridge_coef

Unnamed: 0,Coefficient
roof_style_Others,17458.215965
ms_subclass_85,17057.979821
ms_subclass_20,10703.064477
open_porch_sf,9603.545389
ms_subclass_40,7764.085021
ms_subclass_Others,7045.903158
total_bsmt_sf,6529.13251
ms_zoning_Others,6112.364167
overall_qual,5748.414206
foundation_PConc,5563.234079


In [23]:
#Number of coefficient remaining after fitting
ridge_coef[ridge_coef!=0].notna().sum()

Coefficient    58
dtype: int64

We see that the lasso regularization has reduced the previously overblown coefficients. The number of coefficients also reduced from 75 to 58

### Perform Lasso Regression

We will now train a Lasso Regression based on the best Cardinality Threshold and Correlation Threshold obtained for Linear Regression 

In [24]:
lasso = LassoCV(n_alphas=1000)
lasso.fit(X_train_ss, y_train)

round(lasso.alpha_, 2)

92.43

In [25]:
round((-cross_val_score(
    lasso,
    X_train_ss,
    y_train,
    cv=10,
    scoring='neg_mean_squared_error'
).mean())**0.5, 2)

25831.85

In [26]:
round((mean_squared_error(y_test, lasso.predict(X_test_ss)))**0.5, 2)

28636.59

In [27]:
lasso_coef = pd.DataFrame(lasso.coef_, 
                          index = X.columns, 
                          columns=['Coefficient']
                         ).sort_values(by='Coefficient', ascending=False)

lasso_coef

Unnamed: 0,Coefficient
ms_subclass_85,19527.38
roof_style_Others,18242.89
foundation_PConc,11441.7
ms_subclass_20,10859.21
open_porch_sf,9646.022
ms_subclass_40,7780.165
ms_subclass_Others,7026.999
total_bsmt_sf,6777.101
ms_zoning_Others,6186.262
overall_qual,5712.288


In [28]:
#Number of coefficient remaining after fitting
lasso_coef[lasso_coef!=0].notna().sum()

Coefficient    43
dtype: int64

We see that the lasso regularization has reduced the previously overblown coefficients. The number of coefficients also reduced more significantly from 75 to 42 through shutting off the coefficients.

### Perform Elastic Net

Lastly, we will train and evaluate two models of ElasticNet

#### Elastic Net with l1_ratio  manually selected at 0.2

In [29]:
enet_ratio = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

enet_l120 = ElasticNetCV(l1_ratio=0.2, n_alphas=1000)

enet_l120.fit(X_train_ss, y_train)

print(round(enet_l120.alpha_, 2))
print(enet_l120.l1_ratio_)

322.57
0.2


In [30]:
round((-cross_val_score(
    enet_l120,
    X_train_ss,
    y_train,
    cv=10,
    scoring='neg_mean_squared_error'
).mean())**0.5, 2)

77045.76

In [31]:
round((mean_squared_error(y_test, enet_l120.predict(X_test_ss)))**0.5, 2)

73818.45

In [32]:
enet_l120_coef = pd.DataFrame(enet_l120.coef_, 
                          index = X.columns, 
                          columns=['Coefficient']
                         ).sort_values(by='Coefficient', ascending=False)

enet_l120_coef

Unnamed: 0,Coefficient
roof_style_Others,240.621067
ms_subclass_40,214.047119
ms_subclass_85,213.34885
ms_subclass_Others,205.465081
total_bsmt_sf,201.863347
overall_qual,197.066895
ms_subclass_70,195.641275
ms_subclass_160,194.360187
ms_subclass_75,191.784121
ms_subclass_190,182.678557


In [33]:
#Number of coefficient remaining after fitting
enet_l120_coef[enet_l120_coef!=0].notna().sum()

Coefficient    58
dtype: int64

The ElasticNet model with l1_ratio at 0.2 exhibited more signifcant regularization of coefficients, but this comes at the expense of poorer performance in terms of RMSE. The number The number of coefficients reduced from 75 to 58 after fitting.

#### Elastic Net with l1_ratio decided by ElasticNetCV

In [34]:
enet_ratio = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

enet_model = ElasticNetCV(l1_ratio=enet_ratio, n_alphas=1000)

enet_model.fit(X_train_ss, y_train)

print(round(enet_model.alpha_, 2))
print(enet_model.l1_ratio_)

71.68
0.9


In [35]:
round((-cross_val_score(
    enet_model,
    X_train_ss,
    y_train,
    cv=10,
    scoring='neg_mean_squared_error'
).mean())**0.5, 2)

43088.42

In [36]:
round((mean_squared_error(y_test, enet_model.predict(X_test_ss)))**0.5, 2)

42049.49

In [37]:
enet_coef = pd.DataFrame(enet_model.coef_, 
                          index = X.columns, 
                          columns=['Coefficient']
                         ).sort_values(by='Coefficient', ascending=False)

enet_coef

Unnamed: 0,Coefficient
roof_style_Others,3972.095799
ms_subclass_85,3732.780016
ms_subclass_40,3387.799795
ms_subclass_Others,3273.620254
total_bsmt_sf,3260.334607
ms_subclass_160,3194.416891
ms_subclass_70,2930.473459
overall_qual,2898.443707
ms_subclass_75,2726.186114
open_porch_sf,2660.234397


In [38]:
#Number of coefficient remaining after fitting
enet_coef[enet_coef!=0].notna().sum()

Coefficient    58
dtype: int64

The ElasticNet model hyperparameters chosen by ElasticNetCV exhibited less regularization of coefficients compared to the one where l1_ration is manually set, but performs better in terms of RMSE. However, it still performs worse than the Linear, Lasso and Ridge Regression. The number of coefficients reduced from 75 to 58 after fitting.

### Model Selection

The following summarizes the models trained and their performance:

| Models | Description       | Hyperparams               | Features | CV RMSE  | Holdout RMSE |
|--------|-------------------|---------------------------|----------|----------|--------------|
| 1      | Elastic Net       | alpha 322.57 l1 ratio 0.2 | 58       | 77045.76 | 73818.45     |
| 2      | Elastic Net       | alpha 71.68 l1 ratio 0.9  | 58       | 43088.42 | 42049.49     |
| 3      | Lasso Regression  | alpha 92.43               | 42       | 25831.85 | 28636.59     |
| 4      | Ridge Regression  | alpha 29.15               | 58       | 29336.05 | 28669.68     |
| 5      | Linear Regression | -                         | 71       | 25928.04 | 28717.78     |

The best performing model is the Lasso Regression with alpha at 92.43. We will proceed to instantiate LassoRegression at alpha = 92.43 and re-train the entire training dataset and use it to predict the Kaggle test set for submission to Kaggle.

In [39]:
# Create X_test from df_test
X_test = df_test.drop(columns=['id', 'pid'])

In [40]:
# Isolate continuous variables for scaling on both the full train dataset X and X_tes
X_cont = X[con_var]
X_test_cont = X_test[con_var]

In [41]:
#Scale the continuous variables
X_ss_cont = ss.fit_transform(X_cont)
X_test_ss_cont = ss.transform(X_test_cont)

In [42]:
# Drop the orginal continuous variables from X and join back the scaled ones
X_ss = X.drop(
    columns=con_var
    ).reset_index(drop=True
                 ).join(pd.DataFrame(X_ss_cont, columns=con_var))

In [43]:
X_ss

Unnamed: 0,exterior_1st_HdBoard,exterior_1st_MetalSd,exterior_1st_Others,exterior_1st_PreCast,exterior_1st_VinylSd,exterior_1st_Wd Sdng,exterior_2nd_HdBoard,exterior_2nd_MetalSd,exterior_2nd_Other,exterior_2nd_Others,exterior_2nd_PreCast,exterior_2nd_VinylSd,exterior_2nd_Wd Sdng,fireplaces,foundation_Others,foundation_PConc,garage_type_Attchd,garage_type_Others,house_style_1Story,house_style_Others,lot_config_Inside,lot_config_Others,mas_vnr_type_CBlock,mas_vnr_type_None,mas_vnr_type_Others,ms_subclass_120,ms_subclass_160,ms_subclass_180,ms_subclass_190,ms_subclass_20,ms_subclass_30,ms_subclass_40,ms_subclass_45,ms_subclass_50,ms_subclass_60,ms_subclass_70,ms_subclass_75,ms_subclass_80,ms_subclass_85,ms_subclass_90,ms_subclass_Others,ms_zoning_Others,ms_zoning_RL,roof_style_Gable,roof_style_Others,1st_flr_sf,bsmt_exposure,bsmt_qual,bsmtfin_sf_1,bsmtfin_type_1,exter_qual,fence,fireplace_qu,full_bath,garage_area,garage_cars,garage_finish,gr_liv_area,heating_qc,kitchen_qual,lot_area,lot_frontage_reg_imputed,lot_shape,mas_vnr_area,neigh_group,open_porch_sf,overall_qual,total_bsmt_sf,totrms_abvgrd,wood_deck_sf,year_built,year_remod_add
0,1.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,1.0,0,0.0,0.0,0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0,0.0,1.0,0.0,0,0,0,0.0,0.0,0,0,0.0,1.0,0,0,0,0,0,0.0,0.0,1.0,1.0,0.0,-1.147138,-0.578420,-0.539454,0.217429,1.147731,1.018687,-0.476237,-0.972833,0.775143,0.015043,0.297246,0.311306,-0.029372,0.875693,0.732992,0.693651,0.705562,-1.069878,1.119975,-0.846779,-0.049213,-0.072433,-0.768178,-0.275249,-0.733632,0.145724,0.994258
1,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0,0.0,0,1.0,0.0,1,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0,0.0,1.0,0.0,0,0,0,0.0,0.0,0,0,0.0,1.0,0,0,0,0,0,0.0,0.0,1.0,1.0,0.0,-0.649884,-0.578420,0.569907,0.454006,1.147731,1.018687,-0.476237,0.687903,0.775143,0.407043,0.297246,0.311306,1.300688,0.875693,0.732992,0.306416,-1.263589,-1.069878,0.203334,-0.135317,0.405819,0.631200,-0.324621,1.021536,-0.733632,0.808327,0.613931
2,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0,0.0,0,1.0,0.0,0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0,1.0,0.0,0.0,0,0,0,1.0,0.0,0,0,0.0,0.0,0,0,0,0,0,0.0,0.0,1.0,1.0,0.0,-0.269009,-0.578420,-0.539454,0.667836,1.147731,-0.687488,-0.476237,-0.972833,-1.046622,-1.053624,-1.009613,-0.800891,-0.902288,-1.197915,0.732992,-0.376265,-0.096405,0.713252,-0.567346,-0.135317,0.072129,-0.776066,0.015124,-0.923641,-0.733632,-0.616270,1.089340
3,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0,0.0,0,1.0,0.0,0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0,1.0,0.0,0.0,0,0,0,0.0,0.0,0,0,0.0,1.0,0,0,0,0,0,0.0,0.0,1.0,1.0,0.0,-1.096884,-0.578420,0.569907,-0.995030,-1.205406,-0.687488,-0.476237,-0.972833,0.775143,-0.334957,0.297246,1.423503,-0.101770,-0.161111,-0.772852,-0.016758,0.137031,0.713252,-0.567346,1.287606,-0.716594,-0.776066,-1.572713,0.373143,0.053634,1.139628,1.089340
4,0.0,0.0,0.0,0,0.0,1.0,0.0,0.0,0,1.0,0,0.0,0.0,0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0,1.0,0.0,0.0,0,0,0,0.0,0.0,0,0,1.0,0.0,0,0,0,0,0,0.0,0.0,1.0,1.0,0.0,-0.866772,-0.578420,-1.648816,-0.995030,-1.205406,-0.687488,-0.476237,-0.972833,0.775143,0.057043,0.297246,-0.800891,-0.099702,-1.197915,-0.772852,0.830952,0.557218,-1.069878,-0.567346,-0.135317,0.178303,-0.072433,-0.883785,-0.275249,-0.733632,-2.372167,0.423767
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2035,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0,0.0,0,1.0,0.0,1,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0,1.0,0.0,0.0,0,0,0,1.0,0.0,0,0,0.0,0.0,0,0,0,0,0,0.0,0.0,1.0,1.0,0.0,1.505765,1.287156,0.569907,1.304775,1.147731,1.018687,-0.476237,1.241482,0.775143,0.225043,0.297246,1.423503,0.485690,0.875693,0.732992,0.298193,0.417156,-1.069878,-0.567346,1.287606,3.469705,1.334834,1.966300,0.373143,-0.733632,1.172759,1.089340
2036,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0,0.0,0,1.0,0.0,0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0,1.0,0.0,0.0,0,0,0,0.0,1.0,0,0,0.0,0.0,0,0,0,0,0,0.0,0.0,1.0,1.0,0.0,-0.787423,-0.578420,-0.539454,-0.399037,0.206476,-0.687488,-0.476237,-0.972833,-1.046622,0.313710,0.297246,-0.800891,-1.307719,0.875693,-0.772852,0.468959,0.541189,-1.069878,-0.567346,-0.846779,-0.716594,-1.479699,-0.447307,-1.572034,0.510249,-1.046961,-1.620493
2037,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,0,1.0,0,0.0,0.0,1,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0,1.0,0.0,0.0,0,0,0,0.0,0.0,0,0,1.0,0.0,0,0,0,0,0,0.0,0.0,1.0,1.0,0.0,0.035163,-0.578420,-0.539454,-0.995030,-1.205406,-0.687488,-0.476237,0.687903,-1.046622,-0.605624,0.297246,-0.800891,0.868367,-0.161111,-0.772852,-0.445872,-0.609966,0.713252,-0.567346,1.287606,-0.716594,-0.072433,-0.364730,1.669928,-0.733632,-1.444523,-1.620493
2038,0.0,0.0,1.0,0,0.0,0.0,0.0,0.0,0,1.0,0,0.0,0.0,2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0,1.0,0.0,0.0,0,0,0,1.0,0.0,0,0,0.0,0.0,0,0,0,0,0,0.0,0.0,1.0,1.0,0.0,0.109222,-0.578420,-0.539454,-0.642439,-0.264151,-0.687488,-0.476237,1.241482,-1.046622,-0.829624,-1.009613,-0.800891,-0.606490,-1.197915,-0.772852,0.097596,0.463843,0.713252,-0.567346,-0.135317,2.150110,-1.479699,0.352510,-0.275249,-0.733632,-0.516879,-1.335248


In [44]:
# Drop the orginal continuous variables from X_test and join back the scaled ones
X_test_ss = X_test.drop(
    columns=con_var
    ).reset_index(drop=True
                 ).join(pd.DataFrame(X_test_ss_cont, columns=con_var))

In [45]:
# Instantiate Lasso with alpha = 92.43 and fit on X_ss and y
lasso_final = Lasso(alpha = 92.43)

lasso_final.fit(X_ss,y)

Lasso(alpha=92.43)

In [46]:
# Predict on the Kaggle test set
y_pred = lasso_final.predict(X_test_ss)

#### Coefficient of Continuous Variables

In [47]:
lasso_final_coef = pd.DataFrame(lasso_final.coef_, index = X.columns, columns=['Coefficient']).sort_values(by='Coefficient', ascending=False)
coef_cont_var = lasso_final_coef.loc[lasso_final_coef.index.isin(con_var), :]
coef_cont_var.loc[coef_cont_var['Coefficient']!=0, :]

Unnamed: 0,Coefficient
open_porch_sf,8229.412
total_bsmt_sf,5316.694
overall_qual,5093.659
bsmt_qual,3827.485
gr_liv_area,3460.017
year_remod_add,3278.613
year_built,1797.624
totrms_abvgrd,1007.716
wood_deck_sf,880.7518
fireplace_qu,522.0919


#### Coefficient of Categorical Variables

In [48]:
coef_cat_var = lasso_final_coef.loc[~lasso_final_coef.index.isin(con_var), :]
coef_cat_var.loc[coef_cat_var['Coefficient']!=0, :]

Unnamed: 0,Coefficient
ms_subclass_85,21771.32
roof_style_Others,17052.38
ms_subclass_20,11060.5
foundation_PConc,9322.009
ms_subclass_40,8342.41
ms_subclass_Others,6839.128
ms_zoning_Others,5547.894
ms_subclass_180,5217.167
ms_subclass_70,4967.097
ms_subclass_50,3480.368


In [49]:
# Preparation for submission to Kaggle
df_test['saleprice'] = y_pred

In [50]:
submit = df_test[[col for col in df_test.columns if col in['id', 'saleprice']]]

In [51]:
submit.columns = ['Id', 'SalePrice']

In [52]:
submit.to_csv('../data/kaggle_submit.csv', index=False)

### Conclusion and Recommendation

The submitted Lasso Regression with alpha = 92.43 model has a RMSE score of about \\$30,430 on Kaggle. The holdout RMSE is \\$28,636 for this model in comparison. This is in line with expectation, where the model do slightly worse on data it has not seen.

We may infer the influence of each features from the coefficient of this model. It is necessary to discuss the influence of features according to whether they are categorical or continuous in nature.

**Categorical Features**
1. The model show that houses which are of the categories MS SubClass 20, 40, 85 and 180 can command a premium of \\$5000 to \\$21000. These four classes are 1-STORY 1946 & NEWER ALL STYLES, 1-STORY W/FINISHED ATTIC ALL AGES, SPLIT FOYER and PUD - MULTILEVEL - INCL SPLIT LEV/FOYER respectively.

2. Houses with foundation made of poured concrete also commands higher prices of at least \\$9000

3. The presence of precast coverings lowers the sale price of a house by at least $5000

**Continuous Features**

1. Big open porches, total basement and ground living areas have a positive influence on house price, with price increase ranging from \\$3,000 to \\$9,000 per square foot.

2. Quite intuitively, quality related features such as overall finishing quality and height of basement are positive predictor of house price

3. A huge garage area negatively impacts the value of a house, decreasing by almost \\$7000 dollars for every square foot.

4. As expected, house price decreases with age and year since remod. 

**Recommendation and Further Study**

1. Homeowners in Ames should consider expanding their own open porch if they do not already have one to enhance their property value. The cost of building a porch is a 200-square-foot is between \\$4600 to \\$22000. [link](https://www.homeadvisor.com/cost/outdoor-living/build-a-porch/). The benefits clearly outweighs the cost.

2. Homeseekers would be advised to avoid houses with a huge a garage and and has precast coverings due to these features drag on the property value.

3. While location is a known predictor of property price, the Lasso Regression model has penalised the coefficient of the `neigh_group` feature (whereby neighborhood were grouped into 5 groups according to house sale price) to zero, therefore eliminating its influence. More specific information on the exact addresses of the property instead of neighborhoods would be helpful to make the Lasso Regression model more predictive.

4. Most of the findings in this project are related to the features of the property itself, but it is known that economic conditions and interest rate are factors influencing property prices. The boxplot of `yr_sold` vs `saleprice` did not surface any significant change in distribution of the `saleprice` from 2006 to 2010 despite 2009 to 2010 being the years of the Global Financial Crisis. Perhaps, more data on the prevailing interest rates can be retrieved to see its correlation with housing price.