# Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import mean_squared_error, r2_score
# import statsmodels.api as sm


%matplotlib inline

np.random.seed(42)

## Load in training data

In [2]:
train_clean = pd.read_csv('../data/train_clean.csv', index_col='id',na_values='', keep_default_na=False)

In [3]:
train_clean.head()

Unnamed: 0_level_0,ms_subclass,ms_zoning,lot_frontage,lot_area,street,alley,lot_shape,land_contour,utilities,lot_config,...,3ssn_porch,screen_porch,pool_area,pool_qc,fence,misc_feature,mo_sold,yr_sold,sale_type,saleprice
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
109,60,RL,69.0552,13517,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,0,0,,,,3,2010,WD,130500
544,60,RL,43.0,11492,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,0,0,,,,4,2009,WD,220000
153,20,RL,68.0,7922,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,0,,,,1,2010,WD,109000
318,60,RL,73.0,9802,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,0,,,,4,2010,WD,174000
255,50,RL,82.0,14235,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,0,,,,3,2010,WD,138500


## Confirming No Null Values

In [4]:
train_clean.isnull().sum().sum()

0

## Create Dummy Variables

We'll create dummy variables for all the categorical variables in the data set

This is also referred to as one hot encoding. It means that a categorical feature with k levels will result in k new features, each taking on the value of 0 or 1 to denote the presensce of that attribute.

This is essential to the modeling process, as most machine learning algorithms require data to be represented numerically.

In [5]:
train_clean_dummies = pd.get_dummies(train_clean)

In [6]:
train_clean_dummies.shape

(2051, 299)

In [7]:
train_clean_dummies.head()

Unnamed: 0_level_0,ms_subclass,lot_frontage,lot_area,overall_qual,overall_cond,year_built,year_remod/add,mas_vnr_area,bsmtfin_sf_1,bsmtfin_sf_2,...,misc_feature_Shed,sale_type_COD,sale_type_CWD,sale_type_Con,sale_type_ConLD,sale_type_ConLI,sale_type_ConLw,sale_type_New,sale_type_Oth,sale_type_WD
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
109,60,69.0552,13517,6,8,1976,2005,289.0,533.0,0.0,...,0,0,0,0,0,0,0,0,0,1
544,60,43.0,11492,7,5,1996,1997,132.0,637.0,0.0,...,0,0,0,0,0,0,0,0,0,1
153,20,68.0,7922,5,7,1953,2007,0.0,731.0,0.0,...,0,0,0,0,0,0,0,0,0,1
318,60,73.0,9802,5,5,2006,2007,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1
255,50,82.0,14235,6,8,1900,1993,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,1


## Set up `X` and `y`

Next we'll create a dataframe with all our predictor variables and a series of the dependent variable (sale price)

In [9]:
X = train_clean_dummies.drop('saleprice', 1)
y = train_clean_dummies.saleprice

## Create training and validation sets

Next we'll use `train_test_split` to create a train and test set for our data. We'll train our model on the training data and test our fitted model on the test date to measure our accuracy. By default, we'll fit our model on 75% of the observations (training data) and use the remaining 25% to generate our predictions.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)

## Scale the data

Scaling the data means that we will transform the data so that each feature will have a mean of 0 and a standard deviation of 1.

In [11]:
# instantiate StandardCaler
ss = StandardScaler()
ss.fit(X_train)
X_train_ss = ss.transform(X_train)
X_test_ss = ss.transform(X_test);

  return self.partial_fit(X, y)
  after removing the cwd from sys.path.
  """


### Lasso Regression

The Lasso is a type of linear regression model that performans both regularization and feature selection. With Lasso, we apply a penalty to the coefficient value, which encourages less important features to be dropped from the model (by making their coefficients zero) in order to minimize the loss function

We set an alpha value, which will apply a weight to the coefficients. The higher the alpha value, the higher the penalty being applied to our coefficients. An alpha value of zero would result in the original least squares loss function.

Next we'll test 85 different alpha values between 0.15 and 1, and use cross-validation to select the best value (cv=5). That means that we will create 5 subsets of our training data. For each subset we will fit the model using a number of alpha values between 0.15 and 1, and select the alpha value that yileds the lowest R2 score from the simulations.

In [12]:
# Set up a list of Lasso alphas to check.
l_alphas = np.linspace(0.15, 1, 85)
# Generates 85 values equally between 0.15 and 1.

# Cross-validate over our list of Lasso alphas.
lasso_model = LassoCV(alphas=l_alphas, cv=5)

# Fit model using best ridge alpha!
lasso_model = lasso_model.fit(X_train_ss, y_train)



We can check to see which alpha value was selected by the model

In [31]:
# Here is the optimal value of alpha
lasso_optimal_alpha = lasso_model.alpha_
lasso_optimal_alpha

0.15

An alpha of 0.15 was selected, the smallest possible alpha in our list

Next we'll chech our training R2 score:

In [32]:
lasso_model.score(X_train_ss, y_train)

0.915626686604713

Indicates that 91.6% of the variation in sale price is explained by our model

In [33]:
lasso_model.score(X_test_ss, y_test)

0.8996233518851791

We see the test R2 score was 90%, slightly lower than our training R2 score.

Again, we'll create a data frame of our coefficient weights with the actual column names

In [35]:
columns = X.columns

In [36]:
betas = pd.DataFrame(lasso_model.coef_, index=columns)

In [37]:
betas.head()

Unnamed: 0,0
ms_subclass,-3962.133887
lot_frontage,-791.882342
lot_area,2308.656061
overall_qual,9552.983859
overall_cond,5000.430263


In [38]:
betas.columns = ['weights']
betas['abs_w'] = betas['weights'].abs()

In [39]:
betas.head()

Unnamed: 0,weights,abs_w
ms_subclass,-3962.133887,3962.133887
lot_frontage,-791.882342,791.882342
lot_area,2308.656061,2308.656061
overall_qual,9552.983859,9552.983859
overall_cond,5000.430263,5000.430263


We can look at the features that had the most significant impact on our model below

In [42]:
weights_top10 = betas.sort_values('abs_w', ascending=False)['weights'].head(10)

In [43]:
weights_top10

pool_area               24532.604870
pool_qc_Gd             -18586.641934
2nd_flr_sf              14731.360578
pool_qc_Fa             -10492.871976
1st_flr_sf               9724.499892
overall_qual             9552.983859
neighborhood_NridgHt     7981.109287
neighborhood_StoneBr     7517.270716
pool_qc_TA              -7301.384154
exter_qual_Ex            6624.518840
Name: weights, dtype: float64

The first feature is the pool are. The coefficient 24532 tells us that if the pool_area of a home were to increase by 1 standard deviation, the price of the home should increase by $24K. We can check the standard deviation of the pool_area below:

In [44]:
np.std(train_clean.pool_area)

37.773358297290855

So if the size of the pool increase by 37 sqft, the sale price should increase by $24K

We see several features related to the condition of the pool, indicating that if a pool is not in excellent condition, it doesn't add value to the home

We see that 1st_flr_sf and 2nd_flr_sf both have a positive impact on sale price, as expected.

Next we can look at features that were eliminated from our model, by making the coefficients zero

In [49]:
weights_bot50 = betas.sort_values('abs_w', ascending=True)['weights'].head(50)

In [50]:
weights_bot50

ms_zoning_I (all)       0.000000e+00
exter_cond_TA           0.000000e+00
heating_GasA           -0.000000e+00
bsmtfin_type_2_NA       0.000000e+00
alley_Pave              0.000000e+00
condition_2_RRAe        0.000000e+00
garage_qual_NA          0.000000e+00
garage_qual_TA         -0.000000e+00
land_contour_Lvl        0.000000e+00
garage_cond_NA          0.000000e+00
garage_cond_TA          0.000000e+00
utilities_AllPub        0.000000e+00
pool_qc_Ex              0.000000e+00
utilities_NoSeWa        0.000000e+00
utilities_NoSewr        0.000000e+00
fence_NA               -0.000000e+00
bsmt_cond_Gd            0.000000e+00
lot_shape_Reg           0.000000e+00
central_air_Y          -0.000000e+00
street_Pave             2.727586e-11
neighborhood_Timber    -1.222753e+00
bsmtfin_type_2_Unf      7.393454e+00
condition_2_Artery      1.105677e+01
bsmt_cond_NA           -1.231797e+01
condition_2_RRAn       -1.726343e+01
condition_2_RRNn        2.425547e+01
misc_feature_Shed      -2.533484e+01
a

We see some of the categorical fields we identified during EDA with high class imbalances like 'utilities' and 'heating_GasA' were eliminated from the model

Next we'll fit our data using the Ridge regression model