### Overview
1) First I will create some baseline predictions using the null mean where we make and evaluate predictions based on the mean of the target variable. We can then use these metrics to evaluate how much better or worse our model performs. 

2) Secondly I will create a basic model where I simply run some of the original features through a linear regression model and evaluate the performance. I will use the features that I think are best suited to predict Sale Price based on domain knowledge of housing prices and the correlation/linear relationships I saw between features and target in my EDA notebook.

3) Lastly I will irerate on the basic model creating my own features, dummifying categorical features, standardizing the features and running the data throuh ridge, lasso, and linear regression to identify the most optimal model. 


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LassoCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn import metrics

import warnings
warnings.filterwarnings("ignore")

In [2]:
%store -r traindf
%store -r testdf

In [3]:
traindf.shape

(1883, 225)

In [4]:
testdf.shape

(878, 221)

### Baseline Metrics
In order to calculate the baseline metrics I will split the train data into train and validation sets then use the mean of the train target to predict the validation target and evaluate the results. 

In [5]:
# baseline metrics

mean_price= traindf['SalePrice'].mean()
residuals= traindf['SalePrice'] - mean_price

sse= (residuals**2).sum()
mse= (residuals**2).sum() / (len(residuals))
rmse= mse**0.5
mae= np.abs(residuals).mean()
r2= 1 - (sse/sse) 
print(f'RMSE: {rmse}, MAE: {mae}, R2: {r2}')

RMSE: 79012.56150617845, MAE: 58054.939752471, R2: 0.0


### Basic Linear Regression Model
Here I will make predictions using a very simple multi linear regression model utilizing only a handful of features that show the strongest correlation with Sale Price.  

In [6]:
#Model 1- Basic Model - Simple Multi Linear Regression using highly correlated features
X= traindf[[ 'Overall Qual',
            'Gr Liv Area', 
            'Garage Area', 
            'Lot Area',  
            'TotRms AbvGrd', 
            'Bedroom AbvGr'
           ]]
y= traindf['SalePrice']
basic_model= LinearRegression()

In [7]:
X_train, X_val, y_train, y_val= train_test_split(X, y, random_state = 42, test_size= 0.3)

In [8]:
basic_model.fit(X_train, y_train)
basic_model.score(X_train, y_train), basic_model.score(X_val, y_val)

(0.8023510634702022, 0.8145583966046375)

In [9]:
#predictions on train and validation datasets
basic_pred_train= basic_model.predict(X_train)
basic_pred_val= basic_model.predict(X_val)

#residuals on train and validation
basic_resid_train= y_train - basic_pred_train
basic_resid_val= y_val - basic_pred_val

#train and validation scores
basic_mse_train= metrics.mean_squared_error(y_train, basic_pred_train)
basic_rmse_train= basic_mse_train**0.5

basic_mse_val= metrics.mean_squared_error(y_val, basic_pred_val)
basic_rmse_val= basic_mse_val**0.5

basic_rmse_train, basic_rmse_val

(35243.641266755796, 33741.140208725934)

In [10]:
#make predictions on the test data

X_test= testdf[[ 'Overall Qual',
            'Gr Liv Area', 
            'Garage Area', 
            'Lot Area',  
            'TotRms AbvGrd', 
            'Bedroom AbvGr'
           ]]

basic_model_preds= basic_model.predict(X_test)
basic_model_kaggle = pd.DataFrame(testdf['Id'])
basic_model_kaggle.reset_index(drop=True, inplace=True)
basic_model_kaggle['SalePrice'] = basic_model_preds
basic_model_kaggle.shape

(878, 2)

In [11]:
basic_model_kaggle.to_csv('datasets/basic_model_kaggle.csv')

### Linear Regression with Engineered Features
The basic model showed great improvement over the baseline metrics however the R2 score suggests there is room to increase the complexity of the model to drive more accurate predictions. Here I will add some complexity to the model by increasing the feature set and introduce some engineered features.

In [12]:
#Before I can fit the model I need to ensure that the train and test data have the same columns
#Here I am simply dropping categorical dummy columns that aren't present in both datasets
train_temp= traindf.drop(columns= [('Utilities', 'NoSeWa'),('Neighborhood', 'GrnHill'),('Neighborhood', 'Landmrk'),('Condition 2', 'Feedr'),('Condition 2', 'PosN'),('Condition 2', 'RRAe'),('Condition 2', 'RRAn'),('Condition 2', 'RRNn'),('Roof Matl', 'Membran'),('Exterior 1st', 'CBlock'),('Exterior 1st', 'ImStucc'),('Exterior 1st', 'Stone'),('Exterior 2nd', 'Stone'),('Bsmt Cond', 'Fa'),('Bsmt Cond', 'Po'),('Heating', 'OthW'),('Heating QC', 'Po'),('Electrical', 'Mix'),('Functional', 'Sal'),('Functional', 'Sev')])

test_temp= testdf.drop(columns= [('MS Zoning', 'I (all)'),('Utilities', 'NoSewr'),('Roof Matl', 'Metal'),('Roof Matl', 'Roll'),('Exterior 1st', 'PreCast'),('Exterior 2nd', 'Other'),('Exterior 2nd', 'PreCast'),('Mas Vnr Type', 'CBlock'),('Foundation', 'Slab'),('Bsmt Qual', 'NA'),('Bsmt Cond', 'NA'),('Bsmt Exposure', 'NA'),('BsmtFin Type 1', 'NA'),('BsmtFin Type 2', 'NA'),('Heating', 'GasA'),('Kitchen Qual', 'Po'),('Sale Type', 'VWD')])
test_temp.shape, train_temp.shape

((878, 204), (1883, 205))

In [13]:
#model2 engineered features
X2= train_temp.drop(columns=['SalePrice'])
y2= traindf['SalePrice']
model2= LinearRegression()

In [14]:
X2_train, X2_val, y2_train, y2_val = train_test_split (X2, y2, random_state = 42, test_size = 0.3)

In [15]:
##fit the model, predict y, and score the model
model2.fit(X2_train, y2_train)
model2_train_preds= model2.predict(X2_train)
model2_val_preds= model2.predict(X2_val)
r2_train= model2.score(X2_train, y2_train)
r2_val= model2.score(X2_val, y2_val)

model2_mse_train= metrics.mean_squared_error(y2_train, model2_train_preds)
model2_rmse_train= model2_mse_train ** 0.5

model2_mse_val= metrics.mean_squared_error(y2_val, model2_val_preds)
model2_rmse_val= model2_mse_val ** 0.5

print(f'Train Scores Model: 2 r2: {r2_train}, RMSE: {model2_rmse_train}')
print(f'Validation Scores Model: 2 r2: {r2_val}, RMSE: {model2_rmse_val}')

Train Scores Model: 2 r2: 0.9419030556880365, RMSE: 19107.777540676234
Validation Scores Model: 2 r2: 0.9159486061276553, RMSE: 22715.830932102275


In [16]:
test_temp.shape, train_temp.shape

((878, 204), (1883, 205))

In [17]:
#get predictions to submit to Kaggle
# X2_val['Id']
model2_preds= model2.predict(test_temp)
kaggle_model2 = pd.DataFrame(test_temp['Id'])
kaggle_model2.reset_index(drop=True, inplace=True)
kaggle_model2['SalePrice'] = model2_preds
kaggle_model2.shape

(878, 2)

In [18]:
kaggle_model2.to_csv('datasets/kaggle_model2.csv', index = False)

### Model Complexity Continued
Now that I have established a fairly accurate model I want to tweak this model adding complexity and applying some different estimators to see if I can drive any further improvement in the predictions. Here I will scale the features and run the same features from the previous model through Lasso and Ridge Regression.

In [19]:
#Standard Scaler
sc = StandardScaler()

#scale the x_train and x_validation datasets
X_scaled= sc.fit_transform(X2_train)
X_val_scaled= sc.transform(X2_val)
X_test_scaled= sc.transform(test_temp)

#fit a new model using the scaled data
#model_scaled= LinearRegression()
#model_scaled.fit(X_scaled, y2_train)

In [20]:
#Ridge Regression
ridge = Ridge()
ridge.fit(X_scaled, y2_train)
ridge.score(X_scaled, y2_train), ridge.score(X_val_scaled, y2_val)

(0.9418890872278884, 0.9162711103244894)

In [21]:
#Lasso Regression
lasso= Lasso()
lasso.fit(X_scaled, y2_train)
lasso.score(X_scaled, y2_train), lasso.score(X_val_scaled, y2_val)

(0.9419019424391437, 0.9160748216552557)

In [22]:
# Set up a list of Lasso alphas to check.
l_alphas = np.logspace(-4, 1, 100)
# Cross-validate over our list of Lasso alphas.
lasso_cv= LassoCV(alphas= l_alphas)
# Fit model using best ridge alpha!
lasso_cv.fit(X_scaled, y2_train)

lasso_cv.score(X_scaled, y2_train), lasso_cv.score(X_val_scaled, y2_val)

(0.9418179911925115, 0.9169685248963709)

In [23]:
#RMSE scores for validation and train
y_preds_train= lasso_cv.predict(X_scaled)
y_preds_val= lasso_cv.predict(X_val_scaled)

rmse_lassocv_train= (metrics.mean_squared_error(y2_train, y_preds_train))**0.5
rmse_lassocv_val= (metrics.mean_squared_error(y2_val, y_preds_val))**0.5

print(f' RMSE train {rmse_lassocv_train}, RMSE val {rmse_lassocv_val}')

 RMSE train 19121.761055158677, RMSE val 22577.588038814592


In [24]:
lassocv_preds= lasso_cv.predict(X_test_scaled)
kaggle_lassocv = pd.DataFrame(test_temp['Id'])
kaggle_lassocv.reset_index(drop=True, inplace=True)
kaggle_lassocv['SalePrice'] = lassocv_preds
kaggle_lassocv.to_csv('datasets/kaggle_lassocv.csv')

In [25]:
#optimize the hyperparameters with a data pipeline and gridsearch

pipe = Pipeline([
    ('ss', StandardScaler()),
    ('lass', LassoCV())
])

pipe.fit(X2_train, y2_train)
pipe.score(X2_val, y2_val)

0.916078570884413

In [26]:
pipe_params = {
    'ss__with_mean': [True, False],
    'lass__alphas': [np.logspace(-3, 0, 10)]
}

pipe_gridsearch = GridSearchCV(pipe, # What is the model we want to fit?
                                 pipe_params, # What is the dictionary of hyperparameters?
                                 cv=5, # What number of folds in CV will we use?
                                 verbose=1)

In [27]:
# Fit the GridSearchCV object to the data.
pipe_gridsearch.fit(X2_train, y2_train);

Fitting 5 folds for each of 2 candidates, totalling 10 fits


In [28]:
#I tried to run a gridsearch pipeline on a lasso regression model and standard scaler
#Because the score doesn't appear any better than the less complex model I am sticking with the simpler approach here
pipe_gridsearch.score(X2_val, y2_val)

0.9160748216552557

In [39]:
# I am going to push these model coefficients to excel for visualization
model2_coefficients= pd.DataFrame(model2.coef_, X2.columns).sort_values(by= 0, ascending= False)
model2_coefficients.to_csv('datasets/model2_coefs.csv')