# Ames Housing Data: Modeling
---

In this notebook, we will be model-building by iterating across various features to find which ones are most associated with sales price after accounting for the interplay between variables. We will run our variables through four methods of linear regression.

1. Linear Regression <u>**Without**</u> Log-Transformed Target Variable
2. Linear Regression <u>**With**</u> Log-Transformed Target Variable
3. Linear Regression <u>**With**</u> Log-Transformed Target Variable and RidgeCV Regularization
4. Linear Regression <u>**With**</u> Log-Transformed Target Variable and LassoCV Regularization

For more information on the initial data cleaning, exploration, and visualization see the [initial notebook](../code/01_EDA_and_Cleaning.ipynb) of this analysis. For transforming our variables into model-ready form, see the [second notebook](../code/02_Feature_Engineering.ipynb) of this analysis.

For more information on the background, [data](https://jse.amstat.org/v19n3/decock/DataDocumentation.txt), and a summary of methods and findings, please see the associated [README](../Farah_Malik_Proj2_README.md) for this analysis.

### Contents:
- [I. Model Building and Testing](#I.-Model-Building-and-Testing)
    - [Sales Price - Not Log Transformed](#Modeling-Sale-Price-Modeling-(Not-Log-Transformed))
    - [Sales Price - Log Transformed](#Modeling-Sale-Price-Modeling-(Log-Transformed))
- [II. Regularization](#II.-Regularization)
    - [Ridge CV](#Ridge-CV)
    - [Lasso CV](#Lasso-CV)

## I. Model Building and Testing
---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# import missingno as msno

from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn import metrics 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder #, PolynomialFeatures
# from sklearn.compose import ColumnTransformer
# from sklearn.neighbors import KNeighborsClassifier

import datetime

In [2]:
hs = pd.read_csv('../datasets/Clean/train.csv', na_values=['NaN', '', 'Missing'], keep_default_na=False)
hs_test = pd.read_csv('../datasets/Clean/test.csv', na_values=['NaN', '', 'Missing'], keep_default_na=False)

In [3]:
pd.set_option('display.max_columns', 1000)
print(hs.columns.tolist(), end=" ")

['Id', 'ms_subclass', 'ms_zoning', 'lot_frontage', 'lot_area', 'street', 'lot_shape', 'land_contour', 'utilities', 'lot_config', 'land_slope', 'neighborhood', 'condition_1', 'condition_2', 'bldg_type', 'house_style', 'overall_qual', 'overall_cond', 'year_built', 'year_remod', 'roof_style', 'roof_matl', 'exterior_1st', 'exterior_2nd', 'mas_vnr_type', 'mas_vnr_area', 'exter_qual', 'exter_cond', 'foundation', 'bsmt_qual', 'bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1', 'bsmtfin_sf_1', 'bsmtfin_type_2', 'bsmtfin_sf_2', 'bsmt_unf_sf', 'total_bsmt_sf', 'heating', 'heating_qc', 'central_air', 'electrical', '1st_flr_sf', '2nd_flr_sf', 'low_qual_fin_sf', 'gr_liv_area', 'bsmt_full_bath', 'bsmt_half_bath', 'full_bath', 'half_bath', 'bedroom_abvgr', 'kitchen_abvgr', 'kitchen_qual', 'totrms_abvgrd', 'functional', 'fireplaces', 'fireplace_qu', 'garage_type', 'garage_yr_blt', 'garage_finish', 'garage_cars', 'garage_area', 'garage_qual', 'garage_cond', 'paved_drive', 'wood_deck_sf', 'open_porch_sf', '

In [4]:
# UPDATE FEATURES FOR TESTING HERE
feats_updated = ['overall_qual', 'year_built', 'year_remod', 'total_bsmt_sf', 'gr_liv_area', 'full_bath', 'fireplaces', 'age', 'garage_area', 'kitchen_qual_Fa',
 'kitchen_qual_Gd', 'kitchen_qual_TA', 'was_remod', 'bsmt_cat_finished','bsmt_cat_unfinished', 'grg_qual_num', 'garage_cat_finished', 'garage_cat_unfinished', 'garage_cat_rough_finished', 'cond12_feeder_st',
 'cond12_near_park', 'cond12_near_rr', 'cond12_norm', 'lotconfig_culdsac', 'lotconfig_inside', 'hi_bsmt_exposure', 'nbr_rank']

### Modeling Sale Price (Not Log Transformed)

In [5]:
def mod_iteration(feats):
    
    # Fit regression to X_train and y_train (75% of training.csv)
    X = hs[feats]
    y = hs['SalePrice']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 531)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    
    # Predict SalePrice for 25% testing data within train.csv and compare to truth to get residuals
    y_preds = lr.predict(X_test)
    MSE = metrics.mean_squared_error(y_test, y_preds)
    RMSE = metrics.mean_squared_error(y_test, y_preds, squared=False)
        
    for i, coef in zip(X.columns, lr.coef_):
        print(f"{i}: {coef}")
    print(f"intercept: {lr.intercept_}")
    
    return f"Training R2: {lr.score(X_train, y_train)}, Testing R2: {lr.score(X_test, y_test)}, MSE: {MSE}, RMSE: {RMSE}"
    
mod_iteration(feats_updated)

overall_qual: 11804.824274025836
year_built: -410.5353268900927
year_remod: 172.00976309426602
total_bsmt_sf: 34.30426812411277
gr_liv_area: 52.17362331117928
full_bath: -8145.517803792539
fireplaces: 6412.174583419767
age: -553.5199622532209
garage_area: 43.71183061537619
kitchen_qual_Fa: -53701.51691238236
kitchen_qual_Gd: -51089.166041880686
kitchen_qual_TA: -56445.04567581338
was_remod: 8142.983106854677
bsmt_cat_finished: -21644.16920072713
bsmt_cat_unfinished: -31213.474983927346
grg_qual_num: 9866.08968427968
garage_cat_finished: -31071.90266447083
garage_cat_unfinished: -36015.396668255395
garage_cat_rough_finished: -37774.44812588643
cond12_feeder_st: 8112.763353697516
cond12_near_park: 28556.98071340498
cond12_near_rr: 9932.878439017215
cond12_norm: 12892.289059668099
lotconfig_culdsac: 10641.477926450098
lotconfig_inside: 1367.2451261571337
hi_bsmt_exposure: 10308.998256744513
nbr_rank: 1612.944007931598
intercept: 510191.0751264385


'Training R2: 0.8809534470072595, Testing R2: 0.8710097545600572, MSE: 764785529.9624298, RMSE: 27654.75600981556'

In [6]:
def mod_runon_all(feats):
    
    # Fit regression to entire data
    X = hs[feats]
    y = hs['SalePrice']
    lr_all = LinearRegression()
    lr_all.fit(X, y)
    
    # Predict SalePrice for entire data and compare to truth to get residuals
    y_preds_all = lr_all.predict(hs[feats])
    y_true = hs['SalePrice'] #Can use var from entire dataset
    MSE = metrics.mean_squared_error(y_true, y_preds_all)
    RMSE = metrics.mean_squared_error(y_true, y_preds_all, squared=False)
    
    # Use regression to predict SalePrice on Test.csv (unseen) data
    y_preds_all_test = lr_all.predict(hs_test[feats])
    hs_test['SalePrice'] = y_preds_all_test
    
    # Null model for comparison
    hs['null_pred'] = np.mean(y)
    null_pred = hs['null_pred']
    null_MSE = metrics.mean_squared_error(y_true, null_pred)
    null_RMSE = metrics.mean_squared_error(y_true, null_pred, squared=False)
    
    # Submit Predictions to Kaggle
    #submit = hs_test[['Id', 'SalePrice']]
    #submit.set_index('Id', inplace=True)
    #dt = datetime.datetime.now().strftime("%m%d%Y%H")
    #submit.to_csv(f'../datasets/Submissions/Features_Submission-{dt}.csv')
        
    for i, coef in zip(X.columns, lr_all.coef_):
        print(f"{i}: {coef}")
    print(f"intercept: {lr_all.intercept_}")
    print(f"null_MSE: {null_MSE}, null_RMSE: {null_RMSE}")
    
    return f"Full Data R2: {lr_all.score(X, y)}, MSE = {MSE}, RMSE = {RMSE}"

mod_runon_all(feats_updated)

overall_qual: 12151.581022899345
year_built: -585.8462202756798
year_remod: 196.31828856779129
total_bsmt_sf: 34.31428826277168
gr_liv_area: 52.842223083098915
full_bath: -7841.326090856938
fireplaces: 6596.682174842231
age: -717.3638658946714
garage_area: 41.75916216027622
kitchen_qual_Fa: -51807.77565446007
kitchen_qual_Gd: -50797.037977397646
kitchen_qual_TA: -54888.96195427756
was_remod: 7724.500327376971
bsmt_cat_finished: -18042.387453871354
bsmt_cat_unfinished: -28049.355525139814
grg_qual_num: 6232.194797085448
garage_cat_finished: -21465.53690297896
garage_cat_unfinished: -26616.58590005632
garage_cat_rough_finished: -27280.76094591148
cond12_feeder_st: 4549.895677239171
cond12_near_park: 22312.175512042628
cond12_near_rr: 5579.578959396819
cond12_norm: 10417.619381909264
lotconfig_culdsac: 10031.469434333692
lotconfig_inside: -85.38044764889618
hi_bsmt_exposure: 11113.917100889583
nbr_rank: 1548.2313114093179
intercept: 811928.219295017
null_MSE: 6209949084.095235, null_RMSE:

'Full Data R2: 0.8790536004573352, MSE = 751070983.064588, RMSE = 27405.67428589539'

<span style= 'color:blue'>**Under this linear model where the target variable (Sale Price) <u>was not</u> log-transformed, our training and testing R^2 were similar: 88% and 87%, respectively. This indicates that the model is not overfit, and that ~87% of the variability in Sale Price can be explained by the features used in our model. With an RMSE of 27,405 (fitted on the whole data), this model does a much better job than the null model of predicting Sale Price (null_RMSE: 78,803).**</span>

<span style= 'color:blue'>**Some noteable model features included:</span>
- <span style= 'color:blue'>**_Fireplaces_**: For every increase of one fireplace in the property, the predicted Sale Price increased by ~\\$6,596, holding all other variables constant.</span>
- <span style= 'color:blue'>**_Age_**: For every one year increase in age of property, the prediced Sale Price decreased by \\$717, holding all other variables constant.</span>
- <span style= 'color:blue'>**_Kitchen Quality_**: Holding all other variables constant, having a kitchen in Fair quality vs. Excellent quality decreased the predicted Sale Price by \\$51,807. Even having a Good kitchen vs. an Excellent one decreased predicted Sale Price by \\$50,797.</span>
- <span style= 'color:blue'>**_Remodeling_**: Holding all other variables constant, properties that had undergone remodeling increased in the predicted Sale Price on average by \\$7,724.</span>
- <span style= 'color:blue'>**_Basement Exposure_**: Having high basement exposure (e.g., walk-out basement, garden level walls, natural light) was associated with a predicted Sale Price of \\$11,113 larger than not having high basement high basement exposure, holding all other variables constant</span>
- <span style= 'color:blue'>**_Lot Configuration_**: Holding all other variables constant, the Sale Price of a cul-de-sac was predicted to be \\$10,031 higher than a corner unit. A corner unit, however, had higher predicted Sale Prices than inside lot - for inside lots, the predicted Sale Price decreased by \\$85 compared to a corner lot.</span>

### Modeling Sale Price (Log Transformed)

In [7]:
def log_mod_iteration(feats):
    
    # Fit regression to X_train and y_train (75% of training.csv)
    X = hs[feats]
    y = hs['log_price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 531)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
        
    # Predict SalePrice for 25% testing data within train.csv and compare to truth to get residuals
    y_preds = np.exp(lr.predict(X_test)) # Undoing the logged price
    MSE = metrics.mean_squared_error(np.exp(y_test), y_preds)
    RMSE = metrics.mean_squared_error(np.exp(y_test), y_preds, squared=False)
        
    for i, coef in zip(X.columns, np.exp(lr.coef_)):
        print(f"{i}: {coef}")
    print(f"intercept: {np.exp(lr.intercept_)}")
    
    return f"Training R2: {lr.score(X_train, y_train)}, Testing R2: {lr.score(X_test, y_test)}, MSE: {MSE}, RMSE: {RMSE}"
    
log_mod_iteration(feats_updated)

overall_qual: 1.0649930257040596
year_built: 0.997328326731778
year_remod: 1.002028242037629
total_bsmt_sf: 1.0001454538552237
gr_liv_area: 1.0002564586510467
full_bath: 0.9855419914410289
fireplaces: 1.0515311608010953
age: 0.9964379435669756
garage_area: 1.0001368435489209
kitchen_qual_Fa: 0.8337583509783775
kitchen_qual_Gd: 0.9083965913977291
kitchen_qual_TA: 0.8769744321001228
was_remod: 1.009637896122197
bsmt_cat_finished: 1.0115078130951194
bsmt_cat_unfinished: 0.9432869011987438
grg_qual_num: 1.0663610999927287
garage_cat_finished: 0.9275337260902414
garage_cat_unfinished: 0.903813073448707
garage_cat_rough_finished: 0.9170257092128509
cond12_feeder_st: 1.0683746335567368
cond12_near_park: 1.1614864510465106
cond12_near_rr: 1.0803118340479656
cond12_norm: 1.0972223758833322
lotconfig_culdsac: 1.029965060606294
lotconfig_inside: 0.9941061409721528
hi_bsmt_exposure: 1.0352303887843572
nbr_rank: 1.0065833584647488
intercept: 210777.0370283581


'Training R2: 0.9028044824964966, Testing R2: 0.8469805826747016, MSE: 646845076.6517401, RMSE: 25433.14916898299'

In [8]:
def log_mod_runon_all(feats):
    
    # Fit regression to entire data
    X = hs[feats]
    y = hs['log_price']
    lr_all = LinearRegression()
    lr_all.fit(X, y)
    
    # Predict SalePrice for entire data and compare to truth to get residuals
    y_preds_all = np.exp(lr_all.predict(hs[feats]))
    y_true = hs['SalePrice'] #Can use var from entire dataset
    MSE = metrics.mean_squared_error(y_true, y_preds_all)
    RMSE = metrics.mean_squared_error(y_true, y_preds_all, squared=False)
    
    # Use regression to predict SalePrice on Test.csv (unseen) data
    y_preds_all_test = np.exp(lr_all.predict(hs_test[feats]))
    hs_test['SalePrice'] = y_preds_all_test
    
    # Null model for comparison
    hs['null_pred'] = np.exp(np.mean(y))
    null_pred = hs['null_pred']
    null_MSE = metrics.mean_squared_error(y_true, null_pred)
    null_RMSE = metrics.mean_squared_error(y_true, null_pred, squared=False)
    
    # Submit Predictions to Kaggle
    #submit = hs_test[['Id', 'SalePrice']]
    #submit.set_index('Id', inplace=True)
    #dt = datetime.datetime.now().strftime("%m%d%Y%H")
    #submit.to_csv(f'../datasets/Submissions/Features_Submission_logy-{dt}.csv')
        
    for i, coef in zip(X.columns, np.exp(lr_all.coef_)):
        print(f"{i}: {coef}")
    print(f"intercept: {np.exp(lr_all.intercept_)}")
    print(f"null_MSE: {null_MSE}, null_RMSE: {null_RMSE}")
    
    return f"Full Data R2: {lr_all.score(X, y)}, MSE = {MSE}, RMSE = {RMSE}"

log_mod_runon_all(feats_updated)

overall_qual: 1.0696865535392057
year_built: 0.9959033162525016
year_remod: 1.002016109483604
total_bsmt_sf: 1.000145819116516
gr_liv_area: 1.0002620830250117
full_bath: 0.9827221661600672
fireplaces: 1.0485701527856872
age: 0.995148746389361
garage_area: 1.000112803949039
kitchen_qual_Fa: 0.8362399349796291
kitchen_qual_Gd: 0.91334882606626
kitchen_qual_TA: 0.8893959546874888
was_remod: 1.0111401480164193
bsmt_cat_finished: 1.0343143331644078
bsmt_cat_unfinished: 0.9604765127312241
grg_qual_num: 1.069397456003843
garage_cat_finished: 0.9177955327307376
garage_cat_unfinished: 0.8893246786037156
garage_cat_rough_finished: 0.9081651198899384
cond12_feeder_st: 1.0556430807336048
cond12_near_park: 1.1352560495699646
cond12_near_rr: 1.0581557791699316
cond12_norm: 1.0817343297621271
lotconfig_culdsac: 1.0227249064369153
lotconfig_inside: 0.9867104691697841
hi_bsmt_exposure: 1.0402720139384496
nbr_rank: 1.0070666696308654
intercept: 3668821.8353902274
null_MSE: 6410615844.843487, null_RMSE: 

'Full Data R2: 0.8892826104265844, MSE = 545503324.5779072, RMSE = 23356.012600140188'

<span style= 'color:blue'>**Under this linear model where the target variable (Sale Price) <u>was</u> log-transformed, our training and testing R^2 were approximately 5% different: 90% and 85%, respectively. This could potentially indicate some overfitting and is interpreted as: ~85% of the variability in Sale Price can be explained by the features used in our model. With an RMSE of 23,356 (fitted on the whole data), this model does a much better job than the null model of predicting Sale Price (null_RMSE: 80,066).**</span>

<span style= 'color:blue'>**Some noteable model features included:</span>
- <span style= 'color:blue'>**_Fireplaces_**: For every increase of one fireplace in the property, the predicted Sale Price increased by ~5%, holding all other variables constant.</span>
- <span style= 'color:blue'>**_Kitchen Quality_**: Holding all other variables constant, having a kitchen in Fair quality vs. Excellent quality decreased the predicted Sale Price by 16%. Even having a Good kitchen vs. an Excellent one decreased predicted Sale Price by 9%.</span>
- <span style= 'color:blue'>**_Basement Exposure_**: Having high basement exposure (e.g., walk-out basement, garden level walls, natural light) was associated with a predicted Sale Price of 5% larger than not having high basement high basement exposure, holding all other variables constant</span>
- <span style= 'color:blue'>**_Proximity to Certian Conditions_**: Holding all other variables constant, the Sale Price of properties near/adjacent to amenties such as park, greenbelt, etc. was predicted to be 13.5% higher than if the property was adjacent to a large arterial street. Properties near a smaller feeder street or a railroad were had predicted Sale Prices that were higher by 5-6% compared to properties adjacent to arterial streets.</span>
- <span style= 'color:blue'>**_Garage Quality vs. Finished Status_**: Holding all other variables constant, per every one unit increase in the garage quality (ordered from None, Poor, Fair, Average, Good, Excellent), the predicted Sale price was expected to increase by 7%. However, having a finished, rough finished, or unfinished garage, compared to no garage, appeared to be decrease the predicted Sale Price from 8-11%, holding all else equal. These results are unexpected and could potentially be due to multicollinearity taking place in the model between these features.</span>

---
### II. Regularization
Our model is overfit, as reflected by the R^2 for our training data being comfortably higher than the R^2 on our testing data. This indicates that our model is not doing as well on unseen data.

#### Ridge CV

In [9]:
def log_mod_iteration_rreg(feats):
    
    # Fit regression to X_train and y_train (75% of training.csv)
    X = hs[feats]
    y = hs['log_price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 602)
    
    # Scale features
    sc = StandardScaler()
    Z_train = sc.fit_transform(X_train)
    Z_test = sc.transform(X_test)
    
    # Run Ridge Regression
    r_alphas = np.logspace(0, 5, 150)
    ridge_cv = RidgeCV(alphas = r_alphas, scoring = 'r2', cv = 10)
    ridge_cv.fit(Z_train, y_train)
            
    # Predict SalePrice for 25% testing data within train.csv and compare to truth to get residuals
    y_preds = np.exp(ridge_cv.predict(Z_test)) # Undoing the logged price
    MSE = metrics.mean_squared_error(np.exp(y_test), y_preds)
    RMSE = metrics.mean_squared_error(np.exp(y_test), y_preds, squared=False)
        
    for i, coef in zip(X.columns, np.exp(ridge_cv.coef_)):
        print(f"{i}: {coef}")
    print(f"intercept: {np.exp(ridge_cv.intercept_)}")
    
    return f"Training R2: {ridge_cv.score(Z_train, y_train)}, Testing R2: {ridge_cv.score(Z_test, y_test)}, MSE: {MSE}, RMSE: {RMSE}"
    
log_mod_iteration_rreg(feats_updated)

overall_qual: 1.0929868252001564
year_built: 1.003262771814451
year_remod: 1.0411226928538944
total_bsmt_sf: 1.0694591125867974
gr_liv_area: 1.129637652218369
full_bath: 0.9991057228617918
fireplaces: 1.0322852082795653
age: 0.9823141844380772
garage_area: 1.0291470631894004
kitchen_qual_Fa: 0.9746680412736631
kitchen_qual_Gd: 0.9647569783781011
kitchen_qual_TA: 0.9530320364108376
was_remod: 1.0068839246749552
bsmt_cat_finished: 1.012127930693443
bsmt_cat_unfinished: 0.9811754889764924
grg_qual_num: 1.0205010605422733
garage_cat_finished: 1.010191943874786
garage_cat_unfinished: 0.9914120521285621
garage_cat_rough_finished: 1.0032891744958643
cond12_feeder_st: 1.0064821214970217
cond12_near_park: 1.0176827774784885
cond12_near_rr: 1.0104551324378386
cond12_norm: 1.023147186400762
lotconfig_culdsac: 1.0042727178148587
lotconfig_inside: 0.9941119244263893
hi_bsmt_exposure: 1.018112668745832
nbr_rank: 1.0418035451136785
intercept: 168613.18234487195


'Training R2: 0.9028823969984556, Testing R2: 0.8391637960587384, MSE: 602610072.3548315, RMSE: 24548.117491058892'

In [10]:
def log_mod_runon_all_rreg(feats):
    
    # Fit regression to entire data
    X = hs[feats]
    y = hs['log_price']
    
    # Scale features
    sc_all = StandardScaler()
    Z_all = sc_all.fit_transform(X)
    
    # Run Ridge Regression
    r_alphas = np.logspace(0, 5, 150)
    ridge_cv_all = RidgeCV(alphas = r_alphas, scoring = 'r2', cv = 10)
    ridge_cv_all.fit(Z_all, y)
    
    # Predict SalePrice for entire data and compare to truth to get residuals
    y_preds_all = np.exp(ridge_cv_all.predict(Z_all)) # Undoing the logged price
    y_true = hs['SalePrice'] #Can use var from entire dataset
    MSE = metrics.mean_squared_error(y_true, y_preds_all)
    RMSE = metrics.mean_squared_error(y_true, y_preds_all, squared=False)
    
    # Use regression to predict SalePrice on Test.csv (unseen) data
    # first standard scale
    Z_all_test = sc_all.transform(hs_test[feats])
    y_preds_all_test = np.exp(ridge_cv_all.predict(Z_all_test))
    hs_test['SalePrice'] = y_preds_all_test
   
    # Null model for comparison
    hs['null_pred'] = np.exp(np.mean(y))
    null_pred = hs['null_pred']
    null_MSE = metrics.mean_squared_error(y_true, null_pred)
    null_RMSE = metrics.mean_squared_error(y_true, null_pred, squared=False)
    
    # Submit Predictions to Kaggle
    #submit = hs_test[['Id', 'SalePrice']]
    #submit.set_index('Id', inplace=True)
    #dt = datetime.datetime.now().strftime("%m%d%Y%H")
    #submit.to_csv(f'../datasets/Submissions/Features_Submission_logy_ridreg-{dt}.csv')
        
    for i, coef in zip(X.columns, np.exp(ridge_cv_all.coef_)):
        print(f"{i}: {coef}")
    print(f"intercept: {np.exp(ridge_cv_all.intercept_)}")
    print(f"null_MSE: {null_MSE}, null_RMSE: {null_RMSE}")

    return f"Full Data R2: {ridge_cv_all.score(Z_all, y)}, MSE = {MSE}, RMSE = {RMSE}"

log_mod_runon_all_rreg(feats_updated)

overall_qual: 1.0987762905628955
year_built: 1.0003232174529473
year_remod: 1.0426519416109088
total_bsmt_sf: 1.0636634366546767
gr_liv_area: 1.133627098687911
full_bath: 0.9918178161597343
fireplaces: 1.0311686879537127
age: 0.9771391861000637
garage_area: 1.0248598334939023
kitchen_qual_Fa: 0.9759800528263719
kitchen_qual_Gd: 0.9604162465012543
kitchen_qual_TA: 0.9472563392886116
was_remod: 1.0051102781331958
bsmt_cat_finished: 1.016300332978322
bsmt_cat_unfinished: 0.9824121556123594
grg_qual_num: 1.0340336658674736
garage_cat_finished: 0.9901853463766699
garage_cat_unfinished: 0.9725520594303667
garage_cat_rough_finished: 0.984752979750718
cond12_feeder_st: 1.0106924261971812
cond12_near_park: 1.0167444188794361
cond12_near_rr: 1.0091297136376525
cond12_norm: 1.025201914161518
lotconfig_culdsac: 1.0058482789282661
lotconfig_inside: 0.9944630576817284
hi_bsmt_exposure: 1.0175037760192966
nbr_rank: 1.0383325509573256
intercept: 168371.1380929812
null_MSE: 6410615844.843487, null_RMSE

'Full Data R2: 0.8889332665661638, MSE = 548721470.0822326, RMSE = 23424.804590054377'

<span style= 'color:blue'>**Under this linear model using Ridge regularization where the target variable (Sale Price) <u>was</u> log-transformed, our training and testing R^2 were approximately 6% different: 90% and 84%, respectively. This could potentially indicate some overfitting and is interpreted as: ~84% of the variability in Sale Price can be explained by the features used in our model. With an RMSE of 23,424 (fitted on the whole data), this model does a much better job than the null model of predicting Sale Price (null_RMSE: 80,066).**</span>

<span style= 'color:blue'>**Some noteable model features included:</span>
- <span style= 'color:blue'>**_Kitchen Quality_**: Holding all other variables constant, having a kitchen in Fair/Good/Average quality vs. Excellent quality decreased the predicted Sale Price by up to 5%. Regularization decreased the impact of beta coefficients on our target variable.</span>
- <span style= 'color:blue'>**_Living Area Square-Footage (Above and Below Ground)_**: Holding all other variables constant, per every one unit increase in the above ground square-footage, Sale Price was predicted to increase by 13%. Also, per every square-footage increase in basement area, the predicted Sale Price was expected to increase by 4%, holding all else constant.</span>
- <span style= 'color:blue'>**_Year Remodeled_**: Holding all other variables constant, per every year increase in remodeling year (i.e., the more recently the property was remodeled), the Sale Price was predicted to increase by 4%.</span>
- <span style= 'color:blue'>**_Neighborhood Rank_**: Holding all other variables constant, per every unit increase in neighborhood rank*, the Sale Price was predicted to increase by 4%.</span>

<span style= 'color:blue'>*Neighborhood rank = a variable created by scoring neighborhoods based on their rankings in median Sale Price (based on training data), average Overall Quality, average Exterior Quality, average functionality score, and average access/amenities score.</span>

#### Lasso CV

In [11]:
def log_mod_iteration_lreg(feats):
    
    # Fit regression to X_train and y_train (75% of training.csv)
    X = hs[feats]
    y = hs['log_price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 602)
    
    # Scale features
    sc = StandardScaler()
    Z_train = sc.fit_transform(X_train)
    Z_test = sc.transform(X_test)
    
    # Run Lasso Regression
    l_alphas = np.logspace(-5, 0, 150)
    lasso_cv = LassoCV(alphas = l_alphas, cv = 10, max_iter=75_000)
    lasso_cv.fit(Z_train, y_train)
            
    # Predict SalePrice for 25% testing data within train.csv and compare to truth to get residuals
    y_preds = np.exp(lasso_cv.predict(Z_test)) # Undoing the logged price
    MSE = metrics.mean_squared_error(np.exp(y_test), y_preds)
    RMSE = metrics.mean_squared_error(np.exp(y_test), y_preds, squared=False)
        
    for i, coef in zip(X.columns, np.exp(lasso_cv.coef_)):
        print(f"{i}: {coef}")
    print(f"intercept: {np.exp(lasso_cv.intercept_)}")
    
    return f"Training R2: {lasso_cv.score(Z_train, y_train)}, Testing R2: {lasso_cv.score(Z_test, y_test)}, MSE: {MSE}, RMSE: {RMSE}"
    
log_mod_iteration_lreg(feats_updated)

overall_qual: 1.0944725716213615
year_built: 1.0
year_remod: 1.0419231822235242
total_bsmt_sf: 1.070583287073525
gr_liv_area: 1.1337837450897468
full_bath: 0.9970236964304473
fireplaces: 1.0311751234703737
age: 0.9793896642784763
garage_area: 1.0270427918308345
kitchen_qual_Fa: 0.973797894451564
kitchen_qual_Gd: 0.9618578971161179
kitchen_qual_TA: 0.950343515974906
was_remod: 1.0063322691423373
bsmt_cat_finished: 1.008814646012088
bsmt_cat_unfinished: 0.9779581242193113
grg_qual_num: 1.022571427564969
garage_cat_finished: 1.0063610103092122
garage_cat_unfinished: 0.9884523377177753
garage_cat_rough_finished: 1.0
cond12_feeder_st: 1.0065444299238067
cond12_near_park: 1.0175998749335589
cond12_near_rr: 1.0107084865270726
cond12_norm: 1.0234732942574682
lotconfig_culdsac: 1.003900989655572
lotconfig_inside: 0.9941880223427605
hi_bsmt_exposure: 1.017883327936698
nbr_rank: 1.0415731551130674
intercept: 168613.18234487195


'Training R2: 0.9030200359412649, Testing R2: 0.8397330711550361, MSE: 597795364.0840014, RMSE: 24449.854070811987'

In [12]:
def log_mod_runon_all_lreg(feats):
    
    # Fit regression to entire data
    X = hs[feats]
    y = hs['log_price']
    
    # Scale features
    sc_all = StandardScaler()
    Z_all = sc_all.fit_transform(X)
    
    # Run Lasso Regression
    l_alphas = np.logspace(-5, 0, 150)
    lasso_cv_all = LassoCV(alphas = l_alphas, cv = 10, max_iter=75_000)
    lasso_cv_all.fit(Z_all, y)
    
    # Predict SalePrice for entire data and compare to truth to get residuals
    y_preds_all = np.exp(lasso_cv_all.predict(Z_all)) # Undoing the logged price
    y_true = hs['SalePrice'] #Can use var from entire dataset
    MSE = metrics.mean_squared_error(y_true, y_preds_all)
    RMSE = metrics.mean_squared_error(y_true, y_preds_all, squared=False)
    
    # Use regression to predict SalePrice on Test.csv (unseen) data
    # first standard scale
    Z_all_test = sc_all.transform(hs_test[feats])
    y_preds_all_test = np.exp(lasso_cv_all.predict(Z_all_test))
    hs_test['SalePrice'] = y_preds_all_test

    # Null model for comparison
    hs['null_pred'] = np.exp(np.mean(y))
    null_pred = hs['null_pred']
    null_MSE = metrics.mean_squared_error(y_true, null_pred)
    null_RMSE = metrics.mean_squared_error(y_true, null_pred, squared=False)
    
    # Submit Predictions to Kaggle
    submit = hs_test[['Id', 'SalePrice']]
    submit.set_index('Id', inplace=True)
    dt = datetime.datetime.now().strftime("%m%d%Y%H")
    submit.to_csv(f'../datasets/Submissions/Features_Submission_logy_lreg-{dt}.csv')
        
    for i, coef in zip(X.columns, np.exp(lasso_cv_all.coef_)):
        print(f"{i}: {coef}")
    print(f"intercept: {np.exp(lasso_cv_all.intercept_)}")
    print(f"null_MSE: {null_MSE}, null_RMSE: {null_RMSE}")
    
    return f"Full Data R2: {lasso_cv_all.score(Z_all, y)}, MSE = {MSE}, RMSE = {RMSE}"

log_mod_runon_all_lreg(feats_updated)

overall_qual: 1.0988114390483708
year_built: 0.8943931103927437
year_remod: 1.04293912332863
total_bsmt_sf: 1.0637934595213592
gr_liv_area: 1.135516634981402
full_bath: 0.9904980219939052
fireplaces: 1.0305650899065377
age: 0.8742009619170019
garage_area: 1.0242794015299088
kitchen_qual_Fa: 0.974600197611383
kitchen_qual_Gd: 0.9567468470703837
kitchen_qual_TA: 0.943228153619837
was_remod: 1.0054942246382392
bsmt_cat_finished: 1.015806163839149
bsmt_cat_unfinished: 0.981852822046318
grg_qual_num: 1.0473964000332932
garage_cat_finished: 0.9648834748558818
garage_cat_unfinished: 0.9453320083927481
garage_cat_rough_finished: 0.9588618976096734
cond12_feeder_st: 1.0119043954238789
cond12_near_park: 1.017532614057134
cond12_near_rr: 1.0100450190896277
cond12_norm: 1.0272554980925512
lotconfig_culdsac: 1.0055167741816744
lotconfig_inside: 0.9941302362239561
hi_bsmt_exposure: 1.0170259557842876
nbr_rank: 1.0384953023857268
intercept: 168371.1380929812
null_MSE: 6410615844.843487, null_RMSE: 80

'Full Data R2: 0.8892808391126586, MSE = 545560034.3673917, RMSE = 23357.226598365476'

<span style= 'color:blue'>**Under this linear model using Lasso regularization where the target variable (Sale Price) <u>was</u> log-transformed, our training and testing R^2 were still approximately 6% different: 90% and 84%, respectively. This could potentially indicate some overfitting and is interpreted as: ~84% of the variability in Sale Price can be explained by the features used in our model. With an RMSE of 23,357 (fitted on the whole data), this model does a much better job than the null model of predicting Sale Price (null_RMSE: 80,066).**</span>

<span style= 'color:blue'>**Some noteable model features included:</span>
- <span style= 'color:blue'>**_Kitchen Quality_**: Holding all other variables constant, having a kitchen in Fair/Good/Average quality vs. Excellent quality decreased the predicted Sale Price by up to 6%. Regularization decreased the impact of beta coefficients on our target variable.</span>
- <span style= 'color:blue'>**_Living Area Square-Footage (Above and Below Ground)_**: Holding all other variables constant, per every one unit increase in the above ground square-footage, Sale Price was predicted to increase by 13.5%. Also, per every square-footage increase in basement area, the predicted Sale Price was expected to increase by 6%, holding all else constant.</span>
- <span style= 'color:blue'>**_Year Remodeled_**: Holding all other variables constant, per every year increase in remodeling year (i.e., the more recently the property was remodeled), the Sale Price was predicted to increase by 4%.</span>
- <span style= 'color:blue'>**_Garage Quality vs. Finished Status_**: Holding all other variables constant, per every one unit increase in the garage quality (ordered from None, Poor, Fair, Average, Good, Excellent), the predicted Sale price was expected to increase by 5%. However, having a finished, rough finished, or unfinished garage, compared to no garage, appeared to be decrease the predicted Sale Price from 4-5.5%, holding all else equal. These results are unexpected and could potentially be due to multicollinearity taking place in the model between these features.</span>
- <span style= 'color:blue'>**_Basement Finished Status_**: Holding all other variables constant, having a finished basement predicted an increase of 1.6% in Sale Price, compared to no basement. However, having an unfinished basement actually predicted a 2$ decrease in Sale Price, holding all else equal.</span>
- <span style= 'color:blue'>**_Age_**: For every one year increase in age of property, the Sale Price was predicted to decrease by 13%, holding all other variables constant.</span>
- <span style= 'color:blue'>**_Neighborhood Rank_**: Holding all other variables constant, per every unit increase in neighborhood rank*, the Sale Price was predicted to increase by 4%.</span>

<span style= 'color:blue'>*Neighborhood rank = a variable created by scoring neighborhoods based on their rankings in median Sale Price (based on training data), average Overall Quality, average Exterior Quality, average functionality score, and average access/amenities score</span>

<span style= 'color: blue'> **While the training R^2 (from Z_train) is larger than the testing R^2 (from Z_test), neither Ridge nor Lasso CV regularization appear to be addressing the overfitting. This has been the case after several iterations of changing the strength of the regularization (alpha), as well as removing variables to make the model simpler, and increasing/decreasing the number of cross-validations performed. Each scenario resulted in the RMSE score increasing and R^2 either remaining constant or decreasing slightly.**</span> 

<span style= 'color: blue'> **Our final model will be the Lasso-regularized model using log-transformed Sale Price. This model achieved nearly the same metrics as the non-Lasso version, however, using the regularized version may be beneficial on unseen data as it does appear that the model is overfit.**</span> 

In [13]:
# SUBMISSION HISTORY

# 6/3 10P submission = ['overall_qual', 'year_built', 'year_remod', 'total_bsmt_sf', 'gr_liv_area', 'full_bath', 'fireplaces', 'age', 'garage_area', 'kitchen_qual_Fa', 'kitchen_qual_Gd', 'kitchen_qual_TA', 'was_remod', 'bsmt_cat_finished','bsmt_cat_unfinished', 'grg_qual_num', 'garage_cat_finished', 'garage_cat_unfinished', 'cond12_feeder_st', 'cond12_near_park', 'cond12_near_rr', 'cond12_norm', 'lotconfig_culdsac', 'lotconfig_inside', 'hi_bsmt_exposure', 'nbr_rank']
# 6/3 8:30P submission = ['overall_qual', 'year_built', 'year_remod', 'total_bsmt_sf', 'gr_liv_area', 'full_bath', 'fireplaces', 'garage_area', 'age', 'was_remod', 'bsmt_cat_finished','bsmt_cat_unfinished']
# 6/2 submission = ['overall_qual', 'year_built', 'year_remod', 'total_bsmt_sf', 'gr_liv_area', 'full_bath', 'fireplaces', 'garage_area']
# first submission = ['Overall Qual', 'Year Built', 'Year Remod/Add', 'BsmtFin SF 1', 'Total Bsmt SF', 'Gr Liv Area', 'Full Bath', 'Fireplaces', 'Garage Area']
# in-class submission = ['Overall Qual']

In [14]:
# Without Lasso Regl

# 'Full Data R2: 0.8891974819498465, MSE = 546453367.2372748, RMSE = 23376.342041416035'
# feats_updated = ['overall_qual', 'year_built', 'year_remod', 'total_bsmt_sf', 'gr_liv_area', 'full_bath', 'fireplaces', 'age', 'garage_area', 'kitchen_qual_Fa', 'kitchen_qual_Gd', 'kitchen_qual_TA', 'was_remod', 'bsmt_cat_finished','bsmt_cat_unfinished', 'grg_qual_num', 'garage_cat_finished', 'garage_cat_unfinished', 'cond12_feeder_st', 'cond12_near_park', 'cond12_near_rr', 'cond12_norm', 'lotconfig_culdsac', 'lotconfig_inside', 'hi_bsmt_exposure', 'nbr_rank']

In [15]:
# W/ Lasso Regl
# 'Full Data R2: 0.8891958486538678, MSE = 546493452.084716, RMSE = 23377.19940635995'
# feats_updated = ['overall_qual', 'year_built', 'year_remod', 'total_bsmt_sf', 'gr_liv_area', 'full_bath', 'fireplaces', 'age', 'garage_area', 'kitchen_qual_Fa', 'kitchen_qual_Gd', 'kitchen_qual_TA', 'was_remod', 'bsmt_cat_finished','bsmt_cat_unfinished', 'grg_qual_num', 'garage_cat_finished', 'garage_cat_unfinished', 'cond12_feeder_st',  'cond12_near_park', 'cond12_near_rr', 'cond12_norm', 'lotconfig_culdsac', 'lotconfig_inside', 'hi_bsmt_exposure', 'nbr_rank']

In [16]:
# Factors that ended up not being important:

# - Any condition indicators
# - Fireplace and Garage Quality
# - Pool (1/0)
# - Amenities Score

In [17]:
# Interpreting Log transformations in a linear model: https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/