# Modeling and Evaluation

**Introduction**

In this notebook we develop four models for predicting SalePrice. The four models are: 1) Model 1: A simple linear regression; 2) Model 2: A linear regression with polynomial features to control for potential collinearity; 3) Model 3: A lasso regression model with an optimized alpha of ~210; 4) Model 4: A ridge regression model with an optimized alpha of ~65.

Two feature sets have been chosen based on the correlation table in the EDA portion of Notebok One and refined through significant trial and error. The first feature set (variable name: features) is a short set applied to the first linear regression models. The second feature set (variable name: features_l) is a longer set applied to the Lasso and Ridge data sets.

Baseline figures have been estalished for each feature set. Evaluation of each model and comparison to baseline is provided in an evaluation section at the end of each model process. 

This notebook, and project, ends with a discussion about the findings from the modeling process and conclusions that are applicable to our clients (see Notebook 1 - Cleaning and EDA). We also give further steps that we would take to refine models and clean data in the future.

# Final Dataset Preparations and Setup

### Imports and Reading in Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from scipy import stats
from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, LassoCV, Lasso
from sklearn.metrics import r2_score, mean_squared_error
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

In [2]:
#Reading in cleaned data files.
ctrain = pd.read_csv('./datasets/train_clean.csv')

ctest = pd.read_csv('./datasets/test_clean.csv')

In [3]:
ctrain.head()

Unnamed: 0,Id,PID,MS SubClass,lot_frontage,lot_area,Street,Alley,Lot Shape,Overall Qual,Overall Cond,...,Bldg Type_2fmCon,Bldg Type_Duplex,Bldg Type_Twnhs,Bldg Type_TwnhsE,Roof Style_Gable,Roof Style_Gambrel,Roof Style_Hip,Roof Style_Mansard,Roof Style_Shed,Central Air_Y
0,109,533352170,60,68.0,13517,Pave,,IR1,6,8,...,0,0,0,0,1,0,0,0,0,1
1,544,531379050,60,43.0,11492,Pave,,IR1,7,5,...,0,0,0,0,1,0,0,0,0,1
2,153,535304180,20,68.0,7922,Pave,,Reg,5,7,...,0,0,0,0,1,0,0,0,0,1
3,318,916386060,60,73.0,9802,Pave,,Reg,5,5,...,0,0,0,0,1,0,0,0,0,1
4,255,906425045,50,82.0,14235,Pave,,IR1,6,8,...,0,0,0,0,1,0,0,0,0,1


In [4]:
ctest.head()

Unnamed: 0,Id,PID,MS SubClass,lot_frontage,lot_area,Street,Alley,Lot Shape,Overall Qual,Overall Cond,...,Bldg Type_2fmCon,Bldg Type_Duplex,Bldg Type_Twnhs,Bldg Type_TwnhsE,Roof Style_Gable,Roof Style_Gambrel,Roof Style_Hip,Roof Style_Mansard,Roof Style_Shed,Central Air_Y
0,2658,902301120,190,69.0,9142,Pave,Grvl,Reg,6,8,...,1,0,0,0,1,0,0,0,0,0
1,2718,905108090,90,68.0,9662,Pave,,IR1,5,4,...,0,1,0,0,1,0,0,0,0,1
2,2414,528218130,60,58.0,17104,Pave,,IR1,7,5,...,0,0,0,0,1,0,0,0,0,1
3,1989,902207150,30,60.0,8520,Pave,,Reg,5,6,...,0,0,0,0,1,0,0,0,0,1
4,625,535105100,20,68.0,9500,Pave,,IR1,6,5,...,0,0,0,0,1,0,0,0,0,1


### Dropping incongruent columns

In [5]:
## Check for missing columns between the data sets help fromAlanna and chat with Andy as well as https://stackoverflow.com/questions/46335121/add-missing-columns-to-the-dataframe

In [6]:
ctest.shape

(878, 114)

In [7]:
## Check for missing columns between the data sets help fromAlanna and chat with Andy as well as https://stackoverflow.com/questions/46335121/add-missing-columns-to-the-dataframe
missing = set(ctrain.columns) - set(ctest.columns)
missing

{'MS Zoning_C (all)',
 'Neighborhood_GrnHill',
 'Neighborhood_Landmrk',
 'SalePrice'}

In [8]:
#Dropping columns that are not in test from train.
ctrain.drop(['MS Zoning_C (all)', 'Neighborhood_GrnHill', 'Neighborhood_Landmrk'], axis=1, inplace=True)

ctrain.shape

# Model 1: Multiple Linear Regression Model

### Establishing features, variables and train/test

In [10]:
# Features list
features = ['Overall Qual',
            'gr_liv_area', 
            'Exter Qual',
            'Kitchen Qual',
            'Total Bsmt SF', 
            'Garage Area', 
            'Fireplace Qu',
           'Mas Vnr Area',
           'Heating QC',
           'Neighborhood_NridgHt',
           'Central Air_Y',
           'Neighborhood_NoRidge',
           'Neighborhood_StoneBr',
           'MS Zoning_RL']

In [11]:
#List variables
X = ctrain[features]
y = ctrain['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

### Baseline for Lin Reg

As expected we see an R2 close to 0 as predictions are based solely on the mean which has no ability to explain variance in the data. 

Baseline RMSE of is a useful benchmark against which we can compare the RMSE of the two linear regression models.

In [12]:
### Mean of y_test
y_base = y_test.mean()

### Fitting that mean for every row in y column
y_base_preds = [y_base for row in y]

In [13]:
print(y_base)

177767.47337278107


In [14]:
# Establishing r2 and RMSE for y_test mean v. y actual values
print(f'Baseline R2: {r2_score(y, y_base_preds)}')
print(f'Baseline Rmse: {mean_squared_error(y, y_base_preds, squared = False)}')

Baseline R2: -0.0013317025984138642
Baseline Rmse: 78296.82887296818


#### Lin Reg Model

In [15]:
#Instantiate Model
ln = LinearRegression()

#Model fit
ln.fit(X_train,y_train)

#Make preds
ln_preds_train = ln.predict(X_train)
ln_preds_test = ln.predict(X_test)

#### Lin Reg R2

In [16]:
print(f'Training R2: {ln.score(X_train, y_train)}')
print(f'Testing R2: {ln.score(X_test, y_test)}')
print(f'Cross Val Score: {cross_val_score(ln, X_train, y_train, cv=5).mean()}')

Training R2: 0.8686063194638227
Testing R2: 0.8514119728672772
Cross Val Score: 0.8642912547347805


#### Linear Reg RMSE

In [17]:
print(f'Training MSE: {mean_squared_error(y_train, ln_preds_train, squared = False)}')
print(f'Testing MSE: {mean_squared_error(y_test, ln_preds_test, squared = False)}')

Training MSE: 28388.743129034465
Testing MSE: 30050.019938988746


#### Lin Reg for Kaggle

In [18]:
#Defining x
# X_ctest = ctest[features]

#Predicting model on test
# ln_preds_ctest = ln.predict(X_ctest)

#Fitting ID and Predicted SalePrice
# kgl_one = pd.DataFrame(ctest['Id'])
# kgl_one['SalePrice'] = ln_preds_ctest

#Exporting
# kgl_one.to_csv('./datasets/kgl_ln_one.csv', index=False)

## Evaluating the Multiple Lin Reg Model

In [19]:
coef_ln = (pd.DataFrame({'Features': features, 
                        'Coefficients' : ln.coef_
                        })).sort_values(by = 'Coefficients', ascending = False)

#Hide index function from https://stackoverflow.com/questions/21256013/pandas-dataframe-hide-index-functionality
coef_ln.style.hide_index()

Features,Coefficients
Neighborhood_StoneBr,44457.227407
Neighborhood_NridgHt,28379.369655
Neighborhood_NoRidge,23324.774226
Exter Qual,14566.07502
Kitchen Qual,13308.101193
Overall Qual,10313.786241
MS Zoning_RL,8852.892174
Central Air_Y,3019.582074
Heating QC,2918.766749
Fireplace Qu,2704.33152


The multiple linear regression model works by taking multiple independent variables and using them to create a linear model (with a linear relationship between the predictor variables and the variable of interest). This linear model is trying to minimize the distance between the function for a line generated by the model and the points in the data (the residuals that do not fall directly on the line).

Overall the miltiple linear regression model performed significantly better than baseline with a testing RMSE of 30,050. Furthermore, the training and testing R2's were very close to eachother (.868 and .851 respectively) which does not raise concerns regarding generalization of the model to new datasets.

The 10 most important features indicated by this model are: 
- Neighborhood_StoneBr	44457.227407
- Neighborhood_NridgHt	28379.369655
- Neighborhood_NoRidge	23324.774226
- Exter Qual	14566.075020
- Kitchen Qual	13308.101193
- Overall Qual	10313.786241
- MS Zoning_RL	8852.892174
- Central Air_Y	3019.582074
- Heating QC	2918.766749
- Fireplace Qu	2704.331520

So the linear regression model is giving a lot of weight to the location of the house, especially in comparison to the later models. 

# Model 2: Polynomial Reg Model

### Create Polynomial Features

In [20]:
#Instantiate and fit PF to the X_test and Train generated above
poly = PolynomialFeatures(include_bias=False)
poly.fit(X_train)

PolynomialFeatures(include_bias=False)

In [21]:
#Tranform X_train and test
poly_X_train = poly.transform(X_train)
poly_X_test = poly.transform(X_test)

In [22]:
#Create dataframe
pd.DataFrame(poly_X_train, columns = poly.get_feature_names(X.columns))

Unnamed: 0,Overall Qual,gr_liv_area,Exter Qual,Kitchen Qual,Total Bsmt SF,Garage Area,Fireplace Qu,Mas Vnr Area,Heating QC,Neighborhood_NridgHt,...,Central Air_Y^2,Central Air_Y Neighborhood_NoRidge,Central Air_Y Neighborhood_StoneBr,Central Air_Y MS Zoning_RL,Neighborhood_NoRidge^2,Neighborhood_NoRidge Neighborhood_StoneBr,Neighborhood_NoRidge MS Zoning_RL,Neighborhood_StoneBr^2,Neighborhood_StoneBr MS Zoning_RL,MS Zoning_RL^2
0,6.0,1686.0,3.0,3.0,1686.0,612.0,3.0,157.0,3.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,5.0,1630.0,3.0,3.0,1073.0,649.0,3.0,0.0,3.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,7.0,2312.0,4.0,4.0,1177.0,658.0,3.0,210.0,5.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
3,8.0,2232.0,4.0,4.0,1173.0,623.0,3.0,372.0,5.0,0.0,...,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
4,5.0,835.0,3.0,3.0,458.0,366.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1514,5.0,616.0,3.0,3.0,616.0,205.0,0.0,0.0,4.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1515,5.0,912.0,3.0,3.0,912.0,288.0,0.0,0.0,3.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1516,5.0,1373.0,3.0,3.0,1319.0,591.0,3.0,0.0,3.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1517,5.0,984.0,3.0,3.0,984.0,310.0,0.0,0.0,3.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


### Poly Reg Model

In [23]:
#Instantiate Lin Reg Model
p_ln = LinearRegression()
p_ln.fit(poly_X_train, y_train)

#Make preds
pln_preds_train = p_ln.predict(poly_X_train)
pln_preds_test = p_ln.predict(poly_X_test)

#### Poly Reg R2

In [24]:
print(f'Training R2: {p_ln.score(poly_X_train, y_train)}')
print(f'Testing R2: {p_ln.score(poly_X_test, y_test)}')
print(f'Cross Val Score: {cross_val_score(p_ln, poly_X_train, y_train, cv=5).mean()}')

Training R2: 0.9150417391021959
Testing R2: 0.8843918346340929
Cross Val Score: 0.884797233141884


#### Poly Reg RMSE

In [25]:
print(f'Training MSE: {mean_squared_error(y_train, pln_preds_train, squared = False)}')
print(f'Testing MSE: {mean_squared_error(y_test, pln_preds_test, squared = False)}')

Training MSE: 22827.677745188816
Testing MSE: 26506.17804896374


#### Poly Reg for Kaggle

In [26]:
#Defining x
# poly_X_ctest = poly.transform(ctest[features])

#Predicting model on test
# p_ln_preds_ctest = p_ln.predict(poly_X_ctest)

#Fitting ID and Predicted SalePrice
# kgl_p_ln = pd.DataFrame(ctest['Id'])
# kgl_p_ln['SalePrice'] = p_ln_preds_ctest

#Exporting
# kgl_p_ln.to_csv('./datasets/kgl_p_ln.csv', index=False)

## Evaluating the Poly Reg Model

In [53]:
coef_p_ln = (pd.DataFrame({'Features': poly.get_feature_names(X.columns), 
                           'Coefficients' : p_ln.coef_
                          })).sort_values(by = 'Coefficients', ascending = False)

coef_p_ln.style.hide_index();

The Polynomial Regression model will, similar to multilinear regression described above, generate a function using multiple predictor values that minimize the distance between the residuals and the model. However, in this case the polynomial function will generate features from our original feature set taken to the nth degree (in this case 2nd) and in order to allow for non-linear relationships in the data. Furthermore, it will generate interaction terms that capture the affect that one predictor model has on another.  

Overall the  polynomial regression performed significantly better than baseline with a testing RMSE of 26,506. The training and testing R2's were close enough to eachother (.915 and .884 respectively) which indicates some overfitting of the model which can probably be attributed to the variance caused by introducing numerous polynomial features. 

The model outperforms the linear regression model handily with an improvement of about 3,500 units on the RMSE value. However it has a slightly larger spread between the training and testing R2s than Model 1 indicating it is potentially more overfit and has less generalizability.

The coefficients in the model indicate several areas where some colinearity might be creeping in and gives the opportunity to develop a different feature list.

# Model 3: Lasso (Production Model)

### New Features list and Train Test

In [28]:
features_l = ['Overall Qual',
            'gr_liv_area',
               'total_sqft',
            'Exter Qual',
            'Kitchen Qual',
            'Total Bsmt SF', 
            'Garage Area', 
               'Garage Cars',
               'Total Baths',
               'Bsmt Qual',
            'Fireplace Qu',
               'TotRms AbvGrd',
           'Mas Vnr Area',
               'Fireplaces',
           'Heating QC',
           'Neighborhood_NridgHt',
               'BasmtFin Sqft',
               'Deck and Porch Sqft',
               'lot_area',
               'lot_frontage',
               'Garage Qual',
           'Central Air_Y',
               'Garage Cond',
           'Neighborhood_NoRidge',
               'Roof Style_Hip',
               'Garage Yr Blt',
           'Neighborhood_StoneBr',
           'MS Zoning_RL',
               'Bsmt Cond',
               'Land Contour_HLS',
           'House Style_2Story'
             ]

In [29]:
X2 = ctrain[features_l]
y2 = ctrain['SalePrice']

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, random_state = 42)

### Setting Up Standard Scaler transformations

In [30]:
## First we have to fit standard scaler on the X_train
ss = StandardScaler()
ss.fit(X2_train)

##Now we use it to transform train and text X values
ss_X_train = ss.transform(X2_train)
ss_X_test = ss.transform(X2_test)

ss_X_test.shape

(507, 31)

### Establishing Baseline for Lasso and Ridge Models

In [51]:
### Mean of y2_test
y_base_lasr = y2_test.mean()

### Fitting that mean for every row in y column
y_base_preds_lasr = [y_base_lasr for row in y2]

In [32]:
# Establishing r2 and RMSE for y_test mean v. y actual values
print(f'Baseline R2: {r2_score(y2, y_base_preds_lasr)}')
print(f'Baseline Rmse: {mean_squared_error(y2, y_base_preds_lasr, squared = False)}')

Baseline R2: -0.0013317025984138642
Baseline Rmse: 78296.82887296818


In [52]:
print(y_base_lasr)

177767.47337278107


### Finding optimum lasso alpha

In [33]:
## Credit to lessons 4.02 for the code, modified to fit needs
l_alphas = np.logspace(0, 5, 100)
lasso_cv_test = LassoCV(alphas=l_alphas, cv=5)

#Fitting ridge test model onto data to find best lasso
lasso_cv_test.fit(ss_X_train, y2_train);

#Optimum lasso alpha
lasso_cv_test.alpha_

210.49041445120199

### Modeling with Lasso and optimum Lasso value

In [34]:
#Instantiate Lasso model
lasso = Lasso(alpha=lasso_cv_test.alpha_)
lasso.fit(ss_X_train, y2_train)

## Generating train and test preds
lasso_preds_train = lasso.predict(ss_X_train)
lasso_preds_test = lasso.predict(ss_X_test)

### Lasso R2

In [35]:
print(f'Training R2: {lasso.score(ss_X_train, y2_train)}')
print(f'Testing R2: {lasso.score(ss_X_test, y2_test)}')
print(f'Cross Val Score: {cross_val_score(lasso, ss_X_train, y2_train, cv=5).mean()}')

Training R2: 0.8951916356251562
Testing R2: 0.8838039798668246
Cross Val Score: 0.8894650793174737


### Lasso RMSE

In [36]:
print(f'Training MSE: {mean_squared_error(y2_train, lasso_preds_train, squared = False)}')
print(f'Testing MSE: {mean_squared_error(y2_test, lasso_preds_test, squared = False)}')

Training MSE: 25354.607852154724
Testing MSE: 26573.483093412233


### Lasso for Kaggle

In [37]:
# #Defining x
# X_ctest2 = ctest[features_l]

# #Scaling the ctest data
# ss_X_ctest2 = ss.transform(X_ctest2)

# #Predicting model on test
# las_preds_ctest = lasso.predict(ss_X_ctest2)

# #Fitting ID and Predicted SalePrice
# kgl_las = pd.DataFrame(ctest['Id'])
# kgl_las['SalePrice'] = las_preds_ctest

# #Exporting
# kgl_las.to_csv('./datasets/kgl_las.csv', index=False)

## Evaluating the Lasso Model

In [38]:
coef_lasso = (pd.DataFrame({'Features': features_l,
                            'Coefficients' : lasso.coef_
                           })).sort_values(by = 'Coefficients', ascending = False)
coef_lasso.style.hide_index()

Features,Coefficients
gr_liv_area,18849.272639
Overall Qual,15229.251278
Exter Qual,9100.97927
BasmtFin Sqft,8551.729808
Kitchen Qual,7277.844457
Neighborhood_NridgHt,7050.274077
lot_area,6792.809039
Total Bsmt SF,6682.241935
Garage Area,6285.23939
Neighborhood_StoneBr,5458.968152


Lasso regression models use a technique of shrinkage regularization in order to prevent overfitting that might occur in regression analysis. Shrinkage regularization works by penalizing the features that are less relevant and also helps correct multicollinearity by shrinking the coefficient on one of the collinear features, sometimes to zero. [Source.](https://dataaspirant.com/lasso-regression/)

Overall the lasso model performed significantly better than baseline with a testing RMSE of 26,573. Furthermore, the training and testing R2's were very close to eachother (.895 and .883 respectively) which does not raise concerns regarding generalization of the model to new datasets.

The model is roughly equivalent in terms of its performance in RMSE to Model 2 (M2 RMSE: 26506), however it has a slightly smaller spread between the training and testing R2s than Model 2 (M2 R2 spread was 3 percent points) indicating it is potentially less overfit and has more generalizability.

The 10 most important features indicated by this model are: 
- gr_liv_area	18849.272639
- Overall Qual	15229.251278
- Exter Qual	9100.979270
- BasmtFin Sqft	8551.729808
- Kitchen Qual	7277.844457
- Neighborhood_NridgHt	7050.274077
- lot_area	6792.809039
- Total Bsmt SF	6682.241935
- Garage Area	6285.239390
- Neighborhood_StoneBr	5458.968152

# Model 4 Ridge

### Finding optimum alpha using RidgeCV

In [39]:
## Credit to lessons 4.02 for the code, modified to fit needs
r_alphas = np.logspace(0, 5, 100)
ridge_cv_test = RidgeCV(alphas=r_alphas, scoring='r2', cv=5)

In [40]:
#Fitting ridge test model onto data to find best ridge
ridge_cv_test.fit(ss_X_train, y2_train);

In [41]:
#Optimum ridge alpha
ridge_cv_test.alpha_

65.79332246575679

### Modeling with Ridge given optimum alpha

In [42]:
#Instantiate actual ridge with optimum alpha
ridge = Ridge(alpha = ridge_cv_test.alpha_)

In [43]:
#Fitting to the data
ridge.fit(ss_X_train, y2_train)

Ridge(alpha=65.79332246575679)

In [44]:
## Generating train and test preds
ridge_preds_train = ridge.predict(ss_X_train)
ridge_preds_test = ridge.predict(ss_X_test)

### Ridge R2

In [45]:
print(f'Training R2: {ridge.score(ss_X_train, y2_train)}')

print(f'Testing R2: {ridge.score(ss_X_test, y2_test)}')

print(f'Cross Val Score: {cross_val_score(ridge, ss_X_train, y2_train, cv=5).mean()}')

Training R2: 0.8951782484656949
Testing R2: 0.8832310448216906
Cross Val Score: 0.8892770010034143


### Ridge RMSE

In [46]:
print(f'Training MSE: {mean_squared_error(y2_train, ridge_preds_train, squared = False)}')

print(f'Testing MSE: {mean_squared_error(y2_test, ridge_preds_test, squared = False)}')

Training MSE: 25356.227070918405
Testing MSE: 26638.916308235297


### Ridge for Kaggle

In [47]:
# #Defining x
# X_ctest3 = ctest[features]

# #Scaling the ctest data
# ss_X_ctest3 = ss.transform(X_ctest3)

# #Predicting model on test
# rid_preds_ctest = ridge.predict(ss_X_ctest3)

# #Fitting ID and Predicted SalePrice
# kgl_rid = pd.DataFrame(ctest['Id'])
# kgl_rid['SalePrice'] = rid_preds_ctest

# #Exporting
# kgl_rid.to_csv('./datasets/kgl_rid.csv', index=False)

## Evaluating the Ridge Model

In [54]:
coef_ridge = (pd.DataFrame({'Features': features_l,
                            'Coefficients' : ridge.coef_
                           })).sort_values(by = 'Coefficients', ascending = False)

coef_ridge.style.hide_index();

Ridge regression models use its own regularization method (different from lasso) in order to prevent overfitting that might occur in regression analysis. It is essentially a modification of OLS which uses a regularization method to penalize the features by adding in the square of the coefficient into the loss function (loss function allows for the penalization to occur.) Like Lasso this can help to reduce overfitting (if the alpha/lambda term is well stipulated.) [Source.](https://dataaspirant.com/ridge-regression/)

Overall the ridge model performed significantly better than baseline with a testing RMSE of 26,638. Furthermore, the training and testing R2's were very close to eachother (.895 and .883 respectively) which does not raise concerns regarding generalization of the model to new datasets. 

It is almost exactly the same as the Lasso model in terms of performance with an almost negligible difference in testing RMSE (64 RMSE lower) and almost identical R2 spread. This means it outperforms the simple linear regression model and slightly underperforms in comparison to the linear regression model with polynomial features but is potentially less overfit. 

The 10 most important features indicated by this model are: 
- Overall Qual	13854.222537
- Exter Qual	9092.907869
- gr_liv_area	9016.467822
- total_sqft	9016.467822
- BasmtFin Sqft	8014.327998
- Kitchen Qual	7449.568864
- Neighborhood_NridgHt	7021.861669
- Total Bsmt SF	6691.077089
- lot_area	6513.181417
- Garage Area	5689.112358

# Conclusion and Next Steps

Over the course of this notebook I have developed four models as well as a baseline to model the data that was cleaned and explored in Notebook 1. (Notebook 2 was used for cleaning the test data). Evaluation of each model has been given at the end of each model. 

Overall the production model I chose to go with was the Lasso model given the good RMSE, decent R2 scores, lack of overfitting, significant overperformance compared to baseline and chance to control for multicolinearity/do some of the feature engineering automatically.  Results are below. This generates several interesting recommendations that will be presented to our target audience in the presentation with accompanying visualizations.

The 10 most important features indicated by this model are: 
- gr_liv_area	18849.272639
- Overall Qual	15229.251278
- Exter Qual	9100.979270
- BasmtFin Sqft	8551.729808
- Kitchen Qual	7277.844457
- Neighborhood_NridgHt	7050.274077
- lot_area	6792.809039
- Total Bsmt SF	6682.241935
- Garage Area	6285.239390
- Neighborhood_StoneBr	5458.968152

Given that we are making recommendations to local developers in Ames, Iowa we want to give specific and actionable recommendations and not a list of coefficients! As such, I've broken down the interesting coefficients into four groups.

1. **House Characteristics** - Features innate to the house like  square footage, kitchen size and quality, fireplaces, storeys etc.
2. **House Add-ons** - Features not directly related to the main living area like decks, basement size, garage space etc.
3. **Plot Characteristics** - Features related to the land on which the house sits like land size, frontage onto the street and gradient of land.
4. **Location** - Where the house sits within the greater city and area - specifically neighborhoods of interest.

### Recommendations
Given our analysis we would recommend that developers do the following for each category:

1. **House Characteristics:** **Total square footage** above ground level as well as **internal and external quality** are key areas of focus. Furthermore, aim to design a **high quality kitchen.**

2. **Add-Ons:** Include a **large finished basement**, plenty of **garage area**, and a **large basement** in general to attract a higher sale price.

3. **Plot Characteristics:** **Total lot area**  or the size of the plot of land as a whole  is of importance to consumers, even more than the size of the garage and total basement area.

4. **Location** - The **North Ridge Heights** and **Stone Bridge** neighborhoods tend to correlate with higher saleprices. Developing an understanding of why this is the case is important.

Although these have direct relevance to our primary stakeholders (local developers) we think this is valuable data for secondary stakeholders like individuals that wish to flip houses, appraisers, tax assessors etc.

### Next Steps

Given additional time and data, in the first instance, there are a few areas I would like to analyze more deeply:
1. **Locational Analysis:** With additional data we’d like to explore why some areas in Ames have a larger effect on saleprice than others (correlation or causation?) and help bring this insight to understand up and coming areas in the Greater Ames area.

2. **Details:** With the current dataset we’d like to better understand how different aspects of house characteristics and types of features interact and model these interactions to better understand customer choices.
