# Executive Summary

Abc network intends to launch a new docu-series on property flipping called "house hunters". The show's target audience is Americans who are passionate about transforming homes and reselling them at a profit. Its unique selling point is taking a data driven approach to property flipping. The first season of the series will feature houses in Ames, Iowa. I have been hired as a data scientist to  then idenfity features that are the best predictors of sale prices.

By analysing the housing dataset from Ames, Iowa Assessor’s Office containing individual property sales from 2006 - 2010, I have created a predictive model using linear regression with Ridge regularization.  A baseline model describing the average sale price of the entire dataset was used as a benchmark to beat when building the model. To verify that our model works well, we have been provided with a test sample with similar features to the dataset to test on kaggle.

This analyses is split in two parts. The first part is for exploratory data anlysis and the second part for model construction and regularization. *This notebook covers the second portion on model construction and regularization.*

The model includes a total of 14 features: 
- 4 continuous
- 3 discrete
- 3 nominal
- 2 ordinal
- 2 polynomial

## Problem Statement

The primary stakeholders of this analyses is the team in charge of "house hunters" at abc and the secondary stakeholders are American TV viewers who are keen on property flipping. Hence the analyses aims to deconstruct the dataset into digestible information and reduce the features to handful which most strongly predict sale prices, while uncovering interesting relationships between features and sale prices which would help in developing strategies for property flipping


# Additional Comments

This notebook contains the final models which were tested after multiple model iterations using different features and hyperparameter tuning.

# Contents

- [Cleaning the test Dataset](#cleaning)
- [Baseline model](#baseline)
- [Train/test split](#tts)
- [Linear Regression modeling](#lr)
- [Ridge regularization](#ridge)
- [Lasso regularization](#lasso)
- [Elasticnet regularization](#elastic)
- [Model comparison](#compare)
- [Kaggle Evaluation](#kaggle)
- [Results and Conclusions](#results)

In [None]:
# import relevent libraries
import numpy as np
import pandas as pd
from numpy import linspace

from sklearn.linear_model import (Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV)
from sklearn.model_selection import (train_test_split, cross_val_score, GridSearchCV)
from sklearn.preprocessing import (StandardScaler)

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#import train dataset
df = pd.read_csv('./datasets/train_model_features.csv')

#import test dataset
df_test = pd.read_csv('./datasets/test.csv')
df_test_raw = df_test


In [None]:
df.head()

In [None]:
df_test.head()

<a id='cleaning'></a>
# Cleaning the Test dataset


## Data dictionary

| Feature                  | Datatype | Description                                                                              |
|--------------------------|----------|------------------------------------------------------------------------------------------|
| gr_liv_area              | int64    | Above grade (ground) living area square feet                                             |
| total_bsmt_sf            | int64    | Total square feet of basement area                                                       |
| garage_area              | int64    | Size of garage in square feet                                                            |
| mas_vnr_area             | float64  | Masonry veneer area in square feet                                                       |
| overall_qual             | int64    | Rating for overall material and finish of the house                                      |
| fireplace_qu             | int64    | Fireplace quality                                                                        |
| age                      | int64    | Age of property since original construction date                                         |
| since_reno               | int64    | Number of years since last remodeled                                                     |
| tot_baths                | float64  | Total number of bathrooms including basement bathrooms                                   |
| neighbor_h               | int64    | Binary feature, 1 indicating that the property is located in an upper class neighborhood |
| neighbor_l               | int64    | Binary feature, 1 indicating that the property is located in a lower class neighborhood  |
| garage_type_a            | int64    | Binary feature, 1 indicating that the garage is attached or built in to the house.       |
| gr_liv_area overall_qual | int64    | Polynomial feature                                                                       |
| garage_area overall_qual | int64    | Polynomial feature                                                                       |
| saleprice                | int64    | Sale price of property in $                                                              |

The changes made on the training dataset, including feature engineering, will need to be performed on the test dataset such that it will contain the same column names.

In [None]:
df_test.columns = df_test.columns.map(lambda x: (str(x.replace(' ','_'))).lower())
df_test.columns = df_test.columns.map(lambda x: (str(x.replace('/','_'))).lower())

In [None]:
features = ['gr_liv_area',
             'total_bsmt_sf',
             'garage_area',
             'mas_vnr_area',
             'overall_qual',
             'fireplace_qu',
             'neighborhood',
             'garage_type',
             'year_built',
             'year_remod_add',
             'full_bath',
             'bsmt_full_bath',
             'half_bath',
             'bsmt_half_bath',
             'id']

In [None]:
ordinal_dict = {'lot_shape': {'Reg':0, 'IR1':1, 'IR2':2, 'IR3':3}, #mapping lot_shape. Ranking represents measure of irregularity
               
               'utilities': {'AllPub':0, 'NoSewr':1, 'NoSeWa':2, 'ELO':3}, #mapping utilities. Ranking is based on 0 = having all public utilities, 3 = minimum utilities(electricity only) indicating a reduction in quality of life.
               
               'land_slope': {'Gtl':0, 'Mod':1, 'Sev':2}, #mapping land_slope. Ranking is based on slope severity.
                
               'exter_qual': {'Po':-4, 'Fa':-2, 'TA':0, 'Gd':2, 'Ex':4}, #mapping exter_qual, kitchen_qual, exter_cond and heating_qc
               'kitchen_qual':{'Po':-4, 'Fa':-2, 'TA':0, 'Gd':2, 'Ex':4}, #these 4 variables take a similar ranking where 0 represents typical/average quality
               'exter_cond': {'Po':-4, 'Fa':-2, 'TA':0, 'Gd':2, 'Ex':4}, #-ve values indicate quality below average and +ve values indicate quality above average
               'heating_qc': {'Po':-4, 'Fa':-2, 'TA':0, 'Gd':2, 'Ex':4}, #magnitude is set as 2 to amplify effects of 'good' and 'fair' quality
                
               'pool_qc': {'None':0, 'Fa':1, 'TA':2, 'Gd':3, 'Ex':4}, #mapping pool_qc, bsmt_qual, bsmt_cond, fireplace_qu, garage_qual, garage_cond
               'bsmt_qual': {'None':0, 'Po':1,'Fa':2,'TA':3, 'Gd':4, 'Ex':5}, #these variables also take a similar ranking
               'bsmt_cond': {'None':0, 'Po':1,'Fa':2,'TA':3, 'Gd':4, 'Ex':5}, # 0 represents having no pool/basement/fireplace/garage
               'fireplace_qu': {'None':0, 'Po':1,'Fa':2,'TA':3, 'Gd':4, 'Ex':5}, #ranking is in increasing order of quality
               'garage_qual': {'None':0, 'Po':1,'Fa':2,'TA':3, 'Gd':4, 'Ex':5},
               'garage_cond': {'None':0, 'Po':1,'Fa':2,'TA':3, 'Gd':4, 'Ex':5},
                
               'bsmt_exposure': {'No':0, 'None':0, 'Mn':1, 'Av':2, 'Gd':3}, #mapping bsmt_exposure. Having no basement and no exposure are considered to be of the same value of 0.
                
               'bsmtfin_type_1': {'None':0, 'Unf':1,'LwQ':2, 'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6}, #mapping bsmtfin_type_1 bsmtfin_type_2
               'bsmtfin_type_2': {'None':0, 'Unf':1,'LwQ':2, 'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6}, #Ranking is based on quality of finish
                
               'electrical': {'SBrkr':0, 'FuseA':1, 'FuseF':2, 'FuseP':3, 'Mix':2.5}, #mapping electrical. Ranking is based on 0 having standard circuit breakers in in order of decreasing electrical system sophistication. 
                #Mixed electrical system is considered to be between FuseF and FuseP
                
               'functional':{'Typ':0,'Min1':1,'Min2':2, 'Mod':3, 'Maj1':4, 'Maj2':5, 'Sev':6, 'Sal':7}, #mapping functional. Ranking is based on 0 having typical home functionality and in decreasing order of functionality
                
               'garage_finish':{'None':0, 'Unf':1, 'RFn':2, 'Fin':3}, #mapping garage_finish. 0 represents having no garage and ranking is in order of quality of finish.
                
               'paved_drive':{'N':0, 'P':1, 'Y':2}, #mapping paved_drive. Ranking is in order of amount of paving on driveway.
                
               'fence':{'None':0, 'MnWw':1, 'GdWo':2, 'MnPrv':3, 'GdPrv':4} #mapping fence. 0 represents having no fence and ranking is in increasing levels of privacy
               }

In [None]:
#drop null values for columns of interest
df_test['fireplace_qu'] = df_test['fireplace_qu'].fillna('None')
df_test['mas_vnr_area'] = df_test['mas_vnr_area'].fillna(0)

#Ordinal features encoding
for i in ordinal_dict:
    df_test[i] = df_test[i].map(ordinal_dict[i])

#select only the necessary columns
df_test = df_test[features]

#Discrete feature engineering
df_test['age'] = 2010 - df_test['year_built']
df_test['since_reno'] = 2010 - df_test['year_remod_add']
df_test['tot_baths'] = df_test['full_bath'] + df_test['bsmt_full_bath'] + 0.5*(df_test['half_bath'] + df_test['bsmt_half_bath'])

#Drop the original columns
df_test.drop(columns = ['year_built',
                        'year_remod_add',
                        'full_bath',
                        'bsmt_full_bath',
                        'half_bath',
                        'bsmt_half_bath'
                       ], inplace = True)

#Nominal features mapping

df_test['garage_type_a'] = df_test['garage_type'].map(lambda x: 1 if x == 'Attchd' or x == 'BuiltIn' else 0)
df_test['neighbor_h'] = df_test['neighborhood'].map(lambda x: 1 if x == 'StoneBr' or x =='NridgHt' or x == 'NoRidge' or x == 'GrnHill' or x == 'Veenker' else 0)
df_test['neighbor_l'] = df_test['neighborhood'].map(lambda x: 1 if x == 'BrDale' or x == 'IDOTRR' or x == 'MeadowV' else 0)


#Adding polynomial interactions
df_test['gr_liv_area overall_qual'] = df_test['gr_liv_area'] * df_test['overall_qual']
df_test['garage_area overall_qual'] = df_test['garage_area'] * df_test['overall_qual']

#Drop the original columns
df_test.drop(columns = ['garage_type',
                        'neighborhood'
                       ], inplace = True)

#Display the resulting model features
df_test

In [None]:
#check if null values still present
df_test.isnull().sum().sum()

In [None]:
#check all dtypes are numerical
df_test.dtypes

In [None]:
df.head()

Check that the dimensions of both dataframes match. 

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df_test.shape

In [None]:
df_test.columns

Note that the train dataframe has an extra column `saleprice` and the test dataframe has an extra column `id`. Now our features match and we can proceed to modeling.

# Baseline model
<a id='baseline'></a>

The baseline model is the most simplistic prediction of saleprice which is based on the overall mean of all the houses in the dataset. This is the model we need to 'beat' to know our model works.

In [None]:
baseline = df['saleprice'].mean()
baseline = [baseline] * 879

In [None]:
baseline_model = pd.DataFrame({'Id': df_test_raw['id'],
                                 'SalePrice': baseline})

In [None]:
baseline_model.to_csv('./datasets/baseline_model.csv', index = False)

The plot for the baseline model is the mean sale price of all the houses in our dataset, which is 181,511. Since we are using kaggle to verify our model performance, we used the baseline model score to set a score to beat. Upon submission to kaggle, we received a root mean square error of 83,689.36!

<a id='tts'></a>
# Train/test split


In [None]:
target = 'saleprice'
model_features = [x for x in df.columns if x != target]

X = df[model_features]
y = df[target]
X_test = df_test

X_train, X_holdout, y_train, y_holdout  = train_test_split(X, y, random_state = 26, train_size = 0.8)

### Scaling features

Scaling features is essential when their magnitudes are significantly different from one another. However we must be careful to apply scaling only on non-binary features. Therefore the features `garage_type_a`, `neighbor_h` and `neighbor_l` are left out of the scaling process.

In [None]:
df_test.columns

In [None]:
cols = ['gr_liv_area', 
        'total_bsmt_sf', 
        'garage_area', 
        'mas_vnr_area', 
        'overall_qual', 
        'fireplace_qu',
        'age',
        'since_reno',  
        'tot_baths',
        'gr_liv_area overall_qual',
        'garage_area overall_qual'
       ]

In [None]:
#Standardizing predictors
ss_X = StandardScaler()

for i in cols:
    X_train[i] = ss_X.fit_transform(X_train[[i]])
    X_holdout[i] = ss_X.transform(X_holdout[[i]])
    X_test[i] = ss_X.transform(X_test[[i]])

In [None]:
y_train_log = np.log(y_train)
y_holdout_log = np.log(y_holdout)

<a id='lr'></a>
# Linear Regression Model


In [None]:
# linear regression model
lr = LinearRegression()
lr.fit(X_train, y_train_log)

lr_train_rmse = -cross_val_score(
                LinearRegression(),
                X_train,
                y_train_log,
                cv = 10,
                scoring = 'neg_root_mean_squared_error').mean()
lr_train_rmse

In [None]:
#generate predicted values from linear regression model
lr_preds = np.exp(lr.predict(X_train))

In [None]:
# Plot the model
plt.figure(figsize=(12,9))

# Generate a scatterplot of predicted values versus actual values.
plt.scatter(y_train, lr_preds, s=5, color='red', alpha = 0.5)

# Plot a line.
plt.plot([0, np.max(y)],
         [0, np.max(y)],
         color = 'black')

# Tweak title and axis labels.
plt.ylabel("Predicted Values: $\hat{y}$", fontsize = 20)
plt.xlabel("Actual Values: $y$", fontsize = 20)
plt.title('Predicted Values vs. Actual Values', fontsize = 24);

The basic linear regression model is able to predict houses of lower saleprices much more accurately than houses of higher saleprices. The model tends to overpredict houses with high saleprices. We also note that there is much more data available for houses of low saleprices. Having more data on houses sold at higher prices might in constructing a better model.

In [None]:
#store holdout RMSE for comparison later
lr_holdout_rmse = -cross_val_score(
                LinearRegression(),
                X_holdout,
                y_holdout_log,
                cv = 10,
                scoring = 'neg_root_mean_squared_error').mean()
lr_holdout_rmse

In [None]:
#store coefficients for comparison later
lr_coefs = pd.DataFrame({'lr_coefs' :lr.coef_})

<a id='ridge'></a>
# Ridge Model


In [None]:
ridgecv = RidgeCV(alphas = np.logspace(0,3,1000))
ridgecv.fit(X_train,y_train_log)

In [None]:
#Optimal alpha for ridge regression model is
ridgecv.alpha_

In [None]:
#testing the ridge regression model on the train data
ridge_train_rmse = -cross_val_score(
                    Ridge(alpha = ridgecv.alpha_),
                    X_train,
                    y_train_log,
                    cv = 10,
                    scoring = 'neg_root_mean_squared_error',).mean()
ridge_train_rmse

In [None]:
ridge = Ridge(alpha = ridgecv.alpha_)
ridge.fit(X_train, y_train_log)
ridge_preds = np.exp(ridgecv.predict(X_train))

In [None]:
# Plot the model
plt.figure(figsize=(12,9))

# Generate a scatterplot of predicted values versus actual values.
plt.scatter(y_train, ridge_preds, s=5, color='blue', alpha = 0.5)

# Plot a line.
plt.plot([0, np.max(y)],
         [0, np.max(y)],
         color = 'black')

# Tweak title and axis labels.
plt.ylabel("Predicted Values: $\hat{y}$", fontsize = 20)
plt.xlabel("Actual Values: $y$", fontsize = 20)
plt.title('Predicted Values vs. Actual Values', fontsize = 24);

The model with ridge regularization behaves similarly to the normal linear regression model in that it overpredicts houses at higher sale prices.

In [None]:
#store holdout rmse for comparison later
ridge_holdout_rmse = -cross_val_score(
                    Ridge(alpha = ridgecv.alpha_),
                    X_holdout,
                    y_holdout_log,
                    cv = 10,
                    scoring = 'neg_root_mean_squared_error',).mean()

In [None]:
#store coefficients for comparison later
ridge_coefs = pd.DataFrame({'ridge_coefs' :ridgecv.coef_})

<a id='lasso'></a>
# Lasso Model


In [None]:
lassocv = LassoCV(n_alphas = 500)
lassocv.fit(X_train, y_train_log)

In [None]:
# Optimum alpha
lassocv.alpha_

In [None]:
#test rmse for lasso on train dataset
lasso_train_rmse = -cross_val_score(
                    Lasso(alpha = lassocv.alpha_),
                    X_train,
                    y_train_log,
                    cv = 10,
                    scoring = 'neg_root_mean_squared_error',).mean()
lasso_train_rmse

In [None]:
#store coefficients for comparison later
lasso_coefs = pd.DataFrame({'lasso_coefs' :lassocv.coef_})

## Lasso Coefficients

It is important to check if any of the coefficients tended to 0 with lasso regularization. This helps to eliminate features which do not have much predictive value for the saleprice.

In [None]:
#Check whether any of the coefficients tended to 0
lasso_coefs['zeros'] = lasso_coefs['lasso_coefs'].map(lambda x: 1 if x == 0 else 0)
lasso_coefs['zeros'].sum()

In [None]:
lasso = Lasso(alpha = lassocv.alpha_)
lasso.fit(X_train, y_train_log)
lasso_preds = np.exp(lasso.predict(X_train))

In [None]:
# Plot the model
plt.figure(figsize=(12,9))

# Generate a scatterplot of predicted values versus actual values.
plt.scatter(y_train, lasso_preds, s=5, color='orange', alpha = 0.5)

# Plot a line.
plt.plot([0, np.max(y)],
         [0, np.max(y)],
         color = 'black')

# Tweak title and axis labels.
plt.ylabel("Predicted Values: $\hat{y}$", fontsize = 20)
plt.xlabel("Actual Values: $y$", fontsize = 20)
plt.title('Predicted Values vs. Actual Values', fontsize = 24);

The model with lasso regularization behaves similarly to the normal linear regression model in that it overpredicts houses at higher sale prices.

In [None]:
#store rmse for comparison later
lasso_holdout_rmse = -cross_val_score(
                    Lasso(alpha = lassocv.alpha_),
                    X_holdout,
                    y_holdout_log,
                    cv = 10,
                    scoring = 'neg_root_mean_squared_error',).mean()

<a id='elastic'></a>
# Elastic Net


Elasticnet regularization is a combination of both ridge and lasso regularization. The l1_ratio determines which regularization is more heavily weighted.

In [None]:
elastic = ElasticNet()
elasticcv = ElasticNetCV(n_alphas = 500,l1_ratio = list(linspace(0.01,0.5,200)))
elasticcv.fit(X_train, y_train_log)

In [None]:
#elasticnet optimum alpha
elasticcv.alpha_

In [None]:
#elasticnet optimum l1 ratio
elasticcv.l1_ratio_

The l1 ratio for elasticnet regularization tells us how much the model leans towards lasso regularization. As shown by the optimum l1 ratio, the elasticnet model uses mostly ridge regularization and has minimal penalties from lasso regularization.

In [None]:
elastic = ElasticNet(alpha = elasticcv.alpha_, l1_ratio = elasticcv.l1_ratio_)
elastic_train_rmse = -cross_val_score(
                    elastic,
                    X_train,
                    y_train_log,
                    cv = 10,
                    scoring = 'neg_root_mean_squared_error',).mean()
elastic_train_rmse

In [None]:
elastic = ElasticNet(alpha = elasticcv.alpha_, l1_ratio = elasticcv.l1_ratio_)
elastic.fit(X_train, y_train_log)
elastic_preds = np.exp(elastic.predict(X_train))

In [None]:
# Plot the model
plt.figure(figsize=(12,9))

# Generate a scatterplot of predicted values versus actual values.
plt.scatter(y_train, elastic_preds, s=5, color='green', alpha = 0.5)

# Plot a line.
plt.plot([0, np.max(y)],
         [0, np.max(y)],
         color = 'black')

# Tweak title and axis labels.
plt.ylabel("Predicted Values: $\hat{y}$", fontsize = 20)
plt.xlabel("Actual Values: $y$", fontsize = 20)
plt.title('Predicted Values vs. Actual Values', fontsize = 24);

The model with elasticnet regularization behaves similarly to the normal linear regression model in that it overpredicts houses at higher sale prices.

In [None]:
#store rmse for comparison later
elastic_holdout_rmse = -cross_val_score(
                        elastic,
                        X_holdout,
                        y_holdout_log,
                        cv = 10,
                        scoring = 'neg_root_mean_squared_error',).mean()

In [None]:
#store coefficients for comparison later
elastic_coefs = pd.DataFrame({'elastic_coefs' :elasticcv.coef_})

<a id='compare'></a>
# Comparisons between models


In [None]:
rmse = pd.DataFrame ( [lr_train_rmse,
                     lr_holdout_rmse,
                     ridge_train_rmse,
                     ridge_holdout_rmse,
                     lasso_train_rmse,
                     lasso_holdout_rmse,
                     elastic_train_rmse,
                     elastic_holdout_rmse], 
                     index = ['lr_train_rmse',
                     'lr_holdout_rmse',
                     'ridge_train_rmse',
                     'ridge_holdout_rmse',
                     'lasso_train_rmse',
                     'lasso_holdout_rmse',
                     'elastic_train_rmse',
                     'elastic_holdout_rmse'])

rmse

The root mean squared error for the predicted log sale prices is the smallest for ridge regularization, indicating that our model with ridge regularization is the best predictor for saleprices. We can explore the coefficients for our models for more insights.

## Interpreting intercept and coefficients

The intercept and coefficients obtained are for a log transformed linear regression model and do not represent a direct linear relationship to the sale price. Instead, the way to interpret these is as follows: The exponentiated value of the intercepts represents the geometric mean of the saleprice. The coefficients represent a % increase in the saleprice for increase in 1 unit of feature X. For negative values, it represents a % decrease in saleprice for a 1 unit increase of feature X.

In [None]:
colums = pd.DataFrame(X_train.columns, columns = ['features'])

In [None]:
coefficients = pd.concat([colums, lr_coefs,ridge_coefs,lasso_coefs.drop(columns = 'zeros'),elastic_coefs], axis = 1)

In [None]:
coefficients.set_index('features')

In [None]:
intercepts = pd.DataFrame({'lr_intercept': [lr.intercept_],
                          'ridge_intercept': [ridge.intercept_],
                          'lasso_intercept': [lasso.intercept_],
                          'elasticnet_intercept': [elastic.intercept_]})
intercepts

In [None]:
plt.figure(figsize =(8,5))
sns.barplot(data = coefficients, x = 'lr_coefs', y = 'features');

In [None]:
plt.figure(figsize =(8,5))
sns.barplot(data = coefficients, x = 'ridge_coefs', y = 'features');

The coefficients for the lasso and regular regression model do not change by much, indicating that the model is not penalized much for variance.

As seen from the results, the strongest predictors of saleprice is the`overall_qual`, `gr_liv_area`, `neighbor_l` and `since_reno`. The weakest predictors are `mas_vnr_area` and `age`.

Ridge regularization amplifies the predictive value of the polynomial features as compared to the regular regression model. It also amplifies `garage_area`, `total_bsmt_sf`, `since_reno` and `garage type_a`.

<a id='kaggle'></a>
# Kaggle Evaluation


We will retrain the model on the full training set using ridge regularization to ensure that the model utilises the maximum amount of data available.

In [None]:
ss_X = StandardScaler()

X_full_train = X
for i in cols:
    
    X_full_train[i] = ss_X.fit_transform(X[[i]])

In [None]:
y_log = np.log(y)

In [None]:
X_test.drop(columns = 'id', inplace = True)

In [None]:
ridgecv = RidgeCV(alphas = np.logspace(0,3,1000))
ridgecv.fit(X_full_train,y_log)

In [None]:
ridgecv.alpha_

In [None]:
ridge = Ridge(alpha = ridgecv.alpha_)
ridge.fit(X_full_train, y_log)
test_preds = np.exp(ridge.predict(X_test))

In [None]:
test_preds

In [None]:
kaggle_submission = pd.DataFrame({'Id': df_test_raw['id'],
                                 'SalePrice': test_preds})

In [None]:
kaggle_submission

In [None]:
kaggle_submission.to_csv('./datasets/kaggle_submission_final.csv', index = False)

<a id='results'></a>
# Results and conclusion


Upon testing our model against the test dataset of 879 houses, we received a root mean squared error of 26,502.87. This is a great improvement from our baseline score of 83,689.36!

The features we selected from the original dataset prove as valuable predictors of house saleprices. The features which lead to the best increases in sale prices are `overall_qual`, `gr_liv_area`, `neighbor_l` and `since_reno`. With some interesting features such as `fireplace_qu`, `tot_baths`, `garage_area` and `garage_type_a`.

Houseflipping often involves renovating a property and selling it at a higher price. As seen from our model, it is logical that an increase in the property's overall quality(`overall_qual`). Newly renovated houses also tend to fetch a higher selling price (`since_reno`).

However some interesting trends picked up highlighted the potential impact of improving the quality of fireplaces `fireplace_qu`, and having a garage built in or attached to the house (`garage_type_a`).

The strong negative coefficient for `neighbor_l` suggests that Iowa DOT & Rail Road, Briardale and Meadow Village are areas that do not tend to fetch high house prices.

## Recommendations to stakeholders

I would recommend "house hunters" to avoid the neighborhoods in `neighbor_l` in their search for potential houses to flip. When selecting houses, look out for the potential to add an in-built or attached garage to the house. During the renovation, it would be beneficial to improve the quality of the fireplace or add one if there is none. Winters can get really cold in Ames!

## Future work

Our model showed less predictive accuracy for more the more expensive properties as compared to the ones which were lower priced. This can be improved upon by including more data for expensive houses sold.