# Challenge

In this module, we learned how to approach and solve regression problems using linear regression models. Throughout the module, you worked on a house price dataset from Kaggle. In this challenge, you will keep working on this dataset.

The scenario

The housing market is one of the most crucial parts of the economy for every country. Purchasing a home is one of the primary ways to build wealth and savings for people. In this respect, predicting prices in the housing market is a very central topic in economic and financial circles.

The house price dataset from Kaggle includes several features of the houses along with their sale prices at the time they are sold. So far, in this module, you built and implemented some models using this dataset.

In this challenge, you are required to improve your model with respect to its prediction performance.

To complete this challenge, submit a Jupyter notebook containing your solutions to the following tasks.

Steps:
1. Load the houseprices data from Thinkful's database.

2. Do data cleaning, exploratory data analysis, and feature engineering. You can use your previous work in this module. But make sure that your work is satisfactory.

3. Now, split your data into train and test sets where 20% of the data resides in the test set.

4. Build several linear regression models including Lasso, Ridge, or ElasticNet and train them in the training set. Use k-fold cross-validation to select the best hyperparameters if your models include one!

5. Evaluate your best model on the test set.

6. So far, you have only used the features in the dataset. However, house prices can be affected by many factors like economic activity and the interest rates at the time they are sold. So, try to find some useful factors that are not included in the dataset. Integrate these factors into your model and assess the prediction performance of your model. Discuss the implications of adding these external variables into your model.

## 1. Load the houseprices data from Thinkful's database.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time

from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error

from statsmodels.tools.eval_measures import mse, rmse
from sqlalchemy import create_engine

import statsmodels.stats.api as sms
import statsmodels.api as sm

# Display preferences.
%matplotlib inline
pd.options.display.float_format = "{:.3f}".format

import warnings

warnings.filterwarnings(action="ignore")

%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
postgres_user = "dsbc_student"
postgres_pw = "7*.8G9QH21"
postgres_host = "142.93.121.174"
postgres_port = "5432"
postgres_db = "houseprices"

engine = create_engine(
    "postgresql://{}:{}@{}:{}/{}".format(
        postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db
    )
)
houseprices_df = pd.read_sql_query("select * from houseprices", con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


houseprices_df.head(5)

Unnamed: 0,id,mssubclass,mszoning,lotfrontage,lotarea,street,alley,lotshape,landcontour,utilities,...,poolarea,poolqc,fence,miscfeature,miscval,mosold,yrsold,saletype,salecondition,saleprice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


<IPython.core.display.Javascript object>

## 2. Do data cleaning, exploratory data analysis, and feature engineering. You can use your previous work in this module. But make sure that your work is satisfactory.

### NOTE: I performed these steps in the other notebook submitted.  Please refer to that for detailed EDA and data cleaning.  Below are the steps necessary to re-build the model I decided was the best, but the EDA that leads to these steps are in the other notebook.  

In [3]:
houseprices2_df = houseprices_df.copy()

houseprices2_df["log_price"] = np.log(houseprices2_df["saleprice"])

houseprices2_df = houseprices2_df[
    [
        "log_price",
        "overallqual",
        "bsmtqual",
        "kitchenqual",
        "grlivarea",
        "garagearea",
        "fullbath",
        "yearbuilt",
        "exterqual",
        "garagefinish",
        "centralair",
    ]
]

hp2_df = houseprices2_df.dropna()

drop_ids = hp2_df.sort_values("grlivarea", ascending=False).iloc[:4].index
print(f"Dropping house ids {list(drop_ids)}")
hp2_df = hp2_df.drop(index=drop_ids)

hp2_df["centralair"] = (hp2_df["centralair"] == "Y").astype(int)

quality_map = {"Fa": 1, "TA": 2, "Gd": 3, "Ex": 4}
hp2_df[["exterqual"]] = hp2_df[["exterqual"]].replace(quality_map)
hp2_df[["kitchenqual"]] = hp2_df[["kitchenqual"]].replace(quality_map)

quality_map2 = {None: 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4}
hp2_df[["bsmtqual"]] = hp2_df[["bsmtqual"]].replace(quality_map2)

garage2_df = pd.get_dummies(hp2_df["garagefinish"], drop_first=True)

hp2_df = pd.concat([hp2_df, garage2_df], axis=1)
hp2_df.drop(columns=["garagefinish"], inplace=True)

Dropping house ids [1298, 523, 1182, 691]


<IPython.core.display.Javascript object>

## 3. Now, split your data into train and test sets where 20% of the data resides in the test set.

In [4]:
X = hp2_df.drop(columns=["log_price", "RFn"])
y = hp2_df["log_price"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=32
)

<IPython.core.display.Javascript object>

In [5]:
lrm = LinearRegression()

lrm.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print(
    "R-squared of the model in training set is: {}".format(lrm.score(X_train, y_train))
)
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lrm.score(X_test, y_test)))
print(
    "Mean absolute error of the prediction is: {}".format(
        mean_absolute_error(y_test, y_preds_test)
    )
)
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print(
    "Root mean squared error of the prediction is: {}".format(
        rmse(y_test, y_preds_test)
    )
)
print(
    "Mean absolute percentage error of the prediction is: {}".format(
        np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100
    )
)

R-squared of the model in training set is: 0.8380568201689231
-----Test set statistics-----
R-squared of the model in test set is: 0.835899234076841
Mean absolute error of the prediction is: 0.11747184712690383
Mean squared error of the prediction is: 0.025045671844994687
Root mean squared error of the prediction is: 0.15825824416122747
Mean absolute percentage error of the prediction is: 0.9730359339615233


<IPython.core.display.Javascript object>

## 4. Build several linear regression models including Lasso, Ridge, or ElasticNet and train them in the training set. Use k-fold cross-validation to select the best hyperparameters if your models include one!

In [6]:
alphas = [np.power(10.0, p) for p in np.arange(-40, 40, 1)]

# fit lasso using a grid search for the best alpha
lasso_cv = LassoCV(alphas=alphas, verbose=1, cv=5)
lasso_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lasso_cv.predict(X_train)
y_preds_test = lasso_cv.predict(X_test)

print("Best alpha value is: {}".format(lasso_cv.alpha_))
print(
    "R-squared of the model in training set is: {}".format(
        lasso_cv.score(X_train, y_train)
    )
)
print("-----Test set statistics-----")
print(
    "R-squared of the model in test set is: {}".format(lasso_cv.score(X_test, y_test))
)
print(
    "Mean absolute error of the prediction is: {}".format(
        mean_absolute_error(y_test, y_preds_test)
    )
)
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print(
    "Root mean squared error of the prediction is: {}".format(
        rmse(y_test, y_preds_test)
    )
)
print(
    "Mean absolute percentage error of the prediction is: {}".format(
        np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100
    )
)

Best alpha value is: 1e-36
R-squared of the model in training set is: 0.8380568201689231
-----Test set statistics-----
R-squared of the model in test set is: 0.8358992340768416
Mean absolute error of the prediction is: 0.11747184712690449
Mean squared error of the prediction is: 0.02504567184499459
Root mean squared error of the prediction is: 0.15825824416122716
Mean absolute percentage error of the prediction is: 0.9730359339615293


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
................................................................................................................................................................................................................................................................................................................................................................................................................[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.1s finished


<IPython.core.display.Javascript object>

In [7]:
alphas = [np.power(10.0, p) for p in np.arange(-40, 40, 1)]

# fit lasso using a grid search for the best alpha
ridge_cv = RidgeCV(alphas=alphas, cv=5)
ridge_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = ridge_cv.predict(X_train)
y_preds_test = ridge_cv.predict(X_test)

print("Best alpha value is: {}".format(ridge_cv.alpha_))
print(
    "R-squared of the model in training set is: {}".format(
        ridge_cv.score(X_train, y_train)
    )
)
print("-----Test set statistics-----")
print(
    "R-squared of the model in test set is: {}".format(ridge_cv.score(X_test, y_test))
)
print(
    "Mean absolute error of the prediction is: {}".format(
        mean_absolute_error(y_test, y_preds_test)
    )
)
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print(
    "Root mean squared error of the prediction is: {}".format(
        rmse(y_test, y_preds_test)
    )
)
print(
    "Mean absolute percentage error of the prediction is: {}".format(
        np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100
    )
)

Best alpha value is: 1.0
R-squared of the model in training set is: 0.838051559108952
-----Test set statistics-----
R-squared of the model in test set is: 0.8356737924891101
Mean absolute error of the prediction is: 0.11758309610608941
Mean squared error of the prediction is: 0.025080079582184438
Root mean squared error of the prediction is: 0.15836691441770417
Mean absolute percentage error of the prediction is: 0.973980993704107


<IPython.core.display.Javascript object>

In [8]:
alphas = [np.power(10.0, p) for p in np.arange(-40, 40, 1)]

# fit lasso using a grid search for the best alpha
elasticnet_cv = ElasticNetCV(alphas=alphas, cv=5)
elasticnet_cv.fit(X_train, y_train)

# We are making predictions here
y_preds_train = elasticnet_cv.predict(X_train)
y_preds_test = elasticnet_cv.predict(X_test)

print("Best alpha value is: {}".format(elasticnet_cv.alpha_))
print(
    "R-squared of the model in training set is: {}".format(
        elasticnet_cv.score(X_train, y_train)
    )
)
print("-----Test set statistics-----")
print(
    "R-squared of the model in test set is: {}".format(
        elasticnet_cv.score(X_test, y_test)
    )
)
print(
    "Mean absolute error of the prediction is: {}".format(
        mean_absolute_error(y_test, y_preds_test)
    )
)
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print(
    "Root mean squared error of the prediction is: {}".format(
        rmse(y_test, y_preds_test)
    )
)
print(
    "Mean absolute percentage error of the prediction is: {}".format(
        np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100
    )
)

Best alpha value is: 1e-16
R-squared of the model in training set is: 0.8380568201689231
-----Test set statistics-----
R-squared of the model in test set is: 0.8358992340768413
Mean absolute error of the prediction is: 0.11747184712690452
Mean squared error of the prediction is: 0.025045671844994632
Root mean squared error of the prediction is: 0.1582582441612273
Mean absolute percentage error of the prediction is: 0.9730359339615304


<IPython.core.display.Javascript object>

### The OLS regression did just as well as the Lasso, Ridge, and Elasticnet, but with lower computational cost.

## 5. Evaluate your best model on the test set.

### I already did this above in order to evaluate the goodnees of fit of the model.  I don't understand why I need to do that again here...?

## 6. So far, you have only used the features in the dataset. However, house prices can be affected by many factors like economic activity and the interest rates at the time they are sold. So, try to find some useful factors that are not included in the dataset. Integrate these factors into your model and assess the prediction performance of your model. Discuss the implications of adding these external variables into your model.

In [9]:
hp2_df.head()

Unnamed: 0,log_price,overallqual,bsmtqual,kitchenqual,grlivarea,garagearea,fullbath,yearbuilt,exterqual,centralair,RFn,Unf
0,12.248,7,3,3,1710,548,2,2003,3,1,1,0
1,12.109,6,3,2,1262,460,2,1976,2,1,1,0
2,12.317,7,3,3,1786,608,2,2001,3,1,1,0
3,11.849,7,2,3,1717,642,1,1915,2,1,0,1
4,12.429,8,3,3,2198,836,2,2000,3,1,1,0


<IPython.core.display.Javascript object>

Because many of the factors that influence the real estate market depend on time, I'll be using the `yrsold` and `mosold` columns from the original dataframe to merge in economic factors.  Then, I'll re-perform the dropping of na's and columns from above.

Two economic factors that the internet says influence house prices are interest rates and GDP.

In [11]:
houseprices_df["yrsold"].unique()

array([2008, 2007, 2006, 2009, 2010])

<IPython.core.display.Javascript object>

In [12]:
gdp_dict = {2006: 13.815, 2007: 14.452, 2008: 14.713, 2009: 14.449, 2010: 14.992}
houseprices_df[["gdp"]] = houseprices_df[["yrsold"]].replace(gdp_dict)

fr_int_dict = {2006: 4.97, 2007: 5.02, 2008: 1.92, 2009: 0.16, 2010: 0.18}
houseprices_df[["fr_int_rate"]] = houseprices_df[["yrsold"]].replace(fr_int_dict)

<IPython.core.display.Javascript object>

In [13]:
houseprices_df["log_price"] = np.log(houseprices_df["saleprice"])

hp2_df = houseprices_df[
    [
        "log_price",
        "overallqual",
        "bsmtqual",
        "kitchenqual",
        "grlivarea",
        "garagearea",
        "fullbath",
        "yearbuilt",
        "exterqual",
        "garagefinish",
        "centralair",
        "gdp",
        "fr_int_rate",
    ]
]

hp2_df = hp2_df.dropna()

drop_ids = hp2_df.sort_values("grlivarea", ascending=False).iloc[:4].index
print(f"Dropping house ids {list(drop_ids)}")
hp2_df = hp2_df.drop(index=drop_ids)

hp2_df["centralair"] = (hp2_df["centralair"] == "Y").astype(int)

quality_map = {"Fa": 1, "TA": 2, "Gd": 3, "Ex": 4}
hp2_df[["exterqual"]] = hp2_df[["exterqual"]].replace(quality_map)
hp2_df[["kitchenqual"]] = hp2_df[["kitchenqual"]].replace(quality_map)

quality_map2 = {None: 0, "Fa": 1, "TA": 2, "Gd": 3, "Ex": 4}
hp2_df[["bsmtqual"]] = hp2_df[["bsmtqual"]].replace(quality_map2)

garage2_df = pd.get_dummies(hp2_df["garagefinish"], drop_first=True)

hp2_df = pd.concat([hp2_df, garage2_df], axis=1)
hp2_df.drop(columns=["garagefinish"], inplace=True)

Dropping house ids [1298, 523, 1182, 691]


<IPython.core.display.Javascript object>

In [14]:
X = hp2_df.drop(columns=["log_price", "RFn"])
y = hp2_df["log_price"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=32
)

<IPython.core.display.Javascript object>

In [15]:
lrm = LinearRegression()

lrm.fit(X_train, y_train)

# We are making predictions here
y_preds_train = lrm.predict(X_train)
y_preds_test = lrm.predict(X_test)

print(
    "R-squared of the model in training set is: {}".format(lrm.score(X_train, y_train))
)
print("-----Test set statistics-----")
print("R-squared of the model in test set is: {}".format(lrm.score(X_test, y_test)))
print(
    "Mean absolute error of the prediction is: {}".format(
        mean_absolute_error(y_test, y_preds_test)
    )
)
print("Mean squared error of the prediction is: {}".format(mse(y_test, y_preds_test)))
print(
    "Root mean squared error of the prediction is: {}".format(
        rmse(y_test, y_preds_test)
    )
)
print(
    "Mean absolute percentage error of the prediction is: {}".format(
        np.mean(np.abs((y_test - y_preds_test) / y_test)) * 100
    )
)

R-squared of the model in training set is: 0.8380793269446885
-----Test set statistics-----
R-squared of the model in test set is: 0.8364429065465036
Mean absolute error of the prediction is: 0.1173637654865936
Mean squared error of the prediction is: 0.024962694521947302
Root mean squared error of the prediction is: 0.1579958686863277
Mean absolute percentage error of the prediction is: 0.9721036017745278


<IPython.core.display.Javascript object>

Before adding in the gdp and federal reserve interest rates, our linear regression model evaluated to: 

R-squared of the model in training set is: 0.8380568201689231

-----Test set statistics-----

R-squared of the model in test set is: 0.835899234076841

Mean absolute error of the prediction is: 0.11747184712690383

Mean squared error of the prediction is: 0.025045671844994687

Root mean squared error of the prediction is: 0.15825824416122747

Mean absolute percentage error of the prediction is: 0.9730359339615233

These results look almost exactly the same as the ones that included the gdp and federal reserve interest rates.  Therefore I would conclude that either I did something wrong, or these are not useful features, or their explanatory power was already included in one of the other variables.  