1. Load the houseprices data from Thinkful's database.
2. Do data cleaning, exploratory data analysis, and feature engineering. You can use your previous work in this module. But make sure that your work is satisfactory.
3. Now, split your data into train and test sets where 20% of the data resides in the test set.
4. Build several linear regression models including Lasso, Ridge, or ElasticNet and train them in the training set. Use k-fold cross-validation to select the best hyperparameters if your models include one!
5. Evaluate your best model on the test set.
6. So far, you have only used the features in the dataset. However, house prices can be affected by many factors like economic activity and the interest rates at the time they are sold. So, try to find some useful factors that are not included in the dataset. Integrate these factors into your model and assess the prediction performance of your model. Discuss the implications of adding these external variables into your model.


In [7]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine

In [2]:
# %load ../utility/overhead.py
#record module versions used in cell 1
#
def version_recorder():
    '''
    only works if import is first cell run. prints and then returns dictionary with modules:version.
    '''
    import pkg_resources
    resources = In[1].splitlines()
    ##ADD: drop lines if not _from_ or _import_
    version_dict = { resource.split()[1].split(".")[0] : pkg_resources.get_distribution(resource.split()[1].split(".")[0]).version for resource in resources }
    return version_dict
version_recorder()

{'numpy': '1.16.2',
 'pandas': '0.24.2',
 'statsmodels': '0.10.1',
 'sklearn': '0.0',
 'matplotlib': '3.0.3',
 'seaborn': '0.9.0',
 'sqlalchemy': '1.3.5'}

In [3]:
#credentials
user = 'dsbc_student'
pw = '7*.8G9QH21'
host = '142.93.121.174'
port = '5432'
db = 'houseprices'
dialect = 'postgresql'

db_addr = f'{dialect}://{user}:{pw}@{host}:{port}/{db}'
engine = create_engine(db_addr)

query = '''
SELECT
    *
FROM
    houseprices
'''

raw_df = pd.read_sql(query, con=engine)
engine.dispose()

In [4]:
house_df = raw_df.copy(deep=True)

In [5]:
#create df with categorical variables to select some features from
categorical_feat = house_df.dtypes[house_df.dtypes == 'object'].index
new_categories_df = pd.DataFrame()
for feature in categorical_feat:
    new_categories_df = pd.concat([new_categories_df, 
                                   pd.get_dummies(house_df[feature], columns=categorical_feat, drop_first=True, prefix = feature)], axis=1)
#append numerical features to new df
new_categories_df = pd.concat([new_categories_df, 
                               house_df.filter(items=(house_df.columns[(house_df.dtypes.values != 'object').tolist()]), axis=1) ], 
                              axis=1) #tolist() needed to avoid hashability issue

#create X & y
X = house_df[["overallqual", "grlivarea", "fullbath", "yearbuilt",]]
pd.concat([X, new_categories_df[['exterqual_TA', 'foundation_CBlock']]])
X["overalqual_x_year"] = house_df.overallqual * house_df.yearbuilt
y = house_df.saleprice

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [46]:
lasso_model = Lasso(tol=.5)  #increased tol to avoid 'ConvergenceWarning: Objective did not converge'
alphas = np.array([.0001, .001, .01, .1, 1, 10, 100, 1_000, 10_000, 100_000])

grid_las = GridSearchCV(estimator=lasso_model, param_grid=dict(alpha=alphas), cv=12)
grid_las.fit(X, y)
print(grid_las)
print("best r^2 is: {}".format(grid_las.best_score_))
print("associated lambda value is: {}".format(grid_las.best_estimator_.alpha))

GridSearchCV(cv=12, error_score='raise-deprecating',
             estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=1000, normalize=False, positive=False,
                             precompute=False, random_state=None,
                             selection='cyclic', tol=0.5, warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'alpha': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03,
       1.e+04, 1.e+05])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
best r^2 is: 0.7272010779987932
associated lambda value is: 10000.0


In [47]:
ridge_model = Ridge()
alphas = np.array([.0001, .001, .01, .1, 1, 10, 100, 1_000, 10_000, 100_000])

grid_ridge = GridSearchCV(estimator=ridge_model, param_grid=dict(alpha=alphas), cv=7)
grid_ridge.fit(X, y)
print(grid_ridge)
print("best r^2 is: {}".format(grid_ridge.best_score_))
print("associated lambda value is: {}".format(grid_ridge.best_estimator_.alpha))

GridSearchCV(cv=7, error_score='raise-deprecating',
             estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=None, normalize=False, random_state=None,
                             solver='auto', tol=0.001),
             iid='warn', n_jobs=None,
             param_grid={'alpha': array([1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03,
       1.e+04, 1.e+05])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)
best r^2 is: 0.7420424864751328
associated lambda value is: 0.1


In [42]:
elasticnet_model = ElasticNet(tol=1)   #increased tol to avoid 'ConvergenceWarning: Objective did not converge'
alphas = np.array([.0001, .001, .01, .1, 1, 10, 100])
l1_ratio = np.array([.1, .5, .9])
params = dict(alphas=alphas, l1_ratio=l1_ratio)
    
grid_enet = GridSearchCV(estimator=elasticnet_model, param_grid=params, cv=7)
grid_enet.fit(X, y)
print(grid_enet)
print("best r^2 is: {}".format(grid_enet.best_score_))
print("associated lambda value is: {}".format(grid_enet.best_estimator_.alpha))
print("associated lambda value is: {}".format(grid_enet.best_estimator_.))

ValueError: Invalid parameter alphas for estimator ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
           max_iter=1000, normalize=False, positive=False, precompute=False,
           random_state=None, selection='cyclic', tol=1, warm_start=False). Check the list of available parameters with `estimator.get_params().keys()`.

### External Factors

Realtors may cite Location, Location, Location, but timing is vital when buying and selling assets. The Great Recession had a huge impact on the housing market (and vice-versa.) I am including a term for housing recession, defined as 2009-2012?. 

The largest problem with adding this is that it requires finding a separate data set and integrating that information into the extant data in my notebook. One particular issue is that while I found several ways to identify the end of the recession (classically, defined as 2 consecutive quarters), house prices take longer to recover and so there is no clear-cut way to define this term. Because such large-scale macroeconomic factors occur rarely, it is not easy to do testing to avoid overfitting; this I have tried to make the use of this term conservative.