## Assignments

As in previous checkpoints, please submit links to two Juypyter notebooks (one for each assignment below).

Please submit links to all your work below. This is not a graded checkpoint, but you should discuss your solutions with your mentor. Also, when you're done, compare your work to [these example solutions](https://github.com/Thinkful-Ed/machine-learning-regression-problems/blob/master/notebooks/5.solution_evaluating_goodness_of_fit.ipynb).



### 1. Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
* Add *visibility* as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import seaborn as sns
import statsmodels.api as sm

sns.set_style('dark')

import warnings
warnings.filterwarnings('ignore')

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
weather = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [3]:
weather['target'] = weather.apparenttemperature - weather.temperature

y = weather['target']

X = weather[['humidity', 'windspeed']]

X = sm.add_constant(X)

results = sm.OLS(y, X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Mon, 16 Mar 2020   Prob (F-statistic):               0.00
Time:                        23:01:54   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.4381      0.021    115.948      0.0

R-squared and Adj. R-squared are very low. AIC and BIC are both very high. These stats are not good.

In [4]:
weather['hum_wind'] = weather['humidity'] * weather['windspeed']

y = weather['target']

X = weather[['humidity', 'windspeed', 'hum_wind']]

X = sm.add_constant(X)

results = sm.OLS(y, X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Mon, 16 Mar 2020   Prob (F-statistic):               0.00
Time:                        23:07:47   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0839      0.033      2.511      0.0

R-squared and Adj. R-squared and still low, but better than the first. AIC and BIC are still very high.

In [5]:
weather['hum_wind'] = weather['humidity'] * weather['windspeed']

y = weather['target']

X = weather[['humidity', 'windspeed', 'hum_wind', 'visibility']]

X = sm.add_constant(X)

results = sm.OLS(y, X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 target   R-squared:                       0.364
Model:                            OLS   Adj. R-squared:                  0.363
Method:                 Least Squares   F-statistic:                 1.377e+04
Date:                Mon, 16 Mar 2020   Prob (F-statistic):               0.00
Time:                        23:10:35   Log-Likelihood:            -1.6504e+05
No. Observations:               96453   AIC:                         3.301e+05
Df Residuals:                   96448   BIC:                         3.301e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.1006      0.039    -28.459      0.0

This model is still not great, but it is the best of the three.

###  2. House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
* Do you think your model is satisfactory? If so, why?
* In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
* For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

In [6]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
hp_df = pd.read_sql_query('select * from houseprices',con=engine, index_col='id')

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [8]:
#exterqual and mszoning seem to have quite a bit of variance. let's use one-hot encoding for these two
hp_df = pd.concat([hp_df, pd.get_dummies(hp_df['mszoning'], prefix='mszoning', drop_first=True)], axis=1)
hp_df = pd.concat([hp_df, pd.get_dummies(hp_df['exterqual'], prefix='exterqual', drop_first=True)], axis=1)

dummy_col_names = list(pd.get_dummies(hp_df['mszoning'], prefix='mszoning', drop_first=True).columns)
dummy_col_names = dummy_col_names + list(pd.get_dummies(hp_df['exterqual'], prefix='exterqual', drop_first=True).columns)

In [10]:
# Linear Regression
Y = hp_df['saleprice']
# X is the feature set which includes
# is_male and is_smoker variables
X = hp_df[['overallqual', 'grlivarea', 'garagecars', 'garagearea', 'totalbsmtsf'] + dummy_col_names]

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.789
Model:                            OLS   Adj. R-squared:                  0.787
Method:                 Least Squares   F-statistic:                     450.4
Date:                Mon, 16 Mar 2020   Prob (F-statistic):               0.00
Time:                        23:16:41   Log-Likelihood:                -17409.
No. Observations:                1460   AIC:                         3.484e+04
Df Residuals:                    1447   BIC:                         3.491e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -1.744e+04   1.55e+04     -1.129   

R-squared and Adj. R-squared are high, but they're not so high that they're approaching 1. The p-value of the F statistic says that our features are contributing to the model. Might be a satisfactory model, but we could probably also push a bit higher on the R-squared values.

In [11]:
# optimizing the model:

# adding bsmtqual
hp_df = pd.concat([hp_df, pd.get_dummies(hp_df['bsmtqual'], prefix='bsmtqual', drop_first=True)], axis=1)
dummy_col_names = dummy_col_names + list(pd.get_dummies(hp_df['bsmtqual'], prefix='bsmtqual', drop_first=True).columns)

In [13]:
# Re-running Linear Regression
Y = hp_df['saleprice']
# X is the feature set which includes
# is_male and is_smoker variables
X = hp_df[['overallqual', 'grlivarea', 'garagecars', 'totalbsmtsf'] + dummy_col_names]
# optimizing by dropping features with insiginificant p-values
X.drop(['mszoning_RH', 'mszoning_RM'], axis=1, inplace=True)

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:              saleprice   R-squared:                       0.799
Model:                            OLS   Adj. R-squared:                  0.797
Method:                 Least Squares   F-statistic:                     478.4
Date:                Tue, 17 Mar 2020   Prob (F-statistic):               0.00
Time:                        08:38:36   Log-Likelihood:                -17374.
No. Observations:                1460   AIC:                         3.477e+04
Df Residuals:                    1447   BIC:                         3.484e+04
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         1.389e+04   1.07e+04      1.301   

R-squared and Adj R-squared values increased. AIC and BIC are lower than previous model. All features are significant. This model looks to be superior to the previous model.