### 1. Interpretation and significance

Suppose that we would like to know how much families in the US are spending on recreation annually. We've estimated the following model:

$$ expenditure = 873 + 0.0012annual\_income + 0.00002annual\_income^2 - 223.57have\_kids $$

*expenditure* is the annual spending on recreation in US dollars, *annual_income* is the annual income in US dollars, and *have_kids* is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically. Write up your answer.

### Answer 1: 

Assuming the coefficients are statistically significant (which we don't know because none of them have their associated statistics reported), we can say that the bias term is $873, the expenditure increases in quadratically with annual income, and families with kids spend 223.57 dollars fewer on recreation.

### 2. Weather model

In this exercise, you'll work with the historical temperature data from the previous checkpoint. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

### Answer 2:

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import statsmodels.api as sm
import statsmodels.formula.api as smf
from sqlalchemy import create_engine

# Display preferences.
%matplotlib inline
pd.options.display.float_format = "{:.3f}".format

%load_ext nb_black

import warnings

warnings.filterwarnings(action="ignore")

<IPython.core.display.Javascript object>

In [2]:
postgres_user = "dsbc_student"
postgres_pw = "7*.8G9QH21"
postgres_host = "142.93.121.174"
postgres_port = "5432"
postgres_db = "weatherinszeged"

engine = create_engine(
    "postgresql://{}:{}@{}:{}/{}".format(
        postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db
    )
)
szeged_df = pd.read_sql_query("select * from weatherinszeged", con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()


szeged_df.head(50)

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472,7.389,0.89,14.12,251.0,15.826,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.356,7.228,0.86,14.265,259.0,15.826,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.378,9.378,0.89,3.928,204.0,14.957,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.289,5.944,0.83,14.104,269.0,15.826,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.756,6.978,0.83,11.045,259.0,15.826,0.0,1016.51,Partly cloudy throughout the day.
5,2006-04-01 03:00:00+00:00,Partly Cloudy,rain,9.222,7.111,0.85,13.959,258.0,14.957,0.0,1016.66,Partly cloudy throughout the day.
6,2006-04-01 04:00:00+00:00,Partly Cloudy,rain,7.733,5.522,0.95,12.365,259.0,9.982,0.0,1016.72,Partly cloudy throughout the day.
7,2006-04-01 05:00:00+00:00,Partly Cloudy,rain,8.772,6.528,0.89,14.152,260.0,9.982,0.0,1016.84,Partly cloudy throughout the day.
8,2006-04-01 06:00:00+00:00,Partly Cloudy,rain,10.822,10.822,0.82,11.318,259.0,9.982,0.0,1017.37,Partly cloudy throughout the day.
9,2006-04-01 07:00:00+00:00,Partly Cloudy,rain,13.772,13.772,0.72,12.526,279.0,9.982,0.0,1017.22,Partly cloudy throughout the day.


<IPython.core.display.Javascript object>

In [3]:
szeged_df.isna().mean()

date                  0.000
summary               0.000
preciptype            0.000
temperature           0.000
apparenttemperature   0.000
humidity              0.000
windspeed             0.000
windbearing           0.000
visibility            0.000
loudcover             0.000
pressure              0.000
dailysummary          0.000
dtype: float64

<IPython.core.display.Javascript object>

In [4]:
szeged_df.describe()

Unnamed: 0,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure
count,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0
mean,11.933,10.855,0.735,10.811,187.509,10.347,0.0,1003.236
std,9.552,10.697,0.195,6.914,107.383,4.192,0.0,116.97
min,-21.822,-27.717,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.689,2.311,0.6,5.828,116.0,8.34,0.0,1011.9
50%,12.0,12.0,0.78,9.966,180.0,10.046,0.0,1016.45
75%,18.839,18.839,0.89,14.136,290.0,14.812,0.0,1021.09
max,39.906,39.344,1.0,63.853,359.0,16.1,0.0,1046.38


<IPython.core.display.Javascript object>

In [15]:
szeged_df["target_temp"] = szeged_df["apparenttemperature"] - szeged_df["temperature"]

<IPython.core.display.Javascript object>

In [16]:
szeged_df.describe()

Unnamed: 0,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,target_temp
count,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0
mean,11.933,10.855,0.735,10.811,187.509,10.347,0.0,1003.236,-1.078
std,9.552,10.697,0.195,6.914,107.383,4.192,0.0,116.97,1.679
min,-21.822,-27.717,0.0,0.0,0.0,0.0,0.0,0.0,-10.183
25%,4.689,2.311,0.6,5.828,116.0,8.34,0.0,1011.9,-2.217
50%,12.0,12.0,0.78,9.966,180.0,10.046,0.0,1016.45,0.0
75%,18.839,18.839,0.89,14.136,290.0,14.812,0.0,1021.09,0.0
max,39.906,39.344,1.0,63.853,359.0,16.1,0.0,1046.38,4.811


<IPython.core.display.Javascript object>

In [17]:
y = szeged_df["target_temp"]
X = szeged_df[["humidity", "windspeed"]]

<IPython.core.display.Javascript object>

In [18]:
X = sm.add_constant(X)

results = sm.OLS(y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,target_temp,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Tue, 07 Apr 2020",Prob (F-statistic):,0.0
Time:,20:37:48,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
windspeed,-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


<IPython.core.display.Javascript object>

The coefficients are statistically significant at p < 0.001.  The difference in apparent versus regular temperature (wind chill?), i.e. target_temp, seems to decrease with increase in humidity or windspeed.  For a one unit increase in humidity, the target_temp decreases by 3.03 degrees.  For a one unit increase in windspeed, the target_temp decreases by 0.012 degrees.

In [19]:
szeged_df["wind_hum_interaction"] = szeged_df["windspeed"] * szeged_df["humidity"]

<IPython.core.display.Javascript object>

In [20]:
y = szeged_df["target_temp"]
X2 = szeged_df[["humidity", "windspeed", "wind_hum_interaction"]]

<IPython.core.display.Javascript object>

In [21]:
X2 = sm.add_constant(X2)

results = sm.OLS(y, X2).fit()

results.summary()

0,1,2,3
Dep. Variable:,target_temp,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Tue, 07 Apr 2020",Prob (F-statistic):,0.0
Time:,20:55:17,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
humidity,0.1775,0.043,4.133,0.000,0.093,0.262
windspeed,0.0905,0.002,36.797,0.000,0.086,0.095
wind_hum_interaction,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


<IPython.core.display.Javascript object>

All of the coefficients are still statistically significant at the p < 0.01 level.  

The signs of the humidity and windspeed coefficients are now positive.  The interaction variable has a negative sign and a larger magnitude than either of the individual coefficients.

For a given windspeed, a 1 point increase in humidity has a .17-0.3X windspeed point change in the target variable.  For a given humidity, a one point increase in windspeed results in a 0.09-0.3X humidity point change in the target.

###  3. House prices model

In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* Load the **houseprices** data from Thinkful's database.
* Run your house prices model again and interpret the results. Which features are statistically significant, and which are not?
* Now, exclude the insignificant features from your model. Did anything change?
* Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have a more prominent effect on house prices?
* Do the results sound reasonable to you? If not, try to explain the potential reasons.

### Answer 3:

Will do when I finish the houseprices assignment.  TBD.