### 2. Weather model

In this exercise, you'll work with the historical temperature data from the previous checkpoint. To complete this assignment, submit a link to a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
from sklearn import linear_model
import statsmodels.api as sm
%matplotlib inline

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'
                       .format(postgres_user, postgres_pw, postgres_host,
                              postgres_port, postgres_db))

df = pd.read_sql_query('SELECT * FROM weatherinszeged', con=engine)

engine.dispose()

In [3]:
df.head()

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


In [4]:
df.describe()

Unnamed: 0,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure
count,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0,96453.0
mean,11.932678,10.855029,0.734899,10.81064,187.509232,10.347325,0.0,1003.235956
std,9.551546,10.696847,0.195473,6.913571,107.383428,4.192123,0.0,116.969906
min,-21.822222,-27.716667,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.688889,2.311111,0.6,5.8282,116.0,8.3398,0.0,1011.9
50%,12.0,12.0,0.78,9.9659,180.0,10.0464,0.0,1016.45
75%,18.838889,18.838889,0.89,14.1358,290.0,14.812,0.0,1021.09
max,39.905556,39.344444,1.0,63.8526,359.0,16.1,0.0,1046.38


In [5]:
#selecting our target variable and feature set
Y = df['temperature'] - df['apparenttemperature']
X = df[['humidity', 'windspeed']]

#adding constant
X = sm.add_constant(X)

#fitting our model 
results = sm.OLS(Y,X).fit()

results.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,y,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Wed, 02 Oct 2019",Prob (F-statistic):,0.0
Time:,10:35:16,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.4381,0.021,-115.948,0.000,-2.479,-2.397
humidity,3.0292,0.024,126.479,0.000,2.982,3.076
windspeed,0.1193,0.001,176.164,0.000,0.118,0.121

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


We can see that in our model all of our estimated coefficients are statistically significant. This is because in each case we have a p-value of 0.

Both variables are continuous and have positive coefficents meaning that a larger value for either humidity and/or windspeed will result in a larger difference between the temperature and the apparent temperature. Apparent temperature is the temperature as it is percieved by people, meaning that this makes some sense as the windspeed and humidity may influence the perception someone has on the temperature.  

The estimated coefficients tell us that for each single incremental increase in humidity the difference between the temperature and the apparent temperature increases by 3.0292. We also know that the target variable increases by 0.1193 for each unit increase in windspeed.



* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

In [7]:
df['humidity_wind_speed'] = df['humidity']*df['windspeed']


#selecting our target variable and feature set
Y = df['temperature'] - df['apparenttemperature']
X = df[['humidity', 'windspeed', 'humidity_wind_speed']]

#adding constant
X = sm.add_constant(X)

#fitting our model 
results = sm.OLS(Y,X).fit()

results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Wed, 02 Oct 2019",Prob (F-statistic):,0.0
Time:,10:36:55,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0839,0.033,-2.511,0.012,-0.149,-0.018
humidity,-0.1775,0.043,-4.133,0.000,-0.262,-0.093
windspeed,-0.0905,0.002,-36.797,0.000,-0.095,-0.086
humidity_wind_speed,0.2971,0.003,88.470,0.000,0.291,0.304

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


We can see that after adding the interaction term, the coffeicients of all three predictive variables are still considered significant because their p-values are near zero. We can see that the coefficients for humidity and windspeed are now negative. This means that for every unit increase in humidity or windspeed we will have decrease our target variable by 0.1775 or 0.0905 respectively. However the interaction term is positive showing that we increase our target variable by 0.2971 for every unit increase in this combined variable. 