# Interpreting Estimated Coefficients - Weather Model
In this exercise, we'll work with the historical temperature data from the previous checkpoint. 

## Load the dataset from Thinkful's database

In [1]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import os
import matplotlib.pyplot as plt
%matplotlib inline
from sqlalchemy import create_engine
import warnings

warnings.filterwarnings('ignore')

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

# use the credentials to start a connection
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

# Use the connection to extract SQL data
weather_df = pd.read_sql_query('SELECT * FROM weatherinszeged', con=engine)

#Close the connection after query is complete
engine.dispose()

## Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed.

In [12]:
temp_dif = weather_df['temperature'] - weather_df['apparenttemperature']
weather_df['temperaturediff'] = temp_dif

In [13]:
target_var = 'temperaturediff'
feature_set = ['humidity', 'windspeed']

# X is the feature set 
X = weather_df[feature_set]
# Y is the target variable
Y = weather_df[target_var]

# We add constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:        temperaturediff   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Mon, 23 Sep 2019   Prob (F-statistic):               0.00
Time:                        13:39:33   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.4381      0.021   -115.948      0.0

In [24]:
print('The estimated model is:')

equation = str(round(results.params[0], 2))
cols = feature_set
for i in range(len(cols)):
    equation =  equation + ' + ' + str(round(results.params[i+1], 2)) + '(' + cols[i] + ')'

print('Temperature Difference = ', equation)

The estimated model is:
Temperature Difference =  -2.44 + 3.03(humidity) + 0.12(windspeed)


According to the OLS summary, all of the coefficients have a p-value of 0, so they are not statistically significant. The coefficients make sense in that we would expect humidity to have a large impact (in the positive direction) on apparenttemperature, since higher humidities tend to make people feel like the temperature is hotter. I was surprised by the positive sign on the windspeed coefficient, as I would have expected that higher windspeed would make people feel cooler, but it's a significantly smaller value than the coefficient for hmidity, so perhaps it just doesn't have much of an effect on the model.

## Include the interaction of humidity and windspeed to the model above and estimate the model using OLS.

In [27]:
# This is the interaction between humidity and windspeed
weather_df['humidity_windspeed'] = weather_df['humidity'] * weather_df['windspeed']

feature_set = ['humidity', 'windspeed', 'humidity_windspeed']

# X is the feature set 
X = weather_df[feature_set]
# Y is the target variable
Y = weather_df[target_var]

# We add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:        temperaturediff   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Mon, 23 Sep 2019   Prob (F-statistic):               0.00
Time:                        13:51:50   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 -0.0839      0

In [28]:
print('The new estimated model is:')

equation = str(round(results.params[0], 2))
cols = feature_set
for i in range(len(cols)):
    equation =  equation + ' + ' + str(round(results.params[i+1], 2)) + '(' + cols[i] + ')'

print('Temperature Difference = ', equation)

The new estimated model is:
Temperature Difference =  -0.08 + -0.18(humidity) + -0.09(windspeed) + 0.3(humidity_windspeed)


In this version of the model, only the p-value of the constant coefficient becomes greater than 0, but it doesn't become greater than 0.05, so it still isn't statistically significant. Both the humidity and windspeed coefficients have decreased in value and switched from positive to negative. This is likely due to the fact that the combined humidity_windspeed feature is representing enough of the data that the humidity and windspeed individual features have become irrelevant or extraneous. The new coefficients indicate that when both humidity and windspeed increase, the temperature difference increases, but if only one of them increases and the other decreases, the temperature difference changes more with respect to the change in humidity than to the change in windspeed. 