# Evaluating Performance - Weather Model
In this exercise, we'll work with the historical temperature data from the previous checkpoint. 

## Load the dataset from Thinkful's database

In [1]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import os
import matplotlib.pyplot as plt
%matplotlib inline
from sqlalchemy import create_engine
import warnings

warnings.filterwarnings('ignore')

In [2]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

# use the credentials to start a connection
engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

# Use the connection to extract SQL data
weather_df = pd.read_sql_query('SELECT * FROM weatherinszeged', con=engine)

#Close the connection after query is complete
engine.dispose()

## Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed.

In [3]:
temp_dif = weather_df['temperature'] - weather_df['apparenttemperature']
weather_df['temperaturediff'] = temp_dif

In [4]:
target_var = 'temperaturediff'
feature_set = ['humidity', 'windspeed']

# X is the feature set 
X = weather_df[feature_set]
# Y is the target variable
Y = weather_df[target_var]

# We add constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results.
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:        temperaturediff   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Mon, 23 Sep 2019   Prob (F-statistic):               0.00
Time:                        17:41:16   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.4381      0.021   -115.948      0.0

The R-squared value for this model is 0.288, and the adjusted R-squared value is also 0.288. These values are too low to be satisfactory, as they mean that only 28.8% of the variance can be explained by our model. 

## Include the interaction of humidity and windspeed to the model above and estimate the model using OLS.

In [5]:
# This is the interaction between humidity and windspeed
weather_df['humidity_windspeed'] = weather_df['humidity'] * weather_df['windspeed']

feature_set = ['humidity', 'windspeed', 'humidity_windspeed']

# X is the feature set 
X = weather_df[feature_set]
# Y is the target variable
Y = weather_df[target_var]

# We add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:        temperaturediff   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Mon, 23 Sep 2019   Prob (F-statistic):               0.00
Time:                        17:47:11   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 -0.0839      0

For this version of the model, the R-squared value is 0.341 and the adjusted R-squared value is also 0.341. Although this value is higher than the previous 0.288, it's still too low to be satisfactory, as it only explains 34.1% of the variance in the model.

## Add visibility as an additional explanatory variable to the first model and estimate it.

In [6]:
feature_set = ['humidity', 'windspeed', 'visibility']

# X is the feature set 
X = weather_df[feature_set]
# Y is the target variable
Y = weather_df[target_var]

# We add a constant to the model as it's a best practice
# to do so every time!
X = sm.add_constant(X)

# We fit an OLS model using statsmodels
results = sm.OLS(Y, X).fit()

# We print the summary results
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:        temperaturediff   R-squared:                       0.304
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                 1.401e+04
Date:                Mon, 23 Sep 2019   Prob (F-statistic):               0.00
Time:                        17:59:29   Log-Likelihood:            -1.6938e+05
No. Observations:               96453   AIC:                         3.388e+05
Df Residuals:                   96449   BIC:                         3.388e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.5756      0.028    -56.605      0.0

The new R-squared value is 0.304 and the adjusted R-squared value is 0.303. This is higher than the values from the original model, but lower than the values from the model containing the humidity_windspeed interaction feature. Therefore, the interaction feature is more useful than the visibility feature.

## Choose the best one from the three models above with respect to their AIC and BIC scores.

For AIC and BIC, the lower the value, the better. Our first model had AIC and BIC values of 3.409E+05, the second model had values of 3.334E+05, and the third model had values of 3.388E+05. So, based on those values, the second model is the best choice.