### Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

- First, load the dataset from the weatherinszeged table from Thinkful's database.  
- Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why?  
- Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one?  
- Add visibility as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the visibility in terms of the improvement in the adjusted R-squared. Which one is more useful?  
- Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings(action='ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
weather_df = pd.read_sql_query('select * from weatherinszeged', con=engine)
engine.dispose()

weather_df.head()

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


In [9]:
# your target variable is the difference between the apparenttemperature and the temperature.
weather_df['gap temp'] = weather_df['temperature'] - weather_df['apparenttemperature']

In [10]:
# As explanatory variables, use humidity and windspeed
X = weather_df[['windspeed', 'humidity']]
Y = weather_df['gap temp']

In [11]:
# Now, estimate your model using OLS. 

X = sm.add_constant(X)
results = sm.OLS(Y, X).fit()

print(results.summary())

                            OLS Regression Results                            
Dep. Variable:               gap temp   R-squared:                       0.288
Model:                            OLS   Adj. R-squared:                  0.288
Method:                 Least Squares   F-statistic:                 1.949e+04
Date:                Mon, 21 Oct 2019   Prob (F-statistic):               0.00
Time:                        14:36:55   Log-Likelihood:            -1.7046e+05
No. Observations:               96453   AIC:                         3.409e+05
Df Residuals:                   96450   BIC:                         3.409e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -2.4381      0.021   -115.948      0.0

#### What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? ####

R-squared value is 0.288 and adjusted R-squared value is 0.288.  
Normally, R-values should be between 0 and 1. Our outcome is 0.288. It is quite small I think. And I hope the values would be higher than 0.5. Because low R-square value means that it would not be great to explain the information of outcome.


In [12]:
# include the interaction of humidity and windspeed to the model above and estimate the model using OLS
weather_df['humi-wind'] = weather_df['windspeed'] * weather_df['humidity']

In [13]:
X1 = weather_df[['windspeed', 'humidity', 'humi-wind']]
Y1 = weather_df['gap temp']

X1 = sm.add_constant(X1)
results1 = sm.OLS(Y1, X1).fit()
print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:               gap temp   R-squared:                       0.341
Model:                            OLS   Adj. R-squared:                  0.341
Method:                 Least Squares   F-statistic:                 1.666e+04
Date:                Mon, 21 Oct 2019   Prob (F-statistic):               0.00
Time:                        14:37:03   Log-Likelihood:            -1.6669e+05
No. Observations:               96453   AIC:                         3.334e+05
Df Residuals:                   96449   BIC:                         3.334e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0839      0.033     -2.511      0.0

#### Now, what is the R-squared of this model? Does this model improve upon the previous one? ####

This model which is including interaction of humidity and windspeed has been improved compare to the previous one.  
Just 0.053 higher. However, It is still not enought to make me satisfiied. I want the values more than 0.5.

In [14]:
# Add visibility as an additional explanatory variable to the first model and estimate it.
X2 = weather_df[['windspeed', 'humidity', 'humi-wind', 'visibility']]
Y2 = weather_df['gap temp']

X2 = sm.add_constant(X2)
results2 = sm.OLS(Y2, X2).fit()

print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:               gap temp   R-squared:                       0.364
Model:                            OLS   Adj. R-squared:                  0.363
Method:                 Least Squares   F-statistic:                 1.377e+04
Date:                Mon, 21 Oct 2019   Prob (F-statistic):               0.00
Time:                        14:37:44   Log-Likelihood:            -1.6504e+05
No. Observations:               96453   AIC:                         3.301e+05
Df Residuals:                   96448   BIC:                         3.301e+05
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.1006      0.039     28.459      0.0

In [18]:
# Compare the differences put on the table by the interaction term and the visibility in terms of the improvement in the adjusted R-squared. 
condition = ['Original', 'Visibility', 'Interaction']

ad_R_table = pd.DataFrame({'Adj Values':[0.288, 0.341, 0.363]}, index=condition)
ad_R_table

Unnamed: 0,Adj Values
Original,0.288
Visibility,0.341
Interaction,0.363


Above table shows that Adjusted R-Values keep increasing step by step.  
In conclusion, the last value which is interaction model has the highest R-value among the models.

In [19]:
AIC_BIC = {'AIC & BIC':[ 3.409e+05, 3.334e+05, 3.301e+05]}
model_performance = pd.DataFrame(AIC_BIC, index=condition)
model_performance

Unnamed: 0,AIC & BIC
Original,340900.0
Visibility,333400.0
Interaction,330100.0


The AIC and BIC values indicate the performance of the model. And the smaller one has the best performance among the models.  

**So, Interaction model has the smallest values among the models. It turns out that the interaction model has the best performance and show the most information about the outcome.**