### 1. Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link a Jupyter notebook containing your solutions to the following tasks:

* First, load the dataset from the **weatherinszeged** table from Thinkful's database.
* Like in the previous checkpoint, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
* Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
* Add *visibility* as an additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
* Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.


In [4]:
# %load 19.4_interpreting_coefs_drill_2_weather.py
#!/usr/bin/env python

# 1. load the dataset from the weatherinszeged table from Thinkful's database.
# 2. Build a linear regression model where your target variable is the difference between the apparenttemperature and the temperature. As explanatory variables, use humidity and windspeed. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables?
# 3. Next, include the interaction of humidity and windspeed to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for humidity and windspeed change? Interpret the estimated coefficients.

# In[1]:


import numpy as np
import pandas as pd
from  sklearn import preprocessing as pre
from sklearn import linear_model
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm


# In[2]:


#Record versions of modules used for future reference
import pkg_resources
resources = ('numpy', 'pandas', 'matplotlib', 'sklearn', 'seaborn', 'sqlalchemy', 'statsmodels')
version_dict = { i : pkg_resources.get_distribution(i).version for i in resources }
version_dict


# In[3]:


user = 'dsbc_student'
pw = '7*.8G9QH21'
host = '142.93.121.174'
port = '5432'
db = 'weatherinszeged'
dialect = 'postgresql'

db_location = f"{dialect}://{user}:{pw}@{host}:{port}/{db}"
engine = create_engine(db_location)

sql = '''
SELECT
    *
FROM
    weatherinszeged
'''
raw_df = pd.read_sql(sql, con=engine)
engine.dispose()


# In[4]:


weather_df = raw_df


# In[5]:


X = weather_df[["humidity", "windspeed"]]
y = weather_df.temperature - weather_df.apparenttemperature


# In[6]:


sm.add_constant(X)
model_results = sm.OLS(y, X).fit()


# In[7]:


model_results.summary()

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.425
Model:,OLS,Adj. R-squared (uncentered):,0.425
Method:,Least Squares,F-statistic:,35700.0
Date:,"Tue, 23 Jul 2019",Prob (F-statistic):,0.0
Time:,22:21:32,Log-Likelihood:,-176750.0
No. Observations:,96453,AIC:,353500.0
Df Residuals:,96451,BIC:,353500.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
humidity,0.4873,0.010,47.338,0.000,0.467,0.507
windspeed,0.0772,0.001,126.510,0.000,0.076,0.078

0,1,2,3
Omnibus:,9577.682,Durbin-Watson:,0.228
Prob(Omnibus):,0.0,Jarque-Bera (JB):,12669.324
Skew:,0.867,Prob(JB):,0.0
Kurtosis:,3.378,Cond. No.,27.2


the r^2 scores are quite poor!

In [5]:
# In[8]:


X2 = X
X2["humidity_x_windspeed"] = weather_df.humidity * weather_df.windspeed


# In[9]:


X2.head()


# In[10]:


sm.add_constant(X2)
model2_results = sm.OLS(y, X2).fit()
model2_results.summary()


# Humidity is only at the edge of significance, and now the coefficients from both humidity and windspeed are negative. The interaction is where all the positive correlation is coming from. 

# In[ ]:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.533
Model:,OLS,Adj. R-squared (uncentered):,0.533
Method:,Least Squares,F-statistic:,36770.0
Date:,"Tue, 23 Jul 2019",Prob (F-statistic):,0.0
Time:,22:21:32,Log-Likelihood:,-166700.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96450,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
humidity,-0.2820,0.011,-26.590,0.000,-0.303,-0.261
windspeed,-0.0958,0.001,-74.776,0.000,-0.098,-0.093
humidity_x_windspeed,0.3038,0.002,149.513,0.000,0.300,0.308

0,1,2,3
Omnibus:,4919.327,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9471.445
Skew:,0.381,Prob(JB):,0.0
Kurtosis:,4.333,Cond. No.,38.0


only slightly better.

In [6]:
X3 = weather_df[["humidity", "windspeed", "visibility"]]

In [7]:
sm.add_constant(X3)
model3_results = sm.OLS(y, X3).fit()
model3_results.summary()

  return ptp(axis=axis, out=out, **kwargs)


0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.49
Model:,OLS,Adj. R-squared (uncentered):,0.49
Method:,Least Squares,F-statistic:,30940.0
Date:,"Tue, 23 Jul 2019",Prob (F-statistic):,0.0
Time:,22:31:43,Log-Likelihood:,-170960.0
No. Observations:,96453,AIC:,341900.0
Df Residuals:,96450,BIC:,341900.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
humidity,1.3488,0.012,108.590,0.000,1.324,1.373
windspeed,0.1052,0.001,167.634,0.000,0.104,0.106
visibility,-0.0976,0.001,-110.936,0.000,-0.099,-0.096

0,1,2,3
Omnibus:,5476.521,Durbin-Watson:,0.283
Prob(Omnibus):,0.0,Jarque-Bera (JB):,6619.177
Skew:,0.587,Prob(JB):,0.0
Kurtosis:,3.519,Cond. No.,43.9


It is better than model1, but worse than model 2. Model 2 also has lower AIC & BIC, making it dominantly superior.