# <font color=blue>Assignments for "Evaluating Goodness of Fit"</font>

As in previous lessons, please submit a link to a single gist that contains links to two Juypyter notebooks (one for each assignment below).

## 1. Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link in the gist file to the Jupyter notebook containing your solutions to the following tasks:

- Load the **weather** data from Kaggle
- Like in the previous lesson, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
- Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
- Add *visibility* as additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
- Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.api.types import is_numeric_dtype
from sklearn import metrics
import statsmodels.api as sm
import math
from sklearn import linear_model



In [3]:
df = pd.read_csv("C:/Users/Elif/data/weatherHistory.csv")

In [4]:
df["Temp_diff"] = df["Temperature (C)"] - df["Apparent Temperature (C)"]

Y = df["Temp_diff"]

X =df[["Humidity","Wind Speed (km/h)"]]


X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,Temp_diff,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,21:05:34,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.4381,0.021,-115.948,0.000,-2.479,-2.397
Humidity,3.0292,0.024,126.479,0.000,2.982,3.076
Wind Speed (km/h),0.1193,0.001,176.164,0.000,0.118,0.121

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.264
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


**R squared and adj-R-squared : 0.28. Since the values are low, we cannot say that the model predicts very successfully.**
    

In [6]:
df['HumWind']=df["Humidity"]*df["Wind Speed (km/h)"]
X =df[["Humidity","Wind Speed (km/h)",'HumWind']]
X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,Temp_diff,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,21:07:26,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0839,0.033,-2.511,0.012,-0.149,-0.018
Humidity,-0.1775,0.043,-4.133,0.000,-0.262,-0.093
Wind Speed (km/h),-0.0905,0.002,-36.797,0.000,-0.095,-0.086
HumWind,0.2971,0.003,88.470,0.000,0.291,0.304

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.262
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


**R-squared : 0.34 Offers more successful predictions than other model.**

##  2. House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link in the gist file to the Jupyter notebook containing your solutions to the following tasks:

- Load the **houseprices** data from Kaggle.
- Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
- Do you think your model is satisfactory? If so, why?
- In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
- For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

In [7]:
data = pd.read_csv("C:/Users/Elif/data/house_prices.csv")

In [8]:
data=data.drop(['PoolQC','MiscFeature','Fence','Alley'], axis=1)
def fix_missing(df, col, name):
    if is_numeric_dtype(col):
        df[name] = col.fillna(col.median())    
for n, c in data.items():
        fix_missing(data, c, n)
Y = data["SalePrice"]

X = data[["YearBuilt","TotalBsmtSF","1stFlrSF","FullBath","GarageArea",'GrLivArea','OverallQual']]
X = sm.add_constant(X)

results = sm.OLS(Y,X).fit()
display(results.summary())

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.768
Model:,OLS,Adj. R-squared:,0.767
Method:,Least Squares,F-statistic:,688.2
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,21:10:48,Log-Likelihood:,-17476.0
No. Observations:,1460,AIC:,34970.0
Df Residuals:,1452,BIC:,35010.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-8.112e+05,8.97e+04,-9.039,0.000,-9.87e+05,-6.35e+05
YearBuilt,372.5621,47.208,7.892,0.000,279.958,465.166
TotalBsmtSF,18.0092,4.313,4.176,0.000,9.549,26.469
1stFlrSF,14.1704,4.981,2.845,0.005,4.399,23.942
FullBath,-4481.4104,2656.489,-1.687,0.092,-9692.377,729.556
GarageArea,43.3915,6.225,6.971,0.000,31.181,55.602
GrLivArea,51.2896,3.141,16.331,0.000,45.129,57.450
OverallQual,2.148e+04,1154.905,18.596,0.000,1.92e+04,2.37e+04

0,1,2,3
Omnibus:,525.297,Durbin-Watson:,1.99
Prob(Omnibus):,0.0,Jarque-Bera (JB):,65024.498
Skew:,-0.612,Prob(JB):,0.0
Kurtosis:,35.671,Cond. No.,271000.0


**R-squared and Adj. R-squared : 0.76. The model offers successful predictions. But let's see if they can be made into a model that makes more successful predictions.**

In [10]:
X_ = data[["YearBuilt","TotalBsmtSF","1stFlrSF",'GrLivArea','OverallQual','GarageCars']]
X_ = sm.add_constant(X_)

results = sm.OLS(Y,X_).fit()
display(results.summary())

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.769
Model:,OLS,Adj. R-squared:,0.768
Method:,Least Squares,F-statistic:,807.2
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,21:13:21,Log-Likelihood:,-17474.0
No. Observations:,1460,AIC:,34960.0
Df Residuals:,1453,BIC:,35000.0
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6.923e+05,8.5e+04,-8.149,0.000,-8.59e+05,-5.26e+05
YearBuilt,309.7555,44.583,6.948,0.000,222.301,397.210
TotalBsmtSF,20.9344,4.272,4.901,0.000,12.555,29.313
1stFlrSF,13.9446,4.955,2.814,0.005,4.224,23.665
GrLivArea,48.1718,2.727,17.665,0.000,42.822,53.521
OverallQual,2.076e+04,1161.170,17.880,0.000,1.85e+04,2.3e+04
GarageCars,1.393e+04,1830.880,7.608,0.000,1.03e+04,1.75e+04

0,1,2,3
Omnibus:,432.994,Durbin-Watson:,1.979
Prob(Omnibus):,0.0,Jarque-Bera (JB):,43808.158
Skew:,-0.258,Prob(JB):,0.0
Kurtosis:,29.83,Cond. No.,253000.0


I removed the "FullBath" feature that is unrelated to the target variable and observed that the success of the model did not change much.