# <font color=blue>Assignments for "Evaluating Goodness of Fit"</font>

As in previous lessons, please submit a link to a single gist that contains links to two Juypyter notebooks (one for each assignment below).

## 1. Weather model

For this assignment, you'll revisit the historical temperature dataset. To complete this assignment, submit a link in the gist file to the Jupyter notebook containing your solutions to the following tasks:

- Load the **weather** data from Kaggle
- Like in the previous lesson, build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. What are the R-squared and adjusted R-squared values? Do you think they are satisfactory? Why? 
- Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Now, what is the R-squared of this model? Does this model improve upon the previous one? 
- Add *visibility* as additional explanatory variable to the first model and estimate it. Did R-squared increase? What about adjusted R-squared? Compare the differences put on the table by the interaction term and the *visibility* in terms of the improvement in the adjusted R-squared. Which one is more useful?
- Choose the best one from the three models above with respect to their AIC and BIC scores. Validate your choice by discussing your justification with your mentor.

In [9]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

In [6]:
#Getting data to a variable df.
weather_df = pd.read_csv('weatherHistory.csv')

#Creating a new column to hold target variable.
weather_df['DiffTemp'] = weather_df['Apparent Temperature (C)'] - weather_df["Temperature (C)"]

In [11]:
#Getting ready constants and variables
y = weather_df.DiffTemp
X = weather_df[["Humidity", "Wind Speed (km/h)"]]
X = sm.add_constant(X)

In [14]:
#Fitting variables to the regressipn model
results = sm.OLS(y, X).fit()
#Printing results
results.summary()

0,1,2,3
Dep. Variable:,DiffTemp,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,12:49:47,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
Humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
Wind Speed (km/h),-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.264
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


In [72]:
weather_df["WindSpeedxHumidity"] = weather_df["Humidity"] * weather_df["Wind Speed (km/h)"]

In [76]:
y = weather_df.DiffTemp
X = weather_df[["Humidity", "Wind Speed (km/h)","WindSpeedxHumidity"]]
X = sm.add_constant(X)

In [77]:
#Fitting variables to the regressipn model
results = sm.OLS(y, X).fit()
#Printing results
results.summary()

0,1,2,3
Dep. Variable:,DiffTemp,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,13:51:49,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
Humidity,0.1775,0.043,4.133,0.000,0.093,0.262
Wind Speed (km/h),0.0905,0.002,36.797,0.000,0.086,0.095
WindSpeedxHumidity,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.262
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


**Comment:** After adding interaction of humidity and windspeed our models R-squared value is increased.

In [79]:
#Getting ready constants and variables
y = weather_df.DiffTemp
X = weather_df[["Humidity", "Wind Speed (km/h)", "Visibility (km)"]]
X = sm.add_constant(X)

In [80]:
#Fitting variables to the regressipn model
results = sm.OLS(y, X).fit()
#Printing results
results.summary()

0,1,2,3
Dep. Variable:,DiffTemp,R-squared:,0.304
Model:,OLS,Adj. R-squared:,0.303
Method:,Least Squares,F-statistic:,14010.0
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,13:53:09,Log-Likelihood:,-169380.0
No. Observations:,96453,AIC:,338800.0
Df Residuals:,96449,BIC:,338800.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.5756,0.028,56.605,0.000,1.521,1.630
Humidity,-2.6066,0.025,-102.784,0.000,-2.656,-2.557
Wind Speed (km/h),-0.1199,0.001,-179.014,0.000,-0.121,-0.119
Visibility (km),0.0540,0.001,46.614,0.000,0.052,0.056

0,1,2,3
Omnibus:,3833.895,Durbin-Watson:,0.279
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4584.022
Skew:,-0.459,Prob(JB):,0.0
Kurtosis:,3.545,Cond. No.,131.0


**Comments:** After adding visibility R-squared value is increased. 

Choosing second model is the answer. It has higher R-squared value and lower BIC, AIC scores. That means second models explains more variance and carries more information.

**Final Comment:** R2 value is so low for all of the models. That is not acceptable. Model cannot explain the variance of difference in  So models have to be enhanced.

##  2. House prices model

In this exercise, you'll work on your house prices model. To complete this assignment, submit a link in the gist file to the Jupyter notebook containing your solutions to the following tasks:

- Load the **houseprices** data from Kaggle.
- Run your house prices model again and assess the goodness of fit of your model using F-test, R-squared, adjusted R-squared, AIC and BIC.
- Do you think your model is satisfactory? If so, why?
- In order to improve the goodness of fit of your model, try different model specifications by adding or removing some variables. 
- For each model you try, get the goodness of fit metrics and compare your models with each other. Which model is the best and why?

In [17]:
house_df = pd.read_csv('housePrices.csv')
house_df.Alley = house_df.Alley.fillna("No Alley Access")
house_df.FireplaceQu = house_df.FireplaceQu.fillna("No Fireplace")
house_df.PoolQC = house_df.PoolQC.fillna("No pool")
house_df.Fence = house_df.Fence.fillna("No fence")
house_df.MiscFeature = house_df.MiscFeature.fillna("None")

#Filling missing values per columns with median values
import pandas.api.types as ptypes
def fix_missing(df, col, name):
    if ptypes.is_numeric_dtype(col):
        df[name] = col.fillna(col.median())
        
for n, c in house_df.items():
        fix_missing(house_df, c, n)
        
house_df = house_df.dropna()

#Getting categoric columns from dataframe and removing customerid which is unique for each customer.
categoricColumns = house_df.select_dtypes('object').columns.tolist()
categoricColumns.pop(0)
len(categoricColumns)

#Creating a new dataframe to concat new numerical columns on. 
numeric_df = pd.DataFrame()
#By using a loop concating all columns in a df
for var in categoricColumns:
    numeric_df = pd.concat([numeric_df, pd.get_dummies(house_df[var], prefix=var)], axis=1)
numeric_df

#Adding numerical columns and original dataframe to new df.
new_house_df = pd.concat([house_df, numeric_df], axis=1)
new_house_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,RL,65.0,8450,Pave,No Alley Access,Reg,Lvl,AllPub,...,0,0,0,1,0,0,0,0,1,0
1,2,20,RL,80.0,9600,Pave,No Alley Access,Reg,Lvl,AllPub,...,0,0,0,1,0,0,0,0,1,0
2,3,60,RL,68.0,11250,Pave,No Alley Access,IR1,Lvl,AllPub,...,0,0,0,1,0,0,0,0,1,0
3,4,70,RL,60.0,9550,Pave,No Alley Access,IR1,Lvl,AllPub,...,0,0,0,1,1,0,0,0,0,0
4,5,60,RL,84.0,14260,Pave,No Alley Access,IR1,Lvl,AllPub,...,0,0,0,1,0,0,0,0,1,0


In [49]:
#Adding columns that has more than 0.33 or more correlation between target variable.
chosenColumns = ['OverallQual', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'TotalBsmtSF', '1stFlrSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt',
 'GarageCars', 'GarageArea', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'ExterQual_Ex', 'ExterQual_Gd', 'Foundation_PConc', 'BsmtQual_Ex', 'BsmtFinType1_GLQ', 'HeatingQC_Ex',
 'KitchenQual_Ex', 'GarageFinish_Fin', 'SaleType_New', 'SaleCondition_Partial']

In [50]:
y = new_house_df.SalePrice
X = new_house_df[chosenColumns]

X = sm.add_constant(X)

results3 = sm.OLS(y, X).fit()

results3.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.819
Model:,OLS,Adj. R-squared:,0.816
Method:,Least Squares,F-statistic:,228.7
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,13:14:47,Log-Likelihood:,-15841.0
No. Observations:,1338,AIC:,31740.0
Df Residuals:,1311,BIC:,31880.0
Df Model:,26,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-6.617e+05,1.61e+05,-4.109,0.000,-9.78e+05,-3.46e+05
OverallQual,1.323e+04,1306.209,10.125,0.000,1.07e+04,1.58e+04
YearBuilt,201.2914,66.633,3.021,0.003,70.573,332.010
YearRemodAdd,268.2792,68.391,3.923,0.000,134.112,402.447
MasVnrArea,4.7359,6.161,0.769,0.442,-7.350,16.822
BsmtFinSF1,13.3777,2.884,4.639,0.000,7.720,19.035
TotalBsmtSF,-0.5938,5.900,-0.101,0.920,-12.169,10.982
1stFlrSF,13.4661,6.075,2.217,0.027,1.549,25.383
GrLivArea,35.0392,4.322,8.106,0.000,26.560,43.519

0,1,2,3
Omnibus:,758.662,Durbin-Watson:,1.922
Prob(Omnibus):,0.0,Jarque-Bera (JB):,96412.951
Skew:,-1.635,Prob(JB):,0.0
Kurtosis:,44.457,Cond. No.,724000.0


**Comment:** 
1) F-test p-value is lower than 0.05 which means it is statistically significant for 95% confidince level. This model is better than empty model. R-squared p-value is 0.819 which can be acceptable but can be improved.

2) This is not satifactory since it has features that are not significant. After removing them, it can evaluated again.

In [63]:
#Adding columns that has more than 0.33 or more correlation between target variable.
chosenColumns2 = ['OverallQual', 'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1','1stFlrSF', 'GrLivArea', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars','Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'ExterQual_Ex', 'ExterQual_Gd', 'BsmtQual_Ex', 'KitchenQual_Ex']

In [68]:
#Creating a new column that multiplies living area of house and garage car numbers. And adding it to the chosencolumns variable.
new_house_df['AreaxGarageCars'] = new_house_df.GrLivArea * new_house_df.GarageCars
chosenColumns2.append('AreaxGarageCars')

y = new_house_df.SalePrice
X = new_house_df[chosenColumns2]

X = sm.add_constant(X)

results3 = sm.OLS(y, X).fit()

results3.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.827
Model:,OLS,Adj. R-squared:,0.825
Method:,Least Squares,F-statistic:,395.5
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,13:45:00,Log-Likelihood:,-15811.0
No. Observations:,1338,AIC:,31660.0
Df Residuals:,1321,BIC:,31740.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-9.442e+05,1.26e+05,-7.469,0.000,-1.19e+06,-6.96e+05
OverallQual,1.379e+04,1248.479,11.047,0.000,1.13e+04,1.62e+04
YearBuilt,176.2080,46.863,3.760,0.000,84.273,268.142
YearRemodAdd,322.0736,62.292,5.170,0.000,199.871,444.276
BsmtFinSF1,14.5669,2.346,6.208,0.000,9.964,19.170
1stFlrSF,16.3112,3.268,4.991,0.000,9.901,22.722
GrLivArea,-18.8619,7.192,-2.623,0.009,-32.971,-4.753
TotRmsAbvGrd,2343.0505,1062.009,2.206,0.028,259.642,4426.459
Fireplaces,1.09e+04,1659.367,6.570,0.000,7645.948,1.42e+04

0,1,2,3
Omnibus:,929.475,Durbin-Watson:,1.935
Prob(Omnibus):,0.0,Jarque-Bera (JB):,114210.889
Skew:,-2.332,Prob(JB):,0.0
Kurtosis:,48.021,Cond. No.,1.54e+16


**Comments:**
1) After adding a new variable created by using current ones, now our model isi improved to the R-squared value 0.827. 

2) Second model is better since has higher R-ssquared value. And BIC and AIC scores are smaller than first one.