# <font color=blue>Assignments for "Understanding The Relationship"</font>

To close out this lesson, you're going to do three assignments. For the first assignment, you'll write up a short answer to a question in a Gist file.  For the second two assignments, you'll do your work in Jupyter notebooks, and you should link to those notebooks in the same Gist file.

Please submit a single Gist file containing the answer to first assignment, plus links for second two.

## 1. Interpretion and signficance

Suppose that we would like to know how much families in the US are spending on recreation annually. We estimated the following model:

$$ expenditure = 873 + 0.0012annual\_income + 0.00002annual\_income^2 - 223.57have\_kids $$

*expenditure* is the annual spending on recreation in US dollars, *annual_income* is the annual income in US dollars and *have_kids* is a dummy variable indicating the families with children. Interpret the estimated coefficients. What additional statistics should be given in order to make sure that your interpretations make sense statistically. Write up your answer and save in a Gist. 

**Answer:**

1) F-test score/p-value: It gives information about is this data set is statictically signifant.

2) Coefficients p values: They show if adding coefficient to the model is meaningful.

3) R-squared/Adj. R-squared: It shows how much of the variance is explained by using this model. Which gives us a information about how good is model.

After getting these statistical calculations, we can have opinion on coefficients and the model.


## 2. Weather model

In this exercise, you'll work with the historical temperature data from the previous lesson. To complete this assignment, submit a link in the gist file to the Jupyter notebook containing your solutions to the following tasks:

- First, load the dataset from the **weatherinszeged** table from Kaggle.
- Build a linear regression model where your target variable is the difference between the *apparenttemperature* and the *temperature*. As explanatory variables, use *humidity* and *windspeed*. Now, estimate your model using OLS. Are the estimated coefficients statistically significant? Are the signs of the estimated coefficients in line with your previous expectations? Interpret the estimated coefficients. What are the relations between the target and the explanatory variables? 
- Next, include the interaction of *humidity* and *windspeed* to the model above and estimate the model using OLS. Are the coefficients statistically significant? Did the signs of the estimated coefficients for *humidity* and *windspeed* change? Interpret the estimated coefficients.

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression



In [24]:
#Getting data to a variable df.
weather_df = pd.read_csv('weatherHistory.csv')

#Creating a new column to hold target variable.
weather_df['DiffTemp'] = weather_df['Apparent Temperature (C)'] - weather_df["Temperature (C)"] 

In [11]:
y = weather_df.DiffTemp

In [12]:
X = weather_df[["Humidity", "Wind Speed (km/h)"]]

In [13]:
lrm = LinearRegression()

In [14]:
lrm.fit(X, y)

LinearRegression()

In [15]:
lrm.coef_

array([-3.02918594, -0.11929075])

In [20]:
lrm.intercept_

2.4381054151878074

In [22]:
import statsmodels.api as sm



In [23]:
X2 = sm.add_constant(X)

results = sm.OLS(y, X2).fit()

results.summary()

0,1,2,3
Dep. Variable:,DiffTemp,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Sun, 13 Sep 2020",Prob (F-statistic):,0.0
Time:,15:27:00,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.4381,0.021,115.948,0.000,2.397,2.479
Humidity,-3.0292,0.024,-126.479,0.000,-3.076,-2.982
Wind Speed (km/h),-0.1193,0.001,-176.164,0.000,-0.121,-0.118

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.264
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,-0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


**Comments:**

We get same results by using statsmodel. It gives us statistical information.
3 coefficients are significant for 95% confidence level. Their p-value is lower than 0.05. Our model tells us both humidity and wind speed have negative effect on temperature feeling. Model says it gets colder with wind speed and humidity, but it is expected to get hot with humidity.
On the other hand, R-squared value is so low. It can only explains 0.28 of variance. Therefore, this model should be improved.

### Interacting wind speed and humidity

In [26]:
weather_df['HumidityxWindSpeed'] = weather_df["Humidity"] * weather_df["Wind Speed (km/h)"]

In [27]:
X3 = weather_df[["Humidity", "Wind Speed (km/h)", "HumidityxWindSpeed"]]

In [30]:
#Adding constant column to use in sm.OLS 
X3 = sm.add_constant(X3)

In [31]:
#Using new X fitting values by using OLS
results2 = sm.OLS(y, X3).fit()

#Getting summary of statistical results
results2.summary()


0,1,2,3
Dep. Variable:,DiffTemp,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Sun, 13 Sep 2020",Prob (F-statistic):,0.0
Time:,17:22:20,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.0839,0.033,2.511,0.012,0.018,0.149
Humidity,0.1775,0.043,4.133,0.000,0.093,0.262
Wind Speed (km/h),0.0905,0.002,36.797,0.000,0.086,0.095
HumidityxWindSpeed,-0.2971,0.003,-88.470,0.000,-0.304,-0.291

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.262
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,-0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


**Comments:**

It covers more variance now. And all coefficient values are significant. Now, difference is increased by both wind speed and humidity. This time wind speed doesnt effect as expected. However, model is still not good enough.

##  3. House prices model

In this exercise, you'll interpret your house prices model. To complete this assignment, submit a link in the gist file to the Jupyter notebook containing your solutions to the following tasks:

- Load the **houseprices** data from Kaggle.
- Run your house prices model again and interpret the results. Which features are statistically significant and which are not?
- Now, exclude the insignificant features from your model. Did anything change?
- Interpret the statistically significant coefficients by quantifying their relations with the house prices. Which features have more prominent effect on the house prices?
- Do the results sound reasonable to you? If not, try to explain the potential reasons.

In [46]:
house_df = pd.read_csv('house_prices.csv')
house_df.Alley = house_df.Alley.fillna("No Alley Access")
house_df.FireplaceQu = house_df.FireplaceQu.fillna("No Fireplace")
house_df.PoolQC = house_df.PoolQC.fillna("No pool")
house_df.Fence = house_df.Fence.fillna("No fence")
house_df.MiscFeature = house_df.MiscFeature.fillna("None")

#Filling missing values per columns with median values
import pandas.api.types as ptypes
def fix_missing(df, col, name):
    if ptypes.is_numeric_dtype(col):
        df[name] = col.fillna(col.median())
        
for n, c in house_df.items():
        fix_missing(house_df, c, n)
        
house_df = house_df.dropna()

#Getting categoric columns from dataframe and removing customerid which is unique for each customer.
categoricColumns = house_df.select_dtypes('object').columns.tolist()
categoricColumns.pop(0)
len(categoricColumns)

#Creating a new dataframe to concat new numerical columns on. 
numeric_df = pd.DataFrame()
#By using a loop concating all columns in a df
for var in categoricColumns:
    numeric_df = pd.concat([numeric_df, pd.get_dummies(house_df[var], prefix=var)], axis=1)
numeric_df

#Adding numerical columns and original dataframe to new df.
new_house_df = pd.concat([house_df, numeric_df], axis=1)
new_house_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,1,60,RL,65.0,8450,Pave,No Alley Access,Reg,Lvl,AllPub,...,0,0,0,1,0,0,0,0,1,0
1,2,20,RL,80.0,9600,Pave,No Alley Access,Reg,Lvl,AllPub,...,0,0,0,1,0,0,0,0,1,0
2,3,60,RL,68.0,11250,Pave,No Alley Access,IR1,Lvl,AllPub,...,0,0,0,1,0,0,0,0,1,0
3,4,70,RL,60.0,9550,Pave,No Alley Access,IR1,Lvl,AllPub,...,0,0,0,1,1,0,0,0,0,0
4,5,60,RL,84.0,14260,Pave,No Alley Access,IR1,Lvl,AllPub,...,0,0,0,1,0,0,0,0,1,0


In [48]:
y = new_house_df.SalePrice

X = new_house_df[['OverallQual','YearBuilt','YearRemodAdd','TotalBsmtSF','1stFlrSF','GrLivArea','FullBath','TotRmsAbvGrd','GarageCars','GarageArea','BsmtQual_Ex','KitchenQual_Ex']]


In [49]:
X = sm.add_constant(X)

In [50]:
results3 = sm.OLS(y, X).fit()

results3.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.792
Model:,OLS,Adj. R-squared:,0.79
Method:,Least Squares,F-statistic:,420.7
Date:,"Sun, 13 Sep 2020",Prob (F-statistic):,0.0
Time:,17:39:46,Log-Likelihood:,-15935.0
No. Observations:,1338,AIC:,31900.0
Df Residuals:,1325,BIC:,31960.0
Df Model:,12,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.077e+06,1.32e+05,-8.150,0.000,-1.34e+06,-8.18e+05
OverallQual,1.595e+04,1269.064,12.572,0.000,1.35e+04,1.84e+04
YearBuilt,266.1787,52.878,5.034,0.000,162.446,369.912
YearRemodAdd,255.8068,66.604,3.841,0.000,125.146,386.468
TotalBsmtSF,11.1026,5.995,1.852,0.064,-0.658,22.863
1stFlrSF,13.3141,6.274,2.122,0.034,1.005,25.623
GrLivArea,53.2180,4.194,12.688,0.000,44.989,61.447
FullBath,-4367.3438,2756.241,-1.585,0.113,-9774.416,1039.729
TotRmsAbvGrd,-452.5890,1141.148,-0.397,0.692,-2691.244,1786.066

0,1,2,3
Omnibus:,713.445,Durbin-Watson:,1.986
Prob(Omnibus):,0.0,Jarque-Bera (JB):,83287.818
Skew:,-1.488,Prob(JB):,0.0
Kurtosis:,41.537,Cond. No.,487000.0


**Comment:** 

Model R-squared = 0.79. TotRmsAbvGrd, GarageArea, TotalBsmtSF these features are not significant for 95% confidence level. Other variables are significant. These 3 feature can be removed from model.

In [53]:
y = new_house_df.SalePrice

X = new_house_df[['OverallQual','YearBuilt','YearRemodAdd','1stFlrSF','GrLivArea','FullBath','GarageCars','BsmtQual_Ex','KitchenQual_Ex']]

X = sm.add_constant(X)

results4 = sm.OLS(y, X).fit()

results4.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.791
Model:,OLS,Adj. R-squared:,0.79
Method:,Least Squares,F-statistic:,559.9
Date:,"Sun, 13 Sep 2020",Prob (F-statistic):,0.0
Time:,17:43:49,Log-Likelihood:,-15937.0
No. Observations:,1338,AIC:,31890.0
Df Residuals:,1328,BIC:,31950.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.11e+06,1.3e+05,-8.523,0.000,-1.37e+06,-8.54e+05
OverallQual,1.611e+04,1262.743,12.756,0.000,1.36e+04,1.86e+04
YearBuilt,296.7283,50.714,5.851,0.000,197.241,396.216
YearRemodAdd,241.5888,65.877,3.667,0.000,112.355,370.822
1stFlrSF,23.8476,3.276,7.278,0.000,17.420,30.275
GrLivArea,51.7038,3.081,16.781,0.000,45.660,57.748
FullBath,-4905.6223,2718.705,-1.804,0.071,-1.02e+04,427.803
GarageCars,1.385e+04,2151.099,6.440,0.000,9633.167,1.81e+04
BsmtQual_Ex,3.457e+04,4366.732,7.916,0.000,2.6e+04,4.31e+04

0,1,2,3
Omnibus:,612.385,Durbin-Watson:,1.986
Prob(Omnibus):,0.0,Jarque-Bera (JB):,62509.908
Skew:,-1.154,Prob(JB):,0.0
Kurtosis:,36.405,Cond. No.,451000.0


**Comment:** 

After removing them R-squared value is 0.79 which is almost same. And FullBath is now not significant. It can be removed from model.

In [54]:
y = new_house_df.SalePrice

X = new_house_df[['OverallQual','YearBuilt','YearRemodAdd','1stFlrSF','GrLivArea','GarageCars','BsmtQual_Ex','KitchenQual_Ex']]

X = sm.add_constant(X)

results5 = sm.OLS(y, X).fit()

results5.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.791
Model:,OLS,Adj. R-squared:,0.79
Method:,Least Squares,F-statistic:,628.4
Date:,"Sun, 13 Sep 2020",Prob (F-statistic):,0.0
Time:,17:45:16,Log-Likelihood:,-15938.0
No. Observations:,1338,AIC:,31890.0
Df Residuals:,1329,BIC:,31940.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.036e+06,1.24e+05,-8.376,0.000,-1.28e+06,-7.93e+05
OverallQual,1.59e+04,1258.784,12.634,0.000,1.34e+04,1.84e+04
YearBuilt,272.0125,48.870,5.566,0.000,176.141,367.884
YearRemodAdd,227.5621,65.472,3.476,0.001,99.123,356.002
1stFlrSF,24.1915,3.274,7.390,0.000,17.769,30.614
GrLivArea,49.0273,2.703,18.140,0.000,43.725,54.329
GarageCars,1.351e+04,2144.595,6.300,0.000,9304.850,1.77e+04
BsmtQual_Ex,3.503e+04,4362.856,8.029,0.000,2.65e+04,4.36e+04
KitchenQual_Ex,3.537e+04,4672.299,7.571,0.000,2.62e+04,4.45e+04

0,1,2,3
Omnibus:,581.701,Durbin-Watson:,1.983
Prob(Omnibus):,0.0,Jarque-Bera (JB):,56266.892
Skew:,-1.058,Prob(JB):,0.0
Kurtosis:,34.698,Cond. No.,427000.0


R-squared value is 0.79. It means it didnt change after removing values that are not significant. Coefficient values are changed between models.