# <font color=blue>Assignments for "Assumptions of Linear Regression"</font>

To close out this lesson, you'll do two assignments. Both require you to create Jupyter notebooks. Please submit a link to a single Gist file that contains links to the two notebooks.

## 1. Predicting temperature

In this exercise, you'll work with historical temperature data from the Szeged, Hungary area. You will download the dataset from [Kaggle](https://www.kaggle.com/budincsevity/szeged-weather/home). To complete this assignment, submit a Jupyter notebook containing your solutions to the following tasks:

- First, load the dataset from Kaggle.
- Build a regression model where the target variable is *temperature*. As explanatory variables, use *humidity*, *windspeed*, *windbearing* and *pressure*. Estimate the model using OLS.
- Now, check if your model meets the Gauss-Markov Conditions above. If some of the assumptions are not met, discuss the implications of the violations for the correctness of your model.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.api.types import is_numeric_dtype
from sklearn import metrics
import math

import warnings
warnings.filterwarnings("ignore")

wh = pd.read_csv("C:/Users/Elif/data/weatherHistory.csv")
wh.head()

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


In [14]:
from sklearn import linear_model

# Y is the target variable
Y = wh['Temperature (C)']
# X is the feature set which includes

X = wh[['Humidity','Wind Speed (km/h)','Wind Bearing (degrees)','Pressure (millibars)']]


In [18]:
import statsmodels.api as sm

# We need to add constant manually 
# in statsmodels' sm
X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()

results.summary()

0,1,2,3
Dep. Variable:,Temperature (C),R-squared:,0.421
Model:,OLS,Adj. R-squared:,0.421
Method:,Least Squares,F-statistic:,17500.0
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,19:25:53,Log-Likelihood:,-328210.0
No. Observations:,96453,AIC:,656400.0
Df Residuals:,96448,BIC:,656500.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,37.9264,0.233,162.709,0.000,37.470,38.383
Humidity,-32.4962,0.123,-264.288,0.000,-32.737,-32.255
Wind Speed (km/h),-0.2014,0.003,-57.557,0.000,-0.208,-0.195
Wind Bearing (degrees),0.0040,0.000,18.463,0.000,0.004,0.004
Pressure (millibars),-0.0007,0.000,-3.452,0.001,-0.001,-0.000

0,1,2,3
Omnibus:,3375.432,Durbin-Watson:,0.057
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3793.297
Skew:,-0.455,Prob(JB):,0.0
Kurtosis:,3.339,Cond. No.,10600.0


## 2. House prices

To complete this assignment, submit a Jupyter notebook containing your solutions to the following tasks:

- Load the **houseprices** data from Kaggle. 
- Reimplement the model you built in the previous lesson. 
- Check for all of the assumptions above and discuss the implications if some of the assumptions are not met.

In [8]:
hp= pd.read_csv("C:/Users/Elif/data/house_prices.csv")

In [11]:
from sklearn import linear_model
import statsmodels.api as sm
# Y is the target variable
Y = hp['SalePrice']
# X is the feature set which includes

X = hp[['GrLivArea','OverallQual']]

# We create a LinearRegression model object
# from scikit-learn's linear_model module.
lrm = linear_model.LinearRegression()

# fit method estimates the coefficients using OLS
lrm.fit(X, Y)

X = sm.add_constant(X)

results = sm.OLS(Y, X).fit()
print('\nCoefficients: \n', lrm.coef_)
print('\nIntercept: \n', lrm.intercept_)
results.summary()


Coefficients: 
 [   55.86222591 32849.04744063]

Intercept: 
 -104092.66963598118


0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.714
Model:,OLS,Adj. R-squared:,0.714
Method:,Least Squares,F-statistic:,1820.0
Date:,"Sat, 19 Sep 2020",Prob (F-statistic):,0.0
Time:,19:21:29,Log-Likelihood:,-17630.0
No. Observations:,1460,AIC:,35270.0
Df Residuals:,1457,BIC:,35280.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.041e+05,5045.372,-20.631,0.000,-1.14e+05,-9.42e+04
GrLivArea,55.8622,2.630,21.242,0.000,50.704,61.021
OverallQual,3.285e+04,999.198,32.875,0.000,3.09e+04,3.48e+04

0,1,2,3
Omnibus:,341.985,Durbin-Watson:,1.985
Prob(Omnibus):,0.0,Jarque-Bera (JB):,8725.15
Skew:,0.469,Prob(JB):,0.0
Kurtosis:,14.939,Cond. No.,7350.0


 Since we get the parameters using OLS, we can write our estimated model: 
 
 **SalePrice = -104092.66 + 55.86 GrLivArea + 32849.04OverallQual
 
 
- linearity of the model in the coefficients the error term should be zero on average
- homoscedasticity low multicollinearity error terms should be uncorrelated with one other features shouldn't be correlated with the errors normality of the errors

