## Question 1

The easiest coefficient to interpret is -223.57, which says that family making a given income with kids will spend $223.57 less per year than another family making the same income without kids. The constant 873 is tougher to interpret, but could be seen as a baseline or minimum, since that is the expenditure our model would predict if a family had no kids and no income. However, it is likely that this is a rare condition that the model is better fit for predicting expenditures of families with incomes. As such, the constant is best interpreted as the bias of the model. Finally, the coefficients of 0.0012 and 0.00002 for annual income and the square of annual income are very difficult to interpret. If the squared term were not included then the 0.0012 term would be simple enough to interpret, but with the squared term present the best interpretation is to say that expenditure increases as income increases.<br> <br>
In order to ensure that the interpretations of the model make sense statistically, it would be good to have t-test and p-value statistics for each coefficient.

## Question 2

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn import linear_model
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

import warnings
warnings.filterwarnings('ignore')

postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'weatherinszeged'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
weather_df = pd.read_sql_query('select * from weatherinszeged',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [2]:
weather_df.head(10)

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.
5,2006-04-01 03:00:00+00:00,Partly Cloudy,rain,9.222222,7.111111,0.85,13.9587,258.0,14.9569,0.0,1016.66,Partly cloudy throughout the day.
6,2006-04-01 04:00:00+00:00,Partly Cloudy,rain,7.733333,5.522222,0.95,12.3648,259.0,9.982,0.0,1016.72,Partly cloudy throughout the day.
7,2006-04-01 05:00:00+00:00,Partly Cloudy,rain,8.772222,6.527778,0.89,14.1519,260.0,9.982,0.0,1016.84,Partly cloudy throughout the day.
8,2006-04-01 06:00:00+00:00,Partly Cloudy,rain,10.822222,10.822222,0.82,11.3183,259.0,9.982,0.0,1017.37,Partly cloudy throughout the day.
9,2006-04-01 07:00:00+00:00,Partly Cloudy,rain,13.772222,13.772222,0.72,12.5258,279.0,9.982,0.0,1017.22,Partly cloudy throughout the day.


In [3]:
weather_df['temp_diff'] = weather_df['temperature'] - weather_df['apparenttemperature']

In [4]:
weather_df.head(10)

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary,temp_diff
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.,2.083333
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.,2.127778
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.,0.0
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.,2.344444
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.,1.777778
5,2006-04-01 03:00:00+00:00,Partly Cloudy,rain,9.222222,7.111111,0.85,13.9587,258.0,14.9569,0.0,1016.66,Partly cloudy throughout the day.,2.111111
6,2006-04-01 04:00:00+00:00,Partly Cloudy,rain,7.733333,5.522222,0.95,12.3648,259.0,9.982,0.0,1016.72,Partly cloudy throughout the day.,2.211111
7,2006-04-01 05:00:00+00:00,Partly Cloudy,rain,8.772222,6.527778,0.89,14.1519,260.0,9.982,0.0,1016.84,Partly cloudy throughout the day.,2.244444
8,2006-04-01 06:00:00+00:00,Partly Cloudy,rain,10.822222,10.822222,0.82,11.3183,259.0,9.982,0.0,1017.37,Partly cloudy throughout the day.,0.0
9,2006-04-01 07:00:00+00:00,Partly Cloudy,rain,13.772222,13.772222,0.72,12.5258,279.0,9.982,0.0,1017.22,Partly cloudy throughout the day.,0.0


In [5]:
# Y is the target variable
Y = weather_df['temp_diff']
# X is the feature set
X = weather_df[['humidity','windspeed']]

# We create a LinearRegression model object
# from scikit-learn's linear_model module.
lrm = linear_model.LinearRegression()

# fit method estimates the coefficients using OLS
lrm.fit(X, Y)

# Inspect the results.
print('\nCoefficients: \n', lrm.coef_)
print('\nIntercept: \n', lrm.intercept_)


Coefficients: 
 [3.02918594 0.11929075]

Intercept: 
 -2.4381054151876933


In [6]:
X_two = sm.add_constant(X)

results = sm.OLS(Y, X_two).fit()

results.summary()

0,1,2,3
Dep. Variable:,temp_diff,R-squared:,0.288
Model:,OLS,Adj. R-squared:,0.288
Method:,Least Squares,F-statistic:,19490.0
Date:,"Tue, 11 Jun 2019",Prob (F-statistic):,0.0
Time:,13:57:22,Log-Likelihood:,-170460.0
No. Observations:,96453,AIC:,340900.0
Df Residuals:,96450,BIC:,340900.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2.4381,0.021,-115.948,0.000,-2.479,-2.397
humidity,3.0292,0.024,126.479,0.000,2.982,3.076
windspeed,0.1193,0.001,176.164,0.000,0.118,0.121

0,1,2,3
Omnibus:,3935.747,Durbin-Watson:,0.267
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4613.311
Skew:,0.478,Prob(JB):,0.0
Kurtosis:,3.484,Cond. No.,88.1


The P-values of all coefficients are zero, so they are each statistically significant. I would have expected humidity to be inversely correlated with my temp_diff variable as it is computed (actual temp - apparent temp), because this has been my general observation. However, I live in Texas, which is much hotter than Szeged, which helps explain why humidity is positively correlated with temp_diff. Humidity does not always make the temperature feel hotter than it actually is, but rather it tends to make hot feel hotter and cold feel colder. Thus, with Szeged being a cold place, this positive correlation makes sense to me.

In [7]:
weather_df['hum_wspeed'] = weather_df['humidity'] * weather_df['windspeed']

In [8]:
weather_df.head(10)

Unnamed: 0,date,summary,preciptype,temperature,apparenttemperature,humidity,windspeed,windbearing,visibility,loudcover,pressure,dailysummary,temp_diff,hum_wspeed
0,2006-03-31 22:00:00+00:00,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.,2.083333,12.566533
1,2006-03-31 23:00:00+00:00,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.,2.127778,12.267556
2,2006-04-01 00:00:00+00:00,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.,0.0,3.496276
3,2006-04-01 01:00:00+00:00,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.,2.344444,11.705988
4,2006-04-01 02:00:00+00:00,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.,1.777778,9.167018
5,2006-04-01 03:00:00+00:00,Partly Cloudy,rain,9.222222,7.111111,0.85,13.9587,258.0,14.9569,0.0,1016.66,Partly cloudy throughout the day.,2.111111,11.864895
6,2006-04-01 04:00:00+00:00,Partly Cloudy,rain,7.733333,5.522222,0.95,12.3648,259.0,9.982,0.0,1016.72,Partly cloudy throughout the day.,2.211111,11.74656
7,2006-04-01 05:00:00+00:00,Partly Cloudy,rain,8.772222,6.527778,0.89,14.1519,260.0,9.982,0.0,1016.84,Partly cloudy throughout the day.,2.244444,12.595191
8,2006-04-01 06:00:00+00:00,Partly Cloudy,rain,10.822222,10.822222,0.82,11.3183,259.0,9.982,0.0,1017.37,Partly cloudy throughout the day.,0.0,9.281006
9,2006-04-01 07:00:00+00:00,Partly Cloudy,rain,13.772222,13.772222,0.72,12.5258,279.0,9.982,0.0,1017.22,Partly cloudy throughout the day.,0.0,9.018576


In [9]:
# Y is the target variable
Y = weather_df['temp_diff']
# X is the feature set
X = weather_df[['humidity','windspeed', 'hum_wspeed']]

# We create a LinearRegression model object
# from scikit-learn's linear_model module.
lrm = linear_model.LinearRegression()

# fit method estimates the coefficients using OLS
lrm.fit(X, Y)

# Inspect the results.
print('\nCoefficients: \n', lrm.coef_)
print('\nIntercept: \n', lrm.intercept_)


Coefficients: 
 [-0.17751219 -0.09048213  0.29711946]

Intercept: 
 -0.08393631009782698


In [10]:
X_two = sm.add_constant(X)

results = sm.OLS(Y, X_two).fit()

results.summary()

0,1,2,3
Dep. Variable:,temp_diff,R-squared:,0.341
Model:,OLS,Adj. R-squared:,0.341
Method:,Least Squares,F-statistic:,16660.0
Date:,"Tue, 11 Jun 2019",Prob (F-statistic):,0.0
Time:,14:07:54,Log-Likelihood:,-166690.0
No. Observations:,96453,AIC:,333400.0
Df Residuals:,96449,BIC:,333400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0839,0.033,-2.511,0.012,-0.149,-0.018
humidity,-0.1775,0.043,-4.133,0.000,-0.262,-0.093
windspeed,-0.0905,0.002,-36.797,0.000,-0.095,-0.086
hum_wspeed,0.2971,0.003,88.470,0.000,0.291,0.304

0,1,2,3
Omnibus:,4849.937,Durbin-Watson:,0.265
Prob(Omnibus):,0.0,Jarque-Bera (JB):,9295.404
Skew:,0.378,Prob(JB):,0.0
Kurtosis:,4.32,Cond. No.,193.0


Humidity and wind speed have now become inversely correlated with temp_diff. However, temp_diff is positively correlated with the interaction of these two features, named hum_wspeed. The coefficients are difficult to interpret beyond this, since the same input variables are contributing to the values of multiple features.

## Question 3

In [11]:
postgres_user = 'dsbc_student'
postgres_pw = '7*.8G9QH21'
postgres_host = '142.93.121.174'
postgres_port = '5432'
postgres_db = 'houseprices'

engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
    postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))
hp_df = pd.read_sql_query('select * from houseprices',con=engine)

# no need for an open connection, as we're only doing a single query
engine.dispose()

In [12]:
# Y is the target variable
Y2 = hp_df['saleprice']
# X is the feature set which includes
# is_male and is_smoker variables
X2 = hp_df[['grlivarea','overallqual']]

# We create a LinearRegression model object
# from scikit-learn's linear_model module.
lrm = linear_model.LinearRegression()

# fit method estimates the coefficients using OLS
lrm.fit(X2, Y2)

# Inspect the results.
print('\nCoefficients: \n', lrm.coef_)
print('\nIntercept: \n', lrm.intercept_)


Coefficients: 
 [   55.86222591 32849.04744063]

Intercept: 
 -104092.66963598129


In [13]:
X2_two = sm.add_constant(X2)

results = sm.OLS(Y2, X2_two).fit()

results.summary()

0,1,2,3
Dep. Variable:,saleprice,R-squared:,0.714
Model:,OLS,Adj. R-squared:,0.714
Method:,Least Squares,F-statistic:,1820.0
Date:,"Tue, 11 Jun 2019",Prob (F-statistic):,0.0
Time:,14:13:20,Log-Likelihood:,-17630.0
No. Observations:,1460,AIC:,35270.0
Df Residuals:,1457,BIC:,35280.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.041e+05,5045.372,-20.631,0.000,-1.14e+05,-9.42e+04
grlivarea,55.8622,2.630,21.242,0.000,50.704,61.021
overallqual,3.285e+04,999.198,32.875,0.000,3.09e+04,3.48e+04

0,1,2,3
Omnibus:,341.985,Durbin-Watson:,1.982
Prob(Omnibus):,0.0,Jarque-Bera (JB):,8725.15
Skew:,0.469,Prob(JB):,0.0
Kurtosis:,14.939,Cond. No.,7350.0


All coefficients have a P-value of zero, meaning they are statistically significant. Per unit change overallqual has a much larger affect on saleprice than grlivarea, as evidenced by the size of the respective coefficients. However, this is difficult to truly interpret because overallqual is a categorical variable with a range of 1-10 that is a subjective measure of the quality of the house. With the results in front of us it would be difficult to answer a question such as 'Is saleprice affected more by quality or by the size of the house?'