### Higher Order Terms
> A model with no higher order terms
> y = b0 + b1x1 + b2x2

> A model with higher order terms
y = b0 + b1x1 + b2x1^2 + b3x2 + b4x1x2
> Here, we have introduced a quadratic (b2x1^2) and the interaction(b4x1x2) term into the model

##### Higher order terms include - 1. quadratics 2. cubics and interactions

#### When to include higher order in the model
> if there is a curve(e.g a 'u' or an 'n' looking curve) in the plot then you will want to add a quadratic
> If we want a cubic relationship then it is because we see two curves(such as a 'N') in the relationship


In [9]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sb
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor

%matplotlib inline

df = pd.read_csv('house_prices.csv')
df.head()
# To use higher order quadratic terms we square the variables in the columns
df['bedrooms_squared'] = df['bedrooms']**2
df

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price,bedrooms_squared
0,1112,B,1188,3,2,ranch,598291,9
1,491,B,3512,5,3,victorian,1744259,25
2,5952,B,1134,3,2,ranch,571669,9
3,3525,A,1940,4,2,ranch,493675,16
4,5108,B,2208,6,4,victorian,1101539,36
5,7507,C,1785,4,2,lodge,455235,16
6,4964,B,2996,5,3,victorian,1489871,25
7,7627,C,3263,5,3,victorian,821931,25
8,6571,A,1159,3,2,ranch,299903,9
9,5220,A,1248,3,2,victorian,321975,9


In [6]:
df['intercept'] = 1
lm = sm.OLS(df['price'], df[['intercept', 'bedrooms', 'bedrooms_squared']])
results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.582
Model:,OLS,Adj. R-squared:,0.582
Method:,Least Squares,F-statistic:,4199.0
Date:,"Fri, 26 Apr 2019",Prob (F-statistic):,0.0
Time:,16:20:27,Log-Likelihood:,-85302.0
No. Observations:,6028,AIC:,170600.0
Df Residuals:,6025,BIC:,170600.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,1.509e+05,1.58e+04,9.524,0.000,1.2e+05,1.82e+05
bedrooms,5.223e+04,8906.834,5.865,0.000,3.48e+04,6.97e+04
bedrooms_squared,2.446e+04,1184.538,20.647,0.000,2.21e+04,2.68e+04

0,1,2,3
Omnibus:,626.86,Durbin-Watson:,2.019
Prob(Omnibus):,0.0,Jarque-Bera (JB):,916.588
Skew:,0.793,Prob(JB):,9.230000000000001e-200
Kurtosis:,4.064,Cond. No.,87.8


In [10]:
# To use higher order cubic terms we square the variables in the columns
df['bedrooms_cubed'] = df['bedrooms']**3
df

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price,bedrooms_squared,bedrooms_cubed
0,1112,B,1188,3,2,ranch,598291,9,27
1,491,B,3512,5,3,victorian,1744259,25,125
2,5952,B,1134,3,2,ranch,571669,9,27
3,3525,A,1940,4,2,ranch,493675,16,64
4,5108,B,2208,6,4,victorian,1101539,36,216
5,7507,C,1785,4,2,lodge,455235,16,64
6,4964,B,2996,5,3,victorian,1489871,25,125
7,7627,C,3263,5,3,victorian,821931,25,125
8,6571,A,1159,3,2,ranch,299903,9,27
9,5220,A,1248,3,2,victorian,321975,9,27


In [12]:
df['intercept'] = 1
lm = sm.OLS(df['price'], df[['intercept', 'bedrooms', 'bedrooms_squared', 'bedrooms_cubed']])
results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.588
Model:,OLS,Adj. R-squared:,0.588
Method:,Least Squares,F-statistic:,2867.0
Date:,"Fri, 26 Apr 2019",Prob (F-statistic):,0.0
Time:,16:26:06,Log-Likelihood:,-85260.0
No. Observations:,6028,AIC:,170500.0
Df Residuals:,6024,BIC:,170600.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,2.492e+05,1.9e+04,13.134,0.000,2.12e+05,2.86e+05
bedrooms,-1.015e+05,1.88e+04,-5.400,0.000,-1.38e+05,-6.46e+04
bedrooms_squared,7.597e+04,5680.821,13.374,0.000,6.48e+04,8.71e+04
bedrooms_cubed,-4674.8551,504.331,-9.269,0.000,-5663.525,-3686.186

0,1,2,3
Omnibus:,670.934,Durbin-Watson:,2.015
Prob(Omnibus):,0.0,Jarque-Bera (JB):,996.039
Skew:,0.831,Prob(JB):,5.1600000000000004e-217
Kurtosis:,4.097,Cond. No.,746.0


In [13]:
# To use higher order interaction terms we multiply the variables to be interacted such as area and bedrooms here
df['area_bed'] = df['area']*df['bedrooms']
df

Unnamed: 0,house_id,neighborhood,area,bedrooms,bathrooms,style,price,bedrooms_squared,bedrooms_cubed,intercept,area_bed
0,1112,B,1188,3,2,ranch,598291,9,27,1,3564
1,491,B,3512,5,3,victorian,1744259,25,125,1,17560
2,5952,B,1134,3,2,ranch,571669,9,27,1,3402
3,3525,A,1940,4,2,ranch,493675,16,64,1,7760
4,5108,B,2208,6,4,victorian,1101539,36,216,1,13248
5,7507,C,1785,4,2,lodge,455235,16,64,1,7140
6,4964,B,2996,5,3,victorian,1489871,25,125,1,14980
7,7627,C,3263,5,3,victorian,821931,25,125,1,16315
8,6571,A,1159,3,2,ranch,299903,9,27,1,3477
9,5220,A,1248,3,2,victorian,321975,9,27,1,3744


In [14]:
lm = sm.OLS(df['price'], df[['intercept', 'bedrooms', 'bedrooms_squared', 'bedrooms_cubed', 'area', 'area_bed']])
results = lm.fit()
results.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.678
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,2539.0
Date:,"Fri, 26 Apr 2019",Prob (F-statistic):,0.0
Time:,16:28:22,Log-Likelihood:,-84515.0
No. Observations:,6028,AIC:,169000.0
Df Residuals:,6022,BIC:,169100.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
intercept,1.235e+04,2.15e+04,0.574,0.566,-2.99e+04,5.46e+04
bedrooms,7032.0414,1.69e+04,0.416,0.677,-2.61e+04,4.02e+04
bedrooms_squared,-3740.6271,5553.642,-0.674,0.501,-1.46e+04,7146.500
bedrooms_cubed,551.5375,567.892,0.971,0.331,-561.733,1664.808
area,347.3424,22.891,15.173,0.000,302.467,392.218
area_bed,-0.9926,4.431,-0.224,0.823,-9.678,7.693

0,1,2,3
Omnibus:,368.931,Durbin-Watson:,2.008
Prob(Omnibus):,0.0,Jarque-Bera (JB):,339.359
Skew:,0.52,Prob(JB):,2.0399999999999998e-74
Kurtosis:,2.482,Cond. No.,92000.0


#### How do we know when to add an interaction term
> y = b0 + b1x1 + b2x2 + b3x1x2 -> this last value depicts the interaction
> we add interaction to our model if the slopes of two variables are related i.e the slope of price and neighborhood A and the slope of price and neighborhood B are equal