# Multiple Linear Regression in Statsmodels - Lab

## Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!

## Objectives
You will be able to:
* Determine if it is necessary to perform normalization/standardization for a specific model or set of data
* Use standardization/normalization on features of a dataset
* Identify if it is necessary to perform log transformations on a set of features
* Perform log transformations on different features of a dataset
* Use statsmodels to fit a multiple linear regression model
* Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters


## The Ames Housing Data

Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
* Split off and one hot encode the categorical features of interest
* Log and scale the selected continuous features

In [1]:
import pandas as pd
import numpy as np

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']


## Continuous Features

In [2]:
# Log transform and normalize
cont = pd.DataFrame()
for feat in continuous:
    ames[feat] = ames[feat].map(lambda x: np.log(x))
    x = ames[feat].copy()
    x = (x - np.mean(x)) / (max(x) - min(x))
    
    ames[feat] = x
    #display(ames[feat])
    cont = pd.concat([cont, ames[feat]], axis=1)
    #cont[feat] = ames[feat]
    
display(cont)

Unnamed: 0,LotArea,1stFlrSF,GrLivArea,SalePrice
0,-0.013488,-0.096588,0.062428,0.072748
1,0.011485,0.050313,-0.045040,0.027636
2,0.042526,-0.069302,0.077811,0.095346
3,0.010463,-0.052802,0.063873,-0.056812
4,0.088929,0.013494,0.151238,0.131794
...,...,...,...,...
1455,-0.026240,-0.055965,0.049149,0.015773
1456,0.073441,0.238129,0.130526,0.075080
1457,-0.000235,0.027446,0.173384,0.152584
1458,0.013856,-0.009324,-0.100788,-0.051912


## Categorical Features

In [3]:
# One hot encode categoricals
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
cats = pd.DataFrame()
for feat in categoricals:
#     x = pd.get_dummies(feat)
#     display(x)
    f = ames[feat]
    cf = f.astype('category')
#     cyl_dummies = pd.get_dummies(data['cylinders'], prefix='cyl', drop_first=True)
    cf_dum = pd.get_dummies(f, prefix = feat+'_', drop_first=True)
#     data = data.drop(['cylinders','model year','origin'], axis=1)
#     data = pd.concat([data, cyl_dummies, yr_dummies, orig_dummies], axis=1)
    ames = ames.drop([feat],axis=1)
    ames = pd.concat([ames, cf_dum], axis=1)
    cats = pd.concat([cats, cf_dum], axis=1)
#     origin_dummies = lb.fit_transform(cf)
#     od_df = pd.DataFrame(origin_dummies,columns=lb.classes_)
#     display(od_df)

In [4]:
cats

Unnamed: 0,BldgType__2fmCon,BldgType__Duplex,BldgType__Twnhs,BldgType__TwnhsE,KitchenQual__Fa,KitchenQual__Gd,KitchenQual__TA,SaleType__CWD,SaleType__Con,SaleType__ConLD,...,Neighborhood__NoRidge,Neighborhood__NridgHt,Neighborhood__OldTown,Neighborhood__SWISU,Neighborhood__Sawyer,Neighborhood__SawyerW,Neighborhood__Somerst,Neighborhood__StoneBr,Neighborhood__Timber,Neighborhood__Veenker
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1456,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1457,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1458,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Combine Categorical and Continuous Features

In [5]:
# combine features into a single dataframe called preprocessed
preprocessed = pd.DataFrame()
preprocessed = pd.concat([preprocessed, cont, cats], axis=1)
preprocessed

Unnamed: 0,LotArea,1stFlrSF,GrLivArea,SalePrice,BldgType__2fmCon,BldgType__Duplex,BldgType__Twnhs,BldgType__TwnhsE,KitchenQual__Fa,KitchenQual__Gd,...,Neighborhood__NoRidge,Neighborhood__NridgHt,Neighborhood__OldTown,Neighborhood__SWISU,Neighborhood__Sawyer,Neighborhood__SawyerW,Neighborhood__Somerst,Neighborhood__StoneBr,Neighborhood__Timber,Neighborhood__Veenker
0,-0.013488,-0.096588,0.062428,0.072748,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,0.011485,0.050313,-0.045040,0.027636,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0.042526,-0.069302,0.077811,0.095346,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0.010463,-0.052802,0.063873,-0.056812,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0.088929,0.013494,0.151238,0.131794,0,0,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,-0.026240,-0.055965,0.049149,0.015773,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1456,0.073441,0.238129,0.130526,0.075080,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1457,-0.000235,0.027446,0.173384,0.152584,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1458,0.013856,-0.009324,-0.100788,-0.051912,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [6]:
#seemed to be an error with column name starting with digit.
preprocessed.rename(columns = {'1stFlrSF':'FirstFlrSF'}, inplace = True)

## Run a linear model with SalePrice as the target variable in statsmodels

In [7]:
# Your code here
import statsmodels.api as sm
from statsmodels.formula.api import ols

outcome = 'SalePrice'
predictors = preprocessed.drop(outcome, axis=1)
pred_sum = '+'.join(predictors.columns)
formula = outcome + '~' + pred_sum
model = ols(formula=formula, data=preprocessed).fit()
model.summary()

0,1,2,3
Dep. Variable:,SalePrice,R-squared:,0.839
Model:,OLS,Adj. R-squared:,0.834
Method:,Least Squares,F-statistic:,156.5
Date:,"Mon, 09 Aug 2021",Prob (F-statistic):,0.0
Time:,01:00:02,Log-Likelihood:,2241.3
No. Observations:,1460,AIC:,-4387.0
Df Residuals:,1412,BIC:,-4133.0
Df Model:,47,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.0171,0.034,-0.500,0.617,-0.084,0.050
LotArea,0.1325,0.024,5.475,0.000,0.085,0.180
FirstFlrSF,0.1482,0.017,8.584,0.000,0.114,0.182
GrLivArea,0.4150,0.017,24.114,0.000,0.381,0.449
BldgType__2fmCon,-0.0223,0.010,-2.173,0.030,-0.042,-0.002
BldgType__Duplex,-0.0546,0.008,-6.813,0.000,-0.070,-0.039
BldgType__Twnhs,-0.0182,0.012,-1.513,0.130,-0.042,0.005
BldgType__TwnhsE,-0.0067,0.008,-0.858,0.391,-0.022,0.009
KitchenQual__Fa,-0.1299,0.011,-11.315,0.000,-0.152,-0.107

0,1,2,3
Omnibus:,289.988,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1242.992
Skew:,-0.886,Prob(JB):,1.22e-270
Kurtosis:,7.159,Cond. No.,109.0


## Run the same model in scikit-learn

In [8]:
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
from sklearn.linear_model import LinearRegression
y = preprocessed[outcome]
linreg = LinearRegression()
linreg.fit(predictors, y)
# coefficients
display(linreg.coef_)
# intercept
display(linreg.intercept_)


array([ 0.13249955,  0.14818669,  0.41495898, -0.02227905, -0.05461696,
       -0.0182353 , -0.00665295, -0.12991735, -0.04963823, -0.08695924,
        0.02968733,  0.07615244,  0.04094343,  0.0043001 ,  0.00209023,
        0.03896161,  0.01531191,  0.02271316,  0.13859388,  0.11392879,
        0.12942767,  0.143226  , -0.0276907 ,  0.00687785, -0.06012281,
       -0.08443505, -0.02731146, -0.00988712, -0.01069842, -0.09891562,
       -0.01273361, -0.12497633, -0.08989276, -0.03317536, -0.05725915,
       -0.00207253, -0.0347746 ,  0.04718371,  0.0471142 , -0.12149619,
       -0.09092412, -0.06177544, -0.03028763,  0.01234773,  0.05581651,
        0.00073964,  0.01658317])

-0.01711216936105247

## Predict the house price given the following characteristics (before manipulation!!)

Make sure to transform your variables as needed!

- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt

In [10]:
vect = [14977, 1976, 1976, 0,0,0,0 ,0,1,0, 0,0,0,0,0,1,0, 0,0,0,1,0, 1, 
       0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0]
'''Intercept	-0.0171	0.034	-0.500	0.617	-0.084	0.050
LotArea	0.1325	0.024	5.475	0.000	0.085	0.180
FirstFlrSF	0.1482	0.017	8.584	0.000	0.114	0.182
GrLivArea	0.4150	0.017	24.114	0.000	0.381	0.449
BldgType__2fmCon	-0.0223	0.010	-2.173	0.030	-0.042	-0.002
BldgType__Duplex	-0.0546	0.008	-6.813	0.000	-0.070	-0.039
BldgType__Twnhs	-0.0182	0.012	-1.513	0.130	-0.042	0.005
BldgType__TwnhsE	-0.0067	0.008	-0.858	0.391	-0.022	0.009
KitchenQual__Fa	-0.1299	0.011	-11.315	0.000	-0.152	-0.107
KitchenQual__Gd	-0.0496	0.007	-7.613	0.000	-0.062	-0.037
KitchenQual__TA	-0.0870	0.007	-12.111	0.000	-0.101	-0.073
SaleType__CWD	0.0297	0.028	1.061	0.289	-0.025	0.085
SaleType__Con	0.0762	0.040	1.927	0.054	-0.001	0.154
SaleType__ConLD	0.0409	0.020	2.029	0.043	0.001	0.081
SaleType__ConLI	0.0043	0.025	0.169	0.865	-0.045	0.054
SaleType__ConLw	0.0021	0.025	0.082	0.935	-0.048	0.052
SaleType__New	0.0390	0.010	3.803	0.000	0.019	0.059
SaleType__Oth	0.0153	0.032	0.480	0.631	-0.047	0.078
SaleType__WD	0.0227	0.008	2.676	0.008	0.006	0.039
MSZoning__FV	0.1386	0.025	5.526	0.000	0.089	0.188
MSZoning__RH	0.1139	0.025	4.512	0.000	0.064	0.163
MSZoning__RL	0.1294	0.021	6.151	0.000	0.088	0.171
MSZoning__RM	0.1432	0.020	7.264	0.000	0.105	0.182
Street__Pave	-0.0277	0.023	-1.182	0.237	-0.074	0.018
Neighborhood__Blueste	0.0069	0.041	0.167	0.868	-0.074	0.088
Neighborhood__BrDale	-0.0601	0.022	-2.711	0.007	-0.104	-0.017
Neighborhood__BrkSide	-0.0844	0.018	-4.735	0.000	-0.119	-0.049
Neighborhood__ClearCr	-0.0273	0.019	-1.456	0.146	-0.064	0.009
Neighborhood__CollgCr	-0.0099	0.015	-0.641	0.522	-0.040	0.020
Neighborhood__Crawfor	-0.0107	0.017	-0.638	0.523	-0.044	0.022
Neighborhood__Edwards	-0.0989	0.016	-6.143	0.000	-0.131	-0.067
Neighborhood__Gilbert	-0.0127	0.016	-0.777	0.437	-0.045	0.019
Neighborhood__IDOTRR	-0.1250	0.021	-6.014	0.000	-0.166	-0.084
Neighborhood__MeadowV	-0.0899	0.021	-4.351	0.000	-0.130	-0.049
Neighborhood__Mitchel	-0.0332	0.017	-1.944	0.052	-0.067	0.000
Neighborhood__NAmes	-0.0573	0.016	-3.664	0.000	-0.088	-0.027
Neighborhood__NPkVill	-0.0021	0.023	-0.092	0.927	-0.046	0.042
Neighborhood__NWAmes	-0.0348	0.016	-2.122	0.034	-0.067	-0.003
Neighborhood__NoRidge	0.0472	0.017	2.737	0.006	0.013	0.081
Neighborhood__NridgHt	0.0471	0.016	3.029	0.002	0.017	0.078
Neighborhood__OldTown	-0.1215	0.018	-6.686	0.000	-0.157	-0.086
Neighborhood__SWISU	-0.0909	0.019	-4.845	0.000	-0.128	-0.054
Neighborhood__Sawyer	-0.0618	0.017	-3.727	0.000	-0.094	-0.029
Neighborhood__SawyerW	-0.0303	0.016	-1.860	0.063	-0.062	0.002
Neighborhood__Somerst	0.0123	0.019	0.658	0.511	-0.024	0.049
Neighborhood__StoneBr	0.0558	0.017	3.232	0.001	0.022	0.090
Neighborhood__Timber	0.0007	0.017	0.042	0.966	-0.033	0.035
Neighborhood__Veenker'''

answer = linreg.intercept_ + np.dot(linreg.coef_ , vect)
answer
#Hmm, this seems too low.  But cool!

3097.3427159042526

## Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!