# Fitting a Multiple Log-Linear Regression Model

A log-linear regression model you developed earlier was able to explain about 24% of the variability in the transformed crime rate per capita variable. You are now asked to develop a log-linear multiple regression model that will likely explain 80% or more of the variability in the transformed dependent variable. You should use independent variables from the Boston Housing dataset that have a correlation coefficient of 0.4 or more.

You are also encouraged to include the interaction of these variables to order two in your model. You should produce graphs and data that show that your model satisfies the assumptions of linear regression.

In [8]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split

In [9]:
rawBostonData = pd.read_csv('https://raw.githubusercontent.com/PacktWorkshops/The-Data-Science-Workshop/master/Chapter02/Dataset/Boston.csv')

In [10]:
rawBostonData = rawBostonData.dropna()
rawBostonData = rawBostonData.drop_duplicates() 
renamedBostonData = rawBostonData.rename(columns = {'CRIM':'crimeRatePerCapita',
 ' ZN ':'landOver25K_sqft',
 'INDUS ':'non-retailLandProptn',
 'CHAS':'riverDummy',
 'NOX':'nitrixOxide_pp10m',
 'RM':'AvgNo.RoomsPerDwelling',
 'AGE':'ProptnOwnerOccupied',
 'DIS':'weightedDist',
 'RAD':'radialHighwaysAccess',
 'TAX':'propTaxRate_per10K',
 'PTRATIO':'pupilTeacherRatio',
 'LSTAT':'pctLowerStatus',
 'MEDV':'medianValue_Ks'})
X = renamedBostonData.drop('crimeRatePerCapita', axis = 1)
y = renamedBostonData[['crimeRatePerCapita']]
seed = 10 
test_data_size = 0.3 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_data_size, random_state = seed)
train_data = pd.concat([X_train, y_train], axis = 1)
test_data = pd.concat([X_test, y_test], axis = 1)

In [11]:
train_data.corr()

Unnamed: 0,landOver25K_sqft,non-retailLandProptn,riverDummy,nitrixOxide_pp10m,AvgNo.RoomsPerDwelling,ProptnOwnerOccupied,weightedDist,radialHighwaysAccess,propTaxRate_per10K,pupilTeacherRatio,pctLowerStatus,medianValue_Ks,crimeRatePerCapita
landOver25K_sqft,1.0,-0.540095,-0.059189,-0.520305,0.355346,-0.577457,0.65934,-0.31192,-0.324172,-0.424612,-0.435827,0.422574,-0.198455
non-retailLandProptn,-0.540095,1.0,0.065271,0.758178,-0.399166,0.667887,-0.728968,0.580813,0.702973,0.398513,0.607457,-0.508338,0.387471
riverDummy,-0.059189,0.065271,1.0,0.091469,0.107996,0.106329,-0.098551,0.022731,-0.007864,-0.094255,-0.04111,0.136831,-0.044587
nitrixOxide_pp10m,-0.520305,0.758178,0.091469,1.0,-0.30651,0.742016,-0.776311,0.606721,0.662164,0.206809,0.603656,-0.453424,0.405813
AvgNo.RoomsPerDwelling,0.355346,-0.399166,0.107996,-0.30651,1.0,-0.263085,0.215439,-0.183,-0.280341,-0.350828,-0.586573,0.666761,-0.167258
ProptnOwnerOccupied,-0.577457,0.667887,0.106329,0.742016,-0.263085,1.0,-0.751059,0.458717,0.515376,0.289976,0.639881,-0.419062,0.35573
weightedDist,0.65934,-0.728968,-0.098551,-0.776311,0.215439,-0.751059,1.0,-0.494932,-0.543333,-0.25914,-0.52212,0.289658,-0.378997
radialHighwaysAccess,-0.31192,0.580813,0.022731,0.606721,-0.183,0.458717,-0.494932,1.0,0.908578,0.46229,0.456592,-0.383132,0.608838
propTaxRate_per10K,-0.324172,0.702973,-0.007864,0.662164,-0.280341,0.515376,-0.543333,0.908578,1.0,0.462556,0.528029,-0.478903,0.565035
pupilTeacherRatio,-0.424612,0.398513,-0.094255,0.206809,-0.350828,0.289976,-0.25914,0.46229,0.462556,1.0,0.374842,-0.503692,0.27653


In [12]:
multiLogLinearModel = smf.ols(
    formula='np.log(crimeRatePerCapita) ~ (pctLowerStatus + radialHighwaysAccess + medianValue_Ks + nitrixOxide_pp10m) ** 2',
    data=train_data
)
multiLogLinearModResult = multiLogLinearModel.fit()
print(multiLogLinearModResult.summary())

OLS Regression Results                                
Dep. Variable:     np.log(crimeRatePerCapita)   R-squared:                       0.884
Model:                                    OLS   Adj. R-squared:                  0.881
Method:                         Least Squares   F-statistic:                     261.5
Date:                        Mon, 13 Jul 2020   Prob (F-statistic):          7.79e-154
Time:                                15:57:55   Log-Likelihood:                -394.39
No. Observations:                         354   AIC:                             810.8
Df Residuals:                             343   BIC:                             853.3
Df Model:                                  10                                         
Covariance Type:                    nonrobust                                         
                                             coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------