# Chapter 6: Multiple Linear Regression - Solution

> (c) 2019 Galit Shmueli, Peter C. Bruce, Peter Gedeck 
>
> _Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python_ (First Edition) 
> Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel. 2019.

In [3]:
# !pip install dmba

Collecting dmba
  Downloading dmba-0.1.0-py3-none-any.whl (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 9.1 MB/s eta 0:00:01
[?25hInstalling collected packages: dmba
Successfully installed dmba-0.1.0


In [4]:
import dmba

no display found. Using non-interactive Agg backend


  return f(*args, **kwds)
  return f(*args, **kwds)


In [5]:
# import required functionality for this chapter



import matplotlib as mlt
%matplotlib inline
from pathlib import Path

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, BayesianRidge
from sklearn.preprocessing import StandardScaler
import statsmodels.formula.api as sm
import matplotlib.pylab as plt
# sometimes matplotlib chooses an x-using backend by default. 
plt.switch_backend('agg') 
mlt.use('agg')
import seaborn as sns

from dmba import regressionSummary, exhaustive_search
from dmba import backward_elimination, forward_selection, stepwise_selection
from dmba import adjusted_r2_score, AIC_score, BIC_score

In [12]:
# working directory
# We assume that data are kept in the same directory as the notebook. If you keep your data in a different folder, replace the 
# argument of the `Path`
# DATA = Path('./data')
# and then load data using 
DATA = pd.read_csv("./data/BostonHousing.csv") 

# Problem 6.1 Predicting Bostom Housing Prices 

The file _BostonHousing.csv_ contains information collected by the US Bureau of the Census concerning housing in the area of
Boston, Massachusetts. The dataset includes information on 506 census housing tracts in the Boston area. The goal is to predict the median house price in new tracts based on information such as crime rate, pollution, and number of rooms. The dataset contains 13 predictors, and the outcome variable is the median house price (MEDV). Table 6.11 describes each of the predictors and the outcome variable.

![TABLE_6.11](TABLE6.11.PNG)


In [13]:
y = DATA['MEDV']
X = DATA.drop(columns=['MEDV'])


In [16]:
len(X.columns)

13

In [17]:
# Your answer here
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)



__6.1.a.__ Why should the data be partitioned into training and validation sets? What will the training set be used for? What will the validation set be used for?

__Answer:__ 

In [None]:
# Your answer here

__6.1.b.__ Fit a multiple linear regression model to the median house price (MEDV) as a function of CRIM, CHAS, and RM. Write the equation for predicting the median house price from the predictors in the model.

__Answer:__


In [19]:
# Your answer here

predict_col = ['CRIM', 'CHAS', 'RM']
from sklearn.linear_model import LinearRegression
lr = LinearRegression()

model = lr.fit(X_train[predict_col], y_train)
result = model.predict(X_test[predict_col])

__6.1.c.__ Using the estimated regression model, what median house price is predicted for a tract in the Boston area that does not bound the Charles River, has a crime rate of 0.1, and where the average number of rooms per house is 6?

__Answer:__ 

In [None]:
# Your answer here
model.predict([[0.1, 0, 6]])

__6.1.d.i.__ Reduce the number of predictors:
Which predictors are likely to be measuring the same thing among the 13 predictors? Discuss the relationships among INDUS, NOX, and TAX.

__Answer:__ 

In [23]:
# Your answer here
X_train.corr()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,LSTAT,CAT. MEDV
CRIM,1.0,-0.198855,0.400198,-0.044589,0.396406,-0.200303,0.33409,-0.366487,0.615947,0.576894,0.28897,0.414142,-0.156959
ZN,-0.198855,1.0,-0.533489,-0.043754,-0.526414,0.274661,-0.575078,0.681817,-0.31379,-0.294267,-0.389163,-0.396572,0.333134
INDUS,0.400198,-0.533489,1.0,0.095158,0.770957,-0.39869,0.636569,-0.707566,0.588952,0.702353,0.348303,0.603644,-0.368196
CHAS,-0.044589,-0.043754,0.095158,1.0,0.135476,0.111272,0.096016,-0.121671,0.028685,0.007746,-0.113003,-0.070652,0.118629
NOX,0.396406,-0.526414,0.770957,0.135476,1.0,-0.299615,0.720417,-0.77233,0.589061,0.650247,0.161253,0.593862,-0.232602
RM,-0.200303,0.274661,-0.39869,0.111272,-0.299615,1.0,-0.210863,0.198299,-0.199738,-0.281127,-0.342643,-0.612577,0.661326
AGE,0.33409,-0.575078,0.636569,0.096016,0.720417,-0.210863,1.0,-0.756589,0.430321,0.47167,0.240841,0.571051,-0.16166
DIS,-0.366487,0.681817,-0.707566,-0.121671,-0.77233,0.198299,-0.756589,1.0,-0.483329,-0.523577,-0.217588,-0.494921,0.109359
RAD,0.615947,-0.31379,0.588952,0.028685,0.589061,-0.199738,0.430321,-0.483329,1.0,0.912527,0.472257,0.480301,-0.204728
TAX,0.576894,-0.294267,0.702353,0.007746,0.650247,-0.281127,0.47167,-0.523577,0.912527,1.0,0.444836,0.530632,-0.271656


__6.1.d.ii.__ Compute the correlation table for the 11 numerical predictors and search for highly correlated pairs. These have potential redundancy and can cause multicollinearity. Choose which ones to remove based on this table.

__Answer:__

In [None]:
# Your answer here

__6.1.d.iii.__ Use stepwise regression with the three options (backward, forward, both) to reduce the remaining predictors as follows: Run stepwise on the training set. Choose the top model from each stepwise run. Then use each of these models separately to predict the validation set. Compare RMSE, MAPE, and mean error, as well as lift charts. Finally, describe the best model.

__Answer:__

In [None]:
# Your answer here