# MATH 3350 Course Notes - Module S9 

## Multiple Linear Regression

Simple linear regression uses a single predictor $x$ (independent variable) to predict an outcome $y$ (dependent variable).  The "true" relationship $y=\alpha + \beta x$ is estimated by the model $y = a + bx$. 

This idea can be extended to use multiple predictors, using a method known as **_multiple regression_**. 


In this case, we are trying to model a potential "true" relationship:
<center>
$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... \beta_k x_k $
</center>

where $x_1, x_2,..., x_k $ are the predictors.  Note that $\beta_0$ is the intercept, and all other $\beta_i$ are analogous to 'slopes' (one for each predictor.)

We **estimate** this relationship with our regression model:

<center>
$\widehat{y} = b_0 + b_1 x_1 + b_2 x_2 + ... b_k x_k $
</center>


We will investigate several models using a real estate data set with information about several homes.

In [None]:
#Look at data set
houses <- read.csv("HousingBrief.csv")
head(houses)

We want to predict the sales price of a house using some of the other variables. We can only visualize the relationship  between price and one other variable at a time, as shown below.

In [None]:
plot(price2014~bedrooms, data=houses)

In [None]:
plot(price2014~squarefeet, data=houses)

### Simple Models (1 predictor)

In [None]:
hmodel_1 <- lm(price2014 ~ bedrooms, data=houses)
summary(hmodel_1)

In [None]:
hmodel_2 <- lm(price2014 ~ squarefeet, data=houses)
summary(hmodel_2)


### Multiple Regression Models 

In [None]:
hmodel_3 <- lm(price2014 ~ bedrooms + squarefeet, data=houses)
summary(hmodel_3)

#### Interpreting the Model

**_Holding square footage constant_**, number of bedrooms does not play a significant role in predicting the sales price of the house.  

The only significant predictor in this model is _squarefeet_.  

The model accounts for $\sim 63.6$% of the variability in home price (using adjusted $R^2$). 


### Models with Other Predictors

In [None]:
hmodel_4 <- lm(price2014 ~ bikescore, data=houses)
summary(hmodel_4)

In [None]:
hmodel_5 <- lm(price2014 ~ walkscore, data=houses)
summary(hmodel_5)

In [None]:
hmodel_6 <- lm(price2014 ~ bikescore+squarefeet, data=houses)
summary(hmodel_6)

In [None]:
hmodel_7 <- lm(price2014 ~ walkscore+squarefeet, data=houses)
summary(hmodel_7)

In [None]:
hmodel_8 <- lm(price2014 ~ bedrooms+total_rooms, data=houses)
summary(hmodel_8)

In [None]:
hmodel_9 <- lm(price2014 ~ bedrooms+full_baths+half_baths+total_rooms, data=houses)
summary(hmodel_9)

In [None]:
hmodel_10 <- lm(price2014 ~ bedrooms+full_baths+half_baths+total_rooms+squarefeet, data=houses)
summary(hmodel_10)

#### Interpreting Coefficients

The coefficient of $30.22$ for the _full_baths_ variable can be interpreted as follows:  

**_Holding all other predictors constant_**, for every additional full bath, the predicted sales price increases approximately $30.2K, on average.

The coefficient of $0.11886$ for _squarefeet_ can be scaled to $\frac{11.889}{100}$ and interpreted as follows: 

**_Holding all other predictors constant_**, for every additional 100 square feet, the predicted sales price increases approximately $11.9K, on average.


### Evaluating the Model

The same conditions that we checked in simple linear regression must also be met in multiple regression:
1. **Linearity**: The relationship between the predictors and dependent variable is linear
2. **Normality**: The residuals of all data points ($y_i - \widehat{y}_i$ for each predicted y value) are approximately normally distributed
3. **Homoscedasticity**: The variance of residuals remains the same, regardless of the value being predicted
4. **Independence**: All observations in the sample are independent of each other

We can still test these assumptions by examining diagnostic plots that R generates. Most of what we need to verify can be checked in the **_residual plot_** and the **_normal quantile plot_**.  We will restrict our diagnostics below to those two plots.



In [None]:
#Display R's default diagnostic plots for our linear model
plot(hmodel_10, which=c(1,2))

#### Observations from the Diagnostic Plots

1. The first plot (residual plot) shows that residuals for lower predicted home prices tend to have smaller magnitude than residuals for higher predicted prices.  This suggests that our homoscedasticity condition is not fully met.
2. The second plot (normal quantile plot) indicates that residuals may not meet the assumption of being normally distributed, especially in the uppermost quantile, and particularly for house 97. 

The conditions are not perfectly met, but the violations are relatively minor (not extreme). There are occasions where an analyst would choose to use this model anyway, but conclusions drawn from the model should be taken with caution.
