# DATA 5600: Introduction to Regression and Machine Learning for Analytics

## __Koop Chapter 06: Multiple Regression__

<br>

Author:  Tyler J. Brough <br>
Updated: November 15, 2021 <br>

---

<br>

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm

plt.rcParams['figure.figsize'] = [10, 8]

---

## __Introduction__

<br>

These notes are based upon Chapter 6: Multiple Regression from the book _Analysis of Economic Data 4th Edition_ by Gary Koop.

<br>

The main objectives of chapters 4 & 5 were the following: 

* development of graphical intuition for regression techniques as the fitting of a straight line through an $XY$-plot

* introduction of the regression coefficient as measuring a marginal effect

* description of the OLS estimate as a best fitting line (minimizing the SSR) through an $XY$-plot

* introduction of $R^{2}$ as a measure of fit of a regression model

* understanding regression as a probability model

<br>

The objectives of multiple regression are not different from these listed above. We will extend these objectives to the case of multiple regressors or explanatory or right-hand side variables.

<br>


### __Explaining House Prices__

<br>

* applied microeconometrics models to try to explain the price of a good as explained by several characteristics

* We will use the `HPRICE.XLS` dataset

* $N = 546$ houses sold in Windsor, Canada

* The LHS variable $Y$ is the selling price of the house

* In the simple regression context $X =$ lot size

* We introduce the following additional RHS variables:

    1. $X_{1} =$ the lot size of the property (square feet)

    2. $X_{2} =$ the number of bedrooms

    3. $X_{3} =$ the number of bathroooms

    4. $X_{4} =$ the number of storeys (excluding basement)

<br>

---

### Exercise 6.1

* (a) Create $XY$-plots using the four explanatory variables in the house pricing example one at a time

* (b) Perform simple regressions using the explanatory variables one at a time (i.e. regress $Y$ on $X_{1}$, then $Y$ on $X_{2}$, and so on)

* (c) Comment on the relationships you find in parts (a) and (b)



In [2]:
df = pd.read_excel("HPRICE.XLS")
df.head()

Unnamed: 0,sale price,lot size,#bedroom,#bath,#stories,driveway,rec room,basement,gas,air cond,#garage,desire loc
0,42000,5850,3,1,2,1,0,1,0,0,1,0
1,38500,4000,2,1,1,1,0,0,0,0,0,0
2,49500,3060,3,1,1,1,0,0,0,0,0,0
3,60500,6650,3,1,2,1,1,0,0,0,0,0
4,61000,6360,2,1,1,1,0,0,0,0,0,0


In [3]:
df.tail()

Unnamed: 0,sale price,lot size,#bedroom,#bath,#stories,driveway,rec room,basement,gas,air cond,#garage,desire loc
541,91500,4800,3,2,4,1,1,0,0,1,0,0
542,94000,6000,3,2,4,1,0,0,0,1,0,0
543,103000,6000,3,2,4,1,1,0,0,1,1,0
544,105000,6000,3,2,2,1,1,0,0,1,1,0
545,105000,6000,3,1,2,1,0,0,0,1,1,0


In [None]:
### (a)


In [None]:
### (b)


#### (c) Comments...


<br>

---


<br>

### __Regression as a Best Fitting Line__

<br>

### __OLS Estimation of the Multiple Regression Model__

<br>

The multiple regression model with $k$ explanatory variables is written as:

<br>

$$
Y = \alpha + \beta_{1}X_{1} + \beta_{2}X_{2} + \ldots + \beta_{k}X_{k} + \epsilon
$$

<br>

We now have the following $\theta = \{\alpha, \beta_{1}, \beta_{2}, \ldots, \beta_{k}\}$ coefficients to estimate.

We proceed as before by seeking parameter values that minimize the sum or squared residuals: 

<br>

$$
SSR = \sum (Y_{i} - \hat{\alpha} - \hat{\beta_{1}}X_{1i} - \ldots - \hat{\beta_{k}} X_{ki})^{2}
$$

<br>

* where $X_{1i}$ is the $i$th observation of the first RHS variable for $i = 1, \ldots, N$ observations.

* the other RHS variables are likewise defined

* The OLS estimates (interpreted as a best-fitting line) are found by choosing the values $\hat{\alpha}$ and $\hat{\beta_{1}}$, $\hat{\beta_{2}}$, $\ldots$, $\hat{\beta_{k}}$ that minimize the SSR

* Python modules calculate these for us 

<br>

### __Multiple Regression as a Probability Model__

<br>

* multiple regression as a probability model is basically the same for simple regression

* $R^{2}$ is still a measure of goodness of fit and is calculated the same way

* Note: it should now be interpreted as a measure of the explanatory power of all the RHS variables together

* We can continue to test each $\beta$ coefficient in the standard null hypothesis: $H_{0}: \quad \beta_{j} = 0$

* We will also introduce the $F$-statistic as a way to test if $R^{2} = 0$ for the model as a whole

* If we find that $R^{2} \ne 0$ then we conclude that "we cannot reject the hypothesis that the explanatory variables in the regression, taken together, help explain the dependent variable"

* If we find that $R^{2} 0$ then we conclude that "the explanatory variables are not significant and do not provide any explanatory power for the dependent variable"

* The general formula for testing for the regression coefficient CIs is the same as in Chapter 5 for simple regression

* It is important to remember that in this frequentist approach, the multiple regression as a 

### __Interpreting OLS Estimates__

<br>

* In the interpretation of the coefficients does introduce subtle differences between simple and multiple regression

* When we speak generically about a property that holds generally for any of the coefficients we will write $\beta_{j}$

* When we wish to speak about a specific coefficient we will indicate it with an exact index (e.g. $\beta_{1}$ for $j=1$)

* In simple regression we mentioned that $\beta$ could be interpreted as a marginal effect (i.e. a measure of the effect that a change in $X$ has on $Y$)

* In multiple regression $\beta_{j}$ can still be interpreted as a marginal effect

* In particular, $\beta_{j}$ is the marginal effect of $X$ on $Y$ when all other explanatory variables are held constant

<br>

We will look at the house price data next...

<br>

In [4]:
y = df['sale price']
X = df[['lot size', '#bedroom', '#bath', '#stories']]
X = sm.add_constant(X)

In [5]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:             sale price   R-squared:                       0.536
Model:                            OLS   Adj. R-squared:                  0.532
Method:                 Least Squares   F-statistic:                     156.0
Date:                Mon, 15 Nov 2021   Prob (F-statistic):           1.18e-88
Time:                        10:37:56   Log-Likelihood:                -6130.0
No. Observations:                 546   AIC:                         1.227e+04
Df Residuals:                     541   BIC:                         1.229e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -4009.5500   3603.109     -1.113      0.2

---

#### __Footnote: OLS Estimation as an Optimization Routine__

<br>

One way to understand the process of fitting a regression is as a numerical minimization problem. 

<br>

We can formulate least squares regression in the following way: 

<br>

$$
\arg\min_{(\alpha, \beta)} \left(\frac{1}{2} || y - (X\beta - \alpha)||^{2}\right)
$$

<br>

__NB:__ thinking about the OLS parameter estimates in this way helps us see the mathematical/numerical aspect of OLS regression. Understanding this procedure as an ___estimator___ links the numerical perspective to the probability perspective and helps us understand regression as a ___probability model___.

<br>

__NB:__ To understand what the $||$ mean in math see here: https://mathworld.wolfram.com/Norm.html

<br>

In [11]:
np.linalg.lstsq?

[0;31mSignature:[0m [0mnp[0m[0;34m.[0m[0mlinalg[0m[0;34m.[0m[0mlstsq[0m[0;34m([0m[0ma[0m[0;34m,[0m [0mb[0m[0;34m,[0m [0mrcond[0m[0;34m=[0m[0;34m'warn'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return the least-squares solution to a linear matrix equation.

Computes the vector x that approximatively solves the equation
``a @ x = b``. The equation may be under-, well-, or over-determined
(i.e., the number of linearly independent rows of `a` can be less than,
equal to, or greater than its number of linearly independent columns).
If `a` is square and of full rank, then `x` (but for round-off error)
is the "exact" solution of the equation. Else, `x` minimizes the
Euclidean 2-norm :math:`|| b - a x ||`.

Parameters
----------
a : (M, N) array_like
    "Coefficient" matrix.
b : {(M,), (M, K)} array_like
    Ordinate or "dependent variable" values. If `b` is two-dimensional,
    the least-squares solution is calculated for each of the `K` column

In [8]:
beta_hat, resid, rank, svals = np.linalg.lstsq(X, y, rcond=None)

In [9]:
beta_hat

array([-4.00954998e+03,  5.42917370e+00,  2.82461379e+03,  1.71051745e+04,
        7.63489700e+03])