# Chapter 15. Multiple Regression

In [8]:
from __future__ import division
from collections import Counter
from functools import partial
from linear_algebra import dot, vector_add
from statistics import median, standard_deviation
from probability import normal_cdf
from gradient_descent import minimize_stochastic
from simple_linear_regression import total_sum_of_squares
import math, random

The VP is impressed by your simple regression model, but you know you can do better.  
You start by collecting more data: for each user you get data on how many hours he works each day and whether he has a PhD.  
You can use this additional data to improve your model.  
Accordingly, you hypothesize a linear model with more independent variables:  

$\normalsize \text{minutes} = \alpha + \beta_1 \text{friends} + \beta_2 \text{work hours} + \beta_3 \text{PhD} + \epsilon$  

For the PhD category we can use a dummy variable (see Chapter 11) that equals 1 for users *with* a PhD and 0 for users *without* a PhD.

## The Model

Recall that in Chapter 14 we fit a model of the form:  

$\Large y_i = \alpha + \beta x_i + \epsilon_i$  

Now imagine that each input $\normalsize x_i$ is not a single number, but is instead a vector of $\normalsize k$ numbers $\normalsize \;{x_i}_1, {x_i}_2, \ldots, {x_i}_k$.  
The multiple regression model assumes that:  

$\Large y_i = \alpha + \beta_1{x_i}_1 + \ldots + \beta_k{x_i}_k + \epsilon_i$

In multiple regression the vector of parameters is usually called $\normalsize \beta$.  
We'll want this to include the constant term as well, which we can achieve by adding a column of ones to our data:

and:

Then our model is:

In [9]:
def predict(x_i, beta):
    """ assumes that the first element of each x_i is 1 """
    return dot(x_i, beta)

In this particular case, our independent variable `x` will be a list of vectors, each of which looks like this:

## Further Assumptions of the Least Squares Model

There are two further assumptions that are required for this model, as well as our solution, to work.

### Assumption the First

The columns of $x$ are [linearly independent](https://en.wikipedia.org/wiki/Linear_independence), meaning that there is no way to write any one as a weighted sum of some of the others.  
If this assumption fails, there is no reliable way to estimate `beta`.  
To illustrate this in an extreme case, imagine that we have an extra field `num_acquaintances` in our data that, for every user, was exactly equal to `num_friends`.  
Then, starting with `beta`, if we add *any* amount to the `num_friends` coefficient and subtract the same amount from the `num_acquaintances` coefficient, the model's predictions will remain unchanged.  
This means that there is no way to find *the* coefficient for `num_friends`.  
Usually violations of this assumption won't be so obvious.

### Assumption the Second

The columns of $x$ are all uncorrelated with the errors of $\normalsize \epsilon$.  
If this fails to be the case, our estimates of `beta` will be systematically wrong.  
For example, in Chapter 14, we built a model that predicted that each additional friend was associated with an extra 0.90 daily minutes on the site.  
Imagine that it's also the case that:  
- people who work more hours spend less time on the site.
- people with more friends tend to work more hours.
In math terms, imagine that the "actual" model is:  

$\large \text{minutes} = \alpha + \beta_1 \text{friends} + \beta_2 \text{work hours} + \epsilon$  

and that work hours and friends are positively correlated.  
In that case, when we minimize the errors of the single variable model:  

$\large \text{minutes} = \alpha + \beta_1 \text{friends} + \epsilon$.  

we will underestimate $\beta_1$.

Think about what would happen if we made predictions using the single variable model with the "actual" value of $\beta_1$ (the value that arises from minimizing the errors of what we called the "actual" model).  
The predictions would tend to be too small for users who work many hours and too large for users who work few hours, because $\beta_2 > 0$ and we failed to include it.  
Because work hours is positively correlated with number of friends, this means that the predictions tend to be too small for users with many friends and too large for users with few friends.  
The result of this is that we can reduce the errors (in the single-variable model) by decreasing our estimate of $\beta_1$, which means that the error-minimizing $\beta_1$ is smaller than the "actual" value.  
That is, in this case the single-variable least-squares solution is biased to underestimate $\beta_1$.  
And, in general, whenever the independent variables are correlated with the errors like this, our least squares solution will give us a biased estimate of $\beta$.

## Fitting the Model