# Chapter 15. Multiple Regression

In [24]:
from __future__ import division
from collections import Counter
from functools import partial
from linear_algebra import dot, vector_add
from statistics import median, standard_deviation
from probability import normal_cdf
from gradient_descent import minimize_stochastic
from simple_linear_regression import total_sum_of_squares
import math, random

The VP is impressed by your simple regression model, but you know you can do better.  
You start by collecting more data: for each user you get data on how many hours he works each day and whether he has a PhD.  
You can use this additional data to improve your model.  
Accordingly, you hypothesize a linear model with more independent variables:  

$\normalsize \text{minutes} = \alpha + \beta_1 \text{friends} + \beta_2 \text{work hours} + \beta_3 \text{PhD} + \epsilon$  

For the PhD category we can use a dummy variable (see Chapter 11) that equals 1 for users *with* a PhD and 0 for users *without* a PhD.

## The Model

Recall that in Chapter 14 we fit a model of the form:  

$\Large y_i = \alpha + \beta x_i + \epsilon_i$  

Now imagine that each input $\normalsize x_i$ is not a single number, but is instead a vector of $\normalsize k$ numbers $\normalsize \;{x_i}_1, {x_i}_2, \ldots, {x_i}_k$.  
The multiple regression model assumes that:  

$\Large y_i = \alpha + \beta_1{x_i}_1 + \ldots + \beta_k{x_i}_k + \epsilon_i$

In multiple regression the vector of parameters is usually called $\normalsize \beta$.  
We'll want this to include the constant term as well, which we can achieve by adding a column of ones to our data:

and:

Then our model is:

In [25]:
def predict(x_i, beta):
    """ assumes that the first element of each x_i is 1 """
    return dot(x_i, beta)

In this particular case, our independent variable `x` will be a list of vectors, each of which looks like this:

Now for the data that we'll be using:

In [26]:
x = [[1,49,4,0],[1,41,9,0],[1,40,8,0],[1,25,6,0],[1,21,1,0],[1,21,0,0],[1,19,3,0],[1,19,0,0],[1,18,9,0],[1,18,8,0],[1,16,4,0],[1,15,3,0],[1,15,0,0],[1,15,2,0],[1,15,7,0],[1,14,0,0],[1,14,1,0],[1,13,1,0],[1,13,7,0],[1,13,4,0],[1,13,2,0],[1,12,5,0],[1,12,0,0],[1,11,9,0],[1,10,9,0],[1,10,1,0],[1,10,1,0],[1,10,7,0],[1,10,9,0],[1,10,1,0],[1,10,6,0],[1,10,6,0],[1,10,8,0],[1,10,10,0],[1,10,6,0],[1,10,0,0],[1,10,5,0],[1,10,3,0],[1,10,4,0],[1,9,9,0],[1,9,9,0],[1,9,0,0],[1,9,0,0],[1,9,6,0],[1,9,10,0],[1,9,8,0],[1,9,5,0],[1,9,2,0],[1,9,9,0],[1,9,10,0],[1,9,7,0],[1,9,2,0],[1,9,0,0],[1,9,4,0],[1,9,6,0],[1,9,4,0],[1,9,7,0],[1,8,3,0],[1,8,2,0],[1,8,4,0],[1,8,9,0],[1,8,2,0],[1,8,3,0],[1,8,5,0],[1,8,8,0],[1,8,0,0],[1,8,9,0],[1,8,10,0],[1,8,5,0],[1,8,5,0],[1,7,5,0],[1,7,5,0],[1,7,0,0],[1,7,2,0],[1,7,8,0],[1,7,10,0],[1,7,5,0],[1,7,3,0],[1,7,3,0],[1,7,6,0],[1,7,7,0],[1,7,7,0],[1,7,9,0],[1,7,3,0],[1,7,8,0],[1,6,4,0],[1,6,6,0],[1,6,4,0],[1,6,9,0],[1,6,0,0],[1,6,1,0],[1,6,4,0],[1,6,1,0],[1,6,0,0],[1,6,7,0],[1,6,0,0],[1,6,8,0],[1,6,4,0],[1,6,2,1],[1,6,1,1],[1,6,3,1],[1,6,6,1],[1,6,4,1],[1,6,4,1],[1,6,1,1],[1,6,3,1],[1,6,4,1],[1,5,1,1],[1,5,9,1],[1,5,4,1],[1,5,6,1],[1,5,4,1],[1,5,4,1],[1,5,10,1],[1,5,5,1],[1,5,2,1],[1,5,4,1],[1,5,4,1],[1,5,9,1],[1,5,3,1],[1,5,10,1],[1,5,2,1],[1,5,2,1],[1,5,9,1],[1,4,8,1],[1,4,6,1],[1,4,0,1],[1,4,10,1],[1,4,5,1],[1,4,10,1],[1,4,9,1],[1,4,1,1],[1,4,4,1],[1,4,4,1],[1,4,0,1],[1,4,3,1],[1,4,1,1],[1,4,3,1],[1,4,2,1],[1,4,4,1],[1,4,4,1],[1,4,8,1],[1,4,2,1],[1,4,4,1],[1,3,2,1],[1,3,6,1],[1,3,4,1],[1,3,7,1],[1,3,4,1],[1,3,1,1],[1,3,10,1],[1,3,3,1],[1,3,4,1],[1,3,7,1],[1,3,5,1],[1,3,6,1],[1,3,1,1],[1,3,6,1],[1,3,10,1],[1,3,2,1],[1,3,4,1],[1,3,2,1],[1,3,1,1],[1,3,5,1],[1,2,4,1],[1,2,2,1],[1,2,8,1],[1,2,3,1],[1,2,1,1],[1,2,9,1],[1,2,10,1],[1,2,9,1],[1,2,4,1],[1,2,5,1],[1,2,0,1],[1,2,9,1],[1,2,9,1],[1,2,0,1],[1,2,1,1],[1,2,1,1],[1,2,4,1],[1,1,0,1],[1,1,2,1],[1,1,2,1],[1,1,5,1],[1,1,3,1],[1,1,10,1],[1,1,6,1],[1,1,0,1],[1,1,8,1],[1,1,6,1],[1,1,4,1],[1,1,9,1],[1,1,9,1],[1,1,4,1],[1,1,2,1],[1,1,9,1],[1,1,0,1],[1,1,8,1],[1,1,6,1],[1,1,1,1],[1,1,1,1],[1,1,5,1]]
daily_minutes_good = [68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,26.02,27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,22.61,26.89,23.48,8.38,27.81,32.35,23.84]

## Further Assumptions of the Least Squares Model

There are two further assumptions that are required for this model, as well as our solution, to work.

### Assumption the First

The columns of $x$ are [linearly independent](https://en.wikipedia.org/wiki/Linear_independence), meaning that there is no way to write any one as a weighted sum of some of the others.  
If this assumption fails, there is no reliable way to estimate `beta`.  
To illustrate this in an extreme case, imagine that we have an extra field `num_acquaintances` in our data that, for every user, was exactly equal to `num_friends`.  
Then, starting with `beta`, if we add *any* amount to the `num_friends` coefficient and subtract the same amount from the `num_acquaintances` coefficient, the model's predictions will remain unchanged.  
This means that there is no way to find *the* coefficient for `num_friends`.  
Usually violations of this assumption won't be so obvious.

### Assumption the Second

The columns of $x$ are all uncorrelated with the errors of $\normalsize \epsilon$.  
If this fails to be the case, our estimates of `beta` will be systematically wrong.  
For example, in Chapter 14, we built a model that predicted that each additional friend was associated with an extra 0.90 daily minutes on the site.  
Imagine that it's also the case that:  
- people who work more hours spend less time on the site.
- people with more friends tend to work more hours.
In math terms, imagine that the "actual" model is:  

$\large \text{minutes} = \alpha + \beta_1 \text{friends} + \beta_2 \text{work hours} + \epsilon$  

and that work hours and friends are positively correlated.  
In that case, when we minimize the errors of the single variable model:  

$\large \text{minutes} = \alpha + \beta_1 \text{friends} + \epsilon$.  

we will underestimate $\beta_1$.

Think about what would happen if we made predictions using the single variable model with the "actual" value of $\beta_1$ (the value that arises from minimizing the errors of what we called the "actual" model).  
The predictions would tend to be too small for users who work many hours and too large for users who work few hours, because $\beta_2 > 0$ and we failed to include it.  
Because work hours is positively correlated with number of friends, this means that the predictions tend to be too small for users with many friends and too large for users with few friends.  
The result of this is that we can reduce the errors (in the single-variable model) by decreasing our estimate of $\beta_1$, which means that the error-minimizing $\beta_1$ is smaller than the "actual" value.  
That is, in this case the single-variable least-squares solution is biased to underestimate $\beta_1$.  
And, in general, whenever the independent variables are correlated with the errors like this, our least squares solution will give us a biased estimate of $\beta$.

## Fitting the Model

As we did in the simple linear model, we'll choose `beta` to minimize the sum of squared errors.  
Finding an exact solution is not simple to do by hand, which means we'll need to use gradient descent.  
We'll start by creating an error function to minimize.  
For stochastic gradient descent, we'll just want the squared error corresponding to a single prediction:

In [27]:
def error(x_i, y_i, beta):
    return y_i - predict(x_i, beta)

def squared_error(x_i, y_i, beta):
    return error(x_i, y_i, beta) ** 2

If you know calculus, you can calculate:

In [28]:
def squared_error_gradient(x_i, y_i, beta):
    """ the gradient (with respect to beta) corresponding to the ith squared error term """
    return [-2 * x_ij * error(x_i, y_i, beta) for x_ij in x_i]

At this point, we're ready to find the optimal beta using stochastic gradient descent:

In [29]:
def estimate_beta(x, y):
    beta_initial = [random.random() for x_i in x[0]]
    return minimize_stochastic(squared_error,
                               squared_error_gradient,
                               x, 
                               y, 
                               beta_initial,
                               0.001)
random.seed(0)
beta = estimate_beta(x, daily_minutes_good)
beta

[30.625234786488353,
 0.9715448288696535,
 -1.8679272872032218,
 0.911456949921445]

Those results means that our model looks like:  

minutes = 30.63 + 0.972friends - 1.868work hours + 0.911PhD

## Interpreting the Model

You should think of the coefficients of the model as representing all-else-being-equal estimate of the impacts of each factor.  
All else being equal, each additional friend corresponds to an extra minute spent on the site each day.  
All else being equal, each additional hour in a user's workday corresponds to about two fewer minutes spent on the site each day.  
All else being equal, having a PhD is associated with spending an extra minute on the site each day.  
What this *doesn't* tell us (directly, at least) is anything about the interactions among the variables.  
It's possible that the effect of work hours is different for people with many friends than it is for people with few friends.  
This model doen't capture that.  
One way to handle this case is to introduce a new variable that is the *product* of "friends" and "work hours".  
This effectively allows the "work hours" coefficient to increase or decrease as the number of friends increases.

It is also possible that the more friends you have, the more time you spend on the site *up to a point*, after which further friends cause you to spend less time on the site. (Perhaps with too many friends the experience is too overwhelming?)    
We could try to capture this in our model by adding another variable that's the *square* of the number of friends.  
Once we start adding variables, we need to worry about whether their coefficients actually "matter".  
There are no limits to the numbers of products, logs, squares, and higher powers we could add.

## Goodness of Fit

Again we can look at the R-squared, which has now increased to 0.68:

In [30]:
def multiple_r_squared(x, y, beta):
    sum_of_squared_errors = sum(error(x_i, y_i, beta) ** 2 for x_i, y_i in zip(x, y))
    return 1.0 - sum_of_squared_errors / total_sum_of_squares(y)

Keep in mind, however, that adding new variables to a regression will *necessarily* increase the R-squared.  
After all, the simple regression model is just the special case of the multiple regression model where the coefficients on "work hours" and "PhD" both equal 0.  
The optimal multiple regression model will necessarily have an error at least as small as that one.

Because of this, in a multiple regression, we also need to look at the [standard errors](https://en.wikipedia.org/wiki/Standard_error) of the coefficients, which measure how certain we are about our estimates of each $\beta_i$.  
The regression as a whole may fit our data very well, but if some of the independent variables are correlated (or irrelevant), their coefficients might not *mean* much.  

The typical approach to measuring these errors starts with another assumption -- that the errors $\epsilon_i$ are independent normal random variables with mean 0 and some shared (unknown) standard deviation $\sigma$.  
In that case, we (or, more likely, our statistical software) can use some linear algebra to find the standard error of each coefficient.  
The larger it is, the less sure our model is about that coefficient.  
Unfortunately, we're not set up to that kind of linear algebra from scratch.

## Digression: The Bootstrap

Imagine that we have a sample of $n$ data points, generated by some (unknown to us) distribution:  

`data = get_sample(num_points=n)`  

In Chapter 5, we wrote a function to calculate the `median` of the observed data, which we can use as an estimate of the median of the distribution itself.  
But how confident can we be about our estimate?  
If all of the data in the sample are very close to 100, then it seems likely that the actual median is close to 100.  
If approximately half of the data in the sample is close to 0 and the other half is close to 200, then we can't be nearly as certain about the median.

If we could repeatedly get new samples, we could compute the median of each and look at the distribution of those medians.  
Usually we can't.  
What we can do instead is [bootstrap](https://en.wikipedia.org/wiki/Bootstrapping_(statistics)) new data sets by choosing $n$ data points *with replacement* from our data and then calculate the medians of those synthetic data sets:

In [31]:
def bootstrap_sample(data):
    """ randomly samples len(data) elements with replacement """
    return [random.choice(data) for _ in data]

def bootstrap_statistic(data, stats_fn, num_samples):
    """ evaluates stats_fn on num_samples bootstrap samples from data """
    return [stats_fn(bootstrap_sample(data)) for _ in range(num_samples)]

For example, consider the two following data sets:

In [32]:
# 101 points all very close to 100
close_to_100 = [99.5 + random.random() for _ in range(101)]
close_to_100

[99.67232348413654,
 99.66789337232244,
 100.48755198519261,
 100.05092001929478,
 99.65667758671324,
 100.23220072672623,
 99.91858666365682,
 100.0227656713815,
 100.23660366112027,
 100.1164465479239,
 100.42194541502072,
 99.74150836873142,
 99.50983790912068,
 99.74437835042211,
 99.78741706803562,
 99.59398218944757,
 99.91917053096665,
 100.18038503075336,
 100.00163894294414,
 99.79919455670316,
 100.11076927840307,
 99.64121932944245,
 100.44445441120023,
 99.67041626825599,
 99.62757797471795,
 99.79085513668522,
 99.91871634648645,
 100.07078532815683,
 99.77176948360089,
 99.83989472198373,
 100.3596932989131,
 100.2521018949548,
 99.9011592314215,
 100.12778552117875,
 99.65014280209458,
 100.32407018238091,
 99.54672792219513,
 99.5170433466829,
 100.47463774250083,
 100.19886179306465,
 99.77952080966868,
 100.23045831494373,
 99.60358713212436,
 99.57683422681276,
 100.35035329379498,
 100.48721391040483,
 99.54927213964649,
 99.88621092049782,
 100.36301848326202,
 100

In [33]:
# 101 points, 50 of them near 0, 50 of them near 200
far_from_100 = ([99.5 + random.random()] + 
                [random.random() for _ in range(50)] + 
                [200 + random.random() for _ in range(50)])
far_from_100

[99.88123934680546,
 0.07338681934354963,
 0.06750332777081425,
 0.09038014271638684,
 0.13555032098946795,
 0.15877205414328543,
 0.09003766654085732,
 0.43257503723419066,
 0.7074066046516575,
 0.49166968165200353,
 0.2424559633954838,
 0.4836258928392764,
 0.531216554944571,
 0.23869991541065028,
 0.3175713118522211,
 0.49244511541480107,
 0.48312920949129423,
 0.37848760329102427,
 0.07212722630772073,
 0.9510078771184474,
 0.2696035016653002,
 0.3098656501775928,
 0.9379520613861619,
 0.41128674425523026,
 0.7340661885033657,
 0.13960513771125882,
 0.6719249710917461,
 0.8133157262876995,
 0.9610729439040515,
 0.46064650262496654,
 0.5994097716334423,
 0.791383888512169,
 0.7929383886152752,
 0.35476637607517914,
 0.6037083473955032,
 0.9251244198229643,
 0.38207224438305765,
 0.6508039786753007,
 0.337311253546157,
 0.9889617627300915,
 0.8479466705917126,
 0.9380136613197754,
 0.890544699972853,
 0.6999668375199836,
 0.5722156351304395,
 0.31457994819943513,
 0.3592733619762747,

In [34]:
# If you calculate the median of each, both will be very close to 100.
print median(close_to_100)
print median(far_from_100)

100.022765671
99.8812393468


In [35]:
# However, if you look at:
bootstrap_statistic(close_to_100, median, 100)
# you will mostly see numbers close to 100.

[100.00163894294414,
 99.91917053096665,
 100.04779962722975,
 100.07078532815683,
 100.05092001929478,
 99.75781287223258,
 100.02951482374817,
 99.9011592314215,
 100.05092001929478,
 99.84888107500696,
 100.16134473286708,
 100.02951482374817,
 99.94438370321161,
 99.87854525749272,
 100.04779962722975,
 99.9011592314215,
 99.91858666365682,
 100.11076927840307,
 100.08018406587479,
 99.91917053096665,
 99.84888107500696,
 99.91917053096665,
 100.00163894294414,
 100.00163894294414,
 100.00163894294414,
 100.04219451967033,
 100.04779962722975,
 100.04093770521457,
 100.02951482374817,
 99.91871634648645,
 100.11076927840307,
 100.00163894294414,
 100.04093770521457,
 100.05092001929478,
 100.00163894294414,
 99.94438370321161,
 99.94438370321161,
 100.04219451967033,
 100.04779962722975,
 99.91871634648645,
 100.05092001929478,
 99.9011592314215,
 100.02951482374817,
 100.0227656713815,
 100.16134473286708,
 99.91871634648645,
 100.00163894294414,
 100.04219451967033,
 100.02951482

In [36]:
# Now looking at:
bootstrap_statistic(far_from_100, median, 100)
# you will see a lot of numbers close to 0 and a lot of numbers close to 200 (with a few close to 100 thrown in)

[200.03634068155228,
 200.08178177744344,
 0.8133157262876995,
 0.9510078771184474,
 99.88123934680546,
 0.9510078771184474,
 200.32769342519742,
 200.0220118191685,
 200.04042139830705,
 0.9610729439040515,
 0.9889617627300915,
 200.0220118191685,
 200.03541078390245,
 0.9380136613197754,
 200.0220118191685,
 0.8479466705917126,
 0.8133157262876995,
 0.9379520613861619,
 200.08178177744344,
 0.9610729439040515,
 200.0220118191685,
 200.03634068155228,
 0.9889617627300915,
 99.88123934680546,
 0.9510078771184474,
 99.88123934680546,
 200.14442737011086,
 200.03634068155228,
 200.0220118191685,
 0.791383888512169,
 200.0220118191685,
 200.03634068155228,
 200.08187070628674,
 99.88123934680546,
 0.9251244198229643,
 0.9251244198229643,
 200.08187070628674,
 0.9380136613197754,
 0.9379520613861619,
 0.9889617627300915,
 200.03634068155228,
 200.03541078390245,
 200.0220118191685,
 99.88123934680546,
 0.9510078771184474,
 200.32769342519742,
 200.03634068155228,
 200.08178177744344,
 0.93

Also, the `standard_deviation` of the first set of medians is close to 0, while the `standard_deviation` of the second set of medians is close to 100.

In [37]:
print standard_deviation(close_to_100)
print standard_deviation(bootstrap_statistic(close_to_100, median, 100))
print standard_deviation(far_from_100)
print standard_deviation(bootstrap_statistic(far_from_100, median, 100))

0.303183667387
0.0681205300196
100.021771234
95.4570184486


This extreme case would be easy to figure out by manually inspecting the data, but in real practice that won't usually be true.

## Standard Errors of Regression Coefficients

We can take the same approach to estimating the standard errors of our regression coefficients.  
We repeatedly take a `bootstrap_sample` of our data and estimate `beta` based on that sample.  
If the coefficient corresponding to one of the independent variables (say `num_friends`) doesn't vary much across samples, then we can be confident that our estimate is relatively accurate.  
If the coefficient varies greatly across samples, then we can't be at all confident of our estimate.

Before sampling, it is *very* important that we `zip` the `x` data and `y` data to make sure that the corresponding values of the independent and dependent variables are sampled together.  
This means that `bootstrap_sample` will return a list of pairs (`x_i, y_i`), which must be reassembled into an `x_sample` and a `y_sample`:

In [38]:
def estimate_sample_beta(sample):
    """ sample is a list of pairs (x_i, y_i) """
    x_sample, y_sample = zip(*sample)
    return estimate_beta(x_sample, y_sample)

random.seed(0)
bootstrap_betas = bootstrap_statistic(zip(x, daily_minutes_good), estimate_sample_beta, 100)
bootstrap_betas[:5]

[[30.126427794849263,
  0.9012617032007041,
  -1.7030126679618276,
  0.33505058469292937],
 [30.739799574784413,
  0.9315551053594878,
  -1.8441042227777509,
  0.8586836390433148],
 [29.412830031142178,
  1.0698124634313169,
  -1.7510563652088023,
  1.151878332200066],
 [30.562978707724948,
  1.2020018751810237,
  -2.008914519534357,
  0.9248718484382588],
 [31.87041480154315,
  0.9571614075937392,
  -1.9366453588430825,
  -0.12061673914648428]]

Now we can estimate the standard deviation of each coefficient:

In [39]:
bootstrap_standard_errors = [standard_deviation([beta[i] for beta in bootstrap_betas]) for i in range(4)]
bootstrap_standard_errors

[1.174097542924062,
 0.07861006463889537,
 0.13138388603694567,
 0.9899022849002838]

In [40]:
# [1.174,    # constant term,  actual error = 1.19
#  0.079,    # num_friends,    actual error = 0.080
#  0.131,    # unemployed,     actual error = 0.127
#  0.990]    # phd,            actual error = 0.998

We can use these results to test hypotheses such as "does $\beta_i$ equal zero?"  
Under the null hypothesis $\beta_i = 0$ (and with our other assumptions about the distribution of $\epsilon_i$) the statistic:  

$ \Large t_j = \hat\beta_j \;/\; \hat\sigma_j$  

which is our estimate of $\beta_j$ divided by our estimate of its standard error, follows a [Student's t-distribution](https://en.wikipedia.org/wiki/Student's_t-distribution) with "$n - k$ [degrees of freedom](https://en.wikipedia.org/wiki/Degrees_of_freedom_%28statistics%29)."

If we had a `students_t_cdf()` function, we could calculate *p*-values for each least_squares coefficient to indicate how likely we would be to observe such a value if the actual coefficient were zero.  
Unfortunately, we don't have such a function (although we would if we weren't working from scratch).  
However, as degrees of freedom get large, the *t*-distribution gets closer to a standard normal.  
In a situation like this, where *n* is much larger than *k*, we can use `normal_cdf()` and still feel good about ourselves:

In [41]:
def p_value(beta_hat_j, sigma_hat_j):
    if beta_hat_j > 0:
        # if the coefficient is positive, calculate twice the probability of seeing an even larger value
        return 2 * (1 - normal_cdf(beta_hat_j / sigma_hat_j))
    else:
        # otherwise twice the probability of seeing a smaller value
        return 2 * normal_cdf(beta_hat_j / sigma_hat_j)
    
print(p_value(30.63, 1.174))   # ~0      constant term
print(p_value(0.972, 0.079))   # ~0      num_friends
print(p_value(-1.868, 0.131))  # ~0      work_hours
print(p_value(0.911, 0.990))   #  0.36   PhD

0.0
0.0
0.0
0.357467198817


In a situation not like this, we would probably be using statistical software that knows how to calculate the *t*-distribution, as well as how to calculate the exact standard errors.  
While most of the coefficients have very small *p*-values (suggesting that they are indeed nonzero), the coefficient for "PhD" is not 'significantly' different from zero, which makes it likely that the coefficient for "PhD" is random rather than meaningful.  
In more elaborate regression scenarios, you sometimes want to test more elaborate hypotheses about the data, such as:  
"at least one of the $\beta_j$ is non-zero", or  
"$\beta_1$ equals $\beta_2$ *and* $\beta_3$ equals $\beta_4$",  
which can be done with an [F-test](https://en.wikipedia.org/wiki/F-test), which is, unfortunately, outside the scope of this book.

## Regularization