<img src="../dsi.png" style="height:128px;">

## Lesson 9: Regression

Today, we will get some hands-on practice with linear regression!

In [26]:
import numpy as np
from datascience import *
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

As we have learnt from today's worksheet, the correlation coefficient  r  doesn't just measure how clustered the points in a scatter plot are about a straight line. It also helps identify the straight line about which the points are clustered.

Let's see if we can find a way to identify this line. 

Below we have data on the heights of parents and their adult children. In the table, you will see 6 columns. *Family* refers to the family that the child belongs to, labeled by the numbers from 1 to 204. *Father* refers to the father's height, in inches while *Mother* refers to the mother's height in inches. *Gender* refers to the gender of the child, male (M) or female (F) and *Height* refers to the height of the adult child in inches. *Kids* is the number of kids in the family of the child. 

From the data in the table, the heights of the children and their parents seem to show a linear association. Let's make predictions of the children's heights based on the midparent heights and see if there is indeed linearity.


In [32]:
family_height = Table.read_table('family.csv')
family_height
#Source: http://www.math.uah.edu/stat/data/Galton.html

Family,Father,Mother,Gender,Height,Kids
1,78.5,67.0,M,73.2,4
1,78.5,67.0,F,69.2,4
1,78.5,67.0,F,69.0,4
1,78.5,67.0,F,69.0,4
2,75.5,66.5,M,73.5,4
2,75.5,66.5,M,72.5,4
2,75.5,66.5,F,65.5,4
2,75.5,66.5,F,65.5,4
3,75.0,64.0,M,71.0,2
3,75.0,64.0,F,68.0,2


In [33]:
family_height = family_height.with_columns('Midparentheight', (family_height.column('Father')+family_height.column('Mother'))/2)
heights = Table().with_columns(
    'MidParent', family_height.column('Midparentheight'),
    'Child', family_height.column('Height')
    )
heights

MidParent,Child
72.75,73.2
72.75,69.2
72.75,69.0
72.75,69.0
71.0,73.5
71.0,72.5
71.0,65.5
71.0,65.5
69.5,71.0
69.5,68.0


There does indeed seem to be a linear relation. Let's see if we can find a way to identify the regression line. First, notice that linear association doesn't depend on the units of measurement – we might as well measure both variables in standard units.


In [34]:
def standard_units(xyz):
    "Convert any array of numbers to standard units."
    return (xyz - np.mean(xyz))/np.std(xyz)  

In [35]:
heights_SU = Table().with_columns(
    'MidParent SU', standard_units(heights.column('MidParent')),
    'Child SU', standard_units(heights.column('Child'))
)
heights_SU


MidParent SU,Child SU
3.48071,1.79823
3.48071,0.681196
3.48071,0.625344
3.48071,0.625344
2.48073,1.882
2.48073,1.60275
2.48073,-0.352057
2.48073,-0.352057
1.62361,1.18386
1.62361,0.346087


On this scale, we can calculate our predictions. But first we have to define which points should be labelled as "close points" on this scale. We can say that midparent heights are "close" if they are within 0.5 inches of each other. Since standard units measure distances in units of SDs, we have to figure out how many SDs of midparent height correspond to 0.5 inches. One SD of midparent heights is about 1.8 inches. So 0.5 inches is about 0.28 SDs.


In [36]:
sd_midparent = np.std(heights.column(0))
sd_midparent

1.7500351938096907

In [37]:
0.5/sd_midparent

0.28570853990172551

We can now make a prediction function.

In [38]:
def predict_child_su(mpht_su):
    """Return a prediction of the height (in standard units) of a child 
    whose parents have a midparent height of mpht_su in standard units.
    """
    close = 0.5/sd_midparent
    close_points = heights_SU.where('MidParent SU', are.between(mpht_su-close, mpht_su + close))
    return close_points.column('Child SU').mean()   

In [39]:
heights_with_su_predictions = heights_SU.with_column(
    'Prediction SU', heights_SU.apply(predict_child_su, 'MidParent SU')
    )

In [40]:
heights_with_su_predictions.scatter('MidParent SU')

As we have already learnt in the worksheet, in regression, we use the value of one variable (which we will call  *x* ) to predict the value of another (which we will call  *y* ). When the variables *x*  and *y*  are measured in standard units, the regression line for predicting  *y*  based on  *x*  has slope  *r*  and passes through the origin.

The three functions below compute the correlation, slope, and intercept. All of them take three arguments: the name of the table, the label of the column containing  *x* , and the label of the column containing  *y* .

In [41]:
def correlation(t, label_x, label_y):
    return np.mean(standard_units(t.column(label_x))*standard_units(t.column(label_y)))

def slope(t, label_x, label_y):
    r = correlation(t, label_x, label_y)
    return r*np.std(t.column(label_y))/np.std(t.column(label_x))

def intercept(t, label_x, label_y):
    return np.mean(t.column(label_y)) - slope(t, label_x, label_y)*np.mean(t.column(label_x))

The correlation between midparent height and child's height is 0.33:

In [42]:
family_r = correlation(heights, 'MidParent', 'Child')
family_r

0.32707394987259658

We can also find the equation of the regression line for predicting the child's height based on midparent height.

In [43]:
family_slope = slope(heights, 'MidParent', 'Child')
family_intercept = intercept(heights, 'MidParent', 'Child')
family_slope, family_intercept

(0.66925889513254522, 22.14880916454139)

The equation of the regression line is

estimate of child's height = 0.67⋅midparent height + 22.15

estimate of child's height = 0.67⋅midparent height + 22.15
 
This is also known as the regression equation. The principal use of the regression equation is to predict *y* based on *x* .
For example, for a midparent height of 70.48 inches, the regression equation predicts the child's height to be 69.32 inches.

In [44]:
family_slope*70.48 + family_intercept


69.318176093483174

Here are all of the rows in our family table, along with our original predictions and the new regression predictions of the children's heights.

In [54]:
heights_with_predictions = heights_with_predictions.with_column(
    'Regression Prediction', family_slope*heights.column('MidParent') + family_intercept
)
heights_with_predictions


Family,Father,Mother,Gender,Height,Kids,Midparentheight,Regression Prediction
1,78.5,67.0,M,73.2,4,72.75,70.8374
1,78.5,67.0,F,69.2,4,72.75,70.8374
1,78.5,67.0,F,69.0,4,72.75,70.8374
1,78.5,67.0,F,69.0,4,72.75,70.8374
2,75.5,66.5,M,73.5,4,71.0,69.6662
2,75.5,66.5,M,72.5,4,71.0,69.6662
2,75.5,66.5,F,65.5,4,71.0,69.6662
2,75.5,66.5,F,65.5,4,71.0,69.6662
3,75.0,64.0,M,71.0,2,69.5,68.6623
3,75.0,64.0,F,68.0,2,69.5,68.6623


In [57]:
heights_with_predictions.scatter('MidParent')

ValueError: The column 'Family' contains non-numerical values. A plot cannot be drawn for this column.

The grey dots show the regression predictions, all on the regression line. Notice how the line is very close to the gold graph of averages. For these data, the regression line does a good job of approximating the centers of the vertical strips.


**The Method of Least Squares**

If you use any arbitrary line as your line of estimates, then some of your errors are likely to be positive and others negative. To avoid cancellation when measuring the rough size of the errors, we take the mean of the sqaured errors rather than the mean of the errors themselves. This is exactly analogous to our reason for looking at squared deviations from average, when we were learning how to calculate the SD.

The mean squared error of estimation using a straight line is a measure of roughly how big the squared errors are; taking the square root yields the root mean square error, which is in the same units as *y*.

Here is a remarkable fact of mathematics in this section: the regression line minimizes the mean squared error of estimation (and hence also the root mean squared error) among all straight lines. That is why the regression line is sometimes called the "least squares line."

**Computing the "best" line:**

To get estimates of *y* based on *x*, you can use any line you want.

Every line has a mean squared error of estimation.

"Better" lines have smaller errors.

The regression line is the unique straight line that minimizes the mean squared error of estimation among all straight lines.

Let's look at an example.

Below we have a table of data examining the relation between strength and shot put distance. The population cosists of 28 female collegiate atheletes. Strength was measured by the biggest amount (in kilograms) that the athlete lifted (column, *power.clean*). The distance, was the athlete's best (column, *shotputt*).


In [7]:
shotput = Table.read_table('shotput.csv')
shotput


power.clean,shot.putt
37.5,6.4
51.5,10.2
61.3,12.4
61.3,13.0
63.6,13.2
66.1,13.0
70.0,12.7
92.7,13.9
90.5,15.5
90.5,15.8


In [8]:
shotput.scatter('power.clean')


That's not a football shaped scatter plot. In fact, it seems to have a slight non-linear component. But if we insist on using a straight line to make our predictions, there is still one best straight line among all straight lines.

Our formulas for the slope and intercept of the regression line, give the following values.

In [14]:
slope(shotput, 'power.clean', 'shot.putt')


0.098343821597819972

In [15]:
intercept(shotput, 'power.clean', 'shot.putt')


5.9596290983739522

Does it still make sense to use these formulas even though the scatter plot isn't football shaped? We can answer this by finding the slope and intercept of the line that minimizes the mse.


We will define the function *shotput_linear_mse* to take an arbirtary slope and intercept as arguments and return the corresponding mse. Then *minimize* applied to shotput_linear_mse will return the best slope and intercept.

In [16]:
def shotput_linear_mse(any_slope, any_intercept):
    x = shotput.column('power.clean')
    y = shotput.column('shot.putt')
    fitted = any_slope*x + any_intercept
    return np.mean((y - fitted) ** 2)

In [17]:
minimize(shotput_linear_mse)


array([ 0.09834382,  5.95962911])

These values are the same as those we got by using our formulas. To summarize:


No matter what the shape of the scatter plot, there is a unique line that minimizes the mean squared error of estimation. It is called the regression line, and its slope and intercept are given by


slope of the regression line = r⋅SD of y/SD of x


slope of the regression line = r⋅SD of y/SD of x
 
intercept of the regression line = average of y − slope⋅average of x


In the following exercise, you'll work with a small invented data set. Run the next cell to generate the dataset ds and see a scatter plot.

In [23]:
ds = Table().with_columns(
    'x', make_array(0,  1,  2,  3,  4),
    'y', make_array(1, .5, -1,  2, -3))
ds.scatter('x')

Running the cell below will generate sliders that control the slope and intercept of a line through the scatter plot.

When you adjust a slider, the line will move.

By moving the line around, make your best guess at the least-squares regression line. (It's okay if your line isn't exactly right, as long as it's reasonable.)

**Note:** Python will probably take about a second to redraw the plot each time you adjust the slider. We suggest clicking the place on the slider you want to try and waiting for the plot to be drawn; dragging the slider handle around will cause a long lag.

In [28]:
def plot_line(slope, intercept):
    plt.figure(figsize=(5,5))
    
    endpoints = make_array(-2, 7)
    p = plt.plot(endpoints, slope*endpoints + intercept, color='orange', label='Proposed line')
    
    plt.scatter(ds.column('x'), ds.column('y'), color='blue', label='Points')
    
    plt.xlim(-4, 8)
    plt.ylim(-6, 6)
    plt.gca().set_aspect('equal', adjustable='box')
    
    plt.legend(bbox_to_anchor=(1.8, .8))

interact(plot_line, slope=widgets.FloatSlider(min=-4, max=4, step=.1), intercept=widgets.FloatSlider(min=-4, max=4, step=.1));


The next cell produces a more useful plot. Use it to find a line that's closer to the least-squares regression line.

In [48]:
def plot_line_and_errors(slope, intercept):
    plt.figure(figsize=(5,5))
    points = make_array(-2, 7)
    p = plt.plot(points, slope*points + intercept, color='orange', label='Proposed line')
    ax = p[0].axes
    
    predicted_ys = slope*ds.column('x') + intercept
    diffs = predicted_ys - ds.column('y')
    for i in np.arange(ds.num_rows):
        x = ds.column('x').item(i)
        y = ds.column('y').item(i)
        diff = diffs.item(i)
        
        if diff > 0:
            bottom_left_x = x
            bottom_left_y = y
        else:
            bottom_left_x = x + diff
            bottom_left_y = y + diff
        
        ax.add_patch(patches.Rectangle(make_array(bottom_left_x, bottom_left_y), abs(diff), abs(diff), color='red', alpha=.3, label=('Squared error' if i == 0 else None)))
        plt.plot(make_array(x, x), make_array(y, y + diff), color='red', alpha=.6, label=('Error' if i == 0 else None))
    
    plt.scatter(ds.column('x'), ds.column('y'), color='blue', label='Points')
    
    plt.xlim(-4, 8)
    plt.ylim(-6, 6)
    plt.gca().set_aspect('equal', adjustable='box')
    
    plt.legend(bbox_to_anchor=(1.8, .8))

interact(plot_line_and_errors, slope=widgets.FloatSlider(min=-4, max=4, step=.1), intercept=widgets.FloatSlider(min=-4, max=4, step=.1));