<img src="../dsi.png" style="height:128px;">

## Lesson 9: Regression

Today, we will get some hands-on practice with linear regression!

In [9]:
import numpy as np
from datascience import *
import matplotlib
import matplotlib.pyplot as plots
import pandas as pd
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)

As we have learnt from today's worksheet, the correlation coefficient  r  doesn't just measure how clustered the points in a scatter plot are about a straight line. It also helps identify the straight line about which the points are clustered.

Let's see if we can find a way to identify this line. 

Below we have data on the heights of parents and their adult children which seems to show a linear association. Let's make predictions of the children's heights based on the midparent heights and see if there is indeed linearity.


In [40]:
family_height = Table.read_table('family.csv')
family_height

Family,Father,Mother,Gender,Height,Kids
1,78.5,67.0,M,73.2,4
1,78.5,67.0,F,69.2,4
1,78.5,67.0,F,69.0,4
1,78.5,67.0,F,69.0,4
2,75.5,66.5,M,73.5,4
2,75.5,66.5,M,72.5,4
2,75.5,66.5,F,65.5,4
2,75.5,66.5,F,65.5,4
3,75.0,64.0,M,71.0,2
3,75.0,64.0,F,68.0,2


In [62]:
family_height = family_height.with_columns('Midparentheight', (family_height.column('Father')+family_height.column('Mother'))/2)
heights = Table().with_columns(
    'MidParent', family_height.column('Midparentheight'),
    'Child', family_height.column('Height')
    )
heights

MidParent,Child
72.75,73.2
72.75,69.2
72.75,69.0
72.75,69.0
71.0,73.5
71.0,72.5
71.0,65.5
71.0,65.5
69.5,71.0
69.5,68.0


There does indeed seem to be a linear relation. Let's see if we can find a way to identify the regression line. First, notice that linear association doesn't depend on the units of measurement – we might as well measure both variables in standard units.


In [46]:
def standard_units(xyz):
    "Convert any array of numbers to standard units."
    return (xyz - np.mean(xyz))/np.std(xyz)  

In [47]:
heights_SU = Table().with_columns(
    'MidParent SU', standard_units(heights.column('MidParent')),
    'Child SU', standard_units(heights.column('Child'))
)
heights_SU


MidParent SU,Child SU
3.48071,1.79823
3.48071,0.681196
3.48071,0.625344
3.48071,0.625344
2.48073,1.882
2.48073,1.60275
2.48073,-0.352057
2.48073,-0.352057
1.62361,1.18386
1.62361,0.346087


On this scale, we can calculate our predictions. But first we have to define which points should be labelled as "close points" on this scale. We can say that midparent heights are "close" if they are within 0.5 inches of each other. Since standard units measure distances in units of SDs, we have to figure out how many SDs of midparent height correspond to 0.5 inches. One SD of midparent heights is about 1.8 inches. So 0.5 inches is about 0.28 SDs.


In [48]:
sd_midparent = np.std(heights.column(0))
sd_midparent

1.7500351938096907

In [49]:
0.5/sd_midparent

0.28570853990172551

We can now make a prediction function.

In [51]:
def predict_child_su(mpht_su):
    """Return a prediction of the height (in standard units) of a child 
    whose parents have a midparent height of mpht_su in standard units.
    """
    close = 0.5/sd_midparent
    close_points = heights_SU.where('MidParent SU', are.between(mpht_su-close, mpht_su + close))
    return close_points.column('Child SU').mean()   

In [52]:
heights_with_su_predictions = heights_SU.with_column(
    'Prediction SU', heights_SU.apply(predict_child_su, 'MidParent SU')
    )

In [53]:
heights_with_su_predictions.scatter('MidParent SU')

As we have already learnt in the worksheet, in regression, we use the value of one variable (which we will call  *x* ) to predict the value of another (which we will call  *y* ). When the variables *x*  and *y*  are measured in standard units, the regression line for predicting  *y*  based on  *x*  has slope  *r*  and passes through the origin.

The three functions below compute the correlation, slope, and intercept. All of them take three arguments: the name of the table, the label of the column containing  *x* , and the label of the column containing  *y* .

In [54]:
def correlation(t, label_x, label_y):
    return np.mean(standard_units(t.column(label_x))*standard_units(t.column(label_y)))

def slope(t, label_x, label_y):
    r = correlation(t, label_x, label_y)
    return r*np.std(t.column(label_y))/np.std(t.column(label_x))

def intercept(t, label_x, label_y):
    return np.mean(t.column(label_y)) - slope(t, label_x, label_y)*np.mean(t.column(label_x))

The correlation between midparent height and child's height is 0.33:

In [56]:
family_r = correlation(heights, 'MidParent', 'Child')
family_r

0.32707394987259658

We can also find the equation of the regression line for predicting the child's height based on midparent height.

In [57]:
family_slope = slope(heights, 'MidParent', 'Child')
family_intercept = intercept(heights, 'MidParent', 'Child')
family_slope, family_intercept

(0.66925889513254522, 22.14880916454139)

The equation of the regression line is

estimate of child's height = 0.67⋅midparent height + 22.15

estimate of child's height = 0.67⋅midparent height + 22.15
 
This is also known as the regression equation. The principal use of the regression equation is to predict *y* based on *x* .
For example, for a midparent height of 70.48 inches, the regression equation predicts the child's height to be 69.32 inches.

In [58]:
family_slope*70.48 + family_intercept


69.318176093483174

Here are all of the rows in our family table, along with our original predictions and the new regression predictions of the children's heights.

In [59]:
heights_with_predictions = heights_with_predictions.with_column(
    'Regression Prediction', family_slope*heights.column('MidParent') + family_intercept
)
heights_with_predictions


MidParent,Child,Prediction,Regression Prediction
72.75,73.2,70.1,70.8374
72.75,69.2,70.1,70.8374
72.75,69.0,70.1,70.8374
72.75,69.0,70.1,70.8374
71.0,73.5,70.4158,69.6662
71.0,72.5,70.4158,69.6662
71.0,65.5,70.4158,69.6662
71.0,65.5,70.4158,69.6662
69.5,71.0,68.5025,68.6623
69.5,68.0,68.5025,68.6623


In [60]:
heights_with_predictions.scatter('MidParent')

The grey dots show the regression predictions, all on the regression line. Notice how the line is very close to the gold graph of averages. For these data, the regression line does a good job of approximating the centers of the vertical strips.
