# Improvement to Galton's regression

During Galton's time, doing simple linear regression was really hard from lack of computing and multiple linear regression was an impossible task. Thus, he had to multiply female's height by 1.08 to conduct a linear regression. Here, we show how multiple regression analysis allows us to predict female and male height differently.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('Data/galton.csv')
df

Unnamed: 0,Gender,Family,Height,Father,Mother
0,male,1,73.2,78.5,67.0
1,female,1,69.2,78.5,67.0
2,female,1,69.0,78.5,67.0
3,female,1,69.0,78.5,67.0
4,male,2,73.5,75.5,66.5
...,...,...,...,...,...
928,male,205,68.5,68.5,65.0
929,male,205,67.7,68.5,65.0
930,female,205,64.0,68.5,65.0
931,female,205,63.5,68.5,65.0


Create a dummy variable for gender. Here, we let 1=female and 0=male

In [4]:
def function(row):
    if row['Gender'] == 'male':
        return 0
    else:
        return 1

In [6]:
df['Gender1'] = df.apply(function, axis=1)

In [7]:
df

Unnamed: 0,Gender,Family,Height,Father,Mother,Gender1
0,male,1,73.2,78.5,67.0,0
1,female,1,69.2,78.5,67.0,1
2,female,1,69.0,78.5,67.0,1
3,female,1,69.0,78.5,67.0,1
4,male,2,73.5,75.5,66.5,0
...,...,...,...,...,...,...
928,male,205,68.5,68.5,65.0,0
929,male,205,67.7,68.5,65.0,0
930,female,205,64.0,68.5,65.0,1
931,female,205,63.5,68.5,65.0,1


In [8]:
from statsmodels.regression.linear_model import OLS
import statsmodels.api as sm

x_vals = df[['Father','Mother','Gender1']]
y_vals = df['Height']

reg_model = OLS(y_vals, sm.add_constant(x_vals)).fit()
display(reg_model.summary())

0,1,2,3
Dep. Variable:,Height,R-squared:,0.636
Model:,OLS,Adj. R-squared:,0.635
Method:,Least Squares,F-statistic:,540.5
Date:,"Thu, 03 Dec 2020",Prob (F-statistic):,3.51e-203
Time:,05:54:31,Log-Likelihood:,-2042.4
No. Observations:,933,AIC:,4093.0
Df Residuals:,929,BIC:,4112.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,21.6512,2.723,7.951,0.000,16.307,26.995
Father,0.3934,0.029,13.718,0.000,0.337,0.450
Mother,0.3184,0.031,10.263,0.000,0.258,0.379
Gender1,-5.2190,0.142,-36.784,0.000,-5.497,-4.941

0,1,2,3
Omnibus:,11.308,Durbin-Watson:,1.555
Prob(Omnibus):,0.004,Jarque-Bera (JB):,15.628
Skew:,-0.116,Prob(JB):,0.000404
Kurtosis:,3.59,Cond. No.,3630.0


We might also be interested in figuring out whether the children's height is more affected by mother or from father's height. Even though the coefficient of father is greater than that of mother, we can't automatically say that the children's height is more affected by father's height, because the scale of the heights are different.

This is an advanced statistical topic, but I hope I can show how we can standardize the predictors to compare the coefficients. By standardizing the predictors, we can directly compare the magnitude of the coefficients, and interpret the slope to be the mean value of y in a regression model

In [None]:
df['Father_s'] = (df['Father']-np.mean)