# 6.2 Regression

1. Load the Galton dataset into a Pandas dataframe?
    *  http://www.randomservices.org/random/data/Galton.html
    
2. Summarize the dataset:
    * Number of rows
    * Average height of male/female kids
    * Std deviation of male/female kids
    
3. Create a training and test dataset. The test dataset should be at least 25%.

4. Create 2 regression models: for predicting the childs height based on (i) father height and (ii) mother's height!

5. Compute the model quality parameters: $R^{2}$ and $MSE$! 

6. Create a multi-variate regression model including both the mother and father height as features! How does the $R^{2}$ change?

7. Create a Spark MLlib model for the same task!

References: 
* http://scikit-learn.org/stable/modules/linear_model.html
* http://scikit-learn.org/stable/model_selection.html
* <http:///pygot.wordpress.com/2017/03/25/simple-linear-regression-with-galton/>
* <https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#linear-regression>

In [1]:
%matplotlib inline
import csv
import requests # pip install requests for easy http request for CSV data
import numpy as np
import pandas as pd

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score
from sklearn import linear_model

In [3]:
df = pd.read_csv("http://www.randomservices.org/random/data/Galton.txt", sep="\t")

In [4]:
df.head(5)

Unnamed: 0,Family,Father,Mother,Gender,Height,Kids
0,1,78.5,67.0,M,73.2,4
1,1,78.5,67.0,F,69.2,4
2,1,78.5,67.0,F,69.0,4
3,1,78.5,67.0,F,69.0,4
4,2,75.5,66.5,M,73.5,4


In [5]:
df.groupby("Gender")["Height"].agg(["mean", "std", "count"])

Unnamed: 0_level_0,mean,std,count
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F,64.110162,2.37032,433
M,69.228817,2.631594,465


In [6]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,1:3], df.Height, test_size=0.25, random_state=42)

In [7]:
X_train.head()

Unnamed: 0,Father,Mother
377,70.5,62.0
357,70.5,63.0
723,67.0,64.0
306,70.0,64.7
464,69.0,66.0


In [8]:
regr = linear_model.LinearRegression()
father_model = regr.fit(X_train[["Father"]], y_train)
mother_model = regr.fit(X_train[["Mother"]], y_train)

In [9]:
X_test.head()

Unnamed: 0,Father,Mother
331,70.5,64.5
638,68.0,63.0
326,70.5,64.0
848,65.0,64.0
39,74.0,62.0


In [10]:
pred_father = father_model.predict(X_test[["Father"]])
pred_mother = mother_model.predict(X_test[["Mother"]])

In [11]:
pred_father[:5]

array([68.66308609, 67.93415922, 68.66308609, 67.05944697, 69.68358372])

In [12]:
pred_mother[:5]

array([66.9136616 , 66.47630547, 66.76787622, 66.76787622, 66.18473472])

In [18]:
print("Father model mse: %f" % mean_squared_error(y_test, pred_father))
print("Mother model mse: %f" % mean_squared_error(y_test, pred_mother))

Father model mse: 14.209773
Mother model mse: 11.439403


In [20]:
print("Father model r2_score: %f" % r2_score(y_test, pred_father))
print("Mother model r2_score: %f" % r2_score(y_test, pred_mother))

Father model r2_score: -0.177897
Mother model r2_score: 0.051749


In [22]:
model = regr.fit(X_train, y_train)

In [23]:
predictions = model.predict(X_test)

In [24]:
print("MV model r2_score: %f" % r2_score(y_test, predictions))

MV model r2_score: 0.076408
