# 6.2 Regression

1. Load the Galton dataset into a Pandas dataframe?
    *  http://www.randomservices.org/random/data/Galton.html
    
2. Summarize the dataset:
    * Number of rows
    * Average height of male/female kids
    * Std deviation of male/female kids
    
3. Create a training and test dataset. The test dataset should be at least 25%.

4. Create 2 regression models: for predicting the childs height based on (i) father height and (ii) mother's height!

5. Compute the model quality parameters: $R^{2}$ and $MSE$! 

6. Create a multi-variate regression model including both the mother and father height as features! How does the $R^{2}$ change?

7. Create a Spark MLlib model for the same task!

References: 
* http://scikit-learn.org/stable/modules/linear_model.html
* http://scikit-learn.org/stable/model_selection.html
* <http:///pygot.wordpress.com/2017/03/25/simple-linear-regression-with-galton/>
* <https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#linear-regression>

In [1]:
%matplotlib inline
import csv
import requests # pip install requests for easy http request for CSV data
import numpy as np
import pandas as pd

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, mean_squared_error, r2_score
from sklearn import linear_model

In [3]:
df = pd.read_csv("http://www.randomservices.org/random/data/Galton.txt", sep="\t")

In [4]:
df.head(5)

Unnamed: 0,Family,Father,Mother,Gender,Height,Kids
0,1,78.5,67.0,M,73.2,4
1,1,78.5,67.0,F,69.2,4
2,1,78.5,67.0,F,69.0,4
3,1,78.5,67.0,F,69.0,4
4,2,75.5,66.5,M,73.5,4


**Simple summarization:**

In [5]:
df.groupby("Gender")["Height"].agg(["mean", "std", "count"])

Unnamed: 0_level_0,mean,std,count
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
F,64.110162,2.37032,433
M,69.228817,2.631594,465


**Split in train test:**

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,1:3], df.Height, test_size=0.25, random_state=42)

In [7]:
X_train.head()

Unnamed: 0,Father,Mother
377,70.5,62.0
357,70.5,63.0
723,67.0,64.0
306,70.0,64.7
464,69.0,66.0


**Create model feature wise:**

In [8]:
regr = linear_model.LinearRegression()
father_model = regr.fit(X_train[["Father"]], y_train)
mother_model = regr.fit(X_train[["Mother"]], y_train)

In [9]:
X_test.head()

Unnamed: 0,Father,Mother
331,70.5,64.5
638,68.0,63.0
326,70.5,64.0
848,65.0,64.0
39,74.0,62.0


**Evaluate both models with $mse$ and $r^2$:**

In [10]:
pred_father = father_model.predict(X_test[["Father"]])
pred_mother = mother_model.predict(X_test[["Mother"]])

In [34]:
print("Father model mse: %f" % mean_squared_error(y_test, pred_father))
print("Mother model mse: %f" % mean_squared_error(y_test, pred_mother))

Father model mse: 14.209773
Mother model mse: 11.439403


In [35]:
print("Father model r2_score: %f" % r2_score(y_test, pred_father))
print("Mother model r2_score: %f" % r2_score(y_test, pred_mother))

Father model r2_score: -0.177897
Mother model r2_score: 0.051749


**Create and evaluate model on all features:**

In [36]:
model = regr.fit(X_train, y_train)

In [37]:
predictions = model.predict(X_test)

In [40]:
print("MV model r2_score: %f" % r2_score(y_test, predictions))
print("MV model mse: %f" % mean_squared_error(y_test, predictions))

MV model r2_score: 0.076408
MV model mse: 11.141921


**MLLib model:**

In [64]:
# Initialize PySpark
import os, sys
APP_NAME = "PySpark Lecture"
SPARK_MASTER="local[1]"
import pyspark
import pyspark.sql
from pyspark.sql import SQLContext
from pyspark.sql import Row
conf=pyspark.SparkConf()
conf=pyspark.SparkConf().setAppName(APP_NAME).set("spark.local.dir", os.path.join(os.getcwd(), "tmp"))
sc = pyspark.SparkContext(master=SPARK_MASTER, conf=conf)
sqlContext = SQLContext(sc)
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()

print("PySpark initiated...")

PySpark initiated...


In [78]:
galton = spark.createDataFrame(df)

In [79]:
galton.take(1)

[Row(Family='1', Father=78.5, Mother=67.0, Gender='M', Height=73.2, Kids=4)]

In [88]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

In [82]:
vecAssembler = VectorAssembler(inputCols = ['Father', 'Mother'], outputCol = 'features')
vgalton = vecAssembler.transform(galton)
vgalton = vgalton.select(['features', 'Height'])
vgalton.show(3)

+-----------+------+
|   features|Height|
+-----------+------+
|[78.5,67.0]|  73.2|
|[78.5,67.0]|  69.2|
|[78.5,67.0]|  69.0|
+-----------+------+
only showing top 3 rows



In [83]:
splits = vgalton.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]

In [85]:
lr = LinearRegression(featuresCol = 'features', labelCol='Height', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Coefficients: [0.3199594585266382,0.18790265706695397]
Intercept: 32.57387419240611


In [86]:
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 3.420892
r2: 0.115522


In [87]:
train_df.describe().show()

+-------+-----------------+
|summary|           Height|
+-------+-----------------+
|  count|              622|
|   mean|66.77765273311896|
| stddev|3.640366674076068|
|    min|             56.0|
|    max|             79.0|
+-------+-----------------+



In [89]:
lr_predictions = lr_model.transform(test_df)
lr_predictions.select("prediction","Height","features").show(5)

lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="Height",metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

+-----------------+------+-----------+
|       prediction|Height|   features|
+-----------------+------+-----------+
|64.81293598747664|  61.0|[62.0,66.0]|
|64.81293598747664|  64.0|[62.0,66.0]|
|64.32543896212819|  66.0|[64.0,60.0]|
|  65.077049590396|  64.0|[64.0,64.0]|
|  65.077049590396|  64.0|[64.0,64.0]|
+-----------------+------+-----------+
only showing top 5 rows

R Squared (R2) on test data = 0.0716247


In [90]:
test_result = lr_model.evaluate(test_df)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_result.rootMeanSquaredError)

Root Mean Squared Error (RMSE) on test data = 3.32402


In [91]:
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()

numIterations: 5
objectiveHistory: [0.5000000000000284, 0.49043996158964853, 0.4653683837864802, 0.46536724312109595, 0.4653672190027659]
+--------------------+
|           residuals|
+--------------------+
|  -2.812935987476635|
|  -7.409207745539106|
|  2.0907922544608937|
|  -4.325438962128189|
| -0.3891469333290587|
|  -5.077049590396001|
|  -4.077049590396001|
|  -3.077049590396001|
|  -3.077049590396001|
| -1.0770495903960011|
|-0.07704959039600112|
|  1.9229504096039989|
|   2.922950409603999|
|   2.922950409603999|
|   5.422950409603999|
|   6.422950409603999|
| -2.2091063918556983|
|  -4.897009048922655|
|  -2.897009048922655|
|  -1.397009048922655|
+--------------------+
only showing top 20 rows

