
![Apache Spark Logo](https://s3-us-west-2.amazonaws.com/gmedasani-chicago-escape-conference/setup/images/spark-logo.png) 
![Machine Learning Image](https://s3-us-west-2.amazonaws.com/gmedasani-chicago-escape-conference/setup/images/machine-learning.png)


# Machine Learing - Linear Regression

This section covers a commom supervised learning pipeline, using a subset of the [Million Song Dataset](http://labrosa.ee.columbia.edu/millionsong/) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD). Our goal is to train a linear regression model to predict the release year of a song given a set of audio features.


This exercise will cover: 

* Part 1: Explore the dataset
* Part 2: Feature Engineering
* Part 3: Create and evaluate a baseline model
* Part 4: Create a Linear Regression Model
* Part 5: Choose the best model by Hyperparamter tuning


Note that, for reference, you can look up the details of the relevant Spark methods in [Spark's Python API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD) and the relevant NumPy methods in the [NumPy Reference](http://docs.scipy.org/doc/numpy/reference/index.html)

## Part 1: Explore the dataset

### Initial setup

* Check if spark context is available
* Import needed libraries
* Define the dataset path (Use

In [96]:
print sc

<pyspark.context.SparkContext object at 0x7f0c9b219c10>


In [1]:
import os.path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Set the value of the studentid to your studentid provided in the instructions'
studentid = 'student1'
filepath = '/home/'+studentid+'/2016-escape-conference-chicago/datasets/millionsong.txt'
print filepath

/home/student1/2016-escape-conference-chicago/datasets/millionsong.txt


### Load the dataset

* Define the schema for the dataframe
* Using SqlContext, create a millionsongs dataframe
* Explore the millionsongs dataframe

#### Define the schema for the millionsongs dataset

In [3]:
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType,StructField,DoubleType

In [4]:
millionSongSchema = StructType([
        StructField("year", DoubleType(), True),
        StructField("f1", DoubleType(), True),
        StructField("f2", DoubleType(), True),
        StructField("f3", DoubleType(), True),
        StructField("f4", DoubleType(), True),
        StructField("f5", DoubleType(), True),
        StructField("f6", DoubleType(), True),
        StructField("f7", DoubleType(), True),
        StructField("f8", DoubleType(), True),
        StructField("f9", DoubleType(), True),
        StructField("f10", DoubleType(), True),
        StructField("f11", DoubleType(), True),
        StructField("f12", DoubleType(), True)])

#### Using sqlContext, create a million songs dataframe

We are using a Spark packagge called [spark-csv: CSV Data Source for Spark](http://spark-packages.org/package/databricks/spark-csv)

In [5]:
millionsongs_raw_df = (sqlContext.read.format('com.databricks.spark.csv')
                        .options(header='false', inferschema='true')
                        .load(filepath, schema = millionSongSchema)
                       )

#### Explore the millionsongs dataframe

In [6]:
millionsongs_raw_df.printSchema()

root
 |-- year: double (nullable = true)
 |-- f1: double (nullable = true)
 |-- f2: double (nullable = true)
 |-- f3: double (nullable = true)
 |-- f4: double (nullable = true)
 |-- f5: double (nullable = true)
 |-- f6: double (nullable = true)
 |-- f7: double (nullable = true)
 |-- f8: double (nullable = true)
 |-- f9: double (nullable = true)
 |-- f10: double (nullable = true)
 |-- f11: double (nullable = true)
 |-- f12: double (nullable = true)



In [7]:
millionsongs_raw_df.head(5)

[Row(year=2001.0, f1=0.884123733793, f2=0.610454259079, f3=0.600498416968, f4=0.474669212493, f5=0.247232680947, f6=0.357306088914, f7=0.344136412234, f8=0.339641227335, f9=0.600858840135, f10=0.425704689024, f11=0.60491501652, f12=0.419193351817),
 Row(year=2001.0, f1=0.854411946129, f2=0.604124786151, f3=0.593634078776, f4=0.495885413963, f5=0.266307830936, f6=0.261472105188, f7=0.506387076327, f8=0.464453565511, f9=0.665798573683, f10=0.542968988766, f11=0.58044428577, f12=0.445219373624),
 Row(year=2001.0, f1=0.908982970575, f2=0.632063159227, f3=0.557428975183, f4=0.498263761394, f5=0.276396052336, f6=0.312809861625, f7=0.448530069406, f8=0.448674249968, f9=0.649791323916, f10=0.489868662682, f11=0.591908113534, f12=0.4500023818),
 Row(year=2001.0, f1=0.842525219898, f2=0.561826888508, f3=0.508715259692, f4=0.443531142139, f5=0.296733836002, f6=0.250213568176, f7=0.488540873206, f8=0.360508747659, f9=0.575435243185, f10=0.361005878554, f11=0.678378718617, f12=0.409036786173),
 Row

In [141]:
millionsongs_raw_df.count()

6724

In [142]:
millionsongs_raw_df.describe(['year','f1','f2','f3']).show()

+-------+------------------+-------------------+-------------------+-------------------+
|summary|              year|                 f1|                 f2|                 f3|
+-------+------------------+-------------------+-------------------+-------------------+
|  count|              6724|               6724|               6724|               6724|
|   mean|1975.7850981558597| 0.6619961470749578| 0.5339408278096173|  0.469804021937418|
| stddev|  21.4198604561239|0.15636089565238775|0.12683383787039013|0.09394144758944296|
|    min|            1922.0|                0.0|                0.0|                0.0|
|    max|            2011.0|                1.0|                1.0|                1.0|
+-------+------------------+-------------------+-------------------+-------------------+



## Part 2: Feature Engineering

As part of the feature enginering, we will perform following two steps

* Shift the labels(year)
* Create a dataframe with Vectorized features
* Create Interaction features and polynomial features
* Create Train, Test and Validation datasets

#### Shift Lables

As we just saw, the lables are years in the 1900s and 2000s. In learning problems, it is often natural to shift lables such that they start from zero. 

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn

In [143]:
from pyspark.sql.functions import mean, min, max
minYear = millionsongs_raw_df.select(min('year')).collect()[0].asDict().values()[0]
print 'Minimum year is : '+str(minYear)

Minimum year is : 1922.0


In [144]:
millionsongs_shifted_df = millionsongs_raw_df.withColumn('year', millionsongs_raw_df.year - minYear)

In [145]:
millionsongs_shifted_df.head(5)

[Row(year=79.0, f1=0.884123733793, f2=0.610454259079, f3=0.600498416968, f4=0.474669212493, f5=0.247232680947, f6=0.357306088914, f7=0.344136412234, f8=0.339641227335, f9=0.600858840135, f10=0.425704689024, f11=0.60491501652, f12=0.419193351817),
 Row(year=79.0, f1=0.854411946129, f2=0.604124786151, f3=0.593634078776, f4=0.495885413963, f5=0.266307830936, f6=0.261472105188, f7=0.506387076327, f8=0.464453565511, f9=0.665798573683, f10=0.542968988766, f11=0.58044428577, f12=0.445219373624),
 Row(year=79.0, f1=0.908982970575, f2=0.632063159227, f3=0.557428975183, f4=0.498263761394, f5=0.276396052336, f6=0.312809861625, f7=0.448530069406, f8=0.448674249968, f9=0.649791323916, f10=0.489868662682, f11=0.591908113534, f12=0.4500023818),
 Row(year=79.0, f1=0.842525219898, f2=0.561826888508, f3=0.508715259692, f4=0.443531142139, f5=0.296733836002, f6=0.250213568176, f7=0.488540873206, f8=0.360508747659, f9=0.575435243185, f10=0.361005878554, f11=0.678378718617, f12=0.409036786173),
 Row(year=79

#### Create a dataframe with vectorized features

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler

A feature transformer that merges multiple columns into a vector column.

In [146]:
features_list = ['f1','f2','f3','f4','f5','f6','f7','f8','f9','f10','f11','f12']
target = 'year'

In [147]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [148]:
assembler = VectorAssembler(
    inputCols=features_list,
    outputCol="features")
millionsongs_df = assembler.transform(millionsongs_shifted_df)

In [149]:
millionsongs_df.printSchema()

root
 |-- year: double (nullable = true)
 |-- f1: double (nullable = true)
 |-- f2: double (nullable = true)
 |-- f3: double (nullable = true)
 |-- f4: double (nullable = true)
 |-- f5: double (nullable = true)
 |-- f6: double (nullable = true)
 |-- f7: double (nullable = true)
 |-- f8: double (nullable = true)
 |-- f9: double (nullable = true)
 |-- f10: double (nullable = true)
 |-- f11: double (nullable = true)
 |-- f12: double (nullable = true)
 |-- features: vector (nullable = true)



In [150]:
millionsongs_df.head(2)

[Row(year=79.0, f1=0.884123733793, f2=0.610454259079, f3=0.600498416968, f4=0.474669212493, f5=0.247232680947, f6=0.357306088914, f7=0.344136412234, f8=0.339641227335, f9=0.600858840135, f10=0.425704689024, f11=0.60491501652, f12=0.419193351817, features=DenseVector([0.8841, 0.6105, 0.6005, 0.4747, 0.2472, 0.3573, 0.3441, 0.3396, 0.6009, 0.4257, 0.6049, 0.4192])),
 Row(year=79.0, f1=0.854411946129, f2=0.604124786151, f3=0.593634078776, f4=0.495885413963, f5=0.266307830936, f6=0.261472105188, f7=0.506387076327, f8=0.464453565511, f9=0.665798573683, f10=0.542968988766, f11=0.58044428577, f12=0.445219373624, features=DenseVector([0.8544, 0.6041, 0.5936, 0.4959, 0.2663, 0.2615, 0.5064, 0.4645, 0.6658, 0.543, 0.5804, 0.4452]))]

#### Create interaction features and Polynomial features

Perform feature expansion in a polynomial space. As said in wikipedia of Polynomial Expansion, which is available at http://en.wikipedia.org/wiki/Polynomial_expansion, “In mathematics, an expansion of a product of sums expresses it as a sum of products by using the fact that multiplication distributes over addition”. Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y).

http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.PolynomialExpansion

In [151]:
from pyspark.ml.feature import PolynomialExpansion
px = PolynomialExpansion(degree=2, inputCol="features", outputCol="polyFeatures")

In [152]:
millionsongs_poly_df = px.transform(millionsongs_df)

In [153]:
millionsongs_poly_df.printSchema()

root
 |-- year: double (nullable = true)
 |-- f1: double (nullable = true)
 |-- f2: double (nullable = true)
 |-- f3: double (nullable = true)
 |-- f4: double (nullable = true)
 |-- f5: double (nullable = true)
 |-- f6: double (nullable = true)
 |-- f7: double (nullable = true)
 |-- f8: double (nullable = true)
 |-- f9: double (nullable = true)
 |-- f10: double (nullable = true)
 |-- f11: double (nullable = true)
 |-- f12: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- polyFeatures: vector (nullable = true)



In [155]:
millionsongs_poly_df.head(1)

[Row(year=79.0, f1=0.884123733793, f2=0.610454259079, f3=0.600498416968, f4=0.474669212493, f5=0.247232680947, f6=0.357306088914, f7=0.344136412234, f8=0.339641227335, f9=0.600858840135, f10=0.425704689024, f11=0.60491501652, f12=0.419193351817, features=DenseVector([0.8841, 0.6105, 0.6005, 0.4747, 0.2472, 0.3573, 0.3441, 0.3396, 0.6009, 0.4257, 0.6049, 0.4192]), polyFeatures=DenseVector([0.8841, 0.7817, 0.6105, 0.5397, 0.3727, 0.6005, 0.5309, 0.3666, 0.3606, 0.4747, 0.4197, 0.2898, 0.285, 0.2253, 0.2472, 0.2186, 0.1509, 0.1485, 0.1174, 0.0611, 0.3573, 0.3159, 0.2181, 0.2146, 0.1696, 0.0883, 0.1277, 0.3441, 0.3043, 0.2101, 0.2067, 0.1634, 0.0851, 0.123, 0.1184, 0.3396, 0.3003, 0.2073, 0.204, 0.1612, 0.084, 0.1214, 0.1169, 0.1154, 0.6009, 0.5312, 0.3668, 0.3608, 0.2852, 0.1486, 0.2147, 0.2068, 0.2041, 0.361, 0.4257, 0.3764, 0.2599, 0.2556, 0.2021, 0.1052, 0.1521, 0.1465, 0.1446, 0.2558, 0.1812, 0.6049, 0.5348, 0.3693, 0.3633, 0.2871, 0.1496, 0.2161, 0.2082, 0.2055, 0.3635, 0.2575, 0.365

#### Create Train, Test and Validation datasets

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit

In [167]:
train_df, valid_df, test_df = millionsongs_df.randomSplit(weights=[0.7,0.15,0.15], seed=10)

In [168]:
train_poly_df, valid_poly_df, test_poly_df = millionsongs_poly_df.randomSplit(weights=[0.7,0.15,0.15], seed=10)

In [169]:
print "Traning Dataset size: " + str(train_df.count())
print "Valid Dataset size: " + str(valid_df.count())
print "Test Dataset size: " + str(test_df.count())

Traning Dataset size: 4765
Valid Dataset size: 981
Test Dataset size: 978


In [170]:
print "Poly Traning Dataset size: " + str(train_poly_df.count())
print "Poly Valid Dataset size: " + str(valid_poly_df.count())
print "Poly Test Dataset size: " + str(test_poly_df.count())

Poly Traning Dataset size: 4765
Poly Valid Dataset size: 981
Poly Test Dataset size: 978


In [171]:
train_df.printSchema()

root
 |-- year: double (nullable = true)
 |-- f1: double (nullable = true)
 |-- f2: double (nullable = true)
 |-- f3: double (nullable = true)
 |-- f4: double (nullable = true)
 |-- f5: double (nullable = true)
 |-- f6: double (nullable = true)
 |-- f7: double (nullable = true)
 |-- f8: double (nullable = true)
 |-- f9: double (nullable = true)
 |-- f10: double (nullable = true)
 |-- f11: double (nullable = true)
 |-- f12: double (nullable = true)
 |-- features: vector (nullable = true)



## Part 3: Create a Baseline model

In this section, we will create a baseline model and evaluate that model with validation and test datasets. 

* Create an average model
* Calculate the RMSE on the validation and test datasets based on the predictions by average model

#### Average model

A very simple yet natural baseline model is one where we always make the same prediction independent for a given data point, using the average label in the training set as the constant prediction value. Compute this value, which is the average(shifted) song year for the training set.

In [163]:
averageTrainYear = train_df.select(mean('year')).collect()[0].asDict().values()[0]
print averageTrainYear

53.766841553


#### Root mean squared error

We naturally would like to see how well this naive baseline performs. We will use root mean squared error(RMSE) for evaluation purposes.

Creates a Column of literal value. https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.lit

Evaluator for Regression, which expects two input columns: prediction and label. Supports the following metrics.
* mse
* rmse
* r2
* mae

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.RegressionEvaluator

In [172]:
from pyspark.ml.evaluation import RegressionEvaluator

In [173]:
predictions_train_baseModel = train_df.withColumn('year_predicted', pyspark.sql.functions.lit(averageTrainYear))
predictions_valid_baseModel = valid_df.withColumn('year_predicted', pyspark.sql.functions.lit(averageTrainYear))
predictions_test_baseModel = test_df.withColumn('year_predicted', pyspark.sql.functions.lit(averageTrainYear))

We can now see the year_predicted column, which is the averageTrainYear repeated for every record in the training, validation and test datasets

In [176]:
predictions_train_baseModel.printSchema()

root
 |-- year: double (nullable = true)
 |-- f1: double (nullable = true)
 |-- f2: double (nullable = true)
 |-- f3: double (nullable = true)
 |-- f4: double (nullable = true)
 |-- f5: double (nullable = true)
 |-- f6: double (nullable = true)
 |-- f7: double (nullable = true)
 |-- f8: double (nullable = true)
 |-- f9: double (nullable = true)
 |-- f10: double (nullable = true)
 |-- f11: double (nullable = true)
 |-- f12: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- year_predicted: double (nullable = false)



In [181]:
predictions_train_baseModel.head(2)

[Row(year=8.0, f1=0.169859883053, f2=0.347163029706, f3=0.310357993388, f4=0.183068614542, f5=0.319387467117, f6=0.615347306265, f7=0.234347964383, f8=0.655487922947, f9=0.312231493679, f10=0.622081983507, f11=0.365167752259, f12=0.401388756294, features=DenseVector([0.1699, 0.3472, 0.3104, 0.1831, 0.3194, 0.6153, 0.2343, 0.6555, 0.3122, 0.6221, 0.3652, 0.4014]), year_predicted=53.766841553),
 Row(year=11.0, f1=0.530393989851, f2=0.205627239023, f3=0.534139321378, f4=0.404864005298, f5=0.122195907889, f6=0.660395414619, f7=0.267234514594, f8=0.525681714294, f9=0.458256825104, f10=0.503793687198, f11=0.568304391237, f12=0.395315655212, features=DenseVector([0.5304, 0.2056, 0.5341, 0.4049, 0.1222, 0.6604, 0.2672, 0.5257, 0.4583, 0.5038, 0.5683, 0.3953]), year_predicted=53.766841553)]

Now, let's calculate the training, validation and test RMSE errors for the average baseline model

In [182]:
evaluator = RegressionEvaluator(labelCol='year', predictionCol='year_predicted', metricName='rmse')
trainError_baseModel = evaluator.evaluate(predictions_train_baseModel)
validError_baseModel = evaluator.evaluate(predictions_valid_baseModel)
testError_baseModel = evaluator.evaluate(predictions_test_baseModel)

In [183]:
print "Baseline Model Trainig Error : "+ str(trainError_baseModel)
print "Baseline Model Validation Error : "+ str(validError_baseModel)
print "Baseline Model Test Error is : "+ str(testError_baseModel)

Baseline Model Trainig Error : 21.3205877722
Baseline Model Validation Error : 21.9055648831
Baseline Model Test Error is : 21.3987891177


##  Part 4: Create a Linear Regression Model

We now have some idea about the validation and test errors from baseline. But we should be able to do a little better by using a linear regression model.

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression

Here we will use the L-BFGS solver

http://spark.apache.org/docs/latest/mllib-optimization.html#limited-memory-bfgs-l-bfgs

** Limited-memory BFGS (L-BFGS)**

L-BFGS is an optimization algorithm in the family of quasi-Newton methods to solve the optimization problems of the form minw∈ℝdf(w)minw∈Rdf(w). The L-BFGS method approximates the objective function locally as a quadratic without evaluating the second partial derivatives of the objective function to construct the Hessian matrix. The Hessian matrix is approximated by previous gradient evaluations, so there is no vertical scalability issue (the number of training features) when computing the Hessian matrix explicitly in Newton’s method. As a result, L-BFGS often achieves rapider convergence compared with other first-order optimization.


In [207]:
reg1=0.0001
lr1 = LinearRegression(featuresCol='features', labelCol='year',
    maxIter=5, regParam=reg1, solver="l-bfgs")

In [208]:
model1 = lr1.fit(train_df)

In [209]:
model1.intercept

54.48121033560951

In [210]:
model1.coefficients

DenseVector([23.7556, 27.1228, -66.3133, 38.2803, -11.8044, -36.3591, 47.4329, -24.3628, 6.0349, -2.5097, 1.2619, -19.0812])

Make Predictions on the validation data and test data

Once we transform the input DataFrame using the fitted model, we will see a new column called 'prediction' in the result DataFrame

In [211]:
predictions_train_model1 = model1.transform(train_df)
predictions_valid_model1 = model1.transform(valid_df)
predictions_test_model1 = model1.transform(test_df)

In [212]:
predictions_train_model1.printSchema()

root
 |-- year: double (nullable = true)
 |-- f1: double (nullable = true)
 |-- f2: double (nullable = true)
 |-- f3: double (nullable = true)
 |-- f4: double (nullable = true)
 |-- f5: double (nullable = true)
 |-- f6: double (nullable = true)
 |-- f7: double (nullable = true)
 |-- f8: double (nullable = true)
 |-- f9: double (nullable = true)
 |-- f10: double (nullable = true)
 |-- f11: double (nullable = true)
 |-- f12: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [213]:
predictions_train_model1.select('year','prediction','features').head(2)

[Row(year=8.0, prediction=16.48691352031952, features=DenseVector([0.1699, 0.3472, 0.3104, 0.1831, 0.3194, 0.6153, 0.2343, 0.6555, 0.3122, 0.6221, 0.3652, 0.4014])),
 Row(year=11.0, prediction=21.825975664529885, features=DenseVector([0.5304, 0.2056, 0.5341, 0.4049, 0.1222, 0.6604, 0.2672, 0.5257, 0.4583, 0.5038, 0.5683, 0.3953]))]

In [214]:
evaluator1 = RegressionEvaluator(labelCol='year', predictionCol='prediction', metricName='rmse')
trainError_model1 = evaluator1.evaluate(predictions_train_model1)
validError_model1 = evaluator1.evaluate(predictions_valid_model1)
testError_model1 = evaluator1.evaluate(predictions_test_model1)

In [215]:
print "Linear Regression Model1 Trainig Error : "+ str(trainError_model1)
print "Linear Regression Model1 Validation Error : "+ str(validError_model1)
print "Linear Regression Model1 Test Error is : "+ str(testError_model1)

Linear Regression Model1 Trainig Error : 15.4646109534
Linear Regression Model1 Validation Error : 16.3166845653
Linear Regression Model1 Test Error is : 15.6200094335


## Part 5: Choose the best model by Hyperparamter tuning


* Perform grid search fo find a good regularization parameter
* Report the test dataset error using the best model selected using the Hyperparameter optimization.

We are already outperforming the baseline on the validation setby almost 5 years on average, but let's see if we can do better by selecting a best model based with hyperparameter tuning. 

Hyperparameter Tuning/Optimization or model selection is the problem of choosing a set of hyperparameters for a learning algorithm, usually with the goal of optimizing a measure of the algorithm's performance on an independent data set such as Validation dataset. 
https://en.wikipedia.org/wiki/Hyperparameter_optimization


For this model selection, the hyperparameter we are tuning is regularization parameter. Regularization type we are using is L2 norm.

#### Perform grid search to find a good regulization parameter. 

We will search in the hyperparameter space with values 1e-100,1e-10,1e-5,1.0

In [241]:
bestRMSE = validError_model1
bestRegParam = reg1
bestModel = model1

numIters = 100
evaluator2 = RegressionEvaluator(labelCol='year', predictionCol='prediction', metricName='rmse')
for reg in [1e-100,1e-10,1e-5,0.0001,1.0]:
    
    lr = LinearRegression(featuresCol='features', labelCol='year',
                                maxIter=numIters, regParam=reg, solver="l-bfgs")
    model = lr.fit(train_df)
    predictions_valid_model = model.transform(valid_df)
    
    # Evaluate the validation error using the current model
    rmseValGrid = evaluator2.evaluate(predictions_valid_model)
    print rmseValGrid
    
    # Compare the current model's RMSE vs best model's RMSE
    if rmseValGrid < bestRMSE:
        bestRMSE = rmseValGrid
        bestRegParam = reg
        bestModel = model
rmseValLRGrid = bestRMSE

print '============'
print 'Validation RMSE Baseline Model:'+str(validError_baseModel)
print 'Validation RMSE Model 1: '+str(validError_model1)
print 'Validation RMSE Best Model: '+str(rmseValLRGrid)

16.2709818641
16.2709818641
16.270982259
16.2709858129
16.3117147103
Validation RMSE Baseline Model:21.9055648831
Validation RMSE Model 1: 16.3166845653
Validation RMSE Best Model: 16.2709818641


In [242]:
bestModel.intercept

65.45458727828077

In [243]:
bestModel.weights

DenseVector([25.2375, 25.1789, -67.2705, 53.7004, -13.0249, -47.7873, 34.2649, -22.7609, 3.4272, -5.475, -12.3466, -12.7098])

In [244]:
bestRegParam

1e-100

#### Report the test dataset error using the best model selected using the Hyperparameter optimization.

In [246]:
predictions_test_bestModel = bestModel.transform(test_df)
testError_bestModel = evaluator2.evaluate(predictions_test_bestModel)

In [247]:
print "Linear Regression Best Model Test Error is : "+ str(testError_bestModel)

Linear Regression Best Model Test Error is : 15.5800681484


In [248]:
print "Average Baseline Model Test Error is : "+ str(testError_baseModel)
print "Linear Regression Model1 Test Error is : "+ str(testError_model1)
print "Linear Regression Best Model Test Error is : "+ str(testError_bestModel)

Average Baseline Model Test Error is : 21.3987891177
Linear Regression Model1 Test Error is : 15.6200094335
Linear Regression Best Model Test Error is : 15.5800681484


** As we can see from the above results, we were able to improve our prediction results from the baseline average model by a huge margin.**


** Further we can build additional models using the polynomial features to improve the model's predictions and reduce test errorts**
