The dataset 'YearPredictionMSD' is a subset of the Million Song Dataset from: https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD

The objective of this project is to apply various regression models on the dataset using Spark ML to predict the release year of a song from it's audio features.

Use the first 463,715 examples as the training dataset and the last 51,630 examples as the test dataset.

In [1]:
import findspark
findspark.init()
import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark-master", "local").getOrCreate()

In [2]:
columns = ["Timbre_Avg_{:02d}".format(i+1) for i in range(12)] + ["Timbre_Covar_{:02d}".format(i+1) for i in range(78)]

In [3]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

In [4]:
mySchema = StructType([StructField(c, FloatType(), False) for c in columns])

In [5]:
to_prepend = [StructField("Year", ShortType(), False)] 
mySchema = StructType(to_prepend + mySchema.fields)

In [6]:
df = spark.read.format("csv").load("data/YearPredictionMSD.txt", schema=mySchema)
df.printSchema()

root
 |-- Year: short (nullable = true)
 |-- Timbre_Avg_01: float (nullable = true)
 |-- Timbre_Avg_02: float (nullable = true)
 |-- Timbre_Avg_03: float (nullable = true)
 |-- Timbre_Avg_04: float (nullable = true)
 |-- Timbre_Avg_05: float (nullable = true)
 |-- Timbre_Avg_06: float (nullable = true)
 |-- Timbre_Avg_07: float (nullable = true)
 |-- Timbre_Avg_08: float (nullable = true)
 |-- Timbre_Avg_09: float (nullable = true)
 |-- Timbre_Avg_10: float (nullable = true)
 |-- Timbre_Avg_11: float (nullable = true)
 |-- Timbre_Avg_12: float (nullable = true)
 |-- Timbre_Covar_01: float (nullable = true)
 |-- Timbre_Covar_02: float (nullable = true)
 |-- Timbre_Covar_03: float (nullable = true)
 |-- Timbre_Covar_04: float (nullable = true)
 |-- Timbre_Covar_05: float (nullable = true)
 |-- Timbre_Covar_06: float (nullable = true)
 |-- Timbre_Covar_07: float (nullable = true)
 |-- Timbre_Covar_08: float (nullable = true)
 |-- Timbre_Covar_09: float (nullable = true)
 |-- Timbre_Covar_

In [7]:
df.show(1)

+----+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+-------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+---------------+----------

In [8]:
from pyspark.ml.feature import RFormula
rForm = RFormula(formula="Year ~ .")

In [9]:
df = df.withColumn("id", monotonically_increasing_id())
train = df.limit(463715).drop("id")
test = df.sort(desc("id")).limit(51630).drop("id")

In [10]:
train.count(), test.count()

(463715, 51630)

Use an id column to get first 463715 and last 51630 rows for train/test data

In [11]:
from pyspark.ml.regression import LinearRegression, DecisionTreeRegressor, GBTRegressor

lr = LinearRegression().setLabelCol("label").setFeaturesCol("features")
dt = DecisionTreeRegressor().setLabelCol("label").setFeaturesCol("features")
gbt = GBTRegressor().setLabelCol("label").setFeaturesCol("features")

initialize instances of the different regression models

In [12]:
from pyspark.ml import Pipeline
lr_stages = [rForm, lr]
dt_stages = [rForm, dt]
gbt_stages = [rForm, gbt]

lr_pipeline = Pipeline().setStages(lr_stages)
dt_pipeline = Pipeline().setStages(dt_stages)
gbt_pipeline = Pipeline().setStages(gbt_stages)

pipeline the models with RFormula feature selection

In [13]:
from pyspark.ml.tuning import ParamGridBuilder

lr_paramGrid = ParamGridBuilder()\
.addGrid(lr.regParam, [0.1, 0.01]) \
.addGrid(lr.fitIntercept, [False, True])\
.addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
.build()

dt_paramGrid = ParamGridBuilder()\
.addGrid(dt.maxDepth, [5, 10, 15])\
.build()

gbt_paramGrid = ParamGridBuilder()\
.addGrid(gbt.maxDepth, [5, 10, 15])\
.addGrid(gbt.maxIter, [5, 10])\
.build()

use ParamGridBuilder to fine tune the parameters

In [14]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator()\
.setMetricName("rmse")\
.setPredictionCol("prediction")\
.setLabelCol("label")

Regression Evaluator will use mean squared error as the evaluation metric

In [15]:
from pyspark.ml.tuning import TrainValidationSplit

lr_tvs = TrainValidationSplit()\
.setTrainRatio(0.75)\
.setEstimatorParamMaps(lr_paramGrid)\
.setEstimator(lr_pipeline)\
.setEvaluator(evaluator)

dt_tvs = TrainValidationSplit()\
.setTrainRatio(0.75)\
.setEstimatorParamMaps(dt_paramGrid)\
.setEstimator(dt_pipeline)\
.setEvaluator(evaluator)

gbt_tvs = TrainValidationSplit()\
.setTrainRatio(0.75)\
.setEstimatorParamMaps(gbt_paramGrid)\
.setEstimator(gbt_pipeline)\
.setEvaluator(evaluator)

TrainValidationSplit() will evaluate and choose the best set of parameters for the model based on the training data to be fitted on the model

In [16]:
#Linear Regression model
lr_model = lr_tvs.fit(train)

In [17]:
#Decision Tree Regression model
dt_model = dt_tvs.fit(train)

In [18]:
#Gradient-Boosted Tree Regression model
gbt_model = gbt_tvs.fit(train)

fit the 3 models using the training data set, then transform on the test data set to get the model's predictions

In [19]:
#Linear Regression Predictions
lr_model.transform(test)\
.select("label", "prediction")\
.show()

+------+------------------+
| label|        prediction|
+------+------------------+
|2005.0|1999.7225998533809|
|2006.0| 2001.930661833288|
|2006.0|1996.9653578829723|
|2006.0|1999.5505932332421|
|2006.0|2001.0638159046891|
|2006.0|2000.9223374318403|
|2005.0|2000.4758839806566|
|2006.0|1996.0939362852807|
|2005.0|2000.1890567411447|
|1992.0|1994.8504773567388|
|1992.0|1995.9048953979502|
|1992.0|1997.5337765016818|
|1992.0| 1995.606851655856|
|1992.0|1997.6457792962956|
|1992.0|1995.2427750982056|
|1992.0| 1995.595865120552|
|1992.0|1990.5308852005548|
|1997.0|1996.0636990250678|
|1994.0|1998.6703682904313|
|1992.0|1999.3243119133053|
+------+------------------+
only showing top 20 rows



In [20]:
#Decision Tree Predictions
dt_model.transform(test)\
.select("label", "prediction")\
.show()

+------+------------------+
| label|        prediction|
+------+------------------+
|2005.0|2003.4903934126257|
|2006.0| 2002.994966442953|
|2006.0|         1995.9375|
|2006.0| 2002.245283018868|
|2006.0|2004.4741100323624|
|2006.0| 2002.533659730722|
|2005.0|1997.6642066420663|
|2006.0|1989.7170963364993|
|2005.0|2003.4903934126257|
|1992.0|1997.7636363636364|
|1992.0|2000.1210855949896|
|1992.0|1989.8943584070796|
|1992.0|1992.6595394736842|
|1992.0|2000.1210855949896|
|1992.0|1989.8943584070796|
|1992.0|1999.2581602373887|
|1992.0|1992.6608187134502|
|1997.0|        1990.74375|
|1994.0| 1993.596715328467|
|1992.0|1989.1089630931458|
+------+------------------+
only showing top 20 rows



In [21]:
#GBT Tree Predictions 
gbt_model.transform(test)\
.select("label", "prediction")\
.show()

+------+------------------+
| label|        prediction|
+------+------------------+
|2005.0|2004.2185689503192|
|2006.0| 1997.442248248616|
|2006.0|2000.9502121309079|
|2006.0| 2001.928149925323|
|2006.0|2005.3385007848908|
|2006.0|1999.0117181302767|
|2005.0|1993.9580768774374|
|2006.0|1989.1882444915184|
|2005.0|2003.8556669238292|
|1992.0|1999.6211594241215|
|1992.0|1989.6934654399224|
|1992.0|  1994.58014688302|
|1992.0|1994.6878513023805|
|1992.0|1999.1540143697835|
|1992.0|1988.5878850941624|
|1992.0|1999.1974168983206|
|1992.0|1997.6926312401044|
|1997.0|1993.3883113266165|
|1994.0| 1994.612439871668|
|1992.0| 1995.833415782612|
+------+------------------+
only showing top 20 rows



In [22]:
evaluator2 = RegressionEvaluator()\
.setMetricName("r2")\
.setPredictionCol("prediction")\
.setLabelCol("label")

In [23]:
#Mean Squared Error of the 3 models
#R square of the 3 models

print("Linear Regression Model RMSE: %6.4f" %(evaluator.evaluate(lr_model.transform(test))))
print("Linear Regression Model R2: %6.4f" %(evaluator2.evaluate(lr_model.transform(test))))
print("Decision Tree Regression Model RMSE: %6.4f" %(evaluator.evaluate(dt_model.transform(test))))
print("Decision Tree Regression Model R2: %6.4f" %(evaluator2.evaluate(dt_model.transform(test))))
print("Gradient-Boosted Tree Regression Model RMSE: %6.4f" %(evaluator.evaluate(gbt_model.transform(test))))
print("Gradient-Boosted Tree Regression Model R2: %6.4f" %(evaluator2.evaluate(gbt_model.transform(test))))

Linear Regression Model RMSE: 9.5105
Linear Regression Model R2: 0.2319
Decision Tree Regression Model RMSE: 9.7271
Decision Tree Regression Model R2: 0.1966
Gradient-Boosted Tree Regression Model RMSE: 9.5577
Gradient-Boosted Tree Regression Model R2: 0.2243


The MSE of all 3 models are very high, which does not reflect a good regression model. Perhaps having only the timbre averages and covariances as audio features are insufficient towards predicting the release year of a song.

A re-attempt at this problem can be made with perhaps more audio features such as pitch, amplitude, etc.