# Lab Sheet 8: Using ALS with Spark, evaluating CV results 

These tasks are for working in the lab session and during the week. We will use the Spark ALS and explore the effects of outcome of a cross-validation.

## Task 1)
Read the data, split into tokens and create a structured DataFrame. For low level tasks like splitting strings, we need to use an RDD, where we can apply a `map` function.

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import Row
# the imports are used creating the data frame

spark = SparkSession.builder.getOrCreate() # create a SparkSession 
# this gets us an RDD. (could also be done with RDD.textFile in this case)
lines = spark.read.text("hdfs://saltdean/data/movielens/sample_movielens_ratings.txt").rdd 
# now split the lines at the '::'
parts = lines.map( ... ) # <<<
ratingsRDD = parts.map( ... ) # <<< create a Row userId as int, movieId  as int, rating as float, timestamp as int
ratings = spark.createDataFrame( ... ) # create a dataframe
ratings.createOrReplaceTempView('ratings') # register the DataFrame so that we can use it with Spark SQL.
(training, test) =  # create a random split into test and training set from the dataframe
print(training.describe()) # just for testing, should show the four columns
print(training.count()) # just fore testing, should be around 1188

DataFrame[summary: string, movieId: string, rating: string, timestamp: string, userId: string]
1183


# Task 2)

Now take a very simple estimate as the baseline: calculate the mean of all ratings.    

The average can be calculated with the SQL `AVG` command, within an SQL `SELECT` statement. If you replace selected column, e.g. `rating`, with `AVG(rating)`, the returned DataFrame will contain only 1 row. You can access the contents of the rows by its name, e.g. `row['avg(rating)']` (the avg needs to be lower case here). 

Then calculate the squared error with respect to the average (as predictor). You can again use the SQL `AVG` command. 


In [3]:
# select the average from the ratings
SQL1 = ' ... '
row = spark.sql(SQL1).collect()[0] # get the single row with the result
print('row', row)
meanRating = ... # access Row as a map 
print('meanRating', meanRating)

se_rdd = test.rdd.map( ... ) #<<< get the the squared error (difference to the average) using Python pow() 
se_df = spark.createDataFrame(se_rdd) # create a data framd
se_df.createOrReplaceTempView( ... ) #<<< Register with the SQL system (choose a name)
print('se_df',se_df) 

# get the average squared error
SQL2 = '...'
row = spark.sql(SQL2).collect()[0]
meanSE = ... #<<< access Row as a map 
print('meanSE',meanSE)

meanRating 1.7741505662891406
se_df DataFrame[se: double]
meanSE 1.5408512158621657


## Task 3: 

Now create an ALS estimator and a parameter grid to explore different values for the `rank` and `regParam` parameter of the ALS. Then build a cross-validator to train the model.

In [19]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, rank=5, regParam=0.1, userCol="userId", itemCol="movieId", ratingCol="rating")

# build a paramter grid for rank and regParam
paramGrid = ParamGridBuilder() # <<<<

# set up a regression evaluater evaluating RMSE 
regEval = RegressionEvaluator( ... ) # <<<<

# set up a cross validator with the als, paramGrid and regEval
crossVal = CrossValidator( ... numFolds=3) # <<<<

print('starting cross-validation')
cvModel = crossVal.fit(training)
print('finished cross-validation')

starting cross-validation
finished cross-validation


## Task 4)

Take the trained cvModel and extract the best parameter values by inspecting the estimatorParameterMap. Compare the RMSE value to that of the mean for different parameter settings.

In [15]:
print(cvModel.avgMetrics) # the metrics form the CrossValidation
print(cvModel.getEstimatorParamMaps()) # gives you the parameter combinations, print it out, too
# use Python zip and list to create a joint paramter and result map
paramMap = ... 
print(paramMap)
# use Python max to get the best params 
paramMax = max(paramMap, key=lambda x: x[1])
print(paramMax)

# Evaluate the cvModel by computing the RMSE on the test data
predictions = ... #<<<
rmse = regEval.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

[({Param(parent='ALS_41f7b157fba36f60120c', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='ALS_41f7b157fba36f60120c', name='rank', doc='rank of the factorization'): 10}, nan)]
({Param(parent='ALS_41f7b157fba36f60120c', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='ALS_41f7b157fba36f60120c', name='rank', doc='rank of the factorization'): 10}, nan)
Root-mean-square error = 1.0302617871748263


## Task 5) 

Apply the apporach above to the larger MovieLens dataset (or part of it). The data is available at `/data/tempstore/movielens/ml-latest-small` 