# Movie Recommendation Engine Tutorial

This is a Notebook adaptation of a Movielens-based example on Codementor.

* [Part 1](https://www.codementor.io/@jadianes/building-a-recommender-with-apache-spark-python-example-app-part1-du1083qbw)
* [Part 2](https://www.codementor.io/@jadianes/building-a-web-service-with-apache-spark-flask-example-app-part2-du1083854)

Before running this notebook, make sure that you have run `setup.sh` in the repo root to download movie recommendation data files.

In [15]:
import os
import pyspark
from pyspark.mllib.recommendation import ALS
import math
from pyspark.sql.types import *

In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 200)

In [3]:
small_ratings_file = os.path.join('/data', 'ml-latest-small', 'ratings.csv')
complete_ratings_file = os.path.join('/data', 'ml-latest', 'ratings.csv')
complete_movies_file = os.path.join('/data', 'ml-latest', 'movies.csv')

model_path = os.path.join('/model', 'movie_lens_als')

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Recommender") \
    .master("local[*]") \
    .config("spark.driver.memory", "4G") \
    .config('spark.driver.memory', '16G') \
    .config('spark.driver.maxResultSize', '10G') \
    .getOrCreate()

## Small Data Files

This section loads a small dataset in order to train the model, i.e. tweak parameters to yield the best ALS result. The result of this step will be used in the next stage, using a larger dataset, to retrain the model for release.

In [6]:
small_ratings_raw_data = spark.read.format("csv") \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .load(small_ratings_file)
  
small_ratings_raw_data.createOrReplaceTempView("ratings")

small_ratings_raw_data.limit(5).toPandas()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
small_ratings_data = spark.sql("select userId, movieId, rating from ratings")

In [18]:
training_RDD, validation_RDD, test_RDD = small_ratings_data.randomSplit([6.0, 2.0, 2.0], seed=0)
validation_for_predict_RDD = validation_RDD.rdd.map(lambda x: (x[0], x[1]))
test_for_predict_RDD = test_RDD.rdd.map(lambda x: (x[0], x[1]))

In [21]:
seed = 5
iterations = 10
regularization_parameter = 0.1
ranks = [4, 8, 12]
errors = [0, 0, 0]
err = 0
tolerance = 0.02

min_error = float('inf')
best_rank = -1
best_iteration = -1
for rank in ranks:
    model = ALS.train(training_RDD, rank, seed=seed, iterations=iterations,
                      lambda_=regularization_parameter)
    predictions = model.predictAll(validation_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
    rates_and_preds = validation_RDD.rdd.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
    error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())
    errors[err] = error
    err += 1
    print('For rank %s the RMSE is %s' % (rank, error))
    if error < min_error:
        min_error = error
        best_rank = rank

print('The best model was trained with rank %s' % best_rank)

For rank 4 the RMSE is 0.914168981326783
For rank 8 the RMSE is 0.9175987533435309
For rank 12 the RMSE is 0.9173928488354329
The best model was trained with rank 4


In [24]:
model = ALS.train(training_RDD, best_rank, seed=seed, iterations=iterations,
                      lambda_=regularization_parameter)
predictions = model.predictAll(test_for_predict_RDD).map(lambda r: ((r[0], r[1]), r[2]))
rates_and_preds = test_RDD.rdd.map(lambda r: ((int(r[0]), int(r[1])), float(r[2]))).join(predictions)
error = math.sqrt(rates_and_preds.map(lambda r: (r[1][0] - r[1][1])**2).mean())
    
print('For testing data the RMSE is %s' % (error))

For testing data the RMSE is 0.9107689846603845


# Large Recommendation Data Files

Using the parameters determined during model training on the smaller recommendation dataset, retrain based on a larger volume of data to produce a version of the model that is production ready

In [26]:
complete_ratings_raw_data = spark.read.format("csv") \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .load(complete_ratings_file)
  
complete_ratings_raw_data.createOrReplaceTempView("ratings")

complete_ratings_data = spark.sql("select userId, movieId, rating from ratings").cache()

print('There are %s recommendations in the complete dataset' % (complete_ratings_data.count()))


There are 27753444 recommendations in the complete dataset


In [28]:
training_RDD, test_RDD = complete_ratings_data.randomSplit([7.0, 3.0], seed=0)

complete_model = ALS.train(training_RDD, best_rank, seed=seed, 
                           iterations=iterations, lambda_=regularization_parameter)

In [32]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[complete_model])

In [31]:
complete_model.save(spark.sparkContext, model_path)