#Movie Recommendation with MLlib

In this chapter, we will use MLlib to make personalized movie recommendations tailored for you. We will work with 10 million ratings from 72,000 users on 10,000 movies, collected by MovieLens. This dataset is pre-loaded in HDFS. For quick testing of your code, we use a smaller dataset which contains 1 million ratings from 6000 users on 4000 movies.

#1. Data set

We will use two files from this MovieLens dataset: “ratings.dat” and “movies.dat”. All ratings are contained in the file “ratings.dat” and are in the following format:

`UserID::MovieID::Rating::Timestamp`

Movie information is in the file “movies.dat” and is in the following format:

`MovieID::Title::Genres`

#2. Collaborative filtering

Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix, in our case, the user-movie rating matrix. MLlib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. In particular, we implement the alternating least squares (ALS) algorithm to learn these latent factors.

<a href="url"><img src="http://ampcamp.berkeley.edu/5/exercises/img/matrix_factorization.png" align="center" height="300" width="500" ></a>

#3. Create training examples

To make recommendation for you, we are going to learn your taste by asking you to rate a few movies. We have selected a small set of movies that have received the most ratings from users in the MovieLens dataset. You can rate those movies by running rateMovies.py (see below, hit 'run' or 'shift-return').

When you run the script, you should see prompt similar to the following:

After you’re done rating the movies, we save your ratings in personalRatings.txt in the MovieLens format, where a special user id 0 is assigned to you.

rateMovies allows you to re-rate the movies if you’d like to see how your ratings affect your recommendations.

In [2]:
%run rateMovies.py

parentDir = 
/home/ubuntu/notebooks/spark_course/6-MLlib-Example
Looks like you've already rated the movies. Overwrite ratings (y/N)? y
Please rate the following movie (1-5 (best), or 0 if not seen): 
Toy Story (1995): 0
Independence Day (a.k.a. ID4) (1996): 5
Dances with Wolves (1990): 0
Star Wars: Episode VI - Return of the Jedi (1983): 5
Mission: Impossible (1996): 0
Ace Ventura: Pet Detective (1994): 0
Die Hard: With a Vengeance (1995): 0
Batman Forever (1995): 0
Pretty Woman (1990): 0
Men in Black (1997): 5
Dumb & Dumber (1994): 0


#4. Setup

The following are the cells you are going to edit, and run.

*Initiate Spark Context - ONLY first time for each notebook. If you get problems with below, see [Help](/notebooks/spark_course/1-Course-Information-and-Links/If-you-get-problems-initiating-spark-context.ipynb)*

In [3]:
import os
from pyspark import SparkContext
sc = SparkContext(appName="search", master=os.environ['MASTER'])

In [4]:
import sys                                                                                                                         
import itertools                                                                                                                   
from math import sqrt                                                                                                              
from operator import add                                                                                                           
from os.path import join, isfile, dirname                                                                                          

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

In [5]:
def parseRating(line):
    """
    Parses a rating record in MovieLens format userId::movieId::rating::timestamp .
    """
    fields = line.strip().split("::")
    return long(fields[3]) % 10, (int(fields[0]), int(fields[1]), float(fields[2]))

def parseMovie(line):
    """
    Parses a movie record in MovieLens format movieId::movieTitle .
    """
    fields = line.strip().split("::")
    return int(fields[0]), fields[1]

def computeRmse(model, data, n):
    """
    Compute RMSE (Root Mean Squared Error).
    """
    predictions = model.predictAll(data.map(lambda x: (x[0], x[1])))
    predictionsAndRatings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \
      .join(data.map(lambda x: ((x[0], x[1]), x[2]))) \
      .values()
    return sqrt(predictionsAndRatings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))

In [6]:
# Load and parse the data
# ratings is an RDD of (last digit of timestamp, (userId, movieId, rating))
ratings = sc.textFile("/uuData/movies/ratings.dat").map(parseRating)

# movies is an RDD of (movieId, movieTitle)                                                                                    
temp = sc.textFile("/uuData/movies/movies.dat")
movies = dict(temp.map(parseMovie).collect()) 

# load personal ratings
myRatings = sc.textFile("/uuData/movies/personalRatings.txt")
myRatingsRDD = myRatings.map(lambda l: l.split('::')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

#5. Running the program

In [7]:
numRatings = ratings.count()
numUsers = ratings.values().map(lambda r: r[0]).distinct().count()
numMovies = ratings.values().map(lambda r: r[1]).distinct().count()

print "Got %d ratings from %d users on %d movies." % (numRatings, numUsers, numMovies)

Got 1000209 ratings from 6040 users on 3706 movies.


#6. Splitting training data
We will use MLlib’s ALS to train a MatrixFactorizationModel, which takes a RDD[Rating] object as input in Scala and RDD[(user, product, rating)] in Python. ALS has training parameters such as rank for matrix factors and regularization constants. To determine a good combination of the training parameters, we split the data into three non-overlapping subsets, named training, test, and validation, based on the last digit of the timestamp, and cache them. We will train multiple models based on the training set, select the best model on the validation set based on RMSE (Root Mean Squared Error), and finally evaluate the best model on the test set. We also add your ratings to the training set to make recommendations for you. We hold the training, validation, and test sets in memory by calling cache because we need to visit them multiple times.

In [8]:
# split ratings into train (60%), validation (20%), and test (20%) based on the
# last digit of the timestamp, add myRatings to train, and cache them

# training, validation, test are all RDDs of (userId, movieId, rating)

numPartitions = 4
training = ratings.filter(lambda x: x[0] < 6) \
  .values() \
  .union(myRatingsRDD) \
  .repartition(numPartitions) \
  .cache()

validation = ratings.filter(lambda x: x[0] >= 6 and x[0] < 8) \
  .values() \
  .repartition(numPartitions) \
  .cache()

test = ratings.filter(lambda x: x[0] >= 8).values().cache()

numTraining = training.count()
numValidation = validation.count()
numTest = test.count()

print "Training: %d, validation: %d, test: %d" % (numTraining, numValidation, numTest)

Training: 602252, validation: 198919, test: 199049


#7. Training using ALS
In this section, we will use ALS.train to train a bunch of models, and select and evaluate the best. Among the training paramters of ALS, the most important ones are rank, lambda (regularization constant), and number of iterations. The train method of ALS we are going to use is defined as the following:

class ALS(object):

    def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1):
        # ...
    return MatrixFactorizationModel(sc, mod)
    
//new: def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False,                                        
              seed=None)

Ideally, we want to try a large number of combinations of them in order to find the best one. Due to time constraint, we will test only 8 combinations resulting from the cross product of 2 different ranks (8 and 12), 2 different lambdas (1.0 and 10.0), and two different numbers of iterations (10 and 20). We use the provided method computeRmse to compute the RMSE on the validation set for each model. The model with the smallest RMSE on the validation set becomes the one selected and its RMSE on the test set is used as the final metric.

*Note: let below step finish before going to the next.


In [9]:
# train models and evaluate them on the validation set

ranks = [8, 12]
lambdas = [0.1, 10.0]
numIters = [10, 20]
bestModel = None
bestValidationRmse = float("inf")
bestRank = 0
bestLambda = -1.0
bestNumIter = -1

for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters):
    model = ALS.train(training, rank, numIter, lmbda)
    validationRmse = computeRmse(model, validation, numValidation)
    print "RMSE (validation) = %f for the model trained with " % validationRmse + \
            "rank = %d, lambda = %.1f, and numIter = %d." % (rank, lmbda, numIter)
    if (validationRmse < bestValidationRmse):
        bestModel = model
        bestValidationRmse = validationRmse
        bestRank = rank
        bestLambda = lmbda
        bestNumIter = numIter

testRmse = computeRmse(bestModel, test, numTest)
    
# evaluate the best model on the test set
print "The best model was trained with rank = %d and lambda = %.1f, " % (bestRank, bestLambda) \
    + "and numIter = %d, and its RMSE on the test set is %f." % (bestNumIter, testRmse)

RMSE (validation) = 0.879141 for the model trained with rank = 8, lambda = 0.1, and numIter = 10.
RMSE (validation) = 0.872579 for the model trained with rank = 8, lambda = 0.1, and numIter = 20.
RMSE (validation) = 3.755870 for the model trained with rank = 8, lambda = 10.0, and numIter = 10.
RMSE (validation) = 3.755870 for the model trained with rank = 8, lambda = 10.0, and numIter = 20.
RMSE (validation) = 0.879635 for the model trained with rank = 12, lambda = 0.1, and numIter = 10.
RMSE (validation) = 0.870872 for the model trained with rank = 12, lambda = 0.1, and numIter = 20.
RMSE (validation) = 3.755870 for the model trained with rank = 12, lambda = 10.0, and numIter = 10.
RMSE (validation) = 3.755870 for the model trained with rank = 12, lambda = 10.0, and numIter = 20.
The best model was trained with rank = 12 and lambda = 0.1, and numIter = 20, and its RMSE on the test set is 0.868953.


In [10]:
    # compare the best model with a naive baseline that always returns the mean rating
    meanRating = training.union(validation).map(lambda x: x[2]).mean()
    baselineRmse = sqrt(test.map(lambda x: (meanRating - x[2]) ** 2).reduce(add) / numTest)
    improvement = (baselineRmse - testRmse) / baselineRmse * 100
    print "The best model improves the baseline by %.2f" % (improvement) + "%."

The best model improves the baseline by 21.96%.


In [11]:
def loadRatings(ratingsFile):
    """
    Load ratings from file.
    """
    if not isfile(ratingsFile):
        print "File %s does not exist." % ratingsFile
        sys.exit(1)
    f = open(ratingsFile, 'r')
    ratings = filter(lambda r: r[2] > 0, [parseRating(line)[1] for line in f])
    f.close()
    if not ratings:
        print "No ratings provided."
        sys.exit(1)
    else:
        return ratings

In [12]:
# load personal ratings
myRatings = loadRatings("personalRatings.txt")
myRatingsRDD = sc.parallelize(myRatings, 1)

In [13]:
 # make personalized recommendations

myRatedMovieIds = set([x[1] for x in myRatings])
candidates = sc.parallelize([m for m in movies if m not in myRatedMovieIds])
predictions = bestModel.predictAll(candidates.map(lambda x: (0, x))).collect()
recommendations = sorted(predictions, key=lambda x: x[2], reverse=True)[:50]

print "Movies recommended for you:"
for i in xrange(len(recommendations)):
    print ("%2d: %s" % (i + 1, movies[recommendations[i][1]])).encode('ascii', 'ignore')

Movies recommended for you:
 1: Anatomy (Anatomie) (2000)
 2: Julien Donkey-Boy (1999)
 3: Across the Sea of Time (1995)
 4: Welcome to Woop-Woop (1997)
 5: Wisdom (1986)
 6: Love Serenade (1996)
 7: Steal Big, Steal Little (1995)
 8: Jakob the Liar (1999)
 9: If Lucy Fell (1996)
10: Mad Dog Time (1996)
11: Isn't She Great? (2000)
12: Committed (2000)
13: Zachariah (1971)
14: In the Mouth of Madness (1995)
15: Fall (1997)
16: Clean Slate (Coup de Torchon) (1981)
17: Sandpiper, The (1965)
18: Mr. Jones (1993)
19: Postman, The (1997)
20: I Confess (1953)
21: Leading Man, The (1996)
22: Dune (1984)
23: Bandits (1997)
24: Shattered Image (1998)
25: Window to Paris (1994)
26: Star Trek V: The Final Frontier (1989)
27: Down to You (2000)
28: Chill Factor (1999)
29: For Love of the Game (1999)
30: Star Wars: Episode I - The Phantom Menace (1999)
31: Hard Core Logo (1996)
32: Belly (1998)
33: Message to Love: The Isle of Wight Festival (1996)
34: Marnie (1964)
35: Goya in Bordeaux (Goya en Bod

### Exercise: try above (from the beginning) and see if the recommendation can be changed (put different scores in your rating). What is the limitation with above? What is needed to make it (much) better? 

### Extra: make a copy of this notebook and change it to simplify rerunning the analysis (move away the parts that you only want to do once). Use this trying to get some reasonable results out of this (small data set): can you e.g. get the recommender to recommend scifi movies?

### For the interested student: see https://www.kaggle.com/, and http://en.wikipedia.org/wiki/Netflix_Prize

Material based on AMPCamp and Databricks training material provided online under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.