#Machine learning tutorial

In this tutorial we will realize a script for a raccomandation system on movie based on your personal rating and a dataset of 1 million ratings from 6000 users on 4000 movies.  

###Dataset

We will use two files from MovieLens dataset: ratings.dat and movies.dat.  

ratings.dat is in the fellowing format:  
UserID::MovieID::Rating::Timestamp

movies.dat's format is:  
MovieID::Title::Genres

###Collaborative filtering

Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix, in our case, the user-movie rating matrix. MLlib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. In particular, we implement the alternating least squares (ALS) algorithm to learn these latent factors.

---
<img src="http://ampcamp.berkeley.edu/5/exercises/img/matrix_factorization.png" height="300" width="600">


###User rating

To create user rating you can use a python script called rateMovies. It ask you a raking for some movie and generate a file named personalRatings.txt.  
You can launch python script in docker shell with fellowing command:  
python rateMovies

We have pregenerated this file for, if you want you can generate your personal rating file and use it in this tutorial.

###Setup

We start with import of libraries.

In [39]:
import sys
import itertools
from math import sqrt
from operator import add
from os.path import join, isfile, dirname

from pyspark import SparkConf, SparkContext
from pyspark.mllib.recommendation import ALS

###User defined functions

We define some functions to use after

In [50]:
def parseRating(line):
    """
    Parses a rating record in MovieLens format userId::movieId::rating::timestamp .
    """
    fields = line.strip().split("::")
    return long(fields[3]) % 10, (int(fields[0]), int(fields[1]), float(fields[2]))

def parseMovie(line):
    """
    Parses a movie record in MovieLens format movieId::movieTitle .
    """
    fields = line.strip().split("::")
    return int(fields[0]), fields[1]

def loadRatings(ratingsFile):
    """
    Load ratings from file.
    """
    if not isfile(ratingsFile):
        print "File %s does not exist." % ratingsFile
        sys.exit(1)
    f = open(ratingsFile, 'r')
    ratings = filter(lambda r: r[2] > 0, [parseRating(line)[1] for line in f])
    f.close()
    if not ratings:
        print "No ratings provided."
        sys.exit(1)
    else:
        return ratings

def computeRmse(model, data, n):
    """
    Compute RMSE (Root Mean Squared Error).
    """
    predictions = model.predictAll(data.map(lambda x: (x[0], x[1])))
    predictionsAndRatings = predictions.map(lambda x: ((x[0], x[1]), x[2])) \
      .join(data.map(lambda x: ((x[0], x[1]), x[2]))) \
      .values()
    return sqrt(predictionsAndRatings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))

Load personal rating from local file system.

In [51]:
# load personal ratings
FILE_PATH = "/notebooks/cineca/data/movielens/medium/"
myRatings = loadRatings(FILE_PATH + "personalRatings.txt")
myRatingsRDD = sc.parallelize(myRatings, 1)

Load dataset in hdfs.

In [53]:
%%bash
hdfs dfs -mkdir /movielens
hdfs dfs -put /notebooks/cineca/data/movielens/medium/ratings.dat /movielens/ratings.dat
hdfs dfs -put /notebooks/cineca/data/movielens/medium/movies.dat /movielens/movies.dat

mkdir: `/movielens': File exists


Load ratings from hdfs

In [54]:
# ratings is an RDD of (last digit of timestamp, (userId, movieId, rating))
#ratings = sc.textFile("file://" + FILE_PATH + "ratings.dat").map(parseRating)
ratings = sc.textFile("/movielens/ratings.dat").map(parseRating)

Load movie dataset from hdfs

In [55]:
# movies is an RDD of (movieId, movieTitle)
movies = dict(sc.textFile("/movielens/movies.dat").map(parseMovie).collect())

Let's do some operation on dataset

In [56]:
numRatings = ratings.count()
numUsers = ratings.values().map(lambda r: r[0]).distinct().count()
numMovies = ratings.values().map(lambda r: r[1]).distinct().count()

print "Got %d ratings from %d users on %d movies." % (numRatings, numUsers, numMovies)

Got 1000209 ratings from 6040 users on 3706 movies.


###Splitting training data

We will use MLlib’s ALS to train a MatrixFactorizationModel, which takes a RDD[Rating] object as input in Scala and RDD[(user, product, rating)] in Python. ALS has training parameters such as rank for matrix factors and regularization constants. To determine a good combination of the training parameters, we split the data into three non-overlapping subsets, named training, test, and validation, based on the last digit of the timestamp, and cache them. We will train multiple models based on the training set, select the best model on the validation set based on RMSE (Root Mean Squared Error), and finally evaluate the best model on the test set. We also add your ratings to the training set to make recommendations for you. We hold the training, validation, and test sets in memory by calling cache because we need to visit them multiple times.

In [57]:
numPartitions = 4

training = ratings.filter(lambda x: x[0] < 6).values().union(myRatingsRDD).repartition(numPartitions).cache()
validation = ratings.filter(lambda x: x[0] >= 6 and x[0] < 8).values().repartition(numPartitions).cache()
test = ratings.filter(lambda x: x[0] >= 8).values().cache()

numTraining = training.count()
numValidation = validation.count()
numTest = test.count()

print "Training: %d, validation: %d, test: %d" % (numTraining, numValidation, numTest)

Training: 602252, validation: 198919, test: 199049


###Get the best model

In this tutorial we try different combination of parameters (rank, lambda and number of iterations) to get the best model.  
Parameters:  
Ranks 8, 12  
Lambdas 1, 10  
Number of interations 10, 20

In [17]:
ranks = [8, 12]
lambdas = [1.0, 10.0]
numIters = [10, 20]
bestModel = None
bestValidationRmse = float("inf")
bestRank = 0
bestLambda = -1.0
bestNumIter = -1

for rank, lmbda, numIter in itertools.product(ranks, lambdas, numIters):
    model = ALS.train(training, rank, numIter, lmbda)
    validationRmse = computeRmse(model, validation, numValidation)
    print "RMSE (validation) = %f for the model trained with " % validationRmse + \
          "rank = %d, lambda = %.1f, and numIter = %d." % (rank, lmbda, numIter)
    if (validationRmse < bestValidationRmse):
        bestModel = model
        bestValidationRmse = validationRmse
        bestRank = rank
        bestLambda = lmbda
        bestNumIter = numIter

testRmse = computeRmse(bestModel, test, numTest)

# evaluate the best model on the test set
print "The best model was trained with rank = %d and lambda = %.1f, " % (bestRank, bestLambda) \
  + "and numIter = %d, and its RMSE on the test set is %f." % (bestNumIter, testRmse)

RMSE (validation) = 1.361322 for the model trained with rank = 8, lambda = 1.0, and numIter = 10.
RMSE (validation) = 1.361321 for the model trained with rank = 8, lambda = 1.0, and numIter = 20.
RMSE (validation) = 3.755870 for the model trained with rank = 8, lambda = 10.0, and numIter = 10.
RMSE (validation) = 3.755870 for the model trained with rank = 8, lambda = 10.0, and numIter = 20.
RMSE (validation) = 1.361321 for the model trained with rank = 12, lambda = 1.0, and numIter = 10.
RMSE (validation) = 1.361321 for the model trained with rank = 12, lambda = 1.0, and numIter = 20.
RMSE (validation) = 3.755870 for the model trained with rank = 12, lambda = 10.0, and numIter = 10.
RMSE (validation) = 3.755870 for the model trained with rank = 12, lambda = 10.0, and numIter = 20.
The best model was trained with rank = 8 and lambda = 1.0, and numIter = 20, and its RMSE on the test set is 1.357077.


###How use learned model?

Complete in the fellowing line istructions to suggest 50 film based on personal rating and the best learned model.

In [18]:
# Get candidates (esclude from candidates movies already in rated in personal ratings)
myRatedMovieIds = set([x[1] for x in myRatings])
candidates = sc.parallelize(...)
# To make prediction using the best model learned in the previous line
predictions = ...
# Get fist 50 from the list of predicted
recommendations = ...

print "Movies recommended for you:"
for i in xrange(len(recommendations)):
    print ("%2d: %s" % (i + 1, movies[recommendations[i][1]])).encode('ascii', 'ignore')

Movies recommended for you:
 1: I Am Cuba (Soy Cuba/Ya Kuba) (1964)
 2: Time of the Gypsies (Dom za vesanje) (1989)
 3: Smashing Time (1967)
 4: Gate of Heavenly Peace, The (1995)
 5: Follow the Bitch (1998)
 6: Zachariah (1971)
 7: Bewegte Mann, Der (1994)
 8: Institute Benjamenta, or This Dream People Call Human Life (1995)
 9: For All Mankind (1989)
10: Hour of the Pig, The (1993)
11: Man of the Century (1999)
12: Lamerica (1994)
13: Lured (1947)
14: Apple, The (Sib) (1998)
15: Sanjuro (1962)
16: I Can't Sleep (J'ai pas sommeil) (1994)
17: Bells, The (1926)
18: Shawshank Redemption, The (1994)
19: Collectionneuse, La (1967)
20: Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)
21: 24 7: Twenty Four Seven (1997)
22: Usual Suspects, The (1995)
23: Godfather, The (1972)
24: Close Shave, A (1995)
25: Big Trees, The (1952)
26: Wrong Trousers, The (1993)
27: Paths of Glory (1957)
28: Soft Fruit (1999)
29: Schindler's List (1993)
30: Third Man, The (1949)
31: Sunset Blvd.