# Predicting movie ratings

One of the most common uses of big data is to predict what users want. This allows Google to show
you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you
might like. This lab will demonstrate how we can use Apache Spark to recommend movies to a user.
We will start with some basic techniques, and then use the mllib library's Alternating Least Squares
method to make more sophisticated predictions.

## Tools
This assignment is based on Python 3.7, Spark 1.4.1, the pySpark API, and the mllib library.


## File
For this lab, we will use a subset dataset of ~1,000,000 ratings from the [movielens 10M stable benchmark rating dataset](http://grouplens.org/datasets/movielens/). However, the same code you write will work for the full
dataset, or their latest dataset of ~25 million ratings.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

movies_filename = '/content/drive/MyDrive/movies.csv'
ratings_filename = '/content/drive/MyDrive/ratings.csv'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Preliminaries

We read in each of the files and create an RDD consisting of parsed lines.
Each line in the ratings dataset (`ratings.csv`) is formatted as:

`UserID::MovieID::Rating::Timestamp`

Each line in the movies (`movies.csv`) dataset is formatted as:

`MovieID::Title::Genres`

The `Genres` field has the format

`Genres1|Genres2|Genres3|...`

The format of these files is uniform and simple, so we can easily parse them using python:
- For each line in the ratings dataset, we create a tuple of (UserID, MovieID, Rating). We drop
the timestamp because we do not need it for this exercise.
- For each line in the movies dataset, we create a tuple of (MovieID, Title). We drop the Genres
because we do not need them for this exercise.

In [2]:
!pip install pyspark
import pyspark
sc = pyspark.SparkContext('local[*]')



In [3]:
num_partitions = 2
rawRatings = sc.textFile(ratings_filename).repartition(num_partitions)
rawMovies = sc.textFile(movies_filename)

def get_ratings_tuple(entry):
    """ Parse a line in the ratings dataset
    Args:
        entry (str): a line in the ratings dataset in the form of UserID::MovieID::Rating::Timestamp
    Returns:
        tuple: (UserID, MovieID, Rating)
    """
    
    items = entry.split('::')
    return int(items[0]), int(items[1]), float(items[2])


def get_movie_tuple(entry):
    """ Parse a line in the movies dataset
    Args:
        entry (str): a line in the movies dataset in the form of MovieID::Title::Genres
    Returns:
        tuple: (MovieID, Title)
    """
    
    items = entry.split('::')
    return int(items[0]), items[1]


ratingsRDD = rawRatings.map(get_ratings_tuple).cache()
moviesRDD = rawMovies.map(get_movie_tuple).cache()

ratingsCount = ratingsRDD.count()
moviesCount = moviesRDD.count()

print('There are {} ratings and {} movies in the datasets'.format(ratingsCount, moviesCount))
print('Ratings: {}'.format(ratingsRDD.take(3)))
print('Movies: {}'.format(moviesRDD.take(3)))

There are 10000054 ratings and 10681 movies in the datasets
Ratings: [(1, 122, 5.0), (1, 185, 5.0), (1, 231, 5.0)]
Movies: [(1, 'Toy Story (1995)'), (2, 'Jumanji (1995)'), (3, 'Grumpier Old Men (1995)')]


We will be examining subsets of the tuples we create (e.g., the top rated movies by users). Whenever we examine only a subset of a large dataset, there is the potential that the result will depend on the order we perform operations, such as joins, or how the data is partitioned across the workers. What we want to guarantee is that we always see the same results for a subset, independent of how we manipulate or store the data.

We can do that by sorting before we examine a subset. You might think that the most obvious choice when dealing with an RDD of tuples would be to use the `sortByKey()` method. However this choice is problematic, as we can still end up with different results if the key is not unique.

Note: It is important to use the [unicode type](https://docs.python.org/2/howto/unicode.html#the-unicode-type) instead of the `string` type as the titles are in unicode characters.

Consider the following example, and note that while the sets are equal, the printed lists are usually in different order by value, *although they may randomly match up from time to time.*

In [4]:
tmp1 = [(1, u'alpha'), (2, u'alpha'), (2, u'beta'), (3, u'alpha'), (1, u'epsilon'), (1, u'delta')]
tmp2 = [(1, u'delta'), (2, u'alpha'), (2, u'beta'), (3, u'alpha'), (1, u'epsilon'), (1, u'alpha')]

oneRDD = sc.parallelize(tmp1)
twoRDD = sc.parallelize(tmp2)
oneSorted = oneRDD.sortByKey(True).collect()
twoSorted = twoRDD.sortByKey(True).collect()
print(oneSorted)
print(twoSorted)

[(1, 'alpha'), (1, 'epsilon'), (1, 'delta'), (2, 'alpha'), (2, 'beta'), (3, 'alpha')]
[(1, 'delta'), (1, 'epsilon'), (1, 'alpha'), (2, 'alpha'), (2, 'beta'), (3, 'alpha')]


Even though the two lists contain identical tuples, the difference in ordering *sometimes* yields a different ordering for the sorted RDD (try running the cell repeatedly and see if the results change or the assertion fails). If we only examined the first two elements of the RDD (e.g., using `take(2)`), then we would observe different answers - **that is a really bad outcome as we want identical input data to always yield identical output**. A better technique is to sort the RDD by *both the key and value*, which we can do by combining the key and value into a single string and then sorting on that string. Since the key is an integer and the value is a unicode string, we can use a function to combine them into a single unicode string (e.g., `unicode('%.3f' % key) + ' ' + value`) before sorting the RDD using [sortBy()][sortby].
[sortby]: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortBy

In [5]:
'{:.3f}'.format(1.43324543522)

'1.433'

In [6]:
def sortFunction(tuple):
    """ Construct the sort string (does not perform actual sorting)
    Args:
        tuple: (rating, MovieName)
    Returns:
        sortString: the value to sort with, 'rating MovieName'
    """
    #key = unicode('%.3f' % tuple[0])
    key = '{:.3f}'.format(tuple[0])
    
    value = tuple[1]
    return (key + ' ' + value)

print(oneRDD.sortBy(sortFunction, ascending = True).collect())
print(twoRDD.sortBy(sortFunction, ascending = True).collect())

[(1, 'alpha'), (1, 'delta'), (1, 'epsilon'), (2, 'alpha'), (2, 'beta'), (3, 'alpha')]
[(1, 'alpha'), (1, 'delta'), (1, 'epsilon'), (2, 'alpha'), (2, 'beta'), (3, 'alpha')]


If we just want to look at the first few elements of the RDD in sorted order, we can use the [takeOrdered][takeordered] method with the `sortFunction` we defined.
[takeordered]: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.takeOrdered

In [7]:
oneSorted1 = oneRDD.takeOrdered(3,key=sortFunction)
twoSorted1 = twoRDD.takeOrdered(3,key=sortFunction)
print('one is {}'.format(oneSorted1))
print('two is {}'.format(twoSorted1))

one is [(1, 'alpha'), (1, 'delta'), (1, 'epsilon')]
two is [(1, 'alpha'), (1, 'delta'), (1, 'epsilon')]


## 1. Basic Recommendations

One way to recommend movies is to always recommend the movies with the highest average rating. In this part, we will use Spark to find the name, number of ratings, and the average rating of the 20 movies with the highest average rating and more than 500 reviews. We want to filter our movies with high ratings but fewer than or equal to 500 reviews because movies with few reviews may not have broad appeal to everyone.

### 1.1 Number of Ratings and Average Ratings for a Movie

We implement a helper function `getCountsAndAverages` that takes a single tuple of (MovieID, (Rating1, Rating2, Rating3, ...)) and returns a tuple of (MovieID, (number of ratings, averageRating)). For example, given the tuple `(100, (10.0, 20.0, 30.0))`, your function should return `(100, (3, 20.0))`

In [8]:
def getCountsAndAverages(IDandRatingsTuple):
    """ Calculate average rating
    Args:
        IDandRatingsTuple: a single tuple of (MovieID, (Rating1, Rating2, Rating3, ...))
    Returns:
        tuple: a tuple of (MovieID, (number of ratings, averageRating))
    """
    ratings = IDandRatingsTuple[1]
    num_ratings = len(ratings)
    return (IDandRatingsTuple[0], (num_ratings, float(sum(ratings)) / num_ratings))

In [9]:
assert(getCountsAndAverages((1, (1, 2, 3, 4))) == (1, (4, 2.5)))
assert(getCountsAndAverages((100, (10.0, 20.0, 30.0))) == (100, (3, 20.0)))
assert(getCountsAndAverages((110, range(20))) == (110, (20, 9.5)))

### 1.2 Movies with Highest Average Ratings

Now that we have a way to calculate the average ratings, we will use the `getCountsAndAverages` helper function with Spark to determine movies with highest average ratings:

* the `ratingsRDD` contains tuples of the form (UserID, MovieID, Rating). From `ratingsRDD` we create an RDD with tuples of the form (MovieID, Python iterable of Ratings for that MovieID);
* using `movieIDsWithRatingsRDD` and the `getCountsAndAverages()` helper function, we compute the number of ratings and average rating for each movie to yield tuples of the form (MovieID, (number of ratings, average rating));
* we want to see movie names, instead of movie IDs, thus we apply RDD transformations to `moviesRDD`, using `movieIDsWithAvgRatingsRDD` to get the movie names for `movieIDsWithAvgRatingsRDD`, yielding tuples of the form (average rating, movie name, number of ratings). This set of transformations will yield an RDD of the form: `[(1.0, u'Autopsy (Macchie Solari) (1975)', 1), (1.0, u'Better Living (1998)', 1), (1.0, u'Big Squeeze, The (1996)', 3)]`.

In [10]:
# From ratingsRDD with tuples of (UserID, MovieID, Rating) create an RDD with tuples of
# the (MovieID, iterable of Ratings for that MovieID)
movieIDsWithRatingsRDD = (ratingsRDD
                          .map(lambda r: (r[1], r[2]))
                          .groupByKey()
                         )
print('movieIDsWithRatingsRDD: {}\n'.format(movieIDsWithRatingsRDD.take(3)))

# Using `movieIDsWithRatingsRDD`, compute the number of ratings and average rating for each movie to
# yield tuples of the form (MovieID, (number of ratings, average rating))
movieIDsWithAvgRatingsRDD = movieIDsWithRatingsRDD.map(getCountsAndAverages)
print('movieIDsWithAvgRatingsRDD: {}\n'.format(movieIDsWithAvgRatingsRDD.take(3)))

# To `movieIDsWithAvgRatingsRDD`, apply RDD transformations that use `moviesRDD` to get the movie
# names for `movieIDsWithAvgRatingsRDD`, yielding tuples of the form
# (average rating, movie name, number of ratings)
movieNameWithAvgRatingsRDD = (moviesRDD
                              .join(movieIDsWithAvgRatingsRDD)
                              .map(lambda m: (m[1][1][1], m[1][0], m[1][1][0]))#*
                             )
print('movieNameWithAvgRatingsRDD: {}\n'.format(movieNameWithAvgRatingsRDD.take(3)))

movieIDsWithRatingsRDD: [(122, <pyspark.resultiterable.ResultIterable object at 0x7f336e6ce9d0>), (292, <pyspark.resultiterable.ResultIterable object at 0x7f336e6cebd0>), (316, <pyspark.resultiterable.ResultIterable object at 0x7f336e6ce690>)]

movieIDsWithAvgRatingsRDD: [(122, (2412, 2.861318407960199)), (292, (16075, 3.4184136858475895)), (316, (18925, 3.3493527080581242))]

movieNameWithAvgRatingsRDD: [(2.860544217687075, 'Waiting to Exhale (1995)', 1764), (3.131256952169077, 'Tom and Huck (1995)', 899), (2.5671971706454464, 'Dracula: Dead and Loving It (1995)', 2262)]



*Example of join here (pay attention to ordering!)

In [11]:
moviesRDD.join(movieIDsWithAvgRatingsRDD).takeOrdered(1)

[(1, ('Toy Story (1995)', (26449, 3.928768573481039)))]

### 1.3 Movies with Highest Average Ratings and more than 500 reviews

Now that we have an RDD of the movies with averge ratings, we can use Spark to determine the 20 movies with highest average ratings and more than 500 reviews.

We apply a single RDD transformation to `movieNameWithAvgRatingsRDD` to limit the results to movies with ratings from more than 500 people. We then use the `sortFunction()` helper function to sort by the average rating to get the movies in order of their rating (highest rating first). You will end up with an RDD of the form: `[(4.5349264705882355, u'Shawshank Redemption, The (1994)', 1088), (4.515798462852263, u"Schindler's List (1993)", 1171), (4.512893982808023, u'Godfather, The (1972)', 1047)]`

In [12]:
# Apply an RDD transformation to `movieNameWithAvgRatingsRDD` to limit the results to movies with
# ratings from more than 500 people. We then use the `sortFunction()` helper function to sort by the
# average rating to get the movies in order of their rating (highest rating first)
movieLimitedAndSortedByRatingRDD = (movieNameWithAvgRatingsRDD
                                    .filter(lambda m: m[2] > 500)
                                    .sortBy(sortFunction, ascending = False))
print('Movies with highest ratings: {}'.format(movieLimitedAndSortedByRatingRDD.take(20)))

Movies with highest ratings: [(4.457238321660348, 'Shawshank Redemption, The (1994)', 31126), (4.415085293227011, 'Godfather, The (1972)', 19814), (4.367142322253193, 'Usual Suspects, The (1995)', 24037), (4.363482949916592, "Schindler's List (1993)", 25777), (4.321966205837174, 'Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)', 3255), (4.319740945070761, 'Casablanca (1942)', 12507), (4.316543909348442, 'Rear Window (1954)', 8825), (4.315439034540158, 'Double Indemnity (1944)', 2403), (4.313629402756509, 'Third Man, The (1949)', 3265), (4.314119283602851, 'Seven Samurai (Shichinin no samurai) (1954)', 5751), (4.306805399325085, 'Paths of Glory (1957)', 1778), (4.303215119343423, 'Godfather: Part II, The (1974)', 13281), (4.298072023101749, 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)', 11774), (4.297154471544715, 'Lives of Others, The (Das Leben der Anderen) (2006)', 1230), (4.294842186297152, 'Dark Knight, The (2008)', 2598), (4.292379632836855, "One Flew

Using a threshold on the number of reviews is one way to improve the recommendations, but there are many other good ways to improve quality. For example, you could weight ratings by the number of ratings.

## 2. Collaborative Filtering

We are going to use a technique called collaborative filtering. Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly.

At first, people rate different items (like videos, images, games). After that, the system is making predictions about a user's rating for an item, which the user has not rated yet. These predictions are built upon the existing ratings of other users, who have similar ratings with the active user.

For movie recommendations, we start with a matrix whose entries are movie ratings by users.  Each column represents a user and each row represents a particular movie.

Since not all users have rated all movies, we do not know all of the entries in this matrix, which is precisely why we need collaborative filtering.  For each user, we have ratings for only a subset of the movies.  With collaborative filtering, the idea is to approximate the ratings matrix by factorizing it as the product of two matrices: one that describes properties of each user, and one that describes properties of each movie.

We want to select these two matrices such that the error for the users/movie pairs where we know the correct ratings is minimized.  The *Alternating Least Squares* algorithm does this by first randomly filling the users matrix with values and then optimizing the value of the movies such that the error is minimized.  Then, it holds the movies matrix constrant and optimizes the value of the user's matrix.  This alternation between which matrix to optimize is the reason for the "alternating" in the name.

### 2.1 Creating a Training Set

Before we jump into using machine learning, we need to break up the `ratingsRDD` dataset into three pieces:

* a training set (RDD), which we will use to train models,
* a validation set (RDD), which we will use to choose the best model,
* a test set (RDD), which we will use for estimating the predictive power of the recommender system.

To randomly split the dataset into the multiple groups, we can use the pySpark `randomSplit` transformation, which takes a set of splits and and seed and returns multiple RDDs.

In [13]:
trainingRDD, validationRDD, testRDD = ratingsRDD.randomSplit([6, 2, 2], seed=0)

print('Training: %s, validation: %s, test: %s\n' % (trainingRDD.count(),
                                                    validationRDD.count(),
                                                    testRDD.count()))
print(trainingRDD.take(3))
print(validationRDD.take(3))
print(testRDD.take(3))

Training: 6000368, validation: 2000295, test: 1999391

[(1, 122, 5.0), (1, 185, 5.0), (1, 316, 5.0)]
[(1, 231, 5.0), (1, 292, 5.0), (1, 594, 5.0)]
[(1, 329, 5.0), (1, 355, 5.0), (1, 356, 5.0)]


After splitting the dataset, your training set has about 600,000 entries and the validation and test sets each have about 200,000 entries (the exact number of entries in each dataset varies slightly due to the random nature of the `randomSplit` transformation.

### 2.2 Root Mean Square Error (RMSE)

In the next part, we will generate a few different models, and will need a way to decide which model is best. We will use the *Root Mean Square Error* (RMSE) or Root Mean Square Deviation (RMSD) to compute the error of each model.  RMSE is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. The RMSD represents the sample standard deviation of the differences between predicted values and observed values. These individual differences are called residuals when the calculations are performed over the data sample that was used for estimation, and are called prediction errors when computed out-of-sample. The RMSE serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a particular variable and not between variables, as it is scale-dependent.

As a first step we write a function to compute the sum of squared error given `predictedRDD` and `actualRDD` RDDs. Both RDDs consist of tuples of the form (UserID, MovieID, Rating)

Given two ratings RDDs, $x$ and $y$ of size $n$, we define RSME as follows:

$$ RMSE = \sqrt{\frac{\sum_{i = 1}^{n} (x_i - y_i)^2}{n}}$$

To calculate RSME, the steps we perform are the following ones.

* Transform `predictedRDD` into the tuples of the form ((UserID, MovieID), Rating). For example, tuples like `[((1, 1), 5), ((1, 2), 3), ((1, 3), 4), ((2, 1), 3), ((2, 2), 2), ((2, 3), 4)]`.
* Transform `actualRDD` into the tuples of the form ((UserID, MovieID), Rating). For example, tuples like `[((1, 2), 3), ((1, 3), 5), ((2, 1), 5), ((2, 2), 1)]`.
* Compute the squared error for each *matching* entry (i.e., the same (UserID, MovieID) in each RDD) in the reformatted RDDs. Note that not every (UserID, MovieID) pair will appear in both RDDs - if a pair does not appear in both RDDs, then it does not contribute to the RMSE. We will end up with an RDD with entries of the form $ (x_i - y_i)^2$.
* Using an RDD action, we compute the total squared error: $ SE = \sum_{i = 1}^{n} (x_i - y_i)^2 $.
* Compute $n$ by using an RDD action, to count the number of pairs for which you computed the total squared error.
* Using the total squared error and the number of pairs, compute the RSME.

In [14]:
import math

def computeError(predictedRDD, actualRDD):
    """ Compute the root mean squared error between predicted and actual
    Args:
        predictedRDD: predicted ratings for each movie and each user where each entry is in the form
                      (UserID, MovieID, Rating)
        actualRDD: actual ratings where each entry is in the form (UserID, MovieID, Rating)
    Returns:
        RSME (float): computed RSME value
    """
    # Transform predictedRDD into the tuples of the form ((UserID, MovieID), Rating)
    predictedReformattedRDD = predictedRDD.map(lambda i: ((i[0], i[1]), i[2]))

    # Transform actualRDD into the tuples of the form ((UserID, MovieID), Rating)
    actualReformattedRDD = actualRDD.map(lambda i: ((i[0], i[1]), i[2]))

    # Compute the squared error for each matching entry (i.e., the same (User ID, Movie ID) in each
    # RDD) in the reformatted RDDs using RDD transformtions - do not use collect()
    squaredErrorsRDD = (predictedReformattedRDD
                        .join(actualReformattedRDD)
                        .map(lambda i: math.pow(i[1][0] - i[1][1], 2))
                       )

    # Compute the total squared error - do not use collect()
    totalError = squaredErrorsRDD.reduce(lambda a, b: a+b)

    # Count the number of entries for which you computed the total squared error
    numRatings = squaredErrorsRDD.count()

    # Using the total squared error and the number of entries, compute the RSME
    return math.pow(float(totalError) / numRatings, 0.5)


# sc.parallelize turns a Python list into a Spark RDD.
testPredicted = sc.parallelize([
    (1, 1, 5),
    (1, 2, 3),
    (1, 3, 4),
    (2, 1, 3),
    (2, 2, 2),
    (2, 3, 4)])
testActual = sc.parallelize([
     (1, 2, 3),
     (1, 3, 5),
     (2, 1, 5),
     (2, 2, 1)])
testPredicted2 = sc.parallelize([
     (2, 2, 5),
     (1, 2, 5)])
testError = computeError(testPredicted, testActual)
print('Error for test dataset (should be 1.22474487139): {}'.format(testError))

testError2 = computeError(testPredicted2, testActual)
print('Error for test dataset2 (should be 3.16227766017): {}'.format(testError2))

testError3 = computeError(testActual, testActual)
print('Error for testActual dataset (should be 0.0): {}'.format(testError3))

Error for test dataset (should be 1.22474487139): 1.224744871391589
Error for test dataset2 (should be 3.16227766017): 3.1622776601683795
Error for testActual dataset (should be 0.0): 0.0


In [15]:
# TEST Root Mean Square Error (2b)
assert(abs(testError - 1.22474487139) < 0.00000001)
assert(abs(testError2 - 3.16227766017) < 0.00000001)
assert(abs(testError3 - 0.0) < 0.00000001)

### 2.3 Using ALS.train

In this part, we will use the MLlib implementation of Alternating Least Squares, `ALS.train`. ALS takes a training dataset (RDD) and several parameters that control the model creation process. To determine the best values for the parameters, we will use ALS to train several models, and then we will select the best model and use the parameters from that model in the rest of this lab exercise.

The process we will use for determining the best model is as follows:
* Pick a set of model parameters. The most important parameter to `ALS.train` is the *rank*, which is the number of rows in the Users matrix or the number of columns in the Movies matrix. We will train models with ranks of 4, 8, and 12 using the `trainingRDD` dataset.
* Create a model using `ALS.train(trainingRDD, rank, seed=seed, iterations=iterations, lambda_=regularizationParameter)` with three parameters: an RDD consisting of tuples of the form (UserID, MovieID, rating) used to train the model, an integer rank (4, 8, or 12), a number of iterations to execute (we will use 5 for the `iterations` parameter), and a regularization coefficient (we will use 0.1 for the `regularizationParameter`).
* For the prediction step, create an input RDD, `validationForPredictRDD`, consisting of (UserID, MovieID) pairs that you extract from `validationRDD`. You will end up with an RDD of the form: `[(1, 1287), (1, 594), (1, 1270)]`
* Using the model and `validationForPredictRDD`, we can predict rating values by calling `model.predictAll` with the `validationForPredictRDD` dataset, where `model` is the model we generated with `ALS.train`.  `predictAll` accepts an RDD with each entry in the format (userID, movieID) and outputs an RDD with each entry in the format (userID, movieID, rating).
* valuate the quality of the model by using the `computeError` function we wrote in part 2.2 to compute the error between the predicted ratings and the actual ratings in `validationRDD`.

In [16]:
from pyspark.mllib.recommendation import ALS

validationForPredictRDD = validationRDD.map(lambda i: (i[0], i[1]))

seed = 5
iterations = 5
regularizationParameter = 0.1
ranks = [4, 8, 12]
errors = [0, 0, 0]
err = 0
tolerance = 0.02

minError = float('inf')
bestRank = -1
bestIteration = -1
for rank in ranks:
    model = ALS.train(trainingRDD, rank, seed=seed, iterations=iterations,
                      lambda_=regularizationParameter)
    predictedRatingsRDD = model.predictAll(validationForPredictRDD)
    error = computeError(predictedRatingsRDD, validationRDD)
    errors[err] = error
    err += 1
    print('For rank {} the RMSE is {}'.format(rank, error))
    if error < minError:
        minError = error
        bestRank = rank

print('The best model was trained with rank {}'.format(bestRank))

For rank 4 the RMSE is 0.8274646853086821
For rank 8 the RMSE is 0.8318119521521181
For rank 12 the RMSE is 0.8151127619270172
The best model was trained with rank 12


We see that the rank 8 produces the best model.

### 2.4 Testing Your Model

So far, we used the `trainingRDD` and `validationRDD` datasets to select the best model.  Since we used these two datasets to determine what model is best, we cannot use them to test how good the model is - otherwise we would be very vulnerable to overfitting.  To decide how good our model is, we need to use the `testRDD` dataset.  We will use the `bestRank` we determined in part 2.3 to create a model for predicting the ratings for the test dataset and then we will compute the RMSE.

The steps we will perform are:

* Train a model, using the `trainingRDD`, `bestRank` and the parameters used in in part 2.3: `seed=seed`, `iterations=iterations`, and `lambda_=regularizationParameter.
* For the prediction step, create an input RDD, `testForPredictingRDD`, consisting of (UserID, MovieID) pairs extracted from `testRDD`. We will end up with an RDD of the form: `[(1, 1287), (1, 594), (1, 1270)]`
* Use `myModel.predictAll` to predict rating values for the test dataset.
* For validation, use the `testRDD` and the `computeError` function to compute the RMSE between `testRDD` and the `predictedTestRDD` from the model.
* Evaluate the quality of the model by using the `computeError` function written in part 2.2 to compute the error between the predicted ratings and the actual ratings in `testRDD`.

In [17]:
myModel = ALS.train(trainingRDD, bestRank, seed=seed, iterations=iterations,
                    lambda_=regularizationParameter)
testForPredictingRDD = testRDD.map(lambda i: (i[0], i[1]))
predictedTestRDD = myModel.predictAll(testForPredictingRDD)

testRMSE = computeError(testRDD, predictedTestRDD)

print('The model had a RMSE on the test set of {}'.format(testRMSE))

The model had a RMSE on the test set of 0.8149134493160599


We now have code to predict how users will rate movies.

## 3. Predictions for Ourselves

The ultimate goal of this lab is to predict what movies to recommend.  In order to do that, we will first need to add ratings for ourselves to the `ratingsRDD` dataset.

### 3.1 Our Movie Ratings

The user ID 0 is unassigned, so we will use it for our ratings. We set the variable `myUserID` to 0 for us. Next, create a new RDD `myRatingsRDD` with our ratings for at least 10 movie ratings. Each entry should be formatted as `(myUserID, movieID, rating)` (i.e., each entry should be formatted in the same way as `trainingRDD`).  As in the original dataset, ratings should be between 1 and 5 (inclusive). If you have not seen at least 10 of these movies, you can increase the parameter passed to `take()` in the above cell until there are 10 movies that you have seen (or you can also guess what your rating would be for movies you have not seen).

In [18]:
myUserID = 0

# Note that the movie IDs are the *last* number on each line. A common error was to use the number of ratings as the movie ID.
myRatedMovies = [
     # The format of each line is (myUserID, movie ID, your rating)
     # For example, to give the movie "Star Wars: Episode IV - A New Hope (1977)" a five rating, you would add the following line:
     #   (myUserID, 260, 5),
     (myUserID, 527, 5), # u"Schindler's List (1993)"
     (myUserID, 50, 5), # u'Usual Suspects, The (1995)'
     (myUserID, 260, 5), # u'Star Wars: Episode IV - A New Hope (1977)'
     (myUserID, 2762, 5), # u'Sixth Sense, The (1999)'
     (myUserID, 2571, 5), # u'Matrix, The (1999)'
     (myUserID, 1278, 5), # u'Young Frankenstein (1974)'
     (myUserID, 296, 3), # u'Pulp Fiction (1994)'
     (myUserID, 2858, 1), # u'American Beauty (1999)'
     (myUserID, 2028, 1), # u'Saving Private Ryan (1998)'
     (myUserID, 1, 3), # u'Toy Story (1995)'
    ]
myRatingsRDD = sc.parallelize(myRatedMovies)
print('My movie ratings: {}'.format(myRatingsRDD.take(10)))

My movie ratings: [(0, 527, 5), (0, 50, 5), (0, 260, 5), (0, 2762, 5), (0, 2571, 5), (0, 1278, 5), (0, 296, 3), (0, 2858, 1), (0, 2028, 1), (0, 1, 3)]


### 3.2 Add Your Movies to Training Dataset

Now that we have ratings for ourselves, you need to add your ratings to the `training` dataset so that the model we train will incorporate our preferences. Spark's `union` transformation combines two RDDs; use `union()` to create a new training dataset that includes your ratings and the data in the original training dataset.

In [19]:
trainingWithMyRatingsRDD = trainingRDD.union(myRatingsRDD)

print('The training dataset now has {}'
      ' more entries than the original training dataset'.format(
             (trainingWithMyRatingsRDD.count() - trainingRDD.count())))
assert(trainingWithMyRatingsRDD.count() - trainingRDD.count()) == myRatingsRDD.count()

The training dataset now has 10 more entries than the original training dataset


### 3.3 Train a Model with Your Ratings

Now, we train a model with our ratings added and the parameters used in part 2.3: `bestRank`, `seed=seed`, `iterations=iterations`, and `lambda_=regularizationParameter`.

In [20]:
myRatingsModel = ALS.train(trainingWithMyRatingsRDD, bestRank, seed=seed,
                           iterations=iterations,
                           lambda_=regularizationParameter)

### 3.4 Check RMSE for the New Model with Our Ratings

Compute the RMSE for this new model on the test set.

* For the prediction step, we reuse `testForPredictingRDD`, consisting of (UserID, MovieID) pairs that we extracted from `testRDD`. The RDD has the form: `[(1, 1287), (1, 594), (1, 1270)]`
* Use `myRatingsModel.predictAll()` to predict rating values for the `testForPredictingRDD` test dataset, set this as `predictedTestMyRatingsRDD`
* For validation, use the `testRDD`and the `computeError` function to compute the RMSE between `testRDD` and the `predictedTestMyRatingsRDD` from the model.

In [21]:
predictedTestMyRatingsRDD = myRatingsModel.predictAll(testForPredictingRDD)
testRMSEMyRatings = computeError(testRDD, predictedTestMyRatingsRDD)
print('The model had a RMSE on the test set of {}'.format(testRMSEMyRatings))

The model had a RMSE on the test set of 0.8210007283089136


### 3.5 Predict Our Ratings

So far, we have only used the `predictAll` method to compute the error of the model.  Here, use the `predictAll` to predict what ratings we would give to the movies that we did not already provide ratings for.

The steps we will perform are:
* Use the Python list `myRatedMovies` to transform the `moviesRDD` into an RDD with entries that are pairs of the form (myUserID, Movie ID) and that does not contain any movies that you have rated. This transformation will yield an RDD of the form: `[(0, 1), (0, 2), (0, 3), (0, 4)]`.
* For the prediction step, use the input RDD, `myUnratedMoviesRDD`, with `myRatingsModel.predictAll` to predict your ratings for the movies.

In [22]:
# Use the Python list myRatedMovies to transform the moviesRDD into an RDD
# with entries that are pairs of the form (myUserID, Movie ID) and that does not
# contain any movies that you have rated.
myUnratedMoviesRDD = (moviesRDD
                      .filter(lambda i: i[0] not in [m[1] for m in myRatedMovies])
                      .map(lambda i: (myUserID, i[0]))
                     )

# Use the input RDD, myUnratedMoviesRDD, with myRatingsModel.predictAll() to predict your ratings for the movies
predictedRatingsRDD = myRatingsModel.predictAll(myUnratedMoviesRDD)

### 3.6 Predict Our Ratings

We have our predicted ratings. Now we can print out the 25 movies with the highest predicted ratings.

The steps we perform are:
* From Parts 1.2 and 1.3, we know that we should look at movies with a reasonable number of reviews (e.g., more than 75 reviews). We can experiment with a lower threshold, but fewer ratings for a movie may yield higher prediction errors. Transform `movieIDsWithAvgRatingsRDD` from Part 1.2, which has the form (MovieID, (number of ratings, average rating)), into an RDD of the form (MovieID, number of ratings): `[(2, 332), (4, 71), (6, 442)]`
* We want to see movie names, instead of movie IDs. Transform `predictedRatingsRDD` into an RDD with entries that are pairs of the form (Movie ID, Predicted Rating): `[(3456, -0.5501005376936687), (1080, 1.5885892024487962), (320, -3.7952255522487865)]`
* Use RDD transformations with `predictedRDD` and `movieCountsRDD` to yield an RDD with tuples of the form (Movie ID, (Predicted Rating, number of ratings)): `[(2050, (0.6694097486155939, 44)), (10, (5.29762541533513, 418)), (2060, (0.5055259373841172, 97))]`
* Use RDD transformations with `predictedWithCountsRDD` and `moviesRDD` to yield an RDD with tuples of the form (Predicted Rating, Movie Name, number of ratings), _for movies with more than 75 ratings._ For example: `[(7.983121900375243, u'Under Siege (1992)'), (7.9769201864261285, u'Fifth Element, The (1997)')]`

In [23]:
movieIDsWithAvgRatingsRDD.take(1)[0]

(122, (2412, 2.861318407960199))

In [27]:
# Transform movieIDsWithAvgRatingsRDD from part (1b),
# which has the form (MovieID, (number of ratings, average rating)),
# into and RDD of the form (MovieID, number of ratings)
movieCountsRDD = movieIDsWithAvgRatingsRDD.map(lambda m: (m[0], m[1][0]))

# Transform predictedRatingsRDD into an RDD with entries that are pairs of the form
# (Movie ID, Predicted Rating)
predictedRDD = predictedRatingsRDD.map(lambda r: (r.product, r.rating))

# Use RDD transformations with predictedRDD and movieCountsRDD to yield an RDD
# with tuples of the form (Movie ID, (Predicted Rating, number of ratings))
predictedWithCountsRDD  = (predictedRDD
                           .join(movieCountsRDD))

# Use RDD transformations with PredictedWithCountsRDD and moviesRDD to yield an RDD
#with tuples of the form (Predicted Rating, Movie Name, number of ratings), for movies
# with more than 75 ratings
ratingsWithNamesRDD = (predictedWithCountsRDD
                       .filter(lambda p: p[1][1] > 75))

predictedHighestRatedMovies = ratingsWithNamesRDD.takeOrdered(20, key=lambda x: -x[0])
print('My highest rated movies as predicted '
      '(for movies with more than 75 reviews):\n{}'.format(
             '\n'.join(map(str, predictedHighestRatedMovies))))

My highest rated movies as predicted (for movies with more than 75 reviews):
(63131, (3.2590244625136977, 88))
(63113, (3.4312300021011986, 389))
(63082, (3.4157228376449336, 108))
(62956, (3.683582291082152, 141))
(62434, (2.302120453469999, 158))
(62394, (2.744556506833817, 106))
(62374, (3.2972070381575227, 109))
(62081, (3.547863059309362, 136))
(61729, (3.676086898344186, 94))
(61352, (3.1897657165034095, 86))
(61350, (3.233978342226265, 83))
(61323, (2.864362333153166, 572))
(61248, (2.7018498082292535, 96))
(61240, (3.595194271399606, 95))
(61160, (3.606771369694213, 127))
(61132, (3.2745751965751855, 588))
(61024, (2.9699920388153656, 264))
(60950, (3.1627092013856353, 161))
(60937, (2.9323536774730625, 141))
(60766, (3.6118529723213046, 80))


In [28]:
moviesRDD.filter(lambda m: m[0] in [p[0] for p in predictedHighestRatedMovies]).collect()

[(60766, 'Man on Wire (2008)'),
 (60937, 'Mummy: Tomb of the Dragon Emperor, The (2008)'),
 (60950, 'Vicky Cristina Barcelona (2008)'),
 (61024, 'Pineapple Express (2008)'),
 (61132, 'Tropic Thunder (2008)'),
 (61160, 'Star Wars: The Clone Wars (2008)'),
 (61240, 'Let the Right One In (Låt den rätte komma in) (2008)'),
 (61248, 'Death Race (2008)'),
 (61323, 'Burn After Reading (2008)'),
 (61350, 'Babylon A.D. (2008)'),
 (61352, 'Traitor (2008)'),
 (61729, 'Ghost Town (2008)'),
 (62081, 'Eagle Eye (2008)'),
 (62374, 'Body of Lies (2008)'),
 (62394, 'Max Payne (2008)'),
 (62434, 'Zack and Miri Make a Porno (2008)'),
 (62956, "Futurama: Bender's Game (2008)"),
 (63082, 'Slumdog Millionaire (2008)'),
 (63113, 'Quantum of Solace (2008)'),
 (63131, 'Role Models (2008)')]