# Movie Recommendation

In this project our goal is to build a movie recommendation system using collaborative filtering. Collaborative filtering is a learning technique used to make predictions (filtering) about the interests of a user by collecting preferences of taste information from many other users (collaboration). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person.

For this task, the [Movielens dataset](https://grouplens.org/datasets/movielens/) was used. It contains over 1 million anonymous ratings applied to 3883 movies by 6040 users.

The Apache Spark framework (and its Machine Learning library) was used to process the data.

## Spark setup

As some functionalities of the Spark SQL component will be used, we need to initialize a `SparkSession` object.

The `SparkContext` object (required for other basic functionalities) is automatically created by PySpark and it is defined in the `sc` variable.

In [1]:
from pyspark.sql import SparkSession

In [2]:
# Create the (singleton) SparkSession object.
ss = SparkSession.builder.appName("Movie Recommendation").getOrCreate()

## Data load

The MovieLens dataset is composed of three text files:

- `users.dat`: contains information about the genre, age, occupation and zipcode of each user.
- `movies.dat`: contains information about the title, year and genres of each movie.
- `ratings.dat`: contains the ratings that users gave to the movies they watched.

They can be opened as plain text RDDs.

In [3]:
import os

In [4]:
# Read the users data from a text file.
# The records are read as strings.
usersRdd0 = sc.textFile(os.path.join("ml-1m", "users.dat"))
usersRdd0.take(5)

['1::F::1::10::48067',
 '2::M::56::16::70072',
 '3::M::25::15::55117',
 '4::M::45::7::02460',
 '5::M::25::20::55455']

In [5]:
# Read the movies data from a text file.
# The records are read as strings.
moviesRdd0 = sc.textFile(os.path.join("ml-1m", "movies.dat"))
moviesRdd0.take(5)

["1::Toy Story (1995)::Animation|Children's|Comedy",
 "2::Jumanji (1995)::Adventure|Children's|Fantasy",
 '3::Grumpier Old Men (1995)::Comedy|Romance',
 '4::Waiting to Exhale (1995)::Comedy|Drama',
 '5::Father of the Bride Part II (1995)::Comedy']

In [6]:
# Read the ratings data from a text file.
# The records are read as strings.
ratingsRdd0 = sc.textFile(os.path.join("ml-1m", "ratings.dat"))
ratingsRdd0.take(5)

['1::1193::5::978300760',
 '1::661::3::978302109',
 '1::914::3::978301968',
 '1::3408::4::978300275',
 '1::2355::5::978824291']

## Data preparation

The plain text RDDs must be converted to structured dataframes so we can use SQL queries and other high level operations to manipulate them. This is done for each RDD separately as each dataset has its own schema.

In [7]:
from pyspark import StorageLevel
from pyspark.sql import Row
import datetime as dt

### Users

According to the MovieLens dataset's documentation, each user record is composed of 5 fields:

- userID: User's unique identification.
- gender: User's gender ("F" for females, "M" for males).
- age: User's age range encoded as a number (see below).
- occupation: User's occupation encoded as a number (see below).
- zipcode: User's zipcode.

We usually need to encode categorical features as numbers (or sets of binary features) in order to employ them as features for predictive models. Here, the `age` and `occupation` field values are already encoded; however, in a preliminary analysis it might be useful to know their actual values for the sake of interpretation. So the default encoding is undone and the codes are translated to the actual values they represent. Later, when needed, we may encode them again using Spark's `StringIndexer` class.

Thus, in the following cells the plain textual data for each user is split, the field values are computed, and the result is put into a structured `DataFrame`.

In [8]:
# Map age codes to their true meanings.
agesDict = {
    "1": "Under 18",
    "18": "18-24",
    "25": "25-34",
    "35": "35-44",
    "45": "45-49",
    "50": "50-55",
    "56": "Over 56",
}

In [9]:
# Map occupation codes to their true meanings.
occupationsDict = {
    "0": "other or unspecified",
    "1": "academic/educator",
    "2": "artist",
    "3": "clerical/admin",
    "4": "college/grad student",
    "5": "customer service",
    "6": "doctor/health care",
    "7": "executive/managerial",
    "8": "farmer",
    "9": "homemaker",
    "10": "K-12 student",
    "11": "lawyer",
    "12": "programmer",
    "13": "retired",
    "14": "sales/marketing",
    "15": "scientist",
    "16": "self-employed",
    "17": "technician/engineer",
    "18": "tradesman/craftsman",
    "19": "unemployed",
    "20": "writer",
}

In [10]:
# Transform the RDD of strings into an RDD of structured Rows.
usersRdd1 = usersRdd0.map(lambda line: line.split("::")) \
                     .map(lambda t: Row(userID=int(t[0]),
                                        gender=t[1],
                                        age=agesDict[t[2]],
                                        occupation=occupationsDict[t[3]],
                                        zipcode=t[4]))
usersRdd1.take(5)

[Row(age='Under 18', gender='F', occupation='K-12 student', userID=1, zipcode='48067'),
 Row(age='Over 56', gender='M', occupation='self-employed', userID=2, zipcode='70072'),
 Row(age='25-34', gender='M', occupation='scientist', userID=3, zipcode='55117'),
 Row(age='45-49', gender='M', occupation='executive/managerial', userID=4, zipcode='02460'),
 Row(age='25-34', gender='M', occupation='writer', userID=5, zipcode='55455')]

In [11]:
# Get a DataFrame from the RDD.
usersDf1 = ss.createDataFrame(usersRdd1)
usersDf1.show(5, truncate=False)

+--------+------+--------------------+------+-------+
|age     |gender|occupation          |userID|zipcode|
+--------+------+--------------------+------+-------+
|Under 18|F     |K-12 student        |1     |48067  |
|Over 56 |M     |self-employed       |2     |70072  |
|25-34   |M     |scientist           |3     |55117  |
|45-49   |M     |executive/managerial|4     |02460  |
|25-34   |M     |writer              |5     |55455  |
+--------+------+--------------------+------+-------+
only showing top 5 rows



In [12]:
# Persist it for better performance.
# usersDf1.persist(StorageLevel.MEMORY_AND_DISK)
usersDf1.cache()

DataFrame[age: string, gender: string, occupation: string, userID: bigint, zipcode: string]

### Movies

At first each movie record is composed of 3 fields:

- movieID: Movie's unique identification.
- title: Movie's title and year (see below).
- genres: List of all genres the movie fits to.

The `title` field includes the movie's year because there are movies with same name made (or remade) in different years, so the year information would be the only way to distinct them. However, as we have a `movieID` field, this is not really necessary. So we can extract the year from the title and make another field.

Thus, in the following cells the plain textual data for each movie is split, the field values are computed, and the result is put into a structured `DataFrame`.

In [13]:
# Transform the RDD of strings into an RDD of structured Rows.
moviesRdd1 = moviesRdd0.map(lambda line: line.split("::")) \
                       .map(lambda t: Row(movieID=int(t[0]),
                                          title=t[1][:-7],
                                          year=int(t[1][-5:-1]),
                                          genres=t[2].split("|")))
moviesRdd1.take(5)

[Row(genres=['Animation', "Children's", 'Comedy'], movieID=1, title='Toy Story', year=1995),
 Row(genres=['Adventure', "Children's", 'Fantasy'], movieID=2, title='Jumanji', year=1995),
 Row(genres=['Comedy', 'Romance'], movieID=3, title='Grumpier Old Men', year=1995),
 Row(genres=['Comedy', 'Drama'], movieID=4, title='Waiting to Exhale', year=1995),
 Row(genres=['Comedy'], movieID=5, title='Father of the Bride Part II', year=1995)]

In [14]:
# Get a DataFrame from the RDD.
moviesDf1 = ss.createDataFrame(moviesRdd1)
moviesDf1.show(5, truncate=False)

+--------------------------------+-------+---------------------------+----+
|genres                          |movieID|title                      |year|
+--------------------------------+-------+---------------------------+----+
|[Animation, Children's, Comedy] |1      |Toy Story                  |1995|
|[Adventure, Children's, Fantasy]|2      |Jumanji                    |1995|
|[Comedy, Romance]               |3      |Grumpier Old Men           |1995|
|[Comedy, Drama]                 |4      |Waiting to Exhale          |1995|
|[Comedy]                        |5      |Father of the Bride Part II|1995|
+--------------------------------+-------+---------------------------+----+
only showing top 5 rows



In [15]:
# Persist it for better performance.
# moviesDf1.persist(StorageLevel.MEMORY_AND_DISK)
moviesDf1.cache()

DataFrame[genres: array<string>, movieID: bigint, title: string, year: bigint]

### Ratings

Each rating data is composed of 4 fields:

- userID: ID of the user who gave the rating.
- movieID: ID of the rated movie.
- rating: A numerical value ranging from 1 (min) to 5 (max).
- timestamp: An integer number that encodes the date and time the rating was given at.

The `timestamp` can be easily converted to a `datetime` object, which is much easier to interpret.

Thus, in the following cells the plain textual data for each rating is split, the field values are computed, and the result is put into a structured DataFrame.

In [16]:
# Transform the RDD of strings into an RDD of structured Rows.
ratingsRdd1 = ratingsRdd0.map(lambda line: line.split("::")) \
                         .map(lambda t: Row(userID=int(t[0]),
                                            movieID=int(t[1]),
                                            rating=float(t[2]),
                                            timestamp=dt.datetime.fromtimestamp(int(t[3]))))
ratingsRdd1.take(5)

[Row(movieID=1193, rating=5.0, timestamp=datetime.datetime(2000, 12, 31, 20, 12, 40), userID=1),
 Row(movieID=661, rating=3.0, timestamp=datetime.datetime(2000, 12, 31, 20, 35, 9), userID=1),
 Row(movieID=914, rating=3.0, timestamp=datetime.datetime(2000, 12, 31, 20, 32, 48), userID=1),
 Row(movieID=3408, rating=4.0, timestamp=datetime.datetime(2000, 12, 31, 20, 4, 35), userID=1),
 Row(movieID=2355, rating=5.0, timestamp=datetime.datetime(2001, 1, 6, 21, 38, 11), userID=1)]

In [17]:
# Get a DataFrame from the RDD.
ratingsDf1 = ss.createDataFrame(ratingsRdd1)
ratingsDf1.show(5, truncate=False)

+-------+------+---------------------+------+
|movieID|rating|timestamp            |userID|
+-------+------+---------------------+------+
|1193   |5.0   |2000-12-31 20:12:40.0|1     |
|661    |3.0   |2000-12-31 20:35:09.0|1     |
|914    |3.0   |2000-12-31 20:32:48.0|1     |
|3408   |4.0   |2000-12-31 20:04:35.0|1     |
|2355   |5.0   |2001-01-06 21:38:11.0|1     |
+-------+------+---------------------+------+
only showing top 5 rows



In [18]:
# Persist it for better performance.
# ratingsDf1.persist(StorageLevel.MEMORY_AND_DISK)
ratingsDf1.cache()

DataFrame[movieID: bigint, rating: double, timestamp: timestamp, userID: bigint]

## Exploratory data analysis

Here we'll just do some exploration on the structured datasets in order to better understand all the data we have.

In [19]:
# Create temporary views that can be accessed by SQL queries.
usersDf1.createOrReplaceTempView("users")
moviesDf1.createOrReplaceTempView("movies")
ratingsDf1.createOrReplaceTempView("ratings")

### Users

In [20]:
# Count the number of users.
numUsers = usersDf1.count()
numUsers

6040

In [21]:
# Get the gender distribution.
ss.sql("""SELECT gender, 
                 COUNT(gender) as frequency
          FROM users
          GROUP BY gender
          ORDER BY frequency DESC""").show()

+------+---------+
|gender|frequency|
+------+---------+
|     M|     4331|
|     F|     1709|
+------+---------+



In [22]:
# Get the age distribution.
ss.sql("""SELECT age,
                 COUNT(age) as frequency
          FROM users
          GROUP BY age
          ORDER BY frequency DESC""").show()

+--------+---------+
|     age|frequency|
+--------+---------+
|   25-34|     2096|
|   35-44|     1193|
|   18-24|     1103|
|   45-49|      550|
|   50-55|      496|
| Over 56|      380|
|Under 18|      222|
+--------+---------+



In [23]:
# Get the occupation distribution.
ss.sql("""SELECT occupation,
                 COUNT(occupation) as frequency
          FROM users
          GROUP BY occupation
          ORDER BY frequency DESC""").show(21)

+--------------------+---------+
|          occupation|frequency|
+--------------------+---------+
|college/grad student|      759|
|other or unspecified|      711|
|executive/managerial|      679|
|   academic/educator|      528|
| technician/engineer|      502|
|          programmer|      388|
|     sales/marketing|      302|
|              writer|      281|
|              artist|      267|
|       self-employed|      241|
|  doctor/health care|      236|
|        K-12 student|      195|
|      clerical/admin|      173|
|           scientist|      144|
|             retired|      142|
|              lawyer|      129|
|    customer service|      112|
|           homemaker|       92|
|          unemployed|       72|
| tradesman/craftsman|       70|
|              farmer|       17|
+--------------------+---------+



In [24]:
# Get the user groups stratified.
ss.sql("""SELECT gender,
                 age,
                 occupation,
                 COUNT(*) as frequency
          FROM users
          GROUP BY gender, age, occupation
          ORDER BY frequency DESC""").show()

+------+--------+--------------------+---------+
|gender|     age|          occupation|frequency|
+------+--------+--------------------+---------+
|     M|   18-24|college/grad student|      371|
|     M|   25-34|other or unspecified|      206|
|     M|   25-34|executive/managerial|      191|
|     M|   25-34| technician/engineer|      180|
|     M|   35-44|executive/managerial|      177|
|     M|   25-34|          programmer|      164|
|     F|   18-24|college/grad student|      163|
|     M|   25-34|college/grad student|      141|
|     M|   35-44| technician/engineer|      116|
|     M|Under 18|        K-12 student|      100|
|     M|   25-34|     sales/marketing|       97|
|     M|   35-44|other or unspecified|       92|
|     F|   25-34|other or unspecified|       92|
|     M|   25-34|              writer|       89|
|     M|   25-34|   academic/educator|       87|
|     M|   25-34|              artist|       82|
|     M|   35-44|   academic/educator|       77|
|     M| Over 56|   

Most of our users belong to the group of male college/grad students aged 18-24 (which is not surprising). The next 3 most populated groups are also composed of men, with various occupations and aged 25-34. Oddly, female users are a minority (why did women watch/rate about 3 times less movies than men? This is probably worthy a deeper investigation...), as well as people at the limits of the age distribution (under 18 and above 56). As one might expect, farmers and tradesman/craftsman are among the least frequent occupations among the users.

### Movies

In [25]:
# Count the number of movies.
numMovies = moviesDf1.count()
numMovies

3883

In [26]:
# Get the year distribution.
ss.sql("""SELECT year,
                 COUNT(year) as frequency
          FROM movies
          GROUP BY year
          ORDER BY frequency DESC""").show()

+----+---------+
|year|frequency|
+----+---------+
|1996|      345|
|1995|      342|
|1998|      337|
|1997|      315|
|1999|      283|
|1994|      257|
|1993|      165|
|2000|      156|
|1986|      104|
|1992|      102|
|1990|       77|
|1987|       71|
|1988|       69|
|1985|       65|
|1984|       60|
|1989|       60|
|1991|       60|
|1982|       50|
|1981|       43|
|1980|       41|
+----+---------+
only showing top 20 rows



In [27]:
# Get some year statistics.
ss.sql("""SELECT MIN(year) as min_year,
                 MAX(year) as max_year,
                 ROUND(AVG(year), 1) as avg_year
          FROM movies""").show()

+--------+--------+--------+
|min_year|max_year|avg_year|
+--------+--------+--------+
|    1919|    2000|  1986.1|
+--------+--------+--------+



All movies in the dataset were made between the years of 1919 and 2000. Most of them are from more recent decades, since the mean year is 1986. More precisely, most rated movies are from the 1990s (particularly 1996).

### Ratings

In [28]:
# Count the number of movies.
numRatings = ratingsDf1.count()
numRatings

1000209

In [29]:
# Get the rating distribution.
ss.sql("""SELECT rating,
                 COUNT(rating) as frequency
          FROM ratings
          GROUP BY rating
          ORDER BY frequency DESC""").show()

+------+---------+
|rating|frequency|
+------+---------+
|   4.0|   348971|
|   3.0|   261197|
|   5.0|   226310|
|   2.0|   107557|
|   1.0|    56174|
+------+---------+



In [30]:
# Get the most active users and their average ratings.
ss.sql("""SELECT userID,
                 COUNT(userID) as numRatings,
                 ROUND(AVG(rating), 1) as avgRating
          FROM ratings
          GROUP BY userID
          ORDER BY numRatings DESC""").show()

+------+----------+---------+
|userID|numRatings|avgRating|
+------+----------+---------+
|  4169|      2314|      3.6|
|  1680|      1850|      3.6|
|  4277|      1743|      4.1|
|  1941|      1595|      3.1|
|  1181|      1521|      2.8|
|   889|      1518|      2.8|
|  3618|      1344|      3.0|
|  2063|      1323|      2.9|
|  1150|      1302|      2.6|
|  1015|      1286|      3.7|
|  5795|      1277|      3.1|
|  4344|      1271|      3.3|
|  1980|      1260|      3.5|
|  2909|      1258|      3.8|
|  1449|      1243|      2.8|
|  4510|      1240|      2.8|
|   424|      1226|      3.7|
|  4227|      1222|      2.7|
|  5831|      1220|      3.7|
|  3391|      1216|      3.7|
+------+----------+---------+
only showing top 20 rows



In [31]:
# Get the overall rank of movies combining both their average rating
# and the number of ratings.
ss.sql("""SELECT movies.title,
                 movies.year,
                 COUNT(ratings.movieID) as numRatings,
                 AVG(ratings.rating) as avgRating
          FROM ratings
               INNER JOIN movies
               ON ratings.movieID = movies.movieID
          GROUP BY ratings.movieID,
                   movies.title,
                   movies.year
          ORDER BY (avgRating * numRatings / %s) DESC""" % numUsers) \
  .show(truncate=False)

+----------------------------------------------+----+----------+------------------+
|title                                         |year|numRatings|avgRating         |
+----------------------------------------------+----+----------+------------------+
|American Beauty                               |1999|3428      |4.3173862310385065|
|Star Wars: Episode IV - A New Hope            |1977|2991      |4.453694416583082 |
|Star Wars: Episode V - The Empire Strikes Back|1980|2990      |4.292976588628763 |
|Star Wars: Episode VI - Return of the Jedi    |1983|2883      |4.022892819979188 |
|Saving Private Ryan                           |1998|2653      |4.337353938937053 |
|Raiders of the Lost Ark                       |1981|2514      |4.477724741447892 |
|Silence of the Lambs, The                     |1991|2578      |4.3518231186966645|
|Matrix, The                                   |1999|2590      |4.315830115830116 |
|Sixth Sense, The                              |1999|2459      |4.4062627084

As 4 stars is the most common rating and 1-2 stars are the least ones, we can say that there aren't many terribly bad movies in the dataset (in our users' opinions). There are users that rated over 1000 movies (and they probably should be rewarded for such contribution), as well as movies that were rated by over 2000 users.

American Beauty (1999) is the most rated movie and also the top ranked one when we combine both the average rating and the number of ratings to compute a final score (because being a 5-star movie according to only 1 user means nothing - quantity is important too).

## Preprocessing

In the preprocessing stage we put all data in good shape for the learning algorithms that come later. This may involve feature encoding, scaling, transforming, splitting, joining, etc.

As the goal here is to use collaborative filtering, the only data that needs to be preprocessed is the `ratings` dataset, because only the explicit feedback scores (aka the ratings) are needed to learn the recommender model.

In [32]:
from pyspark.sql.functions import col

### Mean shift

What should the recommender system do when it finds a new user that never rated anything before (i.e. nothing is known about his/her tastes)? A common approach is just recommending the overall top rated movies. This can be done in a simple manner by shifting each movie's rating relative to its average value (i.e. making ratings have zero mean) and then, in the end, adding such average value to the score predicted for that movie.

In [33]:
# Get the average rating and the rating frequency by movie.
ratingStatsDf = ss.sql("""SELECT movieID,
                                 AVG(rating) as avgRating,
                                 (COUNT(movieID) / %s) as freqRating
                          FROM ratings
                          GROUP BY movieID""" % numUsers)
ratingStatsDf.show(truncate=False)

+-------+------------------+---------------------+
|movieID|avgRating         |freqRating           |
+-------+------------------+---------------------+
|29     |4.062034739454094 |0.06672185430463576  |
|1806   |2.892857142857143 |0.02781456953642384  |
|474    |3.825102880658436 |0.1609271523178808   |
|2453   |3.25              |0.02913907284768212  |
|2529   |3.712564543889845 |0.1923841059602649   |
|2040   |2.9453125         |0.02119205298013245  |
|26     |3.53              |0.016556291390728478 |
|3506   |3.3652694610778444|0.027649006622516556 |
|3091   |4.283687943262412 |0.023344370860927152 |
|2250   |3.1451612903225805|0.010264900662251655 |
|1677   |2.7567567567567566|0.006125827814569536 |
|1950   |4.129310344827586 |0.0576158940397351   |
|3764   |2.6408163265306124|0.04056291390728477  |
|2927   |4.08080808080808  |0.01639072847682119  |
|964    |3.392156862745098 |0.008443708609271523 |
|2509   |2.7777777777777777|0.002980132450331126 |
|2214   |3.0               |1.6

In [34]:
ratingStatsDf.cache()

DataFrame[movieID: bigint, avgRating: double, freqRating: double]

In [35]:
ratingStatsDf.createTempView("ratingStats")

In [36]:
# Get the ratings relative to 
ratingsDf2 = ss.sql("""SELECT ratings.movieID,
                              userID,
                              rating,
                              (rating - avgRating) as shiftedRating
                       FROM ratings
                            INNER JOIN ratingStats
                            ON ratings.movieID = ratingStats.movieID""")
ratingsDf2.show()

+-------+------+------+-------------------+
|movieID|userID|rating|      shiftedRating|
+-------+------+------+-------------------+
|     26|    18|   4.0| 0.4700000000000002|
|     26|    69|   4.0| 0.4700000000000002|
|     26|   229|   4.0| 0.4700000000000002|
|     26|   342|   4.0| 0.4700000000000002|
|     26|   524|   3.0|-0.5299999999999998|
|     26|   655|   3.0|-0.5299999999999998|
|     26|   748|   5.0| 1.4700000000000002|
|     26|   881|   3.0|-0.5299999999999998|
|     26|   890|   3.0|-0.5299999999999998|
|     26|   918|   4.0| 0.4700000000000002|
|     26|   963|   4.0| 0.4700000000000002|
|     26|   973|   4.0| 0.4700000000000002|
|     26|  1015|   3.0|-0.5299999999999998|
|     26|  1069|   3.0|-0.5299999999999998|
|     26|  1120|   3.0|-0.5299999999999998|
|     26|  1150|   3.0|-0.5299999999999998|
|     26|  1182|   2.0|-1.5299999999999998|
|     26|  1203|   4.0| 0.4700000000000002|
|     26|  1279|   3.0|-0.5299999999999998|
|     26|  1314|   1.0|         

In [37]:
# ratingsDf2.persist(StorageLevel.MEMORY_AND_DISK)
ratingsDf2.cache()

DataFrame[movieID: bigint, userID: bigint, rating: double, shiftedRating: double]

In [38]:
# Drop stuff that aren't needed anymore.
ratingsDf1.unpersist()

DataFrame[movieID: bigint, rating: double, timestamp: timestamp, userID: bigint]

## Model training

As previously mentioned, the collaborative filtering technique only requires us to provide the user IDs, the movie IDs and the rating each user gave to each movie they watched. The *Alternating Least Squares (ALS)* technique is then used to learn weight parameters that best fit the data for each user and can predict the ratings a user would probably give to a movie he/she has never seen before.

Parameter tuning (the process of finding the best combination of input parameters for the learning algorithm) was performed using a grid-search with a 3-fold cross-validation procedure.

In [39]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

In [40]:
# Create the Alternating Least Squares (ALS) estimator.
als = ALS(userCol="userID",
          itemCol="movieID",
          ratingCol="shiftedRating",
          maxIter=5)

In [41]:
# Build the parameter grid for tuning.
paramGrid = ParamGridBuilder().addGrid(als.regParam, [0.01, 0.1, 1]) \
                              .build()

In [42]:
# 3-fold cross-validation procedure.
alsCV = CrossValidator(estimator=als,
                       estimatorParamMaps=paramGrid,
                       evaluator=RegressionEvaluator(labelCol="shiftedRating"),
                       numFolds=3)

In [43]:
# Train the model.
alsModel = alsCV.fit(ratingsDf2)

In [44]:
shiftedPredictionsDf = alsModel.transform(ratingsDf2) \
                               .withColumnRenamed("prediction", "shiftedPrediction")
shiftedPredictionsDf.show()

+-------+------+------+-------------------+-----------------+
|movieID|userID|rating|      shiftedRating|shiftedPrediction|
+-------+------+------+-------------------+-----------------+
|    148|    53|   5.0|  2.217391304347826|        1.5828782|
|    148|   673|   5.0|  2.217391304347826|        2.4099007|
|    148|  4169|   3.0|0.21739130434782616|       0.70518947|
|    148|  4227|   2.0|-0.7826086956521738|       -1.3721882|
|    148|  5333|   3.0|0.21739130434782616|      -0.48261184|
|    148|  3184|   4.0| 1.2173913043478262|        0.8259024|
|    148|  4387|   1.0|-1.7826086956521738|       -1.2662555|
|    148|  4784|   3.0|0.21739130434782616|        0.9237462|
|    148|  2383|   2.0|-0.7826086956521738|      -0.16618633|
|    148|  1242|   3.0|0.21739130434782616|       0.17920734|
|    148|  3539|   3.0|0.21739130434782616|       -0.1194027|
|    148|  1069|   2.0|-0.7826086956521738|      -0.70462483|
|    148|  1605|   2.0|-0.7826086956521738|       -0.7226115|
|    148

The trained model predicts recommendation scores that are relative to each movies' average ratings. In order to convert them to the true, unshifted scores, some additional operations must be performed. The function below defines such operations.

In [45]:
def predict(model, inputsDf, ratingStatsDf):
    """Makes the final (unshifted) rating predictions according to the given movies and users.
    
    Inputs:
        model          The pretrained predictive model.
        inputsDf       A DataFrame containing the user and movie IDs for which the predictions will be made.
        ratingStatsDf  A DataFrame containing the rating statistics for each movie.

    Outputs:
        outputsDf  Same as inputsDf + a new column with the unshifted predictions. 
    """
    outputsDf = model.transform(inputsDf) \
                     .withColumnRenamed("prediction", "shiftedPrediction") \
                     .fillna(0, "shiftedPrediction") \
                     .join(ratingStatsDf, ["movieID"]) \
                     .withColumn("prediction", col("shiftedPrediction") + col("avgRating"))

    return outputsDf

In [46]:
predictionsDf = predict(alsModel, ratingsDf2, ratingStatsDf)
predictionsDf.show()

+-------+------+------+-------------------+-----------------+-----------------+--------------------+------------------+
|movieID|userID|rating|      shiftedRating|shiftedPrediction|        avgRating|          freqRating|        prediction|
+-------+------+------+-------------------+-----------------+-----------------+--------------------+------------------+
|    148|    53|   5.0|  2.217391304347826|        1.5828782|2.782608695652174|0.003807947019867...| 4.365486927654432|
|    148|   673|   5.0|  2.217391304347826|        2.4099007|2.782608695652174|0.003807947019867...| 5.192509360935377|
|    148|  4169|   3.0|0.21739130434782616|       0.70518947|2.782608695652174|0.003807947019867...|3.4877981621286143|
|    148|  4227|   2.0|-0.7826086956521738|       -1.3721882|2.782608695652174|0.003807947019867...|1.4104204851648081|
|    148|  5333|   3.0|0.21739130434782616|      -0.48261184|2.782608695652174|0.003807947019867...|2.2999968606492747|
|    148|  3184|   4.0| 1.21739130434782

In [47]:
# predictionsDf.persist(StorageLevel.MEMORY_AND_DISK)
predictionsDf.cache()

DataFrame[movieID: bigint, userID: bigint, rating: double, shiftedRating: double, shiftedPrediction: float, avgRating: double, freqRating: double, prediction: double]

## Results

As the recommender model predicts real numbers, it can be evaluated using the *Root Mean Squared Error* (RMSE) metric.

In [48]:
# Evaluate using the root mean squared error (RMSE).
rmse = RegressionEvaluator(labelCol="rating",
                           predictionCol="prediction",
                           metricName="rmse").evaluate(predictionsDf)

In [49]:
print("Root Mean Squared Error (RMSE) = %g" % rmse)

Root Mean Squared Error (RMSE) = 0.775064


**NOTE**: The reason I trained and tested the model on the same dataset (instead of randomly splitting it into two disjoint datasets, one for training and the other for test) is [this bug](https://issues.apache.org/jira/browse/SPARK-14489). Apparently the problem was fixed in Spark 2.2.0 with the addition of a new input parameter for `ALS`, but I'm currently using Spark 2.1.0 (latest stable version when I wrote this), so it is not available yet.

Now the recommender model is finally ready to be used! Predicting the ratings that specific users would give to specific movies is simply a matter of filling a `DataFrame` with the desired `movieID`s and `userID`s and sending it as input to the `predict()` function. The result will be another `DataFrame` containing an additional column with the predictions (among others).

Notice that, when the input data contains some new `userID` unknown to the model, the shifted predicted rating for the requested movie is 0. Hence, the final prediction is simply the overall average rating for that movie. This is a reasonable strategy: if we know nothing about some user's tastes, just recommend them popular movies that most people liked.

When the `movieID` is unknown by the model, such movie is simply ignored: no prediction is made for it. This is also a reasonable strategy, because it may not be a good idea to recommend a movie we know nothing about. Better wait until it receives some ratings first. Were we building a hybrid recommender system (e.g. a model that combines collaborative filtering with content-based recommendation), a different strartegy could be employed for better results.

In [50]:
# Define some test data.
testDf = ss.createDataFrame([(26, 100),
                             (26, 200),
                             (29, 100),
                             (29, 200),
                             (100, 0),   # unknown user
                             (200, 0),   # unknown user
                             (0, 100),   # unknown movie
                             (0, 200)],  # unknown movie
                            ["movieID", "userID"])
testDf.show()

+-------+------+
|movieID|userID|
+-------+------+
|     26|   100|
|     26|   200|
|     29|   100|
|     29|   200|
|    100|     0|
|    200|     0|
|      0|   100|
|      0|   200|
+-------+------+



In [51]:
# Predict ratings for them. Notice that:
#   - When the user is unknown, the predicted rating is just the movie's overall average rating.
#   - When the movie is unknown, no prediction is made for it.
predict(alsModel, testDf, ratingStatsDf).show(truncate=False)

+-------+------+-----------------+------------------+--------------------+------------------+
|movieID|userID|shiftedPrediction|avgRating         |freqRating          |prediction        |
+-------+------+-----------------+------------------+--------------------+------------------+
|26     |100   |-0.78005         |3.53              |0.016556291390728478|2.7499500203132627|
|26     |200   |0.6017402        |3.53              |0.016556291390728478|4.131740181446075 |
|100    |0     |0.0              |3.0625            |0.02119205298013245 |3.0625            |
|29     |100   |-0.6050494       |4.062034739454094 |0.06672185430463576 |3.4569853677347337|
|29     |200   |0.75856614       |4.062034739454094 |0.06672185430463576 |4.820600880582634 |
|200    |0     |0.0              |2.4166666666666665|0.001986754966887417|2.4166666666666665|
+-------+------+-----------------+------------------+--------------------+------------------+



And if we want to recommend a list of `N` movies that a specific user is most likely to enjoy, the function below can be used. It basically builds a `DataFrame` with all the movies the user never watched and predicts a rating (score) for each of these movies, using the pretrained model. The results are sorted in order to put the `N` highest predicted values as the top recommendations for such user.

In [52]:
from pyspark.sql.functions import lit

In [53]:
def recommend(model, userID, moviesDf, ratingsDf, ratingStatsDf, n=10):
    """Computes movie recommendations for a single user.
    
    Inputs:
        model          The pretrained predictive model.
        moviesDf       A DataFrame containing all the movies.
        ratingsDf      A DataFrame containing all the ratings.
        ratingStatsDf  A DataFrame containing the rating statistics for each movie.
        n              The maximum number of recommendations.

    Outputs:
        outputsDf  Same as inputsDf + a new column with the unshifted predictions. 
    """
    ratedDf = ratingsDf.where(col("userID") == userID) \
                       .select("movieID")
    unratedDf = moviesDf.join(ratedDf, moviesDf["movieID"] == ratedDf["movieID"], "left_outer") \
                        .where(ratedDf["movieID"].isNull()) \
                        .select(moviesDf["movieID"])

    inputsDf = unratedDf.withColumn("userID", lit(userID))
    outputsDf = predict(model, inputsDf, ratingStatsDf) \
                    .orderBy(col("prediction"), ascending=False) \
                    .limit(n) \
                    .select("movieID", "prediction") \
                    .join(moviesDf, "movieID") \
                    .select("movieID", "title", "year", "prediction")

    return outputsDf

In [54]:
recommend(alsModel, 1000, moviesDf1, ratingsDf2, ratingStatsDf).show(truncate=False)

+-------+---------------------------------------------+----+------------------+
|movieID|title                                        |year|prediction        |
+-------+---------------------------------------------+----+------------------+
|2800   |Little Nemo: Adventures in Slumberland       |1992|7.503374143080277 |
|3847   |Ilsa, She Wolf of the SS                     |1974|6.817411861419678 |
|2342   |Hard Core Logo                               |1996|6.614123799584128 |
|1450   |Prisoner of the Mountains (Kavkazsky Plennik)|1996|6.53896393094744  |
|219    |Cure, The                                    |1995|6.406280214136297 |
|37     |Across the Sea of Time                       |1995|6.27704644203186  |
|3817   |Other Side of Sunday, The (S�ndagsengler)    |1996|6.0432658195495605|
|1664   |N�nette et Boni                              |1996|5.965112519264221 |
|2721   |Trick                                        |1999|5.862164483350866 |
|326    |To Live (Huozhe)               