# Building a Recommendation System for the MovieLens Dataset - ALS Explicit Collaborative Filtering

Nick Pasternak (nfp5ga), Kara Fallin (kmf4tg), Aparna Marathe (am7ad)

In this notebook, we load and process the data again. Since our dataset contains 20 million rows, we took a 5% sample before doing a 70/30 train-test split. Then we proceeded to build our ALS explicit collaborative filtering model. We used 3-fold cross validation for tuning and evaluated the model using RMSE and also the actual movie predicti

In [1]:
import pyspark
import os
import pyspark.sql.types as typ
import pyspark.sql.functions as F
from pyspark.sql.functions import col, asc, desc, split, regexp_extract, explode
from pyspark.mllib import recommendation
from pyspark.mllib.recommendation import *
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
from pyspark.sql.types import IntegerType, FloatType
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("collaborative_based_filtering") \
    .config("spark.executor.memory", '250g') \
    .config('spark.executor.cores', '8') \
    .config('spark.cores.max', '8') \
    .config("spark.driver.memory",'250g') \
    .getOrCreate()

sc = spark.sparkContext

In [3]:
os.getcwd()

'/sfs/qumulo/qhome/nfp5ga/Desktop/ds5110'

## Load the data

### Movie links: links to imdb website 

(https://www.imdb.com/title/tt0 + imdbId + /?ref_=fn_al_tt_1)

In [4]:
links = spark.read.csv("/sfs/qumulo/qhome/nfp5ga/Desktop/ds5110/link.csv", header=True)

### Movies

In [5]:
movies = spark.read.csv("/sfs/qumulo/qhome/nfp5ga/Desktop/ds5110/movie.csv", header=True)
movies = movies.withColumn('genres',split(col('genres'),"[|]"))
movies = movies.withColumn('year',regexp_extract('title', r'(.*)\((\d+)\)', 2))
movies = movies.withColumn('title',regexp_extract('title', r'(.*) \((\d+)\)', 1))

### User ratings

In [6]:
ratings = spark.read.csv("/sfs/qumulo/qhome/nfp5ga/Desktop/ds5110/rating.csv", header=True).drop('timestamp')

-----------

## Process the data

### Merge existing data

In [7]:
df1 = ratings.join(movies,on="movieId",how="inner")

df2 = df1.join(links,on=['movieId'],how='inner').drop('tmdbId')

scrape = spark.read.csv("/sfs/qumulo/qhome/nfp5ga/Desktop/ds5110/movie_scrape.csv", header=True)
scrape = scrape.drop('title','year', 'genres')

df = df2.join(scrape,on='imdbId',how='inner')
df = df.drop('imdbId')
df.show()

+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+
|movieId|userId|rating|               title|              genres|year|category|score|         description|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+
|      2|     1|   3.5|             Jumanji|[Adventure, Child...|1995|      PG|  7.0|Jumanji, one of t...|
|     29|     1|   3.5|City of Lost Chil...|[Adventure, Drama...|1995|       R|  7.5|Set in a dystopia...|
|     32|     1|   3.5|Twelve Monkeys (a...|[Mystery, Sci-Fi,...|1995|       R|  8.0|James Cole, a pri...|
|     47|     1|   3.5|Seven (a.k.a. Se7en)| [Mystery, Thriller]|1995|       R|  8.6|A film about two ...|
|     50|     1|   3.5| Usual Suspects, The|[Crime, Mystery, ...|1995|       R|  8.5|Following a truck...|
|    112|     1|   3.5|Rumble in the Bro...|[Action, Adventur...|1995|       R|  6.7|Keong comes from ...|
|    151|     1|     4|             R

In [8]:
df = df.withColumn('movieId', col('movieId').cast(IntegerType()))
df = df.withColumn('userId', col('userId').cast(IntegerType()))
df = df.withColumn('rating', col('rating').cast(FloatType()))
df = df.withColumn('year', col('year').cast(IntegerType()))
df = df.withColumn('score', col('score').cast(FloatType()))

In [9]:
df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- userId: integer (nullable = true)
 |-- rating: float (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- year: integer (nullable = true)
 |-- category: string (nullable = true)
 |-- score: float (nullable = true)
 |-- description: string (nullable = true)



In [10]:
df.count()

20000263

In [11]:
df = df.na.drop()
df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- userId: integer (nullable = true)
 |-- rating: float (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- year: integer (nullable = true)
 |-- category: string (nullable = true)
 |-- score: float (nullable = true)
 |-- description: string (nullable = true)



In [12]:
df.count()

19809344

----------------------

## Build the collaborative-based filtering model

In [13]:
seed=314

In [14]:
df_red = df.sample(fraction=0.05, seed=seed)
df_red.count()

991777

In [15]:
training, test = df_red.randomSplit([0.7, 0.3], seed=seed)
training.cache()

DataFrame[movieId: int, userId: int, rating: float, title: string, genres: array<string>, year: int, category: string, score: float, description: string]

In [16]:
# Build the recommendation model using ALS on the training data
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")

import time
t = time.time()
model = als.fit(training)
print(time.time() - t)

51.25466227531433


In [17]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error is {}".format(rmse))

Root-mean-square error is 1.1015120790414539


In [18]:
ALS = ALS(implicitPrefs=False, userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")

In [19]:
## Tuning with CrossValidator
paramMap = ParamGridBuilder() \
            .addGrid(ALS.rank, [10, 15, 20, 25]) \
            .addGrid(ALS.maxIter, [10, 15, 20, 25]) \
            .addGrid(ALS.regParam, [1, 0.1, 0.01]).build()


evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")


CVALS = CrossValidator(estimator=ALS,
                       estimatorParamMaps=paramMap,
                       evaluator=evaluatorR,
                       numFolds=3)

t1 = time.time()
CVModel = CVALS.setParallelism(4).fit(training)
print(time.time() - t1)

4383.377334356308


In [20]:
CVModel.bestModel._java_obj.parent().getRank()

25

In [21]:
CVModel.bestModel._java_obj.parent().getMaxIter()

25

In [22]:
CVModel.bestModel._java_obj.parent().getRegParam()

0.1

In [23]:
model = CVModel.bestModel

In [24]:
preds = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(preds)
print("Root-mean-square error is {}".format(rmse))

Root-mean-square error is 1.0474128953750446


In [25]:
preds.show(10)

+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+----------+
|movieId|userId|rating|               title|              genres|year|category|score|         description|prediction|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+----------+
|    148| 96769|   3.0|Awfully Big Adven...|             [Drama]|1995|       R|  5.9|Set right after W...|  2.390569|
|    148|  1716|   2.0|Awfully Big Adven...|             [Drama]|1995|       R|  5.9|Set right after W...| 3.1514006|
|    148| 82418|   3.0|Awfully Big Adven...|             [Drama]|1995|       R|  5.9|Set right after W...| 2.8295527|
|    148| 61712|   4.0|Awfully Big Adven...|             [Drama]|1995|       R|  5.9|Set right after W...| 2.9872584|
|    148| 19380|   1.0|Awfully Big Adven...|             [Drama]|1995|       R|  5.9|Set right after W...| 2.0969753|
|    148| 27248|   4.0|Awfully Big Adven...|            

In [26]:
## Generate top 5 movie recommendations for each user
recommendationsForUser = model.recommendForAllUsers(5)
recommendationsForUser.show(5)

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   148|[{1380, 4.5348597...|
|   463|[{38086, 4.275475...|
|   471|[{4515, 4.863552}...|
|   496|[{26749, 4.384217...|
|   833|[{5605, 4.3623004...|
+------+--------------------+
only showing top 5 rows



In [27]:
## Generate top 5 user recommendations for each movie
recommendationsForMovie = model.recommendForAllItems(5)
recommendationsForMovie.show(5)

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|   1580|[{25952, 5.477379...|
|   4900|[{88415, 4.93003}...|
|   5300|[{89622, 5.499586...|
|   6620|[{85625, 5.386104...|
|   7240|[{42964, 2.084168...|
+-------+--------------------+
only showing top 5 rows



In [28]:
nrecommendations = recommendationsForUser.withColumn('rec_exp', explode('recommendations')).select('userId', col('rec_exp.movieId'), col('rec_exp.rating'))
nrecommendations.limit(10).show()

+------+-------+---------+
|userId|movieId|   rating|
+------+-------+---------+
|   148|   1380|4.5348597|
|   148|   1643|4.3827353|
|   148|  27735|4.2632804|
|   148|  26749| 4.257889|
|   148| 106048|  4.23601|
|   463|  38086|4.2754755|
|   463|  26749| 4.180683|
|   463|   4136|4.1747437|
|   463|  34135|3.9561024|
|   463|  69524|3.9524267|
+------+-------+---------+



In [29]:
nrecommendations2 = recommendationsForMovie.withColumn('rec_exp', explode('recommendations')).select('movieId', col('rec_exp.userId'), col('rec_exp.rating'))
nrecommendations2.limit(10).show()

+-------+------+---------+
|movieId|userId|   rating|
+-------+------+---------+
|   1580| 25952| 5.477379|
|   1580| 85616| 5.230733|
|   1580|114872|5.1982565|
|   1580| 22517| 5.176774|
|   1580| 20253|5.1362166|
|   4900| 88415|  4.93003|
|   4900| 40461| 4.807039|
|   4900|115666| 4.744512|
|   4900| 12376| 4.693067|
|   4900|126728| 4.684249|
+-------+------+---------+



In [30]:
## Recommendations for user463
nrecommendations.join(movies, on='movieId').filter('userId = 463').show(truncate=False)

+-------+------+---------+-------------------------------------------------+-----------------------------+----+
|movieId|userId|rating   |title                                            |genres                       |year|
+-------+------+---------+-------------------------------------------------+-----------------------------+----+
|38086  |463   |4.2754755|Wishing Stairs (Yeogo goedam 3: Yeowoo gyedan)   |[Drama, Horror]              |2003|
|26749  |463   |4.180683 |Prospero's Books                                 |[Drama, Fantasy]             |1991|
|4136   |463   |4.1747437|Month in the Country, A                          |[Drama]                      |1987|
|34135  |463   |3.9561024|Bonjour Monsieur Shlomi (Ha-Kochavim Shel Shlomi)|[Comedy, Drama]              |2003|
|69524  |463   |3.9524267|Raiders of the Lost Ark: The Adaptation          |[Action, Adventure, Thriller]|1989|
+-------+------+---------+-------------------------------------------------+----------------------------

In [31]:
## Actual preferences for user463
ratings.join(movies, on='movieId').filter('userId = 463').sort('rating', ascending=False).limit(5).show(truncate=False)

+-------+------+------+---------------------------+------------------------+----+
|movieId|userId|rating|title                      |genres                  |year|
+-------+------+------+---------------------------+------------------------+----+
|47     |463   |5     |Seven (a.k.a. Se7en)       |[Mystery, Thriller]     |1995|
|272    |463   |5     |Madness of King George, The|[Comedy, Drama]         |1994|
|150    |463   |5     |Apollo 13                  |[Adventure, Drama, IMAX]|1995|
|17     |463   |5     |Sense and Sensibility      |[Drama, Romance]        |1995|
|261    |463   |5     |Little Women               |[Drama]                 |1994|
+-------+------+------+---------------------------+------------------------+----+



In [32]:
## Recommendations for user318
nrecommendations.join(movies, on='movieId').filter('userId = 318').show(truncate=False)

+-------+------+---------+-----------------------------------------------------------------------+----------------------------------+----+
|movieId|userId|rating   |title                                                                  |genres                            |year|
+-------+------+---------+-----------------------------------------------------------------------+----------------------------------+----+
|4878   |318   |4.452909 |Donnie Darko                                                           |[Drama, Mystery, Sci-Fi, Thriller]|2001|
|26347  |318   |4.390346 |Irony of Fate, or Enjoy Your Bath! (Ironiya sudby, ili S legkim parom!)|[Comedy, Drama, Romance]          |1975|
|2726   |318   |4.34551  |Killing, The                                                           |[Crime, Film-Noir]                |1956|
|2662   |318   |4.3430314|War of the Worlds, The                                                 |[Action, Drama, Sci-Fi]           |1953|
|70871  |318   |4.335788 |D

In [33]:
## Actual preferences for user318
ratings.join(movies, on='movieId').filter('userId = 318').sort('rating', ascending=False).limit(5).show(truncate=False)

+-------+------+------+------------------------------------------------------------------+-------------------------------------------------+----+
|movieId|userId|rating|title                                                             |genres                                           |year|
+-------+------+------+------------------------------------------------------------------+-------------------------------------------------+----+
|1      |318   |5     |Toy Story                                                         |[Adventure, Animation, Children, Comedy, Fantasy]|1995|
|589    |318   |5     |Terminator 2: Judgment Day                                        |[Action, Sci-Fi]                                 |1991|
|1201   |318   |5     |Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il)|[Action, Adventure, Western]                     |1966|
|1080   |318   |5     |Monty Python's Life of Brian                                      |[Comedy]                          

In [34]:
## Recommendations for user148
nrecommendations.join(movies, on='movieId').filter('userId = 148').show(truncate=False)

+-------+------+---------+-------------------------------------------+--------------------------------------------+----+
|movieId|userId|rating   |title                                      |genres                                      |year|
+-------+------+---------+-------------------------------------------+--------------------------------------------+----+
|1380   |148   |4.5348597|Grease                                     |[Comedy, Musical, Romance]                  |1978|
|1643   |148   |4.3827353|Mrs. Brown (a.k.a. Her Majesty, Mrs. Brown)|[Drama, Romance]                            |1997|
|27735  |148   |4.2632804|Unstoppable                                |[Action, Adventure, Comedy, Drama, Thriller]|2004|
|26749  |148   |4.257889 |Prospero's Books                           |[Drama, Fantasy]                            |1991|
|106048 |148   |4.23601  |Four Days in July                          |[Comedy, Drama]                             |1985|
+-------+------+---------+------

In [35]:
## Actual preferences for user148
ratings.join(movies, on='movieId').filter('userId = 148').sort('rating', ascending=False).limit(5).show(truncate=False)

+-------+------+------+-----------------------+-----------------------------+----+
|movieId|userId|rating|title                  |genres                       |year|
+-------+------+------+-----------------------+-----------------------------+----+
|289    |148   |5     |Only You               |[Comedy, Romance]            |1994|
|497    |148   |5     |Much Ado About Nothing |[Comedy, Romance]            |1993|
|339    |148   |5     |While You Were Sleeping|[Comedy, Romance]            |1995|
|356    |148   |5     |Forrest Gump           |[Comedy, Drama, Romance, War]|1994|
|17     |148   |5     |Sense and Sensibility  |[Drama, Romance]             |1995|
+-------+------+------+-----------------------+-----------------------------+----+



In [36]:
## User Recommendations for movie1580
nrecommendations2.join(movies, on='movieId').filter('movieId = 1580').show(truncate=False)

+-------+------+---------+-------------------------+------------------------+----+
|movieId|userId|rating   |title                    |genres                  |year|
+-------+------+---------+-------------------------+------------------------+----+
|1580   |25952 |5.477379 |Men in Black (a.k.a. MIB)|[Action, Comedy, Sci-Fi]|1997|
|1580   |85616 |5.230733 |Men in Black (a.k.a. MIB)|[Action, Comedy, Sci-Fi]|1997|
|1580   |114872|5.1982565|Men in Black (a.k.a. MIB)|[Action, Comedy, Sci-Fi]|1997|
|1580   |22517 |5.176774 |Men in Black (a.k.a. MIB)|[Action, Comedy, Sci-Fi]|1997|
|1580   |20253 |5.1362166|Men in Black (a.k.a. MIB)|[Action, Comedy, Sci-Fi]|1997|
+-------+------+---------+-------------------------+------------------------+----+



In [37]:
## Actual preferences for user25952
ratings.join(movies, on='movieId').filter('userId = 25952').sort('rating', ascending=False).limit(5).show(truncate=False)

+-------+------+------+-------------------------------+---------------------------+----+
|movieId|userId|rating|title                          |genres                     |year|
+-------+------+------+-------------------------------+---------------------------+----+
|3793   |25952 |5     |X-Men                          |[Action, Adventure, Sci-Fi]|2000|
|3785   |25952 |5     |Scary Movie                    |[Comedy, Horror]           |2000|
|3784   |25952 |5     |Kid, The                       |[Comedy, Fantasy]          |2000|
|1221   |25952 |5     |Godfather: Part II, The        |[Crime, Drama]             |1974|
|1193   |25952 |5     |One Flew Over the Cuckoo's Nest|[Drama]                    |1975|
+-------+------+------+-------------------------------+---------------------------+----+



In [38]:
## Actual preferences for user85616
ratings.join(movies, on='movieId').filter('userId = 85616').sort('rating', ascending=False).limit(5).show(truncate=False)

+-------+------+------+----------------------------------+---------------------------+----+
|movieId|userId|rating|title                             |genres                     |year|
+-------+------+------+----------------------------------+---------------------------+----+
|16     |85616 |5     |Casino                            |[Crime, Drama]             |1995|
|47     |85616 |5     |Seven (a.k.a. Se7en)              |[Mystery, Thriller]        |1995|
|21     |85616 |5     |Get Shorty                        |[Comedy, Crime, Thriller]  |1995|
|25     |85616 |5     |Leaving Las Vegas                 |[Drama, Romance]           |1995|
|32     |85616 |5     |Twelve Monkeys (a.k.a. 12 Monkeys)|[Mystery, Sci-Fi, Thriller]|1995|
+-------+------+------+----------------------------------+---------------------------+----+



In [39]:
## Actual preferences for user114872
ratings.join(movies, on='movieId').filter('userId = 114872').sort('rating', ascending=False).limit(5).show(truncate=False)

+-------+------+------+------------+-----------------+----+
|movieId|userId|rating|title       |genres           |year|
+-------+------+------+------------+-----------------+----+
|186    |114872|5     |Nine Months |[Comedy, Romance]|1995|
|201    |114872|5     |Three Wishes|[Drama, Fantasy] |1995|
|192    |114872|5     |Show, The   |[Documentary]    |1995|
|69     |114872|5     |Friday      |[Comedy]         |1995|
|194    |114872|5     |Smoke       |[Comedy, Drama]  |1995|
+-------+------+------+------------+-----------------+----+



In [40]:
## Actual preferences for user22517
ratings.join(movies, on='movieId').filter('userId = 22517').sort('rating', ascending=False).limit(5).show(truncate=False)

+-------+------+------+-------------------------------+-----------------------------+----+
|movieId|userId|rating|title                          |genres                       |year|
+-------+------+------+-------------------------------+-----------------------------+----+
|47     |22517 |5     |Seven (a.k.a. Se7en)           |[Mystery, Thriller]          |1995|
|231    |22517 |5     |Dumb & Dumber (Dumb and Dumber)|[Adventure, Comedy]          |1994|
|104    |22517 |5     |Happy Gilmore                  |[Comedy]                     |1996|
|10     |22517 |5     |GoldenEye                      |[Action, Adventure, Thriller]|1995|
|110    |22517 |5     |Braveheart                     |[Action, Drama, War]         |1995|
+-------+------+------+-------------------------------+-----------------------------+----+



In [41]:
## Actual preferences for user20253
ratings.join(movies, on='movieId').filter('userId = 20253').sort('rating', ascending=False).limit(5).show(truncate=False)

+-------+------+------+----------------------------------+----------------------------------------+----+
|movieId|userId|rating|title                             |genres                                  |year|
+-------+------+------+----------------------------------+----------------------------------------+----+
|260    |20253 |5     |Star Wars: Episode IV - A New Hope|[Action, Adventure, Sci-Fi]             |1977|
|368    |20253 |5     |Maverick                          |[Adventure, Comedy, Western]            |1994|
|292    |20253 |5     |Outbreak                          |[Action, Drama, Sci-Fi, Thriller]       |1995|
|329    |20253 |5     |Star Trek: Generations            |[Adventure, Drama, Sci-Fi]              |1994|
|145    |20253 |5     |Bad Boys                          |[Action, Comedy, Crime, Drama, Thriller]|1995|
+-------+------+------+----------------------------------+----------------------------------------+----+

