# Building a Recommendation System for the MovieLens Dataset - Content Based Filtering

Nick Pasternak (nfp5ga), Kara Fallin (kmf4tg), Aparna Marathe (am7ad)

In this notebook, we load and process the data again. Since we are using movie metadata (release year, IMDB score, genres, category, movie description) to predict user ratings for this model, we had to use feature engineering. We cleaned, hashed, and vectorized the genres, category, and movie description variables and then assembled all five predictors into a features column to predict user ratings. Since our dataset contains 20 million rows, we took a 5% sample before doing a 70/30 train-test split. However, in this case the split was not random; we essentially split the training and testing sets so that each movie was in either one but not both. Then we proceeded to build our content-based filtering model, utilizing linear regression. We used 3-fold cross validation for tuning and evaluated the model using RMSE and also the actual movie predictions for a given user. After building and evaluating all three models, we then compared RMSE values and movie predictions to determine the best overall model.

In [1]:
import pyspark
import os
import numpy as np
import pyspark.sql.types as typ
import pyspark.sql.functions as F
from pyspark.sql.functions import col, asc, desc, split, regexp_extract, explode
from pyspark.mllib import recommendation
from pyspark.mllib.recommendation import *
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
from pyspark.sql.types import IntegerType, FloatType
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local") \
    .appName("data preprocessing") \
    .config("spark.executor.memory", '250g') \
    .config('spark.executor.cores', '8') \
    .config('spark.cores.max', '8') \
    .config("spark.driver.memory",'250g') \
    .getOrCreate()

sc = spark.sparkContext

In [3]:
os.chdir('/project/ds5559/group10_reviews')
os.getcwd()

'/project/ds5559/group10_reviews'

## Load the data

### Movie links: links to imdb website 

(https://www.imdb.com/title/tt0 + imdbId + /?ref_=fn_al_tt_1)

In [4]:
links = spark.read.csv("/project/ds5559/group10_reviews/link.csv", header=True)

### Movies

In [5]:
movies = spark.read.csv("/project/ds5559/group10_reviews/movie.csv", header=True)
movies = movies.withColumn('genres',split(col('genres'),"[|]"))
movies = movies.withColumn('year',regexp_extract('title', r'(.*)\((\d+)\)', 2))
movies = movies.withColumn('title',regexp_extract('title', r'(.*) \((\d+)\)', 1))

### User ratings

In [6]:
ratings = spark.read.csv("/project/ds5559/group10_reviews/rating.csv", header=True).drop('timestamp')

-----------

## Process the data

### Merge existing data

In [7]:
df1 = ratings.join(movies,on="movieId",how="inner")

df2 = df1.join(links,on=['movieId'],how='inner').drop('tmdbId')

scrape = spark.read.csv("/project/ds5559/group10_reviews/movie_scrape.csv", header=True)
scrape = scrape.drop('title','year', 'genres')

df = df2.join(scrape,on='imdbId',how='inner')
df = df.drop('imdbId')
df.show()

+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+
|movieId|userId|rating|               title|              genres|year|category|score|         description|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+
|      2|     1|   3.5|             Jumanji|[Adventure, Child...|1995|      PG|  7.0|Jumanji, one of t...|
|     29|     1|   3.5|City of Lost Chil...|[Adventure, Drama...|1995|       R|  7.5|Set in a dystopia...|
|     32|     1|   3.5|Twelve Monkeys (a...|[Mystery, Sci-Fi,...|1995|       R|  8.0|James Cole, a pri...|
|     47|     1|   3.5|Seven (a.k.a. Se7en)| [Mystery, Thriller]|1995|       R|  8.6|A film about two ...|
|     50|     1|   3.5| Usual Suspects, The|[Crime, Mystery, ...|1995|       R|  8.5|Following a truck...|
|    112|     1|   3.5|Rumble in the Bro...|[Action, Adventur...|1995|       R|  6.7|Keong comes from ...|
|    151|     1|     4|             R

In [8]:
df = df.withColumn('movieId', col('movieId').cast(IntegerType()))
df = df.withColumn('userId', col('userId').cast(IntegerType()))
df = df.withColumn('rating', col('rating').cast(FloatType()))
df = df.withColumn('year', col('year').cast(IntegerType()))
df = df.withColumn('score', col('score').cast(FloatType()))

In [9]:
df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- userId: integer (nullable = true)
 |-- rating: float (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- year: integer (nullable = true)
 |-- category: string (nullable = true)
 |-- score: float (nullable = true)
 |-- description: string (nullable = true)



In [10]:
df.count()

20000263

In [11]:
df = df.na.drop()
df.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- userId: integer (nullable = true)
 |-- rating: float (nullable = true)
 |-- title: string (nullable = true)
 |-- genres: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- year: integer (nullable = true)
 |-- category: string (nullable = true)
 |-- score: float (nullable = true)
 |-- description: string (nullable = true)



In [12]:
df.count()

19809344

----------------------

## Build the content-based filtering model
Recommending movies (predicting ratings) based on genres, release date, category, IMDB score, and plot summary 

1. Clean genres column: vectorize via TF-IDF
2. Clean category column: index and then vectorize via one-hot encoding
3. Clean and vectorize description column: convert to lower case, remove puncuation and stopwords, and then vectorize via TF-IDF
4. Use VectorAssembler to combine all 5 predictors into a "features" column to ultimately predict rating via linear regression

In [13]:
seed=314

In [14]:
df_red = df.sample(fraction=0.05, seed=seed)
df_red.count()

991777

### Clean genres column

In [15]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
hashingTF = HashingTF(inputCol="genres", outputCol="rawGenres")
featurizedData = hashingTF.transform(df_red)

idf = IDF(inputCol="rawGenres", outputCol="vectorizedGenres")
idfModel = idf.fit(featurizedData)
df1 = idfModel.transform(featurizedData)

### Clean category column

In [16]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer
stringIndexer = StringIndexer(inputCol="category", outputCol="indexedCategory")
indexedModel = stringIndexer.fit(df1)
indexed = indexedModel.transform(df1)

encoder = OneHotEncoder(inputCol="indexedCategory", outputCol="vectorizedCategory")
encodedModel = encoder.fit(indexed)
df2 = encodedModel.transform(indexed)

### Clean description column

In [17]:
from pyspark.sql.functions import regexp_replace, lower
from pyspark.ml.feature import StopWordsRemover
df2 = df2.withColumn("description",lower(regexp_replace(col("description"), "[^a-zA-Z0-9 ]+", "")))

tokenizer = Tokenizer(inputCol="description", outputCol="tokenizedDescription")
tokenized = tokenizer.transform(df2)

remover = StopWordsRemover(inputCol="tokenizedDescription", outputCol="filteredDescription")
removedData = remover.transform(tokenized)

hashingTF2 = HashingTF(inputCol="filteredDescription", outputCol="rawDescription")
featurizedData2 = hashingTF2.transform(removedData)

idf2 = IDF(inputCol="rawDescription", outputCol="vectorizedDescription")
idfModel2 = idf2.fit(featurizedData2)
df3 = idfModel2.transform(featurizedData2)

### Assemble features

In [18]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=["vectorizedGenres", "year", "vectorizedCategory", "score", "vectorizedDescription"], outputCol="features")
output = assembler.transform(df3)

### Predict movie rating via LinearRegression

In [19]:
#training, test = output.randomSplit([0.7, 0.3], seed=seed)
#training.cache()

In [20]:
n = output.count()

In [21]:
print("training size: ", n*0.70)
print("test size: ", n*0.30)

training size:  694243.8999999999
test size:  297533.1


In [22]:
training = output.sort(col("movieId"), ascending=True).limit(694243)
training.cache()
test = output.subtract(training)
test.cache()

DataFrame[movieId: int, userId: int, rating: float, title: string, genres: array<string>, year: int, category: string, score: float, description: string, rawGenres: vector, vectorizedGenres: vector, indexedCategory: double, vectorizedCategory: vector, tokenizedDescription: array<string>, filteredDescription: array<string>, rawDescription: vector, vectorizedDescription: vector, features: vector]

In [23]:
# Build the recommendation model using LinearRegression on the training data
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol='features', labelCol='rating', maxIter=10, regParam=0.3, elasticNetParam=0.8)

import time
t = time.time()
lrModel = lr.fit(training)
print(time.time() - t)

612.6124124526978


In [24]:
# Evaluate the model by computing the RMSE on the test data
predictions = lrModel.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error is {}".format(rmse))

Root-mean-square error is 0.9628164436096527


In [25]:
lr = LinearRegression(featuresCol='features', labelCol='rating', maxIter=10)

In [26]:
## Tuning with CrossValidator
paramMap = ParamGridBuilder() \
            .addGrid(lr.regParam, [0.1, 0.3, 0.5]) \
            .addGrid(lr.elasticNetParam, [0, 0.2, 0.5, 0.8, 1]).build()


evaluatorR = RegressionEvaluator(metricName="rmse", labelCol="rating")


CVALS = CrossValidator(estimator=lr,
                       estimatorParamMaps=paramMap,
                       evaluator=evaluatorR,
                       numFolds=3)

t1 = time.time()
CVModel = CVALS.setParallelism(4).fit(training)
print(time.time() - t1)

1118.3464393615723


In [27]:
CVModel.bestModel._java_obj.parent().getRegParam()

0.3

In [28]:
CVModel.bestModel._java_obj.parent().getElasticNetParam()

0.0

In [29]:
model = CVModel.bestModel

In [30]:
preds = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(preds)
print("Root-mean-square error is {}".format(rmse))

Root-mean-square error is 1.0792119780218898


In [31]:
preds_final = preds.select("movieID","userID","rating","title","genres","year","category","score","description","prediction")

In [32]:
preds_final.show(10)

+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+------------------+
|movieID|userID|rating|               title|              genres|year|category|score|         description|        prediction|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+------------------+
|   4366| 94972|   2.0|Atlantis: The Los...|[Adventure, Anima...|2001|      PG|  6.9|1914 milo thatch ...|3.4205452648858246|
|   6027| 43457|   4.5|            Dogfight|    [Drama, Romance]|1991|       R|  7.4|in 1963 the night...|3.5500877317096413|
|  33912|  8963|   3.5| Unmarried Woman, An|[Comedy, Drama, R...|1978|       R|  7.1|erica is unmarrie...|3.1085281100785727|
|   3994| 96946|   5.0|         Unbreakable|     [Drama, Sci-Fi]|2000|   PG-13|  7.3|david dunn willis...|  3.45743837019414|
|   3996|  4847|   2.5|Crouching Tiger, ...|[Action, Drama, R...|2000|   PG-13|  7.9|in early nineteen...| 3.253817576

In [33]:
## Recommendations for user463
preds_final.filter('userId = 463').sort('prediction', ascending=False).limit(5).show()

+-------+------+------+-----+------+----+--------+-----+-----------+----------+
|movieID|userID|rating|title|genres|year|category|score|description|prediction|
+-------+------+------+-----+------+----+--------+-----+-----------+----------+
+-------+------+------+-----+------+----+--------+-----+-----------+----------+



In [34]:
## Actual preferences for user463
df.filter('userId = 463').sort('rating', ascending=False).limit(5).show()

+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+
|movieId|userId|rating|               title|              genres|year|category|score|         description|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+
|     47|   463|   5.0|Seven (a.k.a. Se7en)| [Mystery, Thriller]|1995|       R|  8.6|A film about two ...|
|    281|   463|   5.0|       Nobody's Fool|[Comedy, Drama, R...|1994|       R|  7.3|Sully is a rascal...|
|    150|   463|   5.0|           Apollo 13|[Adventure, Drama...|1995|      PG|  7.7|"Based on the tru...|
|     17|   463|   5.0|Sense and Sensibi...|    [Drama, Romance]|1995|      PG|  7.7|When Mr. Dashwood...|
|    265|   463|   5.0|Like Water for Ch...|[Drama, Fantasy, ...|1992|       R|  7.1|In a forgotten Me...|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+



In [35]:
## Recommendations for user318
preds_final.filter('userId = 318').sort('prediction', ascending=False).limit(5).show()

+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+------------------+
|movieID|userID|rating|               title|              genres|year|category|score|         description|        prediction|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+------------------+
|   8533|   318|   2.0|       Notebook, The|    [Drama, Romance]|2004|   PG-13|  7.8|in a nursing home...|3.7191656579595653|
|  63082|   318|   5.0| Slumdog Millionaire|[Crime, Drama, Ro...|2008|       R|  8.0|the story of jama...|3.6994560227567357|
|  27311|   318|   4.0|Batman Beyond: Re...|[Action, Animatio...|2000|    2000|  7.7|while trying to u...|3.4999464087178707|
|  48394|   318|   4.0|Pan's Labyrinth (...|[Drama, Fantasy, ...|2006|       R|  8.2|in 1944 falangist...| 3.494974123990354|
|  90866|   318|   5.0|                Hugo|[Children, Drama,...|2011|      PG|  7.5|hugo is an orphan...|3.4844417179

In [36]:
## Actual preferences for user318
df.filter('userId = 318').sort('rating', ascending=False).limit(5).show()

+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+
|movieId|userId|rating|               title|              genres|year|category|score|         description|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+
|      1|   318|   5.0|           Toy Story|[Adventure, Anima...|1995|       G|  8.3|"A little boy nam...|
|    589|   318|   5.0|Terminator 2: Jud...|    [Action, Sci-Fi]|1991|       R|  8.6|Over 10 years hav...|
|   1201|   318|   5.0|Good, the Bad and...|[Action, Adventur...|1966|       R|  8.8|Blondie, The Good...|
|   1080|   318|   5.0|Monty Python's Li...|            [Comedy]|1979|       R|  8.1|The story of Bria...|
|   1136|   318|   5.0|Monty Python and ...|[Adventure, Comed...|1975|      PG|  8.2|History is turned...|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+



In [37]:
## Recommendations for user148
preds_final.filter('userId = 148').sort('prediction', ascending=False).limit(5).show()

+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+------------------+
|movieID|userID|rating|               title|              genres|year|category|score|         description|        prediction|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+------------------+
|   4356|   148|   4.0|Gentlemen Prefer ...|[Comedy, Musical,...|1953|Approved|  7.1|lorelei and dorot...|3.5390345525101425|
|   4062|   148|   4.0|        Mystic Pizza|[Comedy, Drama, R...|1988|       R|  6.3|sisters kat and d...| 3.250548629806492|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+------------------+



In [38]:
## Actual preferences for user148
df.filter('userId = 148').sort('rating', ascending=False).limit(5).show()

+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+
|movieId|userId|rating|               title|              genres|year|category|score|         description|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+
|    289|   148|   5.0|            Only You|   [Comedy, Romance]|1994|      PG|  6.5|"Destiny. Faith (...|
|    497|   148|   5.0|Much Ado About No...|   [Comedy, Romance]|1993|   PG-13|  7.3|"Young lovers Her...|
|    339|   148|   5.0|While You Were Sl...|   [Comedy, Romance]|1995|      PG|  6.8|"Nursing a secret...|
|    356|   148|   5.0|        Forrest Gump|[Comedy, Drama, R...|1994|   PG-13|  8.8|Forrest Gump is a...|
|     17|   148|   5.0|Sense and Sensibi...|    [Drama, Romance]|1995|      PG|  7.7|When Mr. Dashwood...|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+



In [39]:
## User Recommendations for movie1580
preds_final.filter('movieId = 1580').sort('prediction', ascending=False).limit(5).show()

+-------+------+------+-----+------+----+--------+-----+-----------+----------+
|movieID|userID|rating|title|genres|year|category|score|description|prediction|
+-------+------+------+-----+------+----+--------+-----+-----------+----------+
+-------+------+------+-----+------+----+--------+-----+-----------+----------+



In [40]:
## User Recommendations for movie27311
preds_final.filter('movieId = 27311').sort('prediction', ascending=False).limit(5).show()

+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+------------------+
|movieID|userID|rating|               title|              genres|year|category|score|         description|        prediction|
+-------+------+------+--------------------+--------------------+----+--------+-----+--------------------+------------------+
|  27311| 47605|   4.5|Batman Beyond: Re...|[Action, Animatio...|2000|    2000|  7.7|while trying to u...|3.4999464087178707|
|  27311|   318|   4.0|Batman Beyond: Re...|[Action, Animatio...|2000|    2000|  7.7|while trying to u...|3.4999464087178707|
|  27311|136479|   2.0|Batman Beyond: Re...|[Action, Animatio...|2000|    2000|  7.7|while trying to u...|3.4999464087178707|
|  27311| 35780|   3.0|Batman Beyond: Re...|[Action, Animatio...|2000|    2000|  7.7|while trying to u...|3.4999464087178707|
|  27311| 36482|   4.0|Batman Beyond: Re...|[Action, Animatio...|2000|    2000|  7.7|while trying to u...|3.4999464087