## Recommender System using movielens dataset

In [2]:
# import pyspark
from pyspark.sql import SparkSession

In [3]:
# Create a SparkSession
spark = SparkSession.builder.appName('rec').getOrCreate()

Spark MLlib library for Machine Learning provides a Collaborative Filtering implementation by using Alternating Least Squares. The implementation in MLlib has these parameters:

* numBlocks is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* rank is the number of latent factors in the model.
* iterations is the number of iterations to run.
* lambda specifies the regularization parameter in ALS.
* implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data.
* alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.

In [4]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [5]:
# read in the ratings data
data = spark.read.csv('movielens_ratings.csv',inferSchema=True,header=True)

In [6]:
# show the data
data.head()

Row(movieId=2, rating=3.0, userId=0)

In [7]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



In [8]:
# see the schema
data.printSchema()

root
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- userId: integer (nullable = true)



In [9]:
# Do train test split
# Smaller dataset so we will use 0.8 / 0.2
train_data, test_data = data.randomSplit([0.8,0.2])

In [11]:
# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")
model = als.fit(train_data)

In [12]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test_data)

In [13]:
predictions.show()

+-------+------+------+----------+
|movieId|rating|userId|prediction|
+-------+------+------+----------+
|      1|   1.0|    28|-1.2124188|
|      6|   3.0|    26| 2.7169647|
|      5|   2.0|    22| 0.7517233|
|      0|   1.0|    13| 1.7150234|
|      2|   1.0|    16| 1.9919672|
|      5|   3.0|    16|0.02145978|
|      1|   1.0|     3| 0.1411851|
|      6|   1.0|    15|0.89219576|
|      2|   3.0|     9|-3.7661297|
|      1|   1.0|     4| 1.2393562|
|      0|   1.0|     8| 1.8387822|
|      4|   2.0|     8|  2.569644|
|      1|   1.0|     7| 1.8027151|
|      4|   1.0|     7| 1.2354789|
|      7|   1.0|     7| 0.2748811|
|      7|   2.0|    21|0.05178885|
|      0|   1.0|    11| 2.7972293|
|      4|   1.0|    14| 2.6754642|
|      5|   1.0|    14| 0.7042428|
|      2|   3.0|     0|0.09668193|
+-------+------+------+----------+
only showing top 20 rows



In [14]:
# Select (prediction, true label) and compute test error
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")

rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 1.912644265844091


The RMSE described our error in terms of the stars rating column.

In [19]:
# Check out what a single user liked
single_user = test_data.filter(test_data['userId']==11).select(['movieId','userId'])

In [20]:
# User had 14 ratings in the test data set 
# Realistically this should be some sort of hold out set!
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|      0|    11|
|     16|    11|
|     21|    11|
|     32|    11|
|     35|    11|
|     40|    11|
|     43|    11|
|     45|    11|
|     50|    11|
|     59|    11|
|     61|    11|
|     75|    11|
|     81|    11|
|     82|    11|
+-------+------+



In [21]:
# What did our model predict they would like?
recommendations = model.transform(single_user)

In [22]:
# Show the user's top recommendations
recommendations.orderBy('prediction',ascending=False).show()

+-------+------+------------+
|movieId|userId|  prediction|
+-------+------+------------+
|     32|    11|    4.927175|
|      0|    11|   2.7972293|
|     75|    11|   2.7694626|
|     81|    11|    2.136906|
|     50|    11|   2.1019013|
|     21|    11|    1.951603|
|     82|    11|   0.9203961|
|     45|    11|  0.79088193|
|     61|    11|  0.67430556|
|     40|    11|-0.049490035|
|     43|    11| -0.80276585|
|     35|    11|  -0.8230052|
|     59|    11|  -1.0738926|
|     16|    11|   -1.181202|
+-------+------+------------+



For recommender systems, we also have to be aware of the cold start problem, in this case where new users on the platform have not watched a movie. We can go around this by giving them a quick survey to fill up, or let them choose a typical profile, depending on the domain problem.