In Spark, there is one paritcular recommendation algorithm, Alternating Least Squares (ALS). This algorithm leverages collaborative filtering, which makes recommendations based only on which items users interacted with in the past. That is, it does not require or use any additional features about the users or the items.

In [26]:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr


# select 4 cores to process this
spark = SparkSession\
        .builder\
        .appName("ALSExample")\
        .config("spark.executor.cores", '4')\
        .getOrCreate()

# Loading data

In [5]:
ratings = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("../data/goodbooks-10k-master/ratings.csv")
ratings.printSchema()
ratings.createOrReplaceTempView("dfTable")

root
 |-- user_id: integer (nullable = true)
 |-- book_id: integer (nullable = true)
 |-- rating: integer (nullable = true)



In [7]:
ratings.show(3)

+-------+-------+------+
|user_id|book_id|rating|
+-------+-------+------+
|      1|    258|     5|
|      2|   4081|     4|
|      2|    260|     5|
+-------+-------+------+
only showing top 3 rows



In [8]:
books = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("../data/goodbooks-10k-master/books.csv")
books.printSchema()
books.createOrReplaceTempView("dfTable")

root
 |-- book_id: integer (nullable = true)
 |-- goodreads_book_id: integer (nullable = true)
 |-- best_book_id: integer (nullable = true)
 |-- work_id: integer (nullable = true)
 |-- books_count: integer (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: double (nullable = true)
 |-- authors: string (nullable = true)
 |-- original_publication_year: double (nullable = true)
 |-- original_title: string (nullable = true)
 |-- title: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- ratings_count: string (nullable = true)
 |-- work_ratings_count: string (nullable = true)
 |-- work_text_reviews_count: string (nullable = true)
 |-- ratings_1: double (nullable = true)
 |-- ratings_2: integer (nullable = true)
 |-- ratings_3: integer (nullable = true)
 |-- ratings_4: integer (nullable = true)
 |-- ratings_5: integer (nullable = true)
 |-- image_url: string (nullable = true)
 |-- small_image_url: string (nu

In [111]:
book_names = books.select("book_id", "title", "authors")
book_names.show(5)

+-------+--------------------+--------------------+
|book_id|               title|             authors|
+-------+--------------------+--------------------+
|      1|The Hunger Games ...|     Suzanne Collins|
|      2|Harry Potter and ...|J.K. Rowling, Mar...|
|      3|Twilight (Twiligh...|     Stephenie Meyer|
|      4|To Kill a Mocking...|          Harper Lee|
|      5|    The Great Gatsby| F. Scott Fitzgerald|
+-------+--------------------+--------------------+
only showing top 5 rows



# Build Model and Fit

In [83]:
from pyspark.ml.recommendation import ALS

#split data into training and test set
training, test = ratings.randomSplit([0.8, 0.2])

als = ALS()\
  .setMaxIter(5)\
  .setRegParam(0.01)\
  .setUserCol("user_id")\
  .setItemCol("book_id")\
  .setRatingCol("rating")
# print(als.explainParams())

In [84]:
alsModel = als.fit(training)
predictions = alsModel.transform(test)

# Evaluate

When covering the cold-start strategy, we can set up an automatic model evaluator when working with ALS. One thing that may not be immediately obvious is that this recommendation problem is really just a kind of regression problem. Since we’re predicting values (ratings) for given users, we want to optimize for reducing the total difference between our users’ ratings and the true values. We can do this using the RegressionEvaluator.

In [85]:
# in Python
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator()\
  .setMetricName("rmse")\
  .setLabelCol("rating")\
  .setPredictionCol("prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = %f" % rmse)

Root-mean-square error = 0.841255


# Recommendation Results

We can now output the top 𝘬 recommendations for each user or book. The model’s recommendForAllUsers method returns a DataFrame of a user_id, an array of recommendations, as well as a rating for each of those books. recommendForAllItems returns a DataFrame of a book_id, as well as the top users for that book:

In [86]:
from pyspark.sql.functions import col

# generate top 10 book recs for each user
alsModel.recommendForAllUsers(10)\
  .selectExpr("user_id", "explode(recommendations)").show(5)

# generate top 10 user recommendations for each book 
alsModel.recommendForAllItems(10)\
  .selectExpr("book_id", "explode(recommendations)").show(5)

+-------+-----------------+
|user_id|              col|
+-------+-----------------+
|    148| [8326, 5.287354]|
|    148|[8498, 5.2569065]|
|    148|[8703, 5.2383165]|
|    148|[9576, 5.2006955]|
|    148|[8271, 5.1758304]|
+-------+-----------------+
only showing top 5 rows

+-------+------------------+
|book_id|               col|
+-------+------------------+
|   1580| [40753, 5.755218]|
|   1580|[51314, 5.5977416]|
|   1580|[50817, 5.5185285]|
|   1580| [24329, 5.516051]|
|   1580|[43544, 5.4652486]|
+-------+------------------+
only showing top 5 rows



### select test user

In [93]:
test_user_id = 8

test_user = ratings.filter(ratings['user_id'] == test_user_id)
joinExpression = test_user["book_id"] == book_names['book_id']
test_user.join(book_names, joinExpression, joinType)\
 .orderBy('rating', ascending = False).show(truncate = False)

+-------+-------+------+-------+----------------------------------------+
|user_id|book_id|rating|book_id|title                                   |
+-------+-------+------+-------+----------------------------------------+
|8      |977    |5     |977    |Inferno (The Divine Comedy #1)          |
|8      |8312   |5     |8312   |Miracles                                |
|8      |769    |5     |769    |The Complete Sherlock Holmes            |
|8      |5425   |5     |5425   |Darkness at Noon                        |
|8      |80     |5     |80     |The Little Prince                       |
|8      |485    |5     |485    |The Brothers Karamazov                  |
|8      |718    |5     |718    |The Sound and the Fury                  |
|8      |493    |5     |493    |Mere Christianity                       |
|8      |362    |5     |362    |The Screwtape Letters                   |
|8      |9114   |5     |9114   |The Complete Tales and Poems            |
|8      |177    |5     |177    |Crime 

### filter for results and join with book names

In [94]:
userRecs = alsModel.recommendForAllUsers(10)

test_userRecs = userRecs.filter(userRecs['user_id'] == test_user_id)\
                    .selectExpr("user_id", "explode(recommendations)")

test_userRecs = test_userRecs.select("user_id", 'col.*')

In [97]:
joinExpression = test_userRecs["book_id"] == book_names['book_id']
joinType = "inner"
test_userRecs.join(book_names, joinExpression, joinType).show(truncate = False)

+-------+-------+---------+-------+--------------------------------------------------------------------+------------------------------------+
|user_id|book_id|rating   |book_id|title                                                               |authors                             |
+-------+-------+---------+-------+--------------------------------------------------------------------+------------------------------------+
|8      |7548   |6.3333125|7548   |Systematic Theology: An Introduction to Biblical Doctrine           |Wayne A. Grudem                     |
|8      |8390   |6.038451 |8390   |Philosophical Investigations                                        |Ludwig Wittgenstein, G.E.M. Anscombe|
|8      |9549   |5.9829793|9549   |Gargantua and Pantagruel                                            |François Rabelais, M.A. Screech     |
|8      |8011   |5.794862 |8011   |An Enquiry Concerning Human Understanding                           |David Hume                          |
|8    

### can also find top users for a given book

In [112]:
test_book_id = 177

book_names.filter(book_names['book_id'] == test_book_id).show()


+-------+--------------------+--------------------+
|book_id|               title|             authors|
+-------+--------------------+--------------------+
|    177|Crime and Punishment|Fyodor Dostoyevsk...|
+-------+--------------------+--------------------+



In [115]:
bookRecs = alsModel.recommendForAllItems(10)\
                    .selectExpr("book_id", "explode(recommendations)")

test_bookRec = bookRecs.filter(bookRecs['book_id'] == test_book_id)\
                        .select("book_id", "col.*")

test_bookRec.show()

+-------+-------+---------+
|book_id|user_id|   rating|
+-------+-------+---------+
|    177|  38367| 6.388581|
|    177|  40948|6.2325187|
|    177|  43800|6.1170244|
|    177|  15344|5.8767686|
|    177|  14215|5.7751093|
|    177|  40753| 5.655826|
|    177|  31196|5.6470795|
|    177|  28982| 5.641826|
|    177|  23621|5.5977125|
|    177|  35183| 5.591388|
+-------+-------+---------+



# Further evaluation metrics

A RankingMetric allows us to compare our recommendations with an actual set of ratings (or preferences) expressed by a given user. RankingMetric does not focus on the value of the rank but rather whether or not our algorithm recommends an already ranked item again to a user. 

First, we need to collect a set of highly ranked movies for a given user. In our case, we’re going to use a rather low threshold: movies ranked above 2.5. Tuning this value will largely be a business decision:

In [104]:
# in Python
from pyspark.mllib.evaluation import RankingMetrics, RegressionMetrics
from pyspark.sql.functions import col, expr
perUserActual = predictions\
  .where("rating > 2.5")\
  .groupBy("user_id")\
  .agg(expr("collect_set(book_id) as books"))

At this point, we have a collection of users, along with a truth set of previously ranked movies for each user. Now we will get our top 10 recommendations from our algorithm on a per-user basis. We will then see if the top 10 recommendations show up in our truth set. If we have a well-trained model, it will correctly recommend the movies a user already liked. If it doesn’t, it may not have learned enough about each particular user to successfully reflect their preferences:

In [105]:
perUserPredictions = predictions\
  .orderBy(col("user_id"), expr("prediction DESC"))\
  .groupBy("user_id")\
  .agg(expr("collect_list(book_id) as books"))

Now we have two DataFrames, one of predictions and another the top-ranked items for a particular user. We can pass them into the RankingMetrics object. This object accepts an RDD of these combinations, as you can see in the following join and RDD conversion:

In [108]:
# in Python
perUserActualvPred = perUserActual.join(perUserPredictions, ["user_id"]).rdd\
  .map(lambda row: (row[1], row[2][:15]))
ranks = RankingMetrics(perUserActualvPred)

Now we can see the metrics from that ranking. For instance, we can see how precise our algorithm is with the mean average precision. We can also get the precision at certain ranking points, for instance, to see where the majority of the positive recommendations fall:

In [109]:
ranks.meanAveragePrecision
ranks.precisionAt(5)

0.6791110445413869