# Introduction

In terms of a business opportunity we can consider the use of a recommender system when any of the following questions are relevant:

* What would a user like?
* What would a user buy?
* What would a user click?

> *Is there something we could suggest to improve a user experience?*

There are other situations where recommendations might be appropriate outside of the scope of these questions. One example would be if the AAVAIL team wanted to recommend words or phrases for an autofill feature that is part of the company's website or app. To consider a recommender system, we need appropriate data. This most often comes in the form of ratings matrix, sometimes known as utility matrix. Here is what a piece of ratings matrix might look like for AAVAIL data.

|User|Feed 1|Feed 2|Feed 3|Feed 4|Feed 5|Feed 6|Feed 7|Feed 8|Feed 9|
|:--:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
|1 | ? | ? | 4 | ? | ? | ? | 1 | ? | ? |
|2 | ? | ? | 4 | ? | ? | ? | 1 | ? | 2 |
|3 | ? | ? | 4 | 5 | ? | 1 | 1 | 3 | ? |
|4 | 3 | 2 | ? | ? | ? | 1 | ? | ? | ? |
|5 | 1 | 4 | 4 | ? | ? | 1 | 1 | ? | ? |

Notice that the majority of entries are missing, as it typically wiht utility matrices. We can't expect that every user has watched every feed, or even a significant portion of them. Most User/Feed intersections will be unrated, or blank, resulting in a sparse matrix.

Ratings come in two types: **explicit** and **implicit**. The above utility matrix contains explicit ratings because the users rated feeds directly. Implicit data is derived from a user's behaviors or actions for example likes, shares, page visits or amount of time watched. These can be used to construct a utility matrix. Keeping with our AAVAIL feed example, we can engineer a measure based on *indirect* evidence. For example the score for **Feed 1** could be based on a user's location, comment history, preferred type of feed, specified topic preferences and more. Each element that contributes to the overall score could have a maximal value of 1.0 and the final number could be scaled to a range of 1-5. Explicit and implicit data can be combined using this type of approach as well and naturally you would want to have a solid understanding of the stories before engineering a score.

Most recommender systems today are able to leverage both explicit (e.g. numerical ratings) and implicit (e.g. likes, purchases, skipped, bookmarked) patterns in a ratings matrix. The SVD++ algorithm is an example of a method that exploits both patterns.

# Recommendation Systems

The majority of modern recommender systems embrace either a collaborative filtering or a content-based approach. A number of other approaches and hybrids exist making some implemented systems difficult to categorize.

> **Collaborative filtering:** Collaborative filtering is a family of methods that identify a subset of users who have preferences similar to the target user. From these, a ratings matrix is created. The items preferred by these users are combined and filtered to create a ranked list of recommended items. Recommendations based on similarity to infer preference or behavior.

> **Content-based filtering:** Predictions are made based on the properties and characteristics of an item. User behavior is not considered.

> **Hybrid recommender systems:** A combination of collaborative filtering and content based systems.

> **Other types of systems:** Some systems especially legacy ones were based on demographics. Other systems attempt to infer utility before making a recommendation.

### Matrix factorization techniques

> *Find the latent factors that help explain the patterns in a ratings matrix*

*Matrix factorization* is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix to the product of lower-dimension matrices. In general, the user-item interaction matrix will be very, very large, and very sparse. The lower-dimension matrices will be much smaller and denser and can be used to reconstruct the user-item interaction matrix, including predictions where values were previously missing.

**Common approaches**

* Singular Value Decomposition (SVD)
* UV Decomponsition (UVD)
* non-negative matrix factorization (NMF)

Matrix factorization is generally accomplished using *Alternating Least Squares (ALS)* or *Stochastic Gradient Descent (SGD)*. Hyperparameters are used to control the regularization and the relative weighting of implicit versus explicit rating matrices. With recommender systems we are most concerned with scale at prediction. Because user ratings change slowly, if at all, the algorithm does not need to be retrained frequently and so this can be done at night. For this reason, **Spark** is a common platform for developing recommender systems. The computation is already distributed under the Spark framework so scaling infrastructure is straightforward.

The are several Python packages available to help create recommenders including `surprise`. Because scale with respect to prediction is often a concern for recommender systems, many production environments use the implementation found in Spark MLlib. The Spark collaborative filtering implementation uses Altering least Squares.

# Surprise Package

In [1]:
#!pip install surprise

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20200425154708-0001
KERNEL_ID = a32adf45-6cae-459c-8c6d-10b30eae33da


In [2]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

In [3]:
# Load the movielens-100k dataset
data = Dataset.load_builtin("ml-100k")

# We will use the famous SVD algorithm
algo = SVD()

# Run 5-fold cross-validation and print the results
results = cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=False)

print("test RMSE: {}".format(round(results["test_rmse"].mean(), 3)))

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /home/spark/shared/.surprise_data/ml-100k
test RMSE: 0.936


# Spark MLlib Recommender Example

In [4]:
import os
import shutil
import pyspark as ps
from pyspark.ml import Pipeline, Transformer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.sql import Row
from pyspark.sql.types import DoubleType

In [5]:
## ensure the spark context is available
spark = (ps.sql.SparkSession.builder
         .appName("sandbox")
         .getOrCreate()
        )

sc = spark.sparkContext
print(spark.version)

2.4.3


### Make Notebook Run within IBM Watson

In [6]:
# The code was removed by Watson Studio for sharing.

In [7]:
# START CODE BLOCK
# cos2file - takes an object from Cloud Object Storage and writes it to file on container file system.
# Uses the IBM project_lib library.
# See https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/project-lib-python.html
# Arguments:
# p: project object defined in project token
# data_path: the directory to write the file
# filename: name of the file in COS

import os
def cos2file(p,data_path,filename):
    data_dir = p.project_context.home + data_path
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    open( data_dir + '/' + filename, 'wb').write(p.get_file(filename).read())

# file2cos - takes file on container file system and writes it to an object in Cloud Object Storage.
# Uses the IBM project_lib library.
# See https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/project-lib-python.html
# Arguments:
# p: prooject object defined in project token
# data_path: the directory to read the file from
# filename: name of the file on container file system

import os
def file2cos(p,data_path,filename):
    data_dir = p.project_context.home + data_path
    path_to_file = data_dir + '/' + filename
    if os.path.exists(path_to_file):
        file_object = open(path_to_file, 'rb')
        p.save_data(filename, file_object, set_project_asset=True, overwrite=True)
    else:
        print("file2cos error: File not found")
# END CODE BLOCK

In [8]:
cos2file(project, '/data', 'movies.csv')
cos2file(project, '/data', 'ratings.csv')

In [9]:
# download the sample movie lens ratings
data_dir = os.path.join(".", "data")
ratings_file = os.path.join(data_dir, "ratings.csv")
movies_file = os.path.join(data_dir, "movies.csv")

In [10]:
# Load the data
df = spark.read.csv(ratings_file, header=True, inferSchema=True)
df2 = spark.read.csv(movies_file, header=True, inferSchema=True)

df.show(n=4)
df2.show(n=4)

# split the data to training and test sets
(training, test) = df.randomSplit([0.8, 0.2])

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
+------+-------+------+---------+
only showing top 4 rows

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
+-------+--------------------+--------------------+
only showing top 4 rows



### Model Training

In [11]:
## train the recommender model with ALS
als_alg = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
              coldStartStrategy="drop")

model = als_alg.fit(training)

## evaluate the recommender with the holdout set
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

rmse = evaluator.evaluate(predictions)
print("RMSE = {}".format(round(rmse, 3)))

RMSE = 1.093


### Generate user and movie recommendations

In [12]:
## Generate top 10 movie recommendations for each user
user_recs = model.recommendForAllUsers(10)
user_recs.show(n=4)

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   471|[[7099, 8.933234]...|
|   463|[[175303, 7.42290...|
|   496|[[86320, 8.286928...|
|   148|[[3099, 6.1729946...|
+------+--------------------+
only showing top 4 rows



In [13]:
## Generate top 10 user recommendations for each movie
movies_recs = model.recommendForAllItems(10)
movies_recs.show(n=4)

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|   1580|[[37, 5.566883], ...|
|   4900|[[549, 11.236831]...|
|   5300|[[296, 8.064962],...|
|   6620|[[55, 6.7833996],...|
+-------+--------------------+
only showing top 4 rows



In [14]:
## Generate top 10 movie recommendations for a specified set of users
users = df.select(als_alg.getUserCol()).distinct().limit(3)
users_subset_recs = model.recommendForUserSubset(users, 10)
users_subset_recs.show(4)

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   471|[[7099, 8.933234]...|
|   463|[[175303, 7.42290...|
|   148|[[3099, 6.1729946...|
+------+--------------------+



In [15]:
## Generate top 10 user recommendations for a specified set of movies
movies = df.select(als_alg.getItemCol()).distinct().limit(3)
movie_subset_recs = model.recommendForItemSubset(movies, 10)
movie_subset_recs.show(n=4)

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|   1580|[[37, 5.566883], ...|
|   3175|[[296, 7.0045047]...|
|   2366|[[296, 9.403867],...|
+-------+--------------------+



In [16]:
## match the recs to movie ids
recs = movie_subset_recs.toPandas()
movies = df2.toPandas()
rec_titles = movies["title"][recs["movieId"]].tolist()
rec_genres = movies["genres"][recs["movieId"]].tolist()

for r, title in enumerate(rec_titles):
    print(title, "-->", rec_genres[r])

Maximum Overdrive (1986) --> Horror
Lost in America (1985) --> Comedy
Stealing Home (1988) --> Drama


### Model Persistence

In [17]:
## remove directoy if already exists
save_dir = "saved-recommender"
if os.path.isdir(save_dir):
    print("overwritting saved model")
    shutil.rmtree(save_dir)
    
# save model
model.save(save_dir)

In [18]:
from_saved_model = ALSModel.load(save_dir)

In [19]:
test = spark.createDataFrame([(1, 2), (2, 10), (3, 20)], ["userId", "movieId"])
predictions = sorted(model.transform(test).collect(), key=lambda r: r[0])
print(predictions)

[Row(userId=1, movieId=2, prediction=3.8280413150787354), Row(userId=2, movieId=10, prediction=3.844179153442383), Row(userId=3, movieId=20, prediction=-2.1697897911071777)]


# Recommendations Systems in Production

### Cold Start

One issue that arises with recommender systems in production is known as the cold-start problem. Cold start is a potential problem in computer-based information systems which involve a degree of automated data modelling. Specifically, it concerns the issue that the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information.

There are two scenarios when it comes to the cold start problem:

**What shall we recommend to a new used?**
* If the recommender is popularity-based then the most popular items are recommended and this is not a problem. If the recommender is similarity-based, the user could rate five items as part of the sign-up or you could attempt to infer similarity based on user meta-data such as age, gender, location, etc. Even if recommendations are based on similarities, you may still use the most popular items to get the user started, but you would likely want to customize the list possibly based on meta-data.

**How should we treat a new item that hasn't been reviewed?**
* In order to make good recommendations, you need data about how users review the item. But until the item has been recommended, it's unlikely that users will review it. To overcome this dilemma, the item could be randomly suggested for a trial period to collect data. You could put it in a special section such as a new releases to gauge initial request. You can also use meta-data associated with the item to find similar items and infer its recommendations from these similar items.

**Concurrency** can be a challenge for recommenders systems. A recommender might, for example, find the 20 closest users based on latent factor profiles. From those users it would identify a list of potential recommendations that could be sorted and filtered given what is known about the user. The distances between users can often be pre-computed to speed up the recommendations because user characteristics change slowly. Nevertheless, this process has a few steps to it that require a burst of compute time. If five users hit the service at the same time, there is possibility that the processors get weighted down with these simultaneous requests and recommendations become u