# Spark Learning Note - Recommendation
Jia Geng | gjia0214@gmail.com


<a id='directory'></a>

## Directory

- [Data Source](https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data/)
- [1. Alternative Least Square and Collaborate Filtering](#sec1)
- [2. Model Params](#sec2)
- [3. Evaluator and Metrics](#sec3)
    - [3.1 Regression Metrics](#sec3-1)
    - [3.2 Ranking Metrics](#sec3-2)

In [74]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.appName('MLexample').getOrCreate()
spark

In [75]:
data_path = '/home/jgeng/Documents/Git/SparkLearning/book_data/sample_movielens_ratings.txt'
data = spark.read.text(data_path)
data.show(1)
# how to convert strings into array of strings
data = data.selectExpr("split(value, '::') as col")
data.show(1, False)

# how to convert array of strings into columns
data = data.selectExpr('cast(col[0] as int) as userID',
                       'cast(col[1] as int) as movieID',
                       'cast(col[2] as float) as rating',
                       'cast(col[3] as long) as timestamp')
data.show(1)
train, test = data.randomSplit([0.8, 0.2])
train.cache()
test.cache()
print(train.count(), test.count())
print(train.where('rating is null').count())

+-------------------+
|              value|
+-------------------+
|0::2::3::1424380312|
+-------------------+
only showing top 1 row

+---------------------+
|col                  |
+---------------------+
|[0, 2, 3, 1424380312]|
+---------------------+
only showing top 1 row

+------+-------+------+----------+
|userID|movieID|rating| timestamp|
+------+-------+------+----------+
|     0|      2|   3.0|1424380312|
+------+-------+------+----------+
only showing top 1 row

1179 322
0


## 1. Alternative Least Square and Collaborate Filtering <a id='sec1'></a>

Spark have an implementatoin of Alternative Least Squares for Collaborative Filterinig. ALS finds a dimentional featue vector for each user an item such that the dot product of each user's feature vector with each item's feature vector approximates the user's rating for that item. The dataset should includes existing ratings between user-item pairs:
- a user ID column (need to be int)
- an item ID column (need to be int)
- a rating column (need to be a float)
    - the rating can be explicit: a numerical rating that the system should predict directly
    - or implicit: rating represents the strength of interactions between a user and item (e.g. number of visits to a particular page)

The goal for recommendation system is that: given an ipnut data frame, the model will produce feature vectors that can be used to predict user's rating for items they have not yet rated.

Some potential problem of such system - **cold start problems**:
- when introducing a new product that no user has expressed a preference for, the algorithm is not going to recommend it to many people.
- if a new user are onboarding onto the platform, they might not have many ratings yet. Therefore the algorithm won't know what to recommend them.

The MLlib can scale the algorithm to millions of users, millions of items and billions of ratings.

[back to top](#directory)

## 2. Model Params <a id='sec2'></a>

**Hyperparams**

|Name|Input|Notes|
|-|-|-|
|rank|int|the dimension of the feature vectors learned for users and items. **Controls the bias and variance trade off.** Default is 10. 
|alpha|float|when traninig on implicit feedback, alpha sets a baseline confidence for preference. default is 1.0
|regParam|float|default is 0.1
|implicitPrefs|bool|whether training on implicit or explicit. default is explicity
|nonnegative|bool|whether to place a non-negative (feature) constriants on the least square problem. default is False.

**Training Params**

|Name|Input|Notes|
|-|-|-|
|numUserBlocks|int|how many blocks to split the user into. default is 10|
|numItemBlocks|int|how many blocks to split the items into. default is 10|
|maxIter|int|total number of iterations over the data before stopping. default is 10
|checkpointInterval|int|allow saving the checkpoints during training
|seed|int|random seed for replicating results

Rule of thumb to decide how much data to be put in each block:
- one to five millions ratings per block
- if data is less than that in each block, more blocks will not improve the algorithm performance.

**Prediction Params**

|Name|Input|Notes|
|-|-|-|
|coldStartStrategy|'nan', 'drop'| strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data.

By default, Spark assign NaN prediction values when encountering a user that is not present in the actual model. However, if this happens during training, the NaN value will ruin the ability for the evaluator to properly measure the success of the model.

Set to drop will drop any rows that contains NaN prediction so that the rest of the data will become valid for evaluation.

[back to top](#directory)

In [76]:
from pyspark.ml.recommendation import ALS

print(ALS().explainParams())

alpha: alpha for implicit preference (default: 1.0)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
coldStartStrategy: strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'. (default: nan)
finalStorageLevel: StorageLevel for ALS model factors. (default: MEMORY_AND_DISK)
implicitPrefs: whether to use implicit preference (default: False)
intermediateStorageLevel: StorageLevel for intermediate datasets. Cannot be 'NONE'. (default: MEMORY_AND_DISK)
itemCol: column name for item ids. Ids must be within the integer value range. (default: item)
maxIter: max number of iterations (>= 0). (default: 10)
n

In [77]:
train.printSchema()

root
 |-- userID: integer (nullable = true)
 |-- movieID: integer (nullable = true)
 |-- rating: float (nullable = true)
 |-- timestamp: long (nullable = true)



In [78]:
als = ALS().setMaxIter(5)\
            .setRegParam(0.01)\
            .setUserCol('userID')\
            .setItemCol('movieID')\
            .setRatingCol('rating')
alsclf = als.fit(train)

In [79]:
predictions = alsclf.transform(test)
predictions.show(20, False)
predictions.count()

+------+-------+------+----------+-----------+
|userID|movieID|rating|timestamp |prediction |
+------+-------+------+----------+-----------+
|26    |31     |1.0   |1424380312|-1.9767356 |
|27    |31     |1.0   |1424380312|-1.0848575 |
|12    |31     |4.0   |1424380312|0.87788236 |
|4     |31     |1.0   |1424380312|-0.2520102 |
|8     |31     |3.0   |1424380312|2.2586     |
|28    |85     |1.0   |1424380312|6.9975033  |
|26    |85     |1.0   |1424380312|-0.6348226 |
|23    |85     |1.0   |1424380312|-1.3387568 |
|16    |65     |1.0   |1424380312|-0.518541  |
|3     |65     |2.0   |1424380312|-0.20969684|
|19    |65     |1.0   |1424380312|0.7038306  |
|4     |65     |1.0   |1424380312|-0.31682414|
|23    |65     |5.0   |1424380312|1.1317939  |
|20    |53     |3.0   |1424380312|-1.670781  |
|21    |53     |5.0   |1424380312|1.3977592  |
|14    |53     |3.0   |1424380312|2.7680461  |
|22    |78     |1.0   |1424380312|0.38458624 |
|24    |78     |1.0   |1424380312|0.81120306 |
|11    |78   

322

## 3. Evaluator and Metrics <a id='sec3'></a>

[back to top](#directory)

### 3.1 Regression Metrics <a id='sec3-1'></a>
We can use a regression evaluator to measure the rmse of the prediction on the rating and the actual rating.

[back to top](#directory)

In [80]:
from pyspark.mllib.evaluation import RegressionMetrics

# get the rdd data
input_data = predictions.select('rating', 'prediction').rdd.map(lambda x: (x[0], x[1]))
input_data

PythonRDD[2059] at RDD at PythonRDD.scala:53

In [81]:
# compute the regression metrics
metrics = RegressionMetrics(input_data)
metrics.rootMeanSquaredError

2.053136657182337

### 3.2 Ranking Metrics <a id='sec3-2'></a>
There is another tool provided in Spark: `RankingMetrics` that provides more sophisticated way of measuring the performance of the recommendation system. Ranking metrics does not focus on the value of the rank. It focuses on whether the algorithm recommeds an already ranked item again to a user.

For example, if there is a movie that a person gives a very high rate. Will the system recommend this movie to the person? 

From a high level point of view, wo can do:
- predict the person's rating on every movie in the dataset
- rank the movie by predicted ratings
- check whether the high-rated movie is associate with a high rank

In spark, we do:
- train model and make predictions on the testing set
- set up a ranking threshold to represent the 'high ranking'
- filter the ground truth ==> aggregate on user to put the rated items into a list (DF A)
- filter the predicted ranking ==> aggregate on user to put the rated items into a list (DF B)
- join A and B on the users
- call the ranking metrics on the joined DF's RDD data (Only take top k from the prediction columns as recommendations)

[back to top](#directory)

In [82]:
from pyspark.sql.functions import expr

# get the all movies with high actual rating (>2.5 for this case)
filtered_truth = predictions.where('rating > 2.5')

# use collect set to aggregate the high rating movies for each user
agg_truth = filtered_truth.groupBy('userID').agg(expr('collect_set(movieID) as truths'))
agg_truth.show(3)

# get all movies with high predicted rating 
filtered_pred = predictions.where('prediction > 2.5')
agg_pred = filtered_pred.groupBy('userID').agg(expr('collect_set(movieID) as predictions'))
agg_pred.show(3)

# join the two DF
joined = agg_truth.join(agg_pred, 'userID')
joined.show(3)

+------+-------------------+
|userID|             truths|
+------+-------------------+
|    28|[81, 19, 2, 62, 92]|
|    26|       [88, 24, 94]|
|    27|       [75, 55, 80]|
+------+-------------------+
only showing top 3 rows

+------+-------------------+
|userID|        predictions|
+------+-------------------+
|    28|[85, 2, 58, 95, 92]|
|    26|           [88, 44]|
|    27|               [80]|
+------+-------------------+
only showing top 3 rows

+------+-------------------+-------------------+
|userID|             truths|        predictions|
+------+-------------------+-------------------+
|    28|[81, 19, 2, 62, 92]|[85, 2, 58, 95, 92]|
|    26|       [88, 24, 94]|           [88, 44]|
|    27|       [75, 55, 80]|               [80]|
+------+-------------------+-------------------+
only showing top 3 rows



In [83]:
from pyspark.mllib.evaluation import RankingMetrics

k = 15  # recommend top k from predictions
rdds = joined.rdd.map(lambda row: (row[1], row[2][:k]))
metrics = RankingMetrics(rdds)

In [84]:
metrics.meanAveragePrecision

0.38449735449735445

In [85]:
metrics.precisionAt(5)

0.20952380952380956

In [86]:
??metrics