# Spark Learning Note - Recommendation
Jia Geng | gjia0214@gmail.com


In [1]:
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.appName('MLexample').getOrCreate()
spark

## 1. Alternative Least Square and Collaborate Filtering

Spark have an implementatoin of Alternative Least Squares for Collaborative Filterinig. ALS finds a dimentional featue vector for each user an item such that the dot product of each user's feature vector with each item's feature vector approximates the user's rating for that item. The dataset should includes existing ratings between user-item pairs:
- a user ID column (need to be int)
- an item ID column (need to be int)
- a rating column (need to be a float)
    - the rating can be explicit: a numerical rating that the system should predict directly
    - or implicit: rating represents the strength of interactions between a user and item (e.g. number of visits to a particular page)

The goal for recommendation system is that: given an ipnut data frame, the model will produce feature vectors that can be used to predict user's rating for items they have not yet rated.

Some potential problem of such system - **cold start problems**:
- when introducing a new product that no user has expressed a preference for, the algorithm is not going to recommend it to many people.
- if a new user are onboarding onto the platform, they might not have many ratings yet. Therefore the algorithm won't know what to recommend them.

The MLlib can scale the algorithm to millions of users, millions of items and billions of ratings.


## 2. Model Params 

**Hyperparams**

|Name|Input|Notes|
|-|-|-|
|rank|int|the dimension of the feature vectors learned for users and items. **Controls the bias and variance trade off.** Default is 10. 
|alpha|float|default is 1.0
|regParam|float|default is 0.1
|implicitPrefs|bool|whether training on implicit or explicit. default is explicity
|nonnegative|bool|whether to place a non-negative (feature) constriants on the least square problem. default is False.

**Training Params**

|Name|Input|Notes|
|-|-|-|
|numUserBlocks|int|how many blocks to split the user into. default is 10|
|numItemBlocks|int|how many blocks to split the items into. default is 10|
|maxIter|int|total number of iterations over the data before stopping. default is 10
|checkpointInterval|int|allow saving the checkpoints during training
|seed|int|random seed for replicating results

**Prediction Params**

|Name|Input|Notes|
|-|-|-|
|coldStartStrategy|'nan', 'drop'| strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data.

In [3]:
from pyspark.ml.recommendation import ALS

print(ALS().explainParams())

alpha: alpha for implicit preference (default: 1.0)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)
coldStartStrategy: strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'. (default: nan)
finalStorageLevel: StorageLevel for ALS model factors. (default: MEMORY_AND_DISK)
implicitPrefs: whether to use implicit preference (default: False)
intermediateStorageLevel: StorageLevel for intermediate datasets. Cannot be 'NONE'. (default: MEMORY_AND_DISK)
itemCol: column name for item ids. Ids must be within the integer value range. (default: item)
maxIter: max number of iterations (>= 0). (default: 10)
n