## Model-based Recommendation System with Matrix Factorization - ALS Model

In this example, we will build a recommendation system using the ALS model from __pyspark__.

A sparse matrix __R__ can be built based on the data of user-item relation and their ratings.

The goal of matrix factorization method is to separate the utility matrix into the __user latent matrix__ and the __product latent matrix__, such that

$$R \approx U * P$$

There are a lot of methods to factorize the utility matrix, such as the Singular Value Decomposition, Probabilistic Latent Semantic Analysis. In Alternative Least Square (ALS), it is an iterative process to optimize the factorization model.



### Dataset
In this example, we will be using the movielens dataset (ml-100k).

source: https://grouplens.org/datasets/movielens/

### Mathematics behind the ALS Model
First we will define our object function using the loss function, and we can optimize the model by minimizing the loss function.

Loss Function: $RMSE = \sqrt{\sum (real - prediction)^2/n}$, 
where real = $R$, and prediction = $U*P^T$

Assume there are $m$ users and $n$ items, $R = m \times n$, $U = m \times k$ and  $P = n \times k$ where $k$ is the __latent factors__.

$$
\begin{aligned}
loss &= min (real-prediction)^2\\
    &= min(R-U*P^T)^2 \\
    &= min \sum_{x,y}{(R_{x,y} - U_{x}*P_{y}^T)^2} \\
\end{aligned}
$$

In order to avoid overfitting, we add $l2$ norm to our objective function, such that
$$loss = min \sum_{x,y}{(R_{x,y} - U_{x}*P_{y}^T)^2} + \lambda(\lVert U \rVert^2+\lVert P \rVert^2)$$

Next, we take the partial differentiation respect to U and P.

$$
\begin{aligned}
\frac{\partial loss}{\partial U} &= 0 \\
    &= \frac{\partial}{\partial U} \sum_{x,y}{(R_{x,y} - U_{x}*P_{y}^T)^2} + \lambda(\lVert U \rVert^2+\lVert P \rVert^2) = 0\\
    &= -2\sum_{x,y}{(R_{x,y}-U_x*P_y^T)P_y+2\lambda U_x = 0} \\
    &=-(R_x-U_x^TP^T)P + \lambda U_x^T=0 \\
    &= U_x^T = R_xP(P^TP+\lambda I)^{-1}
\end{aligned}
$$

$$
\begin{aligned}
\frac{\partial loss}{\partial P} &= 0 \\
    &= \frac{\partial}{\partial P} \sum_{x,y}{(R_{x,y} - U_{x}*P_{y}^T)^2} + \lambda(\lVert U \rVert^2+\lVert P \rVert^2) = 0\\
    &= P_y^T = R_yU(U^TU+\lambda I)^{-1}
\end{aligned}
$$

Therefore, we have both equations of $U_x$ and $P_y$. By fixing one, we can optimize the other one. Iteratively alternate the latent matrix $U_x$ and $P_y$, we are able to optimize the utility matrix factorization.

In [1]:
import findspark
findspark.init()

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
import pandas as pd

conf = SparkConf()
conf.set("spark.executor.memory","6g")
conf.set("spark.driver.memory", "6g")
conf.set("spark.driver.cores", "8")
sc = SparkContext.getOrCreate(conf)
spark = SparkSession.builder.getOrCreate()


In [2]:
# load data as dataframe
movielens = sc.textFile('../data/ml-100k/u.data').map(lambda x: tuple(x.split('\t'))) \
                .map(lambda x: tuple([float(x[0]), float(x[1]), float(x[2])]))

data = movielens.toDF(['userid','itemid','rating'])
data.take(2)

[Row(userid=196.0, itemid=242.0, rating=3.0),
 Row(userid=186.0, itemid=302.0, rating=3.0)]

In [3]:
data.count()

100000

In [4]:
# Next we create the train and test dataset
train, test = data.randomSplit([0.7,0.3],7856)

In [5]:
train.cache()
test.cache()

DataFrame[userid: double, itemid: double, rating: double]

In the spark ALS model, we are able to define __rank__, __maxIter__, __regParam__, and more can be found on https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.recommendation.ALS.html#pyspark.ml.recommendation.ALS

ALS model usually converge fast, so we put maxIter = 10, and __rank__ is equal to the number of __latent factors__.


In [7]:
# we use the cross validator to tune the hyperparameters
als = ALS(
         userCol="userid", 
         itemCol="itemid",
         ratingCol="rating", 
         coldStartStrategy="drop"
)

param_grid = ParamGridBuilder() \
            .addGrid(als.rank, [10, 100]) \
            .addGrid(als.regParam, [.1]) \
            .addGrid(als.maxIter, [10]) \
            .build()

evaluator = RegressionEvaluator(
           metricName="rmse", 
           labelCol="rating", 
           predictionCol="prediction")

cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3, parallelism = 6)

In [8]:
model = cv.fit(train)

In [9]:
best_model = model.bestModel

print(f"Rank = {best_model._java_obj.parent().getRank()}")
# Print "MaxIter"
print(f"MaxIter = {best_model._java_obj.parent().getMaxIter()}")
# Print "RegParam"
print(f"RegParam = {best_model._java_obj.parent().getRegParam()}")

Rank = 100
MaxIter = 10
RegParam = 0.1


In [10]:
prediction = best_model.transform(test)
rmse = evaluator.evaluate(prediction)
print(f'RMSE = {rmse}')

RMSE = 0.9293164895244701


In [12]:
# we can get the user latent factors and item latent factors from the model
best_model.userFactors.show()
best_model.itemFactors.show()

+---+--------------------+
| id|            features|
+---+--------------------+
| 10|[0.3314802, -0.07...|
| 20|[0.3870823, 0.024...|
| 30|[0.3141748, -0.09...|
| 40|[0.26709092, -0.2...|
| 50|[0.35146096, -0.0...|
| 60|[0.36123818, -0.1...|
| 70|[0.44221446, -0.0...|
| 80|[0.4122668, -0.07...|
| 90|[0.070995346, -0....|
|100|[0.45245463, 0.05...|
|110|[0.2630007, -0.06...|
|120|[0.1782444, -0.15...|
|130|[0.38577065, -0.1...|
|140|[0.41791943, 0.01...|
|150|[0.36449644, -0.2...|
|160|[0.34152013, -0.3...|
|170|[0.33740997, -0.2...|
|180|[0.056818523, -0....|
|190|[0.36812487, -0.0...|
|200|[0.36409047, 0.00...|
+---+--------------------+
only showing top 20 rows

+---+--------------------+
| id|            features|
+---+--------------------+
| 10|[0.24241637, -0.2...|
| 20|[0.21774949, -0.0...|
| 30|[0.2805123, -0.04...|
| 40|[0.08548905, -0.0...|
| 50|[0.47504237, -0.2...|
| 60|[0.046063706, -0....|
| 70|[0.21902403, -0.1...|
| 80|[0.33022547, -0.1...|
| 90|[0.42008877, -0.0...|
|1

#### Recommendations based on the model

we can find the item recommendations to specific users or users who might be interested in specific item using the following methods.

__recommendForAllUsers__, __recommendForAllItems__


__recommendForUserSubset__, __recommendForItemSubset__

In [14]:
# recommendation to all users
best_model.recommendForAllUsers(3).show()

+------+--------------------+
|userid|     recommendations|
+------+--------------------+
|   471|[[932, 4.677558],...|
|   463|[[887, 4.2685623]...|
|   833|[[1597, 4.7141786...|
|   496|[[1240, 4.508081]...|
|   148|[[1449, 4.908884]...|
|   540|[[1449, 4.8743043...|
|   392|[[187, 4.9549775]...|
|   243|[[1449, 4.533235]...|
|   623|[[174, 4.5522203]...|
|   737|[[127, 4.716929],...|
|   897|[[1368, 4.895013]...|
|   858|[[9, 4.339468], [...|
|    31|[[705, 4.6753798]...|
|   516|[[1449, 4.6160107...|
|   580|[[1368, 4.4243093...|
|   251|[[1368, 4.7131], ...|
|   451|[[333, 4.1837206]...|
|    85|[[1449, 4.3844137...|
|   137|[[50, 5.1624665],...|
|   808|[[1368, 5.445765]...|
+------+--------------------+
only showing top 20 rows

