# Matrix Factorization with Mllib in Spark

This notebook is a continuation of our previous Matrix Factorization notebook where we have built a MF engine with Keras. Here we use Matrix Factorization library from Mllib in Spark.

In [1]:
#!pip install pyspark

In [2]:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
from pyspark import SparkContext

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator

In [3]:
# set the random state
rs = 123

In [4]:
sc = SparkSession \
    .builder \
    .getOrCreate()

We load the same sampled dataframe used previously in our matrix factorization engine built with Keras

In [5]:
# load in the data

df = sc.read.csv("sample_df.csv.gz",header=True,inferSchema=True)

Let's have a look at our data

In [6]:
df.show(10)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      2|   3.5|
|     1|     29|   3.5|
|     1|     32|   3.5|
|     1|     47|   3.5|
|     1|     50|   3.5|
|     1|    112|   3.5|
|     1|    151|   4.0|
|     1|    223|   4.0|
|     1|    253|   4.0|
|     1|    260|   4.0|
+------+-------+------+
only showing top 10 rows



We can easily print the schema and the number of rows/columns

In [7]:
df.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)



In [8]:
print('number of rows : {}, number of columns :{}'.format(df.count(),len(df.columns)))

number of rows : 5399624, number of columns :3


In [9]:
# split into train and test
train, test = df.randomSplit([0.9, 0.1])

A matrix factorisation model trained by regularized alternating least-squares `ALS`.

`K` is the number of latent dimentionality and `epochs` is our number of iterations.

In [10]:
# train the model
K = 10
epochs = 10
model = ALS.train(train, K, epochs, nonnegative=True, seed=rs)

### Now we evaluate our model on the train and test sets

In [11]:
# train
x1 = train.rdd.map(lambda p: (p[0], p[1]))
x2 = model.predictAll(x1)
p = x2.map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = train.rdd.map(lambda r: ((r[0], r[1]), r[2])).join(p) # notice we had to map test such it has the same shape as p
# joins on first item: (user_id, movie_id)
# each row of result is: ((user_id, movie_id), (rating, prediction))
mse = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("***** train rmse: %s *****" % mse**0.5)

***** train rmse: 0.6802168283962867 *****


In [12]:
# test
x1 = test.rdd.map(lambda p: (p[0], p[1]))
x2 = model.predictAll(x1)
p = x2.map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = test.rdd.map(lambda r: ((r[0], r[1]), r[2])).join(p)
# joins on first item: (user_id, movie_id)
# each row of result is: ((user_id, movie_id), (rating, prediction))
mse = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("***** test rmse: %s *****" % mse**0.5)

***** test rmse: 0.873094632538312 *****


We remember that the RMSE obtained for our Keras matrix factorization model on the test set was 0.8723, which is on pair with the current prediction.

----

## We can use our model to recommend products to users!

Let us try recommend movies to a set of users.

The method `.recommendProducts()` recommends the top “num” number of products for a given user and returns a list of Rating objects sorted by the predicted rating in descending order.

In [21]:
# one example: recommned 10 items to userId 52862

model.recommendProducts(126443, 10)

[Rating(user=126443, product=106, rating=4.300946771647378),
 Rating(user=126443, product=326, rating=4.19226143014788),
 Rating(user=126443, product=844, rating=4.176850052221196),
 Rating(user=126443, product=659, rating=4.144022301815534),
 Rating(user=126443, product=341, rating=4.118683655867986),
 Rating(user=126443, product=720, rating=4.0540810876237945),
 Rating(user=126443, product=701, rating=4.053854493166524),
 Rating(user=126443, product=214, rating=4.0418627080314),
 Rating(user=126443, product=745, rating=4.037544339946727),
 Rating(user=126443, product=32, rating=4.028476491870228)]

----

below is just a simple check of how x1, x2, and ratesAndPreds look like.

In [14]:
print(x1)
x1.take(5)


PythonRDD[289] at RDD at PythonRDD.scala:53


[(1, 47), (1, 112), (1, 337), (2, 70), (2, 110)]

In [15]:
print(x2)
x2.take(5)

MapPartitionsRDD[278] at mapPartitions at PythonMLLibAPI.scala:1344


[Rating(user=65722, product=515, rating=1.6422999522519517),
 Rating(user=65722, product=949, rating=2.6059620166890283),
 Rating(user=129434, product=912, rating=3.9893138882692036),
 Rating(user=129434, product=665, rating=4.242397620069801),
 Rating(user=91902, product=140, rating=2.8549149853674236)]

In [16]:
print(p)
p.take(5)

PythonRDD[292] at RDD at PythonRDD.scala:53


[((65722, 515), 1.6422999522519517),
 ((65722, 949), 2.6059620166890283),
 ((129434, 912), 3.9893138882692036),
 ((129434, 665), 4.242397620069801),
 ((91902, 140), 2.8549149853674236)]

In [17]:
print(ratesAndPreds)
ratesAndPreds.take(5)

PythonRDD[294] at RDD at PythonRDD.scala:53


[((1, 47), (3.5, 3.997202225559832)),
 ((1, 112), (3.5, 3.4209485245494218)),
 ((2, 70), (5.0, 3.415139092526429)),
 ((3, 260), (5.0, 4.911124741134522)),
 ((3, 512), (2.0, 3.1283293439241255))]