**Matrix factorization** is a class of collaborative filtering algorithms used in recommender systems. **Matrix factorization** approximates a given rating matrix as a product of two lower-rank matrices.
It decomposes a rating matrix R(nxm) into a product of two matrices W(nxd) and U(mxd).

\begin{equation*}
\mathbf{R}_{n \times m} \approx \mathbf{\hat{R}} = 
\mathbf{V}_{n \times k} \times \mathbf{V}_{m \times k}^T
\end{equation*}

In [1]:
!pip install pyspark       #installing pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/87/21/f05c186f4ddb01d15d0ddc36ef4b7e3cedbeb6412274a41f26b55a650ee5/pyspark-2.4.4.tar.gz (215.7MB)
[K     |████████████████████████████████| 215.7MB 39kB/s 
[?25hCollecting py4j==0.10.7
[?25l  Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)
[K     |████████████████████████████████| 204kB 38.7MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
[?25h  Created wheel for pyspark: filename=pyspark-2.4.4-py2.py3-none-any.whl size=216131250 sha256=b80d1a6d39bb509235b86d8b767189b71c4d93498dd0c80690e95

#### Importing the necessary libraries

In [2]:
from pyspark import SparkContext, SQLContext   # required for dealing with dataframes
import numpy as np
from pyspark.ml.recommendation import ALS      # for Matrix Factorization using ALS 

In [3]:
sc = SparkContext()      # instantiating spark context 
sqlContext = SQLContext(sc) # instantiating SQL context 

#### Step 1. Loading the data into a PySpark dataframe

In [4]:
#Read the dataset into a dataframe
jester_ratings_df = sqlContext.read.csv("/kaggle/input/jester-17m-jokes-ratings-dataset/jester_ratings.csv",header = True, inferSchema = True)

In [5]:
#show the ratings
jester_ratings_df.show(5)

+------+------+-------------------+
|userId|jokeId|             rating|
+------+------+-------------------+
|     1|     5|0.21899999999999997|
|     1|     7|             -9.281|
|     1|     8|             -9.281|
|     1|    13| -6.781000000000001|
|     1|    15|              0.875|
+------+------+-------------------+
only showing top 5 rows



In [6]:
#Print total number of ratings, unique users and unique jokes.
print("Total number of ratings: ", jester_ratings_df.count())
print("Number of unique users: ", jester_ratings_df.select("userId").distinct().count())
print("Number of unique jokes: ", jester_ratings_df.select("jokeId").distinct().count())

Total number of ratings:  1761439
Number of unique users:  59132
Number of unique jokes:  140


#### Step 2. Splitting into train and test part

In [7]:
#Split the dataset using randomSplit in a 90:10 ratio
X_train, X_test = jester_ratings_df.randomSplit([0.9,0.1])   # 90:10 ratio

In [8]:
#Print the training data size and the test data size
print("Training data size : ", X_train.count())
print("Test data size : ", X_test.count())

Training data size :  1585285
Test data size :  176154


In [9]:
#Show the train set
X_train.show(5)

+------+------+------------------+
|userId|jokeId|            rating|
+------+------+------------------+
|     1|     7|            -9.281|
|     1|     8|            -9.281|
|     1|    13|-6.781000000000001|
|     1|    15|             0.875|
|     1|    16|            -9.656|
+------+------+------------------+
only showing top 5 rows



In [10]:
#Show the test set
X_test.show(5)

+------+------+-------------------+
|userId|jokeId|             rating|
+------+------+-------------------+
|     1|     5|0.21899999999999997|
|     1|    29|              8.781|
|     1|    50|              9.906|
|     1|    66|  8.687999999999999|
|     1|    69|  8.687999999999999|
+------+------+-------------------+
only showing top 5 rows



#### Step 3. Fitting an ALS model

In [11]:
#Fit an ALS model with rank=5, maxIter=10 and Seed=0
als = ALS(userCol="userId",itemCol="jokeId",ratingCol="rating",rank=5, maxIter=10, seed=0, )
model = als.fit(X_train)

In [12]:
model.userFactors.show(5, truncate = False)  # displaying the latent features for five users

+---+------------------------------------------------------------+
|id |features                                                    |
+---+------------------------------------------------------------+
|10 |[-0.71043116, 0.5012814, -1.010544, 0.93265253, 0.47890794] |
|40 |[0.87856364, -0.3649627, -1.7392969, -1.9242384, -0.6742972]|
|50 |[0.5181268, -0.0895328, -1.4033533, 0.6592673, 2.206175]    |
|60 |[-0.38699847, 0.22039635, 1.991712, -1.2445426, -4.902595]  |
|80 |[1.082909, 2.9622498, -0.65711886, 2.8884735, 0.2750681]    |
+---+------------------------------------------------------------+
only showing top 5 rows



#### Step 4. Making predictions

In [13]:
predictions = model.transform(X_test[["userId","jokeId"]])  # passing userId and jokeId from test dataset as an argument 

In [14]:
# joining X_test and prediction dataframe and also dropping the records for which no predictions made
ratesAndPreds = X_test.join(other=predictions,on=['userId','jokeId'],how='inner').na.drop() 
ratesAndPreds.show(5)

+------+------+-------------------+----------+
|userId|jokeId|             rating|prediction|
+------+------+-------------------+----------+
|  5518|   148|  7.343999999999999| 7.3738794|
| 28836|   148|              5.438| 1.5408345|
| 32539|   148|              9.656| 3.8033004|
| 41890|   148|-0.6559999999999999|-0.6798879|
| 43714|   148|              1.844|  3.648929|
+------+------+-------------------+----------+
only showing top 5 rows



#### Step 5. Evaluating the model

In [15]:
# converting the columns into numpy arrays for direct and easy calculations 
rating = np.array(ratesAndPreds.select("rating").collect()).ravel()
prediction = np.array(ratesAndPreds.select("prediction").collect()).ravel()
print("RMSE : ", np.sqrt(np.mean((rating - prediction)**2)))

RMSE :  4.383377894931288


#### Step 6. Recommending jokes

In [16]:
# recommending top 3 jokes for all the users with highest predicted rating 
model.recommendForAllUsers(3).show(5,truncate = False)

+------+------------------------------------------------------+
|userId|recommendations                                       |
+------+------------------------------------------------------+
|148   |[[138, 13.617236], [115, 12.9109745], [80, 12.498299]]|
|463   |[[15, 6.7115293], [16, 5.9902053], [17, 5.568451]]    |
|471   |[[62, 5.756143], [122, 5.328079], [63, 5.1936283]]    |
|496   |[[16, 6.6375237], [43, 6.5081167], [20, 6.137681]]    |
|833   |[[127, 4.3381605], [80, 4.327071], [132, 4.056589]]   |
+------+------------------------------------------------------+
only showing top 5 rows

