# Matrix Factorization

We will experiment with the recent MovieLens 25M Dataset and build a recommender system using two approaches:
* Factorizing the user-item matrix using Spark ALS implementation
* Factorizing the item-item PMI maatrix using randomized SVD

In both settings we will index the item embeddings and inspect their quality using KNN queries.

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!curl -O https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  287M  100  287M    0     0   187M      0  0:00:01  0:00:01 --:--:--  187M


In [5]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"
import findspark
findspark.init()

In [6]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf, StorageLevel
import pyspark.sql.functions as F
from pyspark.sql import Window
import matplotlib.pyplot as plt
import pandas as pd

%matplotlib inline

In [7]:
conf = (
    SparkConf()
    .set('spark.ui.port', '4050')
    .set("spark.driver.memory", "16g")
    .set('spark.driver.maxResultsSize', '8g')
)
sc = SparkContext(conf=conf)
spark = SparkSession.builder.master('local[*]').getOrCreate()
ss = spark

### Download the dataset

In [8]:
!wget http://files.grouplens.org/datasets/movielens/ml-25m.zip
!unzip ml-25m

--2022-05-20 09:20:17--  http://files.grouplens.org/datasets/movielens/ml-25m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261978986 (250M) [application/zip]
Saving to: ‘ml-25m.zip’


2022-05-20 09:20:21 (73.2 MB/s) - ‘ml-25m.zip’ saved [261978986/261978986]

Archive:  ml-25m.zip
   creating: ml-25m/
  inflating: ml-25m/tags.csv         
  inflating: ml-25m/links.csv        
  inflating: ml-25m/README.txt       
  inflating: ml-25m/ratings.csv      
  inflating: ml-25m/genome-tags.csv  
  inflating: ml-25m/genome-scores.csv  
  inflating: ml-25m/movies.csv       


### Loading the ratings dataset

In [9]:
movies_df = spark.read.csv('ml-25m/movies.csv', header=True, inferSchema=True).cache()
ratings_df = spark.read.csv('ml-25m/ratings.csv', header=True, inferSchema=True)

# Part 1 : Alternating least squares

### Split the dataset
We want to randomly split the dataset into train and test parts

In [None]:
(training_df, validation_df) = ratings_df.randomSplit([0.8, 0.2])

In [None]:
training_df.count()

### Build ALS model
Using the Spark ALS implementation described here https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html
Build a model using the ml-25m dataset.

How long does the training take, change the rank (i.e. the dimension of the vectors) from 10 to 20. How does that affect training speed ?

In [None]:
from pyspark.ml.recommendation import ALS
import time

ranks = [10,15,20]
models = []
training_time = []

for rank in ranks:
  start_time = time.time()
  als = ALS(rank=rank, maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")
  model = als.fit(training_df)
  models.append(model)
  training_time.append(time.time() - start_time)

In [None]:
from matplotlib.pyplot import plot
%matplotlib inline
plot(ranks, training_time)

# processing time seems linear as long as we don't have memory issues to deal-with.

### Evaluation
Using the code described in the Spark documentation, evaluate how good your model is doing on the test set.
The goal is to predict the held out ratings.
A good metric could be RMSE or MAE.

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

predictions = model.transform(validation_df)
evaluator = RegressionEvaluator(metricName="mae", labelCol="rating", predictionCol="prediction")
mae = evaluator.evaluate(predictions)
print(f"MAE = {mae}")

### Inspecting the results

Retrieve the movie vectors from the learned model object (the property is called itemFactors).
and `collect` all these vectors in a list.

In [None]:
movie_vectors_df = (
    model.itemFactors
    .join(movies_df.withColumnRenamed('movieId', 'id'), 'id')
    .select('title', 'features')
)

Now we need to create a dictionary mapping the movieId to it's title to ease the inspection. 
Load the `movies.csv` file using pyspark or pandas and create a `dict` movieId -> title.

In [None]:
movie_vect_dict = {r['title'] : r['features'] for r in movie_vectors_df.collect()}

### Using Nearest neighbours

Pick a few movies, and for each of them, find-out the top 5 nearest neighbours. This is very similar to an optional question of the PLSA project...

In [None]:
title_vector_array = movie_vectors_df.collect()
titles = [r['title'] for r in title_vector_array]
vectors = [r['features'] for r in title_vector_array]

In [None]:
import numpy
from numpy import linalg as LA
import heapq
# naive knn with queue, using numpy to batch vector operations
def knn(query, k, titles, vectors):
  start_time = time.time()
  nb_movies = len(titles)
  diff = numpy.array(vectors) - numpy.array(query)
  distances = LA.norm(diff, axis=1)
  indices = heapq.nlargest(k, range(0, nb_movies), key=lambda x: -distances[x])
  ret = [(titles[i], distances[i]) for i in indices]
  print(f"{time.time() - start_time}")
  return ret

In [None]:
def analyze(i):
  print(f"Query title : {titles[i]}")
  query_vec = vectors[i]
  ret = knn(query_vec, 10, titles, vectors)
  for res in ret:
    print(res)

In [None]:
analyze(4)

# Part 2 : PMI and RSVD


We now are going to factorize the item-item PMI matrix using randomized SVD.

## Size estimation

Let's first estimate the size of the matrix we are about to build.

Reminder : we will generate the co-occurence matrix $C$ from all the pairs of
movies that we find in users ratings. 
This matrix can be big.

Namely, if a user has rated movies `(1, 2, 3, 4)`, we will generate 6 pairs :
`(1, 2), (1, 3), (2, 3), (1, 4), (2, 4), (3, 4)`.

Formally, if a user has rated $n$ movies, he will generate `......` pairs. We should be careful of users that have rated a lot of movies.


In [None]:
def number_of_pairs_to_be_generated(ratings_df):
  ratings_count_by_user_df = (
      ratings_df
      .groupby('userId')
      .agg(F.count('*').alias('count'))
      .sort(F.col('count').desc())
  )

  return int((
      ratings_count_by_user_df
     .withColumn("n_pairs", # TODO #)
     .select(F.sum("n_pairs"))
  ).collect()[0]["sum(n_pairs)"])

print(
    f"Number of positive ratings : {ratings_df.count():,}"
    f", that should generate {number_of_pairs_to_be_generated(ratings_df):,} pairs"
)

### Keep positive interactions

### Keep meaningful movies

We will keep only movies having a sufficient amount of positive ratings.
First, that will make the computations lighter, second, it will prevent us from
computing embeddings for movies we have very little information on.

### Limit the number of pairs

We have way less movies but still almost all ratings and a lot of pairs ! 

### Creating the PMI matrix

Reminder, the PMI matrix writes
$$
PMI(i, j) = \log\left(\frac{p(i, j)}{p(i)p(j)}\right)
$$
that we estimate with
$$
\widehat{PMI}(i, j) = \log\left(\frac{C_{i, j} \cdot n}{C_i \cdot  C_j}\right)
$$
where
* $C_{i, j}$ is the number of pairs (i, j) (i.e. the number of users that have
  given a positive feedback for both movie i and movie j.
* $C_{i}$ is total the number of pairs containing i
* $C_{j}$ is total the number of pairs containing j
* $n$ is the total number of pairs

#### Step 1 : compute co-occurence matrix $C_{i,j}$

#### Step 2 : Compute total number of pairs per movies $C_i$, $C_j$

#### Step 3 : compute $\widehat{PMI}(i, j)$

Using the movie counts and the pair counts, compute the PMI dataframe.

### Factorizing the PMI matrix with RSVD

First, we need to create a mapping between movie ids and position in the matrix that we call vocabulary.

In [None]:
count_ordering_window = Window().orderBy(F.col("count").desc())
vocabulary_df = movie_counts_df.select(
    "movieId",
    (F.row_number().over(count_ordering_window) - F.lit(1)).alias("index"),  # row number starts at 1
).cache()

vocabulary_df.show(3)

Now we need to build a scipy sparse matrix from the PMI dataframe. 
Thanks to our filtering, it is small enough to be collected into memory.

Still this might take a minute or two.

In [25]:
%%time

pmi_pdf = (
    pmi_df
    .join(vocabulary_df.withColumnRenamed("movieId", "movieId1"), on="movieId1")
    .withColumnRenamed("index", "i")
    .join(vocabulary_df.withColumnRenamed("movieId", "movieId2"), on="movieId2")
    .withColumnRenamed("index", "j")
    .select("i", "j", "pmi")
).toPandas()

CPU times: user 40 s, sys: 3.24 s, total: 43.2 s
Wall time: 1min 2s


From this i, j, value representation of the PMI matrix, you can create a scipy sparse matrix (prefer [coo format](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html))

Use the scikit-learn implementation of SVD https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html to factorize the PMI matrix. It uses the randomized SVD algorithm presented as a default.

### Faiss Index

Let's install faiss-cpu, and create an index from these vectors. Query the index like what we have done previously.

In [28]:
!pip install -q faiss-cpu

[K     |████████████████████████████████| 8.6 MB 4.6 MB/s 
[?25h

In [34]:
# create faiss index
import faiss

movie_embeddings = svd.components_.transpose().astype('float32')
index = faiss.IndexFlatIP(movie_embeddings.shape[1])
index.add(movie_embeddings)

In [None]:
inner_product, neighbors = index.search(movie_embeddings[0].reshape(1, -1), 5)
pd.DataFrame({"neigbors": neighbors[0], "inner product": inner_product[0]})