# Matrix Factorization

We will experiment with the recent MovieLens 25M Dataset and build a recommender system using two approaches:
* Factorizing the user-item matrix using Spark ALS implementation
* Factorizing the item-item PMI maatrix using randomized SVD

In both settings we will index the item embeddings and inspect their quality using KNN queries.

# Part 1

### Download the dataset

In [0]:
!wget http://files.grouplens.org/datasets/movielens/ml-25m.zip
!unzip ml-25m
dbutils.fs.ls("file:/databricks/driver/ml-25m/")
dbutils.fs.mv("file:/databricks/driver/ml-25m/", "dbfs:/ml-25m/", recurse=True)

In [0]:
dbutils.fs.ls("dbfs:/ml-25m/")

### Loading the ratings dataset

In [0]:
#from pyspark.sql import SparkSession
#from pyspark.ml.evaluation import RegressionEvaluator
#from pyspark.ml.recommendation import ALS
#from pyspark.sql import Row
#import pyspark.sql.functions as F

In [0]:
movies_df = spark.read.csv('dbfs:/ml-25m/movies.csv', header=True, inferSchema=True).cache()
ratings_df = spark.read.csv('dbfs:/ml-25m/ratings.csv', header=True, inferSchema=True)

### Split the dataset
We want to randomly split the dataset into train and test parts

In [0]:
import pyspark.sql.functions as F

In [0]:
# you may want to try this :

#training_percent = 80
#training_df = user_movies_interactions = (
#    ratings_df
#    .filter(F.expr(f'PMOD(HASH(userId),100)')<training_percent)
#    .repartition('userId', 'movieId')
#).cache()
#validation_df = user_movies_interactions = (
#    ratings_df
#    .filter(F.expr(f'PMOD(HASH(userId),100)')>=training_percent)
#    .repartition('userId', 'movieId')
#).cache()

# it won't help much in the validation phase though !

In [0]:
(training_df, validation_df) = ratings_df.randomSplit([0.8, 0.2])

In [0]:
training_df.count()

### Build ALS model
Using the Spark ALS implementation described here https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html
Build a model using the ml-25m dataset.

How long does the training take, change the rank (i.e. the dimension of the vectors) from 10 to 20. How does that affect training speed ?

In [0]:
from pyspark.ml.recommendation import ALS
import time

ranks=[10,15,20,30]
models=[]
training_time=[]

for rank in ranks:
  start_time = time.time()
  als = ALS(rank=rank, maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating", coldStartStrategy="drop")
  model = als.fit(training_df)
  models.append(model)
  training_time.append(time.time() - start_time)

In [0]:
from matplotlib.pyplot import plot
%matplotlib inline
plot(ranks, training_time)

# processing time seems linear as long as we don't have memory issues to deal-with.

### Evaluation
Using the code described in the Spark documentation, evaluate how good your model is doing on the test set.
The goal is to predict the held out ratings.
A good metric could be RMSE or MAE.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

predictions = model.transform(validation_df)
evaluator = RegressionEvaluator(metricName="mae", labelCol="rating", predictionCol="prediction")
mae = evaluator.evaluate(predictions)
print("MAE = " + str(mae))

### Inspecting the results

Retrieve the movie vectors from the learned model object (the property is called itemFactors).
and `collect` all these vectors in a list.

In [0]:
movie_vectors_df = model.itemFactors.join(movies_df.withColumnRenamed('movieId', 'id'), 'id').select('title', 'features')

Now we need to create a dictionary mapping the movieId to it's title to ease the inspection. 
Load the `movies.csv` file using pyspark or pandas and create a `dict` movieId -> title.

In [0]:
movie_vect_dict = {r['title'] : r['features'] for r in movie_vectors_df.collect()}

### Using Nearest neighbours

Pick a few movies, and for each of them, find-out the top 5 nearest neighbours. This is very similar to an optional question of the PLSA project...

In [0]:
title_vector_array = movie_vectors_df.collect()
titles = [r['title'] for r in title_vector_array]
vectors = [r['features'] for r in title_vector_array]

In [0]:
import numpy
from numpy import linalg as LA
import heapq
# naive knn with queue, using numpy to batch vector operations
def knn(query, k, titles, vectors):
  start_time = time.time()
  nb_movies = len(titles)
  diff = numpy.array(vectors) - numpy.array(query)
  distances = LA.norm(diff, axis=1)
  indices = heapq.nlargest(k, range(0, nb_movies), key=lambda x: -distances[x])
  ret = [(titles[i], distances[i]) for i in indices]
  print(f"{time.time() - start_time}")
  return ret

In [0]:
def analyze(i):
  print(f"Query title : {titles[i]}")
  query_vec = vectors[i]
  ret = knn(query_vec, 10, titles, vectors)
  for res in ret:
    print(res)

In [0]:
analyze(4)

# Part 2

### Another approach - RSVD

We now are going to factorize the item-item PMI matrix using randomized SVD.

### Creating the PMI matrix

Compute the movie pair counts by doing a self join on the ratings dataframe (filtered to keep only the relevant movies).

Cautious ! This computation is expensive as we explicit all movie pairs from all users.

You will need to filter / sample your data wisely to avoid big join.

In [0]:
# first things first we only keep movies liked by user.
ratings_df = ratings_df.filter(F.col('rating')>=3.5).cache()

In [0]:
# Let's look at how much ratings are done user by user
# When user has scored a lots of movies, amount of pairs will increase quadratically !
ratings_count_by_user_df = ratings_df.groupby('userId').agg(F.count('*').alias('count')).sort(F.col('count').desc()).cache()
display(ratings_count_by_user_df)

userId,count
72315,12802
75309,5525
80974,4131
137293,4116
110971,4111
92046,3991
20055,3576
85757,3062
24869,2890
24610,2890


In [0]:
# We will sample user ratings to make sure they don't exceed a given threshold.
threshold = 40
ratings_sampled_df = (
  ratings_df
  .join(ratings_count_by_user_df, 'userId')
  .filter(F.rand() < threshold / F.col('count'))
  .select('userId', 'movieId')
  .repartition('userId', 'movieId')
  .cache()
)

In [0]:
# Also, self join will rely on sort merge join. We want to avoid two sort so we store the dataset, sorted.
ratings_sampled_df.write\
    .bucketBy(8, 'userId') \
    .sortBy('userId') \
    .saveAsTable('bucketed_ratings', format='parquet')
sorted_ratings_df = spark.table('bucketed_ratings').cache()

In [0]:
pairs_df = (
  sorted_ratings_df
    .join(sorted_ratings_df.withColumnRenamed('movieId', 'movieId2'), 'userId')
    .groupby(F.concat(F.greatest('movieId', 'movieId2'), F.lit("-"), F.least('movieId','movieId2')).alias('pair'))
    .agg(F.count("*").alias('pair_count'))
    .cache()
)

In [0]:
pairs_df.show()

Compute the amount of ratings by movie. You will need it in order to compute the pmi formula.

In [0]:
movie_counts = ratings_sampled_df.groupby('movieId').agg(F.count("*").alias('count')).cache()
print(f"Nb Movies : {movie_counts.count()}")
display(movie_counts.sort(F.col('count').desc()))

movieId,count
318,43218
296,36518
356,35947
593,34772
2571,30697
527,29102
260,28867
2959,25264
50,25193
110,24752


Using the movie counts and the pair counts, compute the PMI dataframe using the formula provided in the lecture.
You will be doing a join between the pairs and counts twice.

In [0]:
n_ratings = ratings_sampled_df.count()

pmi_df = (
  pairs_df
    .withColumn('split', F.split(F.col('pair'), '-').alias('split'))
    .select(F.element_at('split', 1).alias('movieId1'), F.element_at('split', 2).alias('movieId2'), F.col('pair_count'))
    .join(movie_counts.withColumnRenamed('movieId', 'movieId2'), 'movieId2')
    .withColumnRenamed('count', 'count_movie2')
    .join(movie_counts.withColumnRenamed('movieId', 'movieId1'), 'movieId1')
    .withColumnRenamed('count', 'count_movie1')
    .select(
      F.col('movieId1'), 
      F.col('movieId2'), 
      ((F.col('pair_count') * n_ratings) / (F.col('count_movie1') * F.col('count_movie2'))).alias('pmi')
    )
    .cache()
)

In [0]:
pmi_df.show()

### RSVD

Now we need to build a scipy sparse matrix (lil_matrix) from the PMI dataframe. It is small enough to be collected into memory.

In [0]:
rows = pmi_df.collect()
vocabulary={}
for row in rows:
  vocabulary.setdefault(row['movieId1'], len(vocabulary))
  vocabulary.setdefault(row['movieId2'], len(vocabulary))

In [0]:
from math import log
from scipy.sparse import lil_matrix
matrix = lil_matrix((len(vocabulary),len(vocabulary)))
for row in rows:
  i = vocabulary[row['movieId1']]
  j = vocabulary[row['movieId2']]
  matrix[i,j] = log(row['pmi'])
  matrix[j,i] = log(row['pmi'])

Use the scikit-learn implementation of SVD https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html to factorize the PMI matrix. It uses the randomized SVD algorithm presented as a default.

In [0]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=30, random_state=42)
svd.fit(matrix)

### Faiss Index

Let's install faiss-cpu, and create an index from these vectors. Query the index like what we have done previously.

In [0]:
!/databricks/python3/bin/python -m pip install --upgrade pip
!pip install faiss-cpu

In [0]:
# create faiss index
import faiss
index = faiss.IndexFlatL2(components_.shape[0])
faiss_matrix = svd.components_.transpose().astype('float32')
index.add(faiss_matrix)

In [0]:
# used to display movie names
inverted_index = {vocabulary[k]:k for k in vocabulary.keys()}
titles_by_id = {row['movieId']:row['title'] for row in movies_df.collect()}

# utility function to display top k
def analyze(movie_index, k):
  nb_dims = faiss_matrix.shape[1]
  (embeddings, indexes) = index.search(faiss_matrix[movie_index,:].reshape((1,nb_dims)), k)
  for movie in [titles_by_id[int(inverted_index[i])] for i in indexes[0,:]]:
    print(movie)

In [0]:
analyze(14, 10)