## Large Scale Matrix Factorization

We will experiment with the recent MovieLens 25M Dataset and build a recommender system using two approaches:
* Factorizing the user-item matrix using Spark ALS implementation
* Factorizing the item-item PMI maatrix using randomized SVD

In both settings we will index the item embeddings and inspect their quality using KNN queries.

### Download the dataset

In [None]:
!wget http://files.grouplens.org/datasets/movielens/ml-25m.zip

In [None]:
!unzip ml-25m

### Creating the Spark session

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
import pyspark.sql.functions as F

In [None]:
ss = SparkSession \
    .builder \
    .appName("mf") \
    .master("local[4]") \
    .config("spark.submit.deployMode", "client") \
    .config("spark.driver.memory", "4g") \
    .config("spark.ui.port", "0") \
    .getOrCreate()
ss

### Loading the ratings dataset

In [None]:
ratings_df = ???

In [None]:
ratings_df.head()

### Split the dataset
We want to randomly split the dataset into train and test parts

In [None]:
(training, test) = ???

### Build ALS model
Using the Spark ALS implementation described here https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html
Build a model using the ml-25m dataset.

How long does the training take, change the rank (i.e. the dimension of the vectors) from 10 to 20. How does that affect training speed ?

In [None]:
model = ???

### Evaluation
Using the code described in the Spark documentation, evaluate how good your model is doing on the test set.
The goal is to predict the held out ratings.
A good metric could be RMSE or MAE.

In [None]:
mae = ???
print("MAE = " + str(mae))

### Inspecting the results

Retrieve the movie vectors from the learned model object (the property is called itemFactors).
and `collect` all these vectors in a list. 

In [None]:
movie_vectors_df = ???

In [None]:
movie_vectors_df.head()

Now we need to create a dictionary mapping the movieId to it's title to ease the inspection. 
Load the `movies.csv` file using pyspark or pandas and create a `dict` movieId -> title.  

In [None]:
movie_names_dict = ???

### Using approximate nearest neighbours

Now we are going to use the Annoy library https://github.com/spotify/annoy to retrieve the nearest neighbours of a given movie.
We will create the index and then add the vectors created previously one by one with a key corresponding to the movie id.

In [None]:
???

Let's just get some movies to test the querying on.

In [None]:
movies_df.show()

Now take a few movie ids and retrieve the closest movies (and their titles) in the embedding space.
How does it look ?

In [None]:
???

### Another approach - RSVD

We now are going to factorize the item-item PMI matrix using randomized SVD.

### Creating the PMI matrix

#### Counting movies and pairs

Create two dataframes one containing movie counts (how many times a movie was rated by a user), and one containing movie pair counts (how many times two movies were rated by the same users).

TIP: only keep movies with at least 10 ratings for instance, and only pairs with at least 10 occurrences.

In [None]:
movie_counts = ???

Using the filtered movies compute the pair counts by doing a self join on the ratings dataframe (filtered to keep only the relevant movies).

In [None]:
pairs_df = ???

Using the movie counts and the pair counts, compute the PMI dataframe using the formula provided in the lecture.
You will be doing a join between the pairs and counts twice.

In [None]:
pmi_df = ???

In [None]:
pmi_df.show()

Now we need to build a scipy sparse matrix (`csr_matrix` format for instance https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix) from the PMI dataframe. It is small enough to be collected into memory.

In [None]:
pmi_matrix = ???

Use the scikit-learn implementation of SVD https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html to factorize the PMI matrix. It used the randomized SVD algorithm presented as a default.

In [None]:
???

Now retrieve the components from the model, these are our movie vectors.

In [None]:
movie_vectors = ???

In [None]:
movie_vectors

Let's create an annoy index from these vectors, and query it like previously.

In [None]:
???