# Credit for this notebook : Victor Hatinguais (former lecturer of this course)

# Movie Recommendation with Spark MLlib

In this notebook, we will use Spark MLlib to build a recommender system from MovieLens datasets.

MovieLens is a project by GroupLens, a research laboratory at University of Minnesota, to provide a movies recommender application and use the collected data to improve algorithms. On https://movielens.org/, anyone can try the app for free and get movies recommandations. To help many people develop the best recommandation algorithms, MovieLens also released several datasets on http://grouplens.org/datasets/movielens/. We will use those datasets in this notebook.

We will work with the two latest datasets available on MovieLens. The smallest one will help us build our application as fast as possible but you can use the biggest one whenever you want if you'd like to experience Spark power with a bigger dataset. Please keep in mind that we'll be using a free low capacity Spark cluster. Spark's scalability lets you run the same exact code on a much bigger cluster if you wish.

The files to be uploaded from the MovieLens latest small dataset are:
- movies.csv
- ratings.csv
Additional files may be uploaded depending on exercises, such as :
- moviesBig.csv
- ratingsBig.csv

The small dataset is around 100k ratings. The biggest one is around 22M ratings.

## 0. Bluemix, Jupyter & Markdown

First, here are some links with useful information to help you answer the following exercices.

This is a Databricks notebook environment where you can interactively develop Spark programs: https://docs.databricks.com/user-guide/notebooks/index.html

Markdown is a simple markup language to structure text: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet

Additional useful resources include:
- Spark Python API documentation: http://spark.apache.org/docs/latest/api/python/
    - SparkContext: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext
    - RDD: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD
    - sqlContext: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SQLContext
    - DataFrame: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame

## 1. MovieLens datasets: load & access

Spark lets you explore data of any structure from a lot of different data sources and data formats.

To load the data, upload them in the data section on the left pane.
**Please only upload the smallest dataset.**

You should get the paths to access the data from Spark.

In [4]:
movies_path = "/FileStore/tables/1a6s039m1508136002922/movies.csv"
ratings_path = "/FileStore/tables/1a6s039m1508136002922/ratings.csv"
# moviesBig_path = "/FileStore/tables/6lb85s4g1508136827033/movies.csv"
# ratingsBig_path = "/FileStore/tables/6lb85s4g1508136827033/ratings.csv"

Then, execute the following cell to be sure the data access is working fine.

Notice that we are specifying an action - **first()** - to test the data access. Just using the **textFile()** method won't be sufficient because **textFile()** is a transformation so Spark only builds his DAG. Execution only takes place when an action is given, such as **first()**.

In [6]:
print "First record in the movies.csv dataset:", sc.textFile(movies_path).first()

In this notebook, we will be using data stored as text files in the Databricks File System. The structure is CSV (comma separated values) and is well-documented (see links below) but we'll be assuming that we don't even know the structure.

- Small dataset documentation: http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html
- Big dataset documentation: http://files.grouplens.org/datasets/movielens/ml-latest-README.html

We will use two files from this MovieLens dataset: *ratings.csv* and *movies.csv*. All ratings are contained in the file *ratings.csv* and are in the following format:
```
userId,movieId,rating,timestamp
```
Movies information are in the file *movies.csv* and are in the following format:
```
movieId,title,genres
```

Now that you are able to access the data, let's explore Spark functionalities.

As you probably know any Spark session needs a SparkContext to submit jobs to an executors cluster. On this managed environment you were provided a free trial Spark cluster and a SparkContext is already available as **sc**.

Refer to the Spark Python API documentation at http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext to learn what method you can call on SparkContext object.

Below are some examples to get the Spark version running on your environment or the default parallelism.

In [9]:
sc.version

In [10]:
sc.defaultParallelism

Notice that this environment is running Spark 2.2.0 and the current default parallelism level is set to 8 which means at least eight partitions will be created for any new RDD. This is a good practice to configure the default parallelism to at least four times the number of executors to accomodate variance between the workloads.

To create RDDs on the datasets, we use the method **.textFile()** which load the data from our Databricks File System. The second parameter let us specify the minimum number of partitions we want and the **.cache()** function has been added to be sure the dataset will be loaded and retained in memory as soon as we execute an action on it (remember Spark is lazy evaluation). Since we will be using those datasets a lot, it may be a good thing to load them in memory to improve performance.

In [12]:
movies = sc.textFile(movies_path, 10).cache()
ratings = sc.textFile(ratings_path, 10).cache()
# moviesBig = sc.textFile(moviesBig_path, 100).cache()
# ratingsBig = sc.textFile(ratingsBig_path, 100).cache()

It was fast but remember that nothing happened yet. Spark just began to build an execution plan but is waiting you to provide an action before executing anything. The RDDs are however ready to analyze.

## 2. Spark basics
Let's discover Spark through simple commands first. Let say we know nothing about the dataset we just loaded. Those data could be unstructured, semi-structured or structured and contain any data format. Spark does not really care, the **textFile()** method let you load those files in RDDs and each line of those files is now an element of the RDDs.

From this chapter, you will find some exercices. The places where you have to put code are marked with **#TODO: explanation**.

First thing you want to know is what is in your dataset, how many elements do you have, what is the structure, the attributes types, etc.

In [15]:
print movies.first()

The *movies* RDD seems to be in CSV format and it is good to know there is an header.

But to understand the data types, you probably want to get more lines. Use the Spark Python API documentation to find out how to retrieve 10 lines from both datasets *ratings* and *movies*.

Notice that you probably don't want to retrieve **all lines**. In distributed computation, the dataset could be huge and it's probably a bad thing to retrieve all the data from executors on hundreds of machine to the driver on one single machine.

In [17]:
# Exercice 1: get 10 elements from every dataset
rats = ratings # TODO: get 10 elements
print "--------\nRatings:\n--------"
for r in rats:
    print r
movs = movies # TODO: get 10 elements
print "\n--------\nMovies:\n--------"
for m in movs:
    print m

We notice that ratings elements are strings with comma separated values. The values are integers or floats.

About movies, elements are strings with comma separated values. The values are strings, possibly with pipe separated values (for categories).

In [19]:
# Exercice 2: print the number of elements in every dataset
for rdd in movies, ratings:
# for rdd in moviesBig, ratingsBig:
    c = rdd #TODO: number of elements in rdd
    print "RDD '{}' has {} elements.".format(rdd, c)

The biggest dataset has 22M+ elements. If we experience computing delays, we may prefer work on the smaller dataset.

While you are working in Spark with data from an input file, you usually start with this kind of RDDs of *lines* from your input file. But this input file probably has a structure or some specific elements that you want to extract from it in order to give your Spark RDD a structure. For example, this CSV file has four attributes: userId, movieId, rating and timestamp. Spark's RDD does not understand the data structure but you can give one to your data by splitting the lines on the comma separator.

You'll learn later that another Spark data structure, named Spark's DataFrames, understand the data structure and is thus associated with a schema.

But for now, prepare the RDDs by extracting the different fields and removing the header row and the timestamp field. You can also cast the fields in integer and float. Start with the small dataset, check the final RDD with **first()** or **take()** and once it's ok, we will duplicate your work on the bigger dataset. The **map()** method is the RDD's method that you are looking for if you wish to apply a function to any element of an RDD and get another RDD in return.

In [21]:
# Exercice 3: prepare the RDDs

# Step 1: remove the header row
movies2 = movies #TODO: Remove the header row from movies
# Check that the new RDD has a row less
print "movies2 has ", movies2.count(), " elements while movies has ", movies.count()
ratings2 = ratings #TODO: Remove the header row from ratings
# Check that the new RDD has a row less
print "ratings2 has ", ratings2.count(), " elements while ratings has ", ratings.count()

In [22]:
# Step 2: split the lines
movies3 = movies2 #TODO: Split the lines to get an RDD of arrays of strings
# Check an element of the new RDD
print movies3.first()
ratings3 = ratings2 #TODO: Split the lines to get an RDD of arrays of strings
# Check an element of the new RDD
print ratings3.first()

In [23]:
# Step 3: take the fields you need (ie. remove the timestamp from ratings) and cast them as the appropriate data type
movies4 = movies3 #TODO: Cast the fields to get an RDD of tuples (int, str, str)
# Check an element of the new RDD
print movies4.first()
ratings4 = ratings3 #TODO: Cast the required fields to get an RDD of tuples (int, int, float)
# Check an element of the new RDD
print ratings4.first()

We had a *ratings* RDD of strings representing lines in our input file.

We now have a *ratings4* RDD of (integer, integer, float).

In [25]:
# Step 4: apply all those operations at once to the dataset, you can make a line break in your code by putting an antislash (\) before it

movies4 = movies #TODO: apply all the above operations at once
# Check an element of the new RDD
print movies4.first()

ratings4 = ratings #TODO: apply all the above operations at once
# Check an element of the new RDD
print ratings4.first()

# If the biggest dataset is available in your environment, you may want to apply those operations to this dataset also. You can cache it for better performances.
# moviesBig.cache()
# ratingsBig.cache()

Even with 22M+ records, the computation with Spark is super fast as long as the dataset can fit in memory and has been cached. This dataset is "only" 600MB. Imagine the possibilities with a cluster of ten 128GB RAM nodes for instance.

With those RDDs, it will be easier to answer the two following exercices. In fact, it would be even easier if you were familiar with SQL (Standard Query Language) by the abstraction of DataFrames. Let's do it later.

In [27]:
# Exercice 4: how many different users is there in the dataset and how many movies have been rated?
print "Number of different users:", ratings4 #TODO
print "Number of different movies that have been rated:", ratings4 #TODO

In [28]:
# Exercice 5: what are the maximum rating and the minimum rating that appear in the big dataset?
print "Rating max: ", ratings4 #TODO
print "Rating min: ", ratings4 #TODO

In [29]:
# Exercice 6: give the full distribution of the ratings, ie. number of occurences of each rating, you can help yourself with the WordCount example
distribution = ratings4 #TODO
# Check the results. You can collect() because you know that the resulting dataset is small
for tuples in distribution.sortByKey().collect():
    print tuples[0], ": ", tuples[1]

In the previous code, it is important to understand where the code executes. You should take advantage of your Spark's cluster power whenever possible and only manipulates small datasets on the driver single machine.

Notice the distribution of the ratings is not uniform. We can represent it with a Matplotlib.

In [31]:
distribution

In [32]:
# %matplotlib inline

import pandas
distribution_pandas = pandas.DataFrame(distribution.sortByKey().collect(), columns=["rating", "freq"])

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(16,12));
distribution_pandas['freq'].plot(kind="bar")
ax.set_xticklabels(distribution_pandas['rating']);

display(fig)

Seems users more often rate movies that they like than the one that they don't...

## 3. Collaborative filtering
Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix, in our case, the user-movie rating matrix. MLlib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. In particular, we implement the alternating least squares (ALS) algorithm to learn these latent factors.

What you want to estimate in our example is the rating a user would put to a movie he has not yet watched. If you are precise enough on that estimation, you just have to recommend to users the movies you have estimate he would have rated 4 or more if he had watched them.

*ALS* is a supervised algorithm which means you need data that already contains the target you want to estimate, ie. the rating. You have this dataset, but it is usually a good thing to split this dataset in two. One, the training set, will be used to train the model. The other one, the test set will be used to evaluate your model performance by applying the model on the blind data (the data without the solution) and comparing the computed solution to the actual solution.

So let's split the *ratings4* RDD between the training set and the test set with respectively about 80%/20% of the original dataset. The ratings should be split approximately randomly.

In [35]:
# Exercice 7: split the big dataset between a training set and a test set
training = ratings4 #TODO
test = ratings4 #TODO
# Check the number of ratings in each RDD
print "Number of ratings in the training dataset: ", training.count()
print "Number of ratings in the test dataset: ", test.count()

We will use MLlib’s *ALS* (Alternating Least Squares) to train a *MatrixFactorizationModel*, which takes a RDD[(user, product, rating)]. ALS has training parameters such as rank for matrix factors and regularization constants. To determine a good combination of the training parameters, we should randomly split the initial dataset between a training and a test datasets, train the model on the training set, evaluate it on the test set and iterate with different parameters. Let's just first try with some "random" parameters and compute the score with the RMSE (Root Mean Squared Error) for instance.

Among the training parameters of ALS, the most important ones are rank, lambda (regularization constant), and number of iterations. The train method of ALS we are going to use is defined as the following:
```python
class ALS(object):

    def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1):
        # ...
        return MatrixFactorizationModel(sc, mod)
```
Ideally, we want to try a large number of combinations of parameters in order to find the best one but we won't for now. So let's train the model.

You can additional documentation at: http://spark.apache.org/docs/1.4.1/mllib-collaborative-filtering.html

In [38]:
from pyspark.mllib.recommendation import ALS

model = ALS.train(training, 2, 5, 0.01)

This kind of algorithm may take some time on a huge dataset...

Once the model is trained, we can apply it to the test set from which we remove the actual ratings (blind estimation). Then we join the predictions with the actual ratings and compute the difference.

In [40]:
predictions = model.predictAll(test.map(lambda x: (x[0], x[1])))
print "An example of prediction: ", predictions.first()
predictionsAndRatings = predictions \
    .map(lambda x: ((x[0], x[1]), x[2])) \
    .join(test.map(lambda x: ((x[0], x[1]), x[2])))
print "An example of predictions and ratings: ", predictionsAndRatings.first()

from operator import add
from math import sqrt
n = test.count()
testRMSE = sqrt(predictionsAndRatings.values().map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))
print "RMSE = ", testRMSE

We can also use the **recommendProducts(user, nb)** method to recommand *nb* products (ie. movies) to a specific user.

In [42]:
model.recommendProducts(10,5)


##4. To continue your journey with Spark
### Comparing to a naive baseline
Does ALS output a non-trivial model? We can compare the evaluation result with a naive baseline model that only outputs the average rating (or you may try one that outputs the average rating per movie). Computing the baseline’s RMSE is straightforward:

In [44]:
meanRating = training.map(lambda x: x[2]).mean()
baselineRMSE = sqrt(test.map(lambda x: (meanRating - x[2]) ** 2).reduce(add) / n)
improvement = (baselineRMSE - testRMSE) / baselineRMSE * 100
print "The baseline RMSE is ", baselineRMSE, ", we improved it by ", improvement, "%"

### Automatic cross validation and parameters selection
To improve the performance of our model, we could:
- correct the ratings above 5 or under 0.5
- round to the nearest .5 or not
- select better parameters through automatic cross validation

### Confusion Matrix
It would be a good thing to build a confusion matrix to understand where our model is good and where it is bad, ie. a matrix 10x10 where the lines are the ratings, the columns are the predictions and each cell gives the number of predictions we made in each class (ie. 0.5, 1.0, 1.5, etc.) for each real rating. Thus, the matrix diagonal should ideally contain all the values. This matrix could also be plot as a heatmap through Matplotlib.

### DataFrames (SparkSQL)

Try to do exercices 4, 5, 6 through SQL thanks to Spark DataFrames. You'll have to define a schema and to use the **toDF()** method.