# Exploring the Movielens dataset with the Spark RDD API

## Install Spark Environment
Since we are not running on databricks, we will need to install Spark by ourselves, every time we run the session.  
We need to install Spark, as well as a Java Runtime Environment.  
Then we need to setup a few environment variables.  

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!curl -O https://dlcdn.apache.org/spark/spark-3.2.3/spark-3.2.3-bin-hadoop3.2.tgz
!tar xf spark-3.2.3-bin-hadoop3.2.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.3-bin-hadoop3.2"

In [None]:
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

conf = SparkConf().set('spark.ui.port', '4050')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.master('local[*]').getOrCreate()

## Optional step : Enable SparkUI through secure tunnel
This step is useful if you want to look at Spark UI.
First, you need to create a free ngrok account : https://dashboard.ngrok.com/login.  
Then connect on the website and copy your AuthToken.

In [None]:
# this step downloads ngrok, configures your AuthToken, then starts the tunnel
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
#!./ngrok authtoken my_ngrok_auth_token_retrieved_from_website # <-------------- change this line !
!./ngrok authtoken 25Pb4DqNqaoy5kCwimBO7dFMwvx_5BYL36GDSkQtRexvt9pRA
get_ipython().system_raw('./ngrok http 4050 &')

## Other Imports

In [None]:
%matplotlib inline

import urllib
import urllib.request as req
import zipfile
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### Downloading the dataset

In [None]:
url = 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'
filehandle, _ = urllib.request.urlretrieve(url)
zip_file_object = zipfile.ZipFile(filehandle, 'r')
zip_file_object.namelist()
zip_file_object.extractall()

In [None]:
!ls

In [None]:
movies_path = "ml-20m/movies.csv"
ratings_path = "ml-20m/ratings.csv"

### Loading the data

Loading data with Spark Dataframe API.  
Loading a csv with the RDD API is not supported out of the box and is painful to implement.

In [None]:
ratings_df = spark.read.options(header=True, inferSchema=True).csv(ratings_path)

Did you notice this created a job in the Spark UI? I thought Spark was lazy until we requested an action ?  
Rerun the same command with inferSchema=False and compare the schema with the command df.printSchema(). Can you understand why Spark triggered a job and what it was for ?

Two jobs are created when hen using inferSchema option. Spark needs to scan the whole dataset in order to infer the data type of each column. Yet, if you disable this option, you will realize that there is still one short job created. So much for the laziness ! To generate the dataframe, Spark needs to know how much columns we have inside each row. That's why a first job is created. Let's keep the inferSchema option set to True for now.

In [None]:
ratings_df.take(1)

In [None]:
ratings_rdd = ratings_df.rdd.map(lambda x: x.asDict())

In [None]:
ratings_rdd.take(3)

Record type of a dataframe is the 'Row'. You can have any record type inside your RDD, we are using Python dictionaries there.

In [None]:
movies_df = spark.read.options(header=True, inferSchema=True).csv(movies_path)
movies_rdd = movies_df.rdd.map(lambda x: x.asDict())

In [None]:
movies_rdd.take(1)

The ratings RDD is a bit large (about 2 min to run a request on it on a docker container with two cores). You can work on a smaller version of it to develop and debug your job and then run it on the full RDD to get the result.  
Why do we persist the small RDD and not the regular one ?

In [None]:
ratings_small_rdd = ratings_rdd.filter(lambda x: x['userId'] < 20000).persist(pyspark.StorageLevel.DISK_ONLY)

If we persist the non-filtered RDD, we will lose all the benefits of the persist function and we will need to read the whole dataset every time. Some other remarks: when doing real analysis, filtering on userId may yield biased results, because you are likely to work with oldest users subscribed to MovieLens ; persist-to-disk function benefits are lost if you stop your Spark session. If your analysis is spanning over multiple sessions, you should save your dataset to distributed file system ; persist-to-memory may be adapted if you are running an iterative algorithm, but be vary, when using persist-to-memory, memory of executors may be shared with other users and you don't have a full guarantee that some partitions won't be recomputed from scratch at some point.

Here we will sample dataset and save it to DFS, then read-it again.

In [None]:
sampled_path = "ml-20m/sampled_ratings.csv"
ratings_df.sample(fraction=0.1).write.format("csv").save(sampled_path, mode="overwrite", header=True)

In [None]:
ratings_small_df = spark.read.options(header=True, inferSchema=True).csv(sampled_path)
ratings_small_rdd = ratings_small_df.rdd.map(lambda x: x.asDict())

Use the sampled rdd when tinkering with your RDD. When you are sure about what you are doing, you can try to use the entire RDD.

### Q1. How many ratings ?

### Q2. How many users ?

Read the documentation for the distinct function in the RDD API and find a solution with this method.
There is another solution relying on a more generic function ? Can you solve the problem without using distinct function ?

### Q3. How many ratings per grade ?

How many users rated a movie with grade r for r in [0,5]?    
Plot it. Do you notice something unusual ?

### Q4. Histogram of number of ratings per user

Plot the distribution of the number of movies rated per user. In other words, what is the fraction of users that rated between bins[i] and bins[i+1] movies for the following bins.  
What is the average and median number of ratings per user?

In [None]:
bins = np.unique(np.logspace(0, 160, base=1.05, num=50, dtype='int32'))
bins

### Q5. Most popular movies

What are the 20 movies with the most ratings ?  
We would like the answer with the movie title and not the movie id.  
Look at the documentation of the join and top functions.

### Q6. Writing partioned datasets

The ratings dataset is available as one big csv file. It is not very convenient since we have to go through the entire file to look for ratings for a specific userId. Moreover, we cannot open only a small part of the dataset.  
Could you write the ratings dataset into 16 files located in /tmp/ratings/part=X/ratings.csv for X in [0, 16[ where userId in part=X are such that userId % 16 == X ?  Your function should return the list of written files with the number of ratings for each file.
Look at the documentation of partitionBy and mapPartitionsWithIndex.

### Q7. Most popular genre per year

For every year since 1980, determine what is the most popular genre.  
Look at the documentation of the flatMap function.

### Q8.  Best movies

Amongst the movies with at least 1000 ratings, what are the top 20 movies per median rating ?

In [None]:
# # When you're done with a session you've created, stop it
spark.stop()