# Exploring the Movielens dataset with the Spark RDD API

In [2]:
%matplotlib inline

import urllib
import urllib.request as req
import zipfile
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import pyspark
from pyspark.sql import SparkSession

### Downloading the dataset

In [4]:
url = 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'
filehandle, _ = urllib.request.urlretrieve(url)
zip_file_object = zipfile.ZipFile(filehandle, 'r')
zip_file_object.namelist()

zip_file_object.extractall()

movies_path = "file:///databricks/driver/ml-20m/movies.csv"
ratings_path = "file:///databricks/driver/ml-20m/ratings.csv"

### Loading the data

In [6]:
ss = SparkSession.builder \
    .master("local[*]")  \
    .appName('movielens-rdd') \
    .getOrCreate()

sc = ss.sparkContext
sc

Loading data with Spark Dataframe API.  
Loading a csv with the RDD API is not supported out of the box and is painful to implement.

In [8]:
ratings_df = spark.read.options(header=True, inferSchema=True).csv(ratings_path)

Did you notice this created a job in the Spark UI? I thought Spark was lazy until we requested an action ?  
Rerun the same command with inferSchema=False and compare the schema with the command df.printSchema(). Can you understand why Spark triggered a job and what it was for ?

In [10]:
ratings_df.take(1)

In [11]:
ratings_rdd = ratings_df.rdd.map(lambda x: x.asDict())

In [12]:
ratings_rdd.take(10)

In [13]:
movies_df = spark.read.options(header=True, inferSchema=True).csv(movies_path)

In [14]:
movies_rdd = movies_df.rdd.map(lambda x: x.asDict())

In [15]:
movies_rdd.take(10)

The ratings RDD is a bit large (about 2 min to run a request on it on a docker container with two cores). You can work on a smaller version of it to develop and debug your job and then run it on the full RDD to get the result.  
Why do we persist the small RDD and not the regular one ?

In [17]:
ratings_small_rdd = ratings_rdd.filter(lambda x: x['userId'] < 20000).persist(pyspark.StorageLevel.DISK_ONLY)

You may have issues with Persist command. When sampling and doing data analysis on a dataset good practice is to write a new dataset once and for all so speed up analyses.

In [19]:
sampled_path = "file:///databricks/driver/ml-20m/sampled_ratings.csv"
ratings_df.sample(fraction=0.1).write.format("csv").save(sampled_path, mode="overwrite")

In [20]:
ratings_small_df = spark.read.options(header=True, inferSchema=True).csv(sampled_path)
ratings_small_rdd = ratings_small_df.rdd.map(lambda x: x.asDict())

### Q1. How many ratings ?

### Q2. How many users ?

Read the documentation for the distinct function in the RDD API.
Where is the userId column ? What happened during the sampling ? Check the 'save' api for a solution.
Can you compute it without using distinct ?

### Q3. How many ratings per grade ?

How many users rated a movie with grade r for r in [0,5]?    
Plot it. Do you notice something unusual ?

### Q4. Histogram of number of ratings per user

Plot the distribution of the number of movies rated per user. In other words, what is the fraction of users that rated between bins[i] and bins[i+1] movies for the following bins.  
What is the average and median number of ratings per user?

In [25]:
bins = np.unique(np.logspace(0, 160, base=1.05, num=50, dtype='int32'))
bins

### Q5. Most popular movies

What are the 20 movies with the most ratings ?  
We would like the answer with the movie title and not the movie id.  
Look at the documentation of the join and top functions.

### Q6. Writing partioned datasets

The ratings dataset is available as one big csv file. It is not very convenient since we have to go through the entire file to look for ratings for a specific userId. Moreover, we cannot open only a small part of the dataset.  
Could you write the ratings dataset into 16 files located in /tmp/ratings/part=X/ratings.csv for X in [0, 16[ where userId in part=X are such that userId % 16 == X ?  Your function should return the list of written files with the number of ratings for each file.
Look at the documentation of partitionBy and mapPartitionsWithIndex.

### Q7. Most popular genre per year

For every year since 1980, determine what is the most popular genre.  
Look at the documentation of the flatMap function.

### Q8.  Best movies

Amongst the movies with at least 1000 ratings, what are the top 20 movies per median rating ?

In [30]:
ss.stop()