# Practical Session 2: Introduction to Spark Dataframes and Spark SQL

In this session we will use the movielens dataset to introduce the essential features of the Spark DataFrame API and showcase its power. This tutorial also has important links to the Spark documentation and/or other relevant material.

In [0]:
import matplotlib.pyplot as plt

In [0]:
%matplotlib inline

## Downloading and unzipping the data (run only once !)

In [0]:
import urllib
import zipfile

url = 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'
filehandle, _ = urllib.request.urlretrieve(url)
zip_file_object = zipfile.ZipFile(filehandle, 'r')
zip_file_object.namelist()

In [0]:
zip_file_object.extractall()

## Part 1: Spark DataFrames essentials


### Reading data

In [0]:
movies_path = "file:///databricks/driver/ml-20m/movies.csv"
ratings_path = "file:///databricks/driver/ml-20m/ratings.csv"

We read the csv files using [`spark.read`](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html)

In [0]:
movies_df = spark.read.options(header=True).csv(movies_path)
ratings_df = spark.read.options(header=True).csv(ratings_path).sample(0.01)

We cache the read dataframes to avoid reloading them in subsequent computation.

In [0]:
movies_df.cache()
ratings_df.cache()

We then print a few rows from each dataframe.

In [0]:
movies_df.show(5)

In [0]:
ratings_df.show(5)

### Manipulating data

In [0]:
movies_df.select("title").show(5)

In [0]:
ratings_df.filter("rating=5").show(5)

In [0]:
ratings_df.groupby("userId").agg({"movieId": "count"}).show(5)

In [0]:
ratings_df.withColumn("is_rating_high", ratings_df["rating"] >= 4).show(5)

In [0]:
ratings_df.withColumn("is_rating_low", ratings_df.rating < 4).show(5)

In [0]:
ratings_df.withColumnRenamed("rating", "note").show(5)

https://spark.apache.org/docs/2.3.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

### Built-in transformations and aggregations

In [0]:
import pyspark.sql.functions as F

In [0]:
ratings_df.select(F.avg("rating"), F.min("rating"), F.max("rating")).show()

### Joining Dataframes

In [0]:
ratings_df.join(movies_df, "movieId").show(5)

### User Defined functions (UDFs)

In [0]:
from pyspark.sql.functions import udf
from pyspark.sql.types import LongType

def length(string: str):
    return len(string)

length_udf = udf(length, LongType())

In [0]:
movies_df.select(length_udf("title")).show(5)

In [0]:
@udf("string")
def length2(string: str):
    return len(string)

In [0]:
movies_df.select(length2("title")).show(5)

In [0]:
title_lengths = movies_df.select(length2("title").alias("title_length"))

In [0]:
title_lengths.select(F.max("title_length")).show()

In [0]:
title_lengths.select(F.min("title_length")).show()

### Query plan inspection and caching

In [0]:
title_lengths.select(F.max("title_length")).explain()

In [0]:
title_lengths.select(F.min("title_length")).explain()

In [0]:
title_lengths.cache()
title_lengths.select(F.max("title_length")).show()

In [0]:
title_lengths.select(F.min("title_length")).explain()

### Writing csv

In [0]:
movies_df.sample(0.1).write.csv("file:///ml-20m/movies-sample.csv")

In [0]:
movies_df.sample(0.1).write.mode("overwrite").csv("file:///ml-20m/movies-sample.csv")

In [0]:
import os
os.listdir("ml-20m")

This command writes a dataframe in parquet format :

In [0]:
spark.read.options(header=True).csv(ratings_path).write.parquet("file:///ml-20m/ratings.parquet")

### Question 0: Compare processing time and amount of executors used, when reading from csv versus reading from parquet, for the following pipelines:
- count total amount of records
- count total amount of records for user 1
- distinct count of timestamps

**hint** `countDistinct` method can be used for third pipeline

### Question 1: Compute the (average, max, min) rating per movie, and get the highest and lowest rated movies ?

**hint** Straightforward GroupBy then Aggregate

### Question 2: Amongst movies that were rated by at least 20 users, what are the movies with highest and lowest rating standard deviation ?

**hint** How do you use a join to keep only a subset of movies ?

### Question 3: Compute the (average, max, min) rating per genre and get the highest and lowest rated genres, as well as the ones with the highest rating standard deviation ?

**hint** How can you extract the individual genres from the genres column ? How do you use a custom function to do this ?

### Question 4: Extract the year information from the title and compute the average rating per year (for years where more than 10 movies came out), how does the this quantity evolve ?

**hint** Extracting the year from the title can be done with a Regular Expression

### Question 5: What are the top 3 genres per year ?

**hint** Look at the answer here https://stackoverflow.com/questions/38397796/retrieve-top-n-in-each-group-of-a-dataframe-in-pyspark

### Question 6: What words of the titles cooccure the most with each genre ? Is the number of cooccurence enough ? Compute the [pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) between genres and movie title words, and filter out words that appear fewer than 100 times.