# Light EDA on movielens

## 1. MovieLens datasets: load & access

Spark lets you explore data of any structure from a lot of different data sources and data formats.

To load the data, upload them in the data section on the left pane.
**Please only upload the smallest dataset.**

You should get the paths to access the data from Spark.

In [None]:
from pyspark import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

movies_path = "../data/movies.csv"
ratings_path = "../data/ratings.csv"

Print first record in the movies.csv and ratings.csv datasets

In [None]:
# Your code below

The structure is CSV (comma separated values) and is well-documented (see links below) but we'll be assuming that we don't even know the structure.

- Small dataset documentation: http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html
- Big dataset documentation: http://files.grouplens.org/datasets/movielens/ml-latest-README.html

We will use two files from this MovieLens dataset: *ratings.csv* and *movies.csv*. All ratings are contained in the file *ratings.csv* and are in the following format:
```
userId,movieId,rating,timestamp
```
Movies information are in the file *movies.csv* and are in the following format:
```
movieId,title,genres
```

Now that you are able to access the data, let's explore Spark functionalities.

As you probably know any Spark session needs a SparkSession to submit jobs to an executors cluster. On this managed environment you were provided a free trial Spark cluster and a SparkSession is already available as **spark**.

Refer to the Spark Python API documentation to learn what method you can call on SparkSession object.

In [None]:
# Display Spark version used in this notebook
# Your code below


In [None]:
# 1. Load the 2 datatets
# Your code below

# 2. The 2 datasets will be reused several times, what could we do to avoid re-reading the files ?

It was fast but remember that nothing happened yet. Spark just began to build an execution plan but is waiting you to provide an action before executing anything. The DataFrames are however ready to analyze.

## 2. Spark basics
Let's discover Spark through simple commands first. Let say we know nothing about the dataset we just loaded. Those data could be unstructured, semi-structured or structured and contain any data format. Spark does not really care, the **read.text()** method let you load those files in DataFrames and each line of those files is now an element of the DataFrames.

From this chapter, you will find some exercices. The places where you have to put code are marked with **#TODO: explanation**.

First thing you want to know is what is in your dataset, how many elements do you have, what is the structure, the attributes types, etc.

In [None]:
# Display schema
# Your code below

# Show first items
# Your code below


The *movies* DataFrame seems to be in CSV format and it is good to know there is an header.

But to understand the data types, you probably want to get more lines. Use the Spark Python API documentation to find out how to retrieve 10 lines from both datasets *ratings* and *movies*.

Notice that you probably don't want to retrieve **all lines**. In distributed computation, the dataset could be huge and it's probably a bad thing to retrieve all the data from executors on hundreds of machine to the driver on one single machine.

In [None]:
# Exercice 1: get 10 elements from every dataset
rats = # TODO: get 10 elements
print("--------\nRatings:\n--------")
print(rats)

movs = # TODO: get 10 elements
print("\n--------\nMovies:\n--------")
print(movs)

We notice that ratings elements are strings with comma separated values. The values are integers or floats.

About movies, elements are strings with comma separated values. The values are strings, possibly with pipe separated values (for categories).

In [None]:
# Exercice 2: print the number of elements in every dataset
# Your code below


The biggest dataset has 22M+ elements. If we experience computing delays, we may prefer work on the smaller dataset.

While you are working in Spark with data from an input file, you usually start with this kind of Dataframes of *rows* from your input file. But this input file probably has a structure or some specific elements that you want to extract from it in order to give your Spark RDD a structure. For example, this CSV file has four attributes: userId, movieId, rating and timestamp. Spark's DataFrames does not understand the data structure but you can give one to your data by splitting the lines on the comma separator.

Prepare the DataFrame by extracting the different fields and removing the header row and the timestamp field. You can also cast the fields in integer and float. Start with the small dataset, check the final RDD with **first()** or **take()**. The **map()** method is the DataFrame's method that you are looking for if you wish to apply a function to any element of a DataFrame and get another DataFrame in return.

In [None]:
# Exercice 3: prepare the DataFrames
# Movies
# Your code below


In [None]:
# Ratings
# Your code below


In [None]:
ratings_df.printSchema()

We had a *ratings* DataFrame of strings representing lines in our input file.

We now have a *ratings_df* DataFrame of (integer, integer, float).

With those DataFrames, it will be easier to answer the two following exercices. In fact, it would be even easier if you were familiar with SQL (Standard Query Language) by the abstraction of DataFrames. Let's do it later.

In [None]:
# Exercice 4: how many different users is there in the dataset and how many movies have been rated?
num_users = ## TODO ##

num_movies = ## TODO ##

In [None]:
# Exercice 5: what are the maximum rating and the minimum rating that appear in the dataset?
# Your code below

In [None]:
# Exercice 6: give the full distribution of the ratings, ie. number of occurences of each rating, you can help yourself with the WordCount example
# Your code below

In the previous code, it is important to understand where the code executes. You should take advantage of your Spark's cluster power whenever possible and only manipulates small datasets on the driver single machine.

Notice the distribution of the ratings is not uniform. We can represent it with a Matplotlib.

In [None]:
ratings_distribution.display("rating")

In [None]:
# %matplotlib inline
import pandas

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

# This line needs to be changed with your variable name
distribution_pandas = ratings_distribution.toPandas()

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(16,12));
distribution_pandas['count'].plot(kind="bar")
ax.set_xticklabels(distribution_pandas['rating']);

display(fig)