# Practical Session 3: Association rule mining

In this session, we will build a first set of algorithms to infer rules from a dataset.  This is known as association rule mining (e.g. people who buy potates and bread are likely to be building a burger and therefore interested in salad and steaks)

In [2]:
import matplotlib.pyplot as plt

In [3]:
%matplotlib inline

## Downloading and unzipping the data (MovieLens)

In [5]:
import urllib
import zipfile

url = 'http://files.grouplens.org/datasets/movielens/ml-20m.zip'
filehandle, _ = urllib.request.urlretrieve(url, '/tmp/data.zip')
zip_file_object = zipfile.ZipFile(filehandle, 'r')
zip_file_object.namelist()
zip_file_object.extractall()

## Reading data

In [8]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Dataset") \
    .getOrCreate()


In [9]:
movies_path = "file:///databricks/driver/ml-20m/movies.csv"
ratings_path = "file:///databricks/driver/ml-20m/ratings.csv"

We read the csv files using [`spark.read`](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html)

In [11]:
movies_df = spark.read.options(header=True).csv(movies_path)
# TASK 1: explain what the filter below does.  Why did we not use sample instead?
ratings_df = spark.read.options(header=True).csv(ratings_path).filter(sf.expr('PMOD(HASH(userId),10)')==0)

We cache the read dataframes to avoid reloading them in subsequent computation.

In [13]:
movies_df.cache()
ratings_df.cache()

We then print a few rows from each dataframe.

In [15]:
movies_df.show(5)

In [16]:
ratings_df.show(5)

In [17]:
ratings_df.select('movieId').distinct().count()

In [18]:
ratings_df.select('userId').distinct().count()

In [19]:
from pyspark.sql.window import Window
from pyspark.sql import functions as sf
import pyspark

In [20]:
# TASK 2: filter ratings to keep only the latest 100 ratings per user
# Hint: create a new column called tm_rank that sorts the ratings per timestamp and per hash of movie id.
#       use sf.rank(), sf.struct(), sf.hash() to create this column, then filter() to filter the data
#       do not forget to drop this new column once you are done.
lim_ratings_df = ...

In [21]:
lim_ratings_df.show(5)

In [22]:
# sanity check: this should return a min of 20 and a max of 100
lim_ratings_df\
    .groupby('userId')\
    .agg(sf.count('*').alias('num_ratings'))\
    .agg(sf.min('num_ratings'), sf.max('num_ratings'))\
    .toPandas()

### 1. Naive approach: Find recurring pairs & triplets.
This approach is simple and not efficient but gives you a baseline and intuition for the next steps.

In [24]:
import pyspark.sql.functions as F
from pyspark.sql.functions import udf

In [25]:
# TASK 3: find recurring pairs with a naive approach, then show the top 25 results
# Remember to use the title field to make the results interpretable
# Also, make sure to work with lim_ratings_df, not ratings_df!



In [26]:
all_movie_pairs_df.count()

In [27]:
# TASK 4: find recurring triplets and show the top 25 results.

### 2. Second approach: A priori  
Implement your own version of A priori.  You may use resources from the web.
https://fr.wikipedia.org/wiki/Algorithme_APriori
https://www.hackerearth.com/blog/developers/beginners-tutorial-apriori-algorithm-data-mining-r-implementation

In [29]:
# TASK 5: implement the a priori approach to find recurring pairs and triplets more efficiently

In [5]:
# report the execution time of the different approaches in your notebook

### 3. FP-growth
https://www.softwaretestinghelp.com/fp-growth-algorithm-data-mining/

In [6]:
# TASK 6: explain how FP-growth works in a few lines (in English or French!)

In [7]:
# TASK 7: Use FP Growth from the Spark MLlib to generate rules

### 4. Validation
Build a validation dataset to debug your code (naive, a priori, fp-growth)

In [8]:
# TASK 8: implement your validation dataset here and run your naive, a priori and FP growth code.  Report running times.
lim_ratings_df = ...

### Grading & instructions

You must return your notebook before **Monday Feb 10th midnight Paris time** by email to Amine, Marc and Olivier.

Grade will be composed of :
1. Timely return
2. Correctness (how you built and used your validation dataset)
3. Readability
4. Performance (this is not a race but we want to see that you compared the running time of your three algorithms)