# Assignment 3: Association Analysis

To read Excel files, you might need to install the `xlrd` package, using something like:

    conda activate myEnvironment # where myEnvironment is the conda environment you use for this module
    conda install xlrd


You may find the following useful to obtain the data from the UCI data repository, and to read it into a dataframe.

In [None]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

import requests, os
#csvUrl = "http://www.sci.csueastbay.edu/~esuess/classes/Statistics_6620/Presentations/ml13/groceries.csv"
#csvFile = 'data/groceries.csv'
xlUrl = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'
xlFile = 'data/Online Retail.xlsx'
dataFile = xlFile
url = xlUrl
if not os.path.exists('data'):
    os.makedirs('data')
if not os.path.isfile(dataFile):
  r = requests.get(url)
  with open(dataFile, 'wb') as f:
    f.write(r.content)
if (dataFile == xlFile):
  df = pd.read_excel(dataFile)
else:
  df = pd.read_csv(dataFile)
df.head()

__Task 1.1__: Select the transactions arising from the `Country` having _9042_ records in the dataframe and convert them to the OneHotEncoded form, where each column has (0,1) values representing the (absence,presence) of that product in a given basket, where each basket (row) is labeled by its `InvoiceNo`.

Hints
1. Use `groupby` and `size()` to determined the number of rows per `Country`
2. Use `groupby` and `sum()` on the `Quantity` to encode as 0 and positive integer, and `reset_index()` so that the rows are labeled by `InvoiceNo`. Remember to set any positive numbers to 1 rather than a frequency count.

__Task 1.2__: Use mlxtend's `apriori` function to find the frequent itemsets where the minimum support threshold is set to 0.02. Hence derive the association rules where the minimum lift threshold is 1.

__Task 1.3__: Defining the _rule length_ to be the total number of products in the rule, plot the distribution of association rules by rule length and explain why the distribution looks like it does. Choosing the _longest_ rules, find the most attractive rule for use when recommending a (set of) products to a customer. Explain why reversing the rule might not be as effective.

_Hints_
1. The rule length is the sum of the lengths of the `antecedents` and `consequents` per rule.
2. Association rules can be used for recommendation to customers who have already bought the antecedent products and might be interested in buying the consequent products. Note that you will need to use (by sorting) and justify suitable association measures to choose the most attractive rule for recommendation purposes.

## Recommender Systems

We use the well-known MovieLens dataset (in this case the small version). You may find the following useful to obtain the data from the GroupLens repository, and to read it into a dataframe.

In [None]:
import os, requests
import numpy as np

#mlSize = "ml-1m"
#mlSize = "ml-100k"
mlSize = "ml-latest-small"
zipUrl = 'http://files.grouplens.org/datasets/movielens/'+mlSize+'.zip'
zipFile = 'data/'+mlSize+'.zip'
dataFile = zipFile
url = zipUrl
dataDir = 'data'
if not os.path.exists(dataDir):
    os.makedirs(dataDir)
if not os.path.isfile(zipFile):
  r = requests.get(zipUrl)
  with open(zipFile, 'wb') as f:
    f.write(r.content)

# Need to unzip the file to read its contents
import zipfile
with zipfile.ZipFile(zipFile,"r") as zip_ref:
  zip_ref.extractall(dataDir)


__Task 2.1__: Read the `users.dat`, `movies.dat` and `ratings.dat` data files into data frames.

_Hints_

1. You may find Pandas `read_csv` provides most of what you need, although you will need to override its default `sep` parameter.

__Task 2.2__: Generate the distribution of ratings (number of user-movie ratings, per rating value).

_Hints_

1. You may find that `value_counts()` helps to count the ratings.

The following code can be used to filter the number of Movies. Choosing a large threshold (like 1750) ensures that only "blockbuster" movies with that number of aggregate ratings will be considered. This is convenient (much reduced runtimes!) when developing your solution, but a less stringent threshold should be used for the result you hand in (100 is suggested). You should also apply similar filters to the Users.

To apply this filter to the ratings dataframe, you might find the `isin(filteredSet)` function useful.

In [None]:
minMovieRatings = 1750
#minMovieRatings = 100
filterMovies = ratingsDf['MovieID'].value_counts() > minMovieRatings
filterMovies = filterMovies[filterMovies].index.tolist()
len(filterMovies)

__Task 2.3__: Using the filtered ratings dataframe, count the ratings per User and plot this data in a histogram. You should do the same with the Movies and comment on the similarities and differences between the two distributions.

_Hint_

1. You might find the `groupby()` and `count()` functions suitable for generating the data you need. 

__Task 2.4__: Repeat Task 2.3 above, but deriving the average ratings rather than their counts.

__Task 2.5__: Load the (filtered) movies ratings data from the dataframe we have been exploring into the preferred 3-column format used by the `scikit-suprise` package. Now benchmark the performance (in terms of RMS error, time to fit, and time to generate predictions for test data) of the `SVD()`, `SlopeOne()`, `NMF()`, `KNNBasic()` recommendation algorithms. Discuss the strengths and weaknesses of each algorithm, based on its benchmarked results.

_Hints_

1. `scikit-surprise` provides `Reader` and `Dataset` functions to load one dataframe from another.
2. `scikit-surprise` also provides a `cross_validate` function that can be used to estimate the test error in the test data, using the requested error metric.
3. When collecting the benchmark data, you might find it convenient to loop over the algorithms and to add the results for each algorithm as a row to your benchmark array or dataframe.