<a href="https://colab.research.google.com/github/michalis0/Business-Intelligence-and-Analytics/blob/master/6%20-%20Association%20Rules/AssociationRules_SOLUTIONS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Intelligence and Analytics Lab 
# Week 6 - Association Rules - Solutions

In this week's lab, we are going to revise the notions of support, confidence and lift, as well as mine association rules using the Python library [`mlxtend`](http://rasbt.github.io/mlxtend/). As usual, you can install this library using `pip`.

In [None]:
!pip install mlxtend

## Exercise 1

Before we start coding, let's practice with a toy example. Calculate support and confidence for the following association rules given the shopping transactions further below:

```
{Water} => {Juice}
    Support = 2/4 = 50%
    Confidence = 2/2 = 100%
    Lift = 1/1 = 1
    
{Juice} => {Water}
    Support = 2/4 = 50%
    Confidence = 2/4 = 50%
    Lift = 0.5/0.5 = 1
    
{Milk} => {Bread}
    Support = 1/4 = 25%
    Confidence = 1/1 = 100%
    Lift = 1/0.25 = 4
    
{Juice, Beer} => {Water}
    Support = 1/4 = 25%
    Confidence = 1/2 = 50%
    Lift = 0.5/0.5 = 1

```

<img src="https://github.com/michalis0/BigScaleAnalytics/blob/master/week5/img/association_rules.png?raw=1" width="300" style="float: left">

As a reminder, for the association rule `{S} => {i}`:

* Support = `# transactions containing S and i / total # transactions`
* Confidence = `# transactions containing S and i / # transactions containing S`

Let the support threshold (minsup) >= 40% and the confidence threshold (minconf) >= 75%.

Which of the four rules are the most interesting, and why? Do you think using only support and confidence is enough to identify a rule as interesting?

## The Apriori algorithm

The `mlxtend` library provides us with an implementation of the Apropri algorithm, which we can use to mine the frequent itemsets.

### Data

The dataset we are going to use is a synthetic dataset containing customer purchases. You can find this dataset [here](https://gist.github.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751) or in the course repository.

In [None]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

df = pd.read_csv("https://raw.githubusercontent.com/michalis0/BigScaleAnalytics/master/week5/data/retail.csv", sep=",")
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


Each row of the dataset represents items that were purchased together by a customer, on the same day at the same store.

The dataset is **sparse**, as a relatively high percentage of cells is null (NA, NaN or equivalent). These null values make it difficult to read the table. Let's find out which unique items can actually be found in the table (based on the first column).

In [None]:
df["0"].unique()

array(['Bread', 'Cheese', 'Meat', 'Eggs', 'Wine', 'Bagel', 'Pencil',
       'Diaper', 'Milk'], dtype=object)

### Preprocessing
To make use of the apriori module given by `mlxtend` library, we need to convert the dataset according to it’s liking. apriori module requires a dataframe that has either 0 and 1 or True and False as data. The data we have is all string (name of items), we need to One Hot Encode the data.

In [None]:
dataset = []
for ind, row in df.iterrows():
    transaction = []
    for item in row:
        if item == item:  # check if item is null
            transaction.append(item)
    dataset.append(transaction)

Next using the `TransactionEncoder` class, we can transform the transactions to True or False.

In [None]:
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

Unnamed: 0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Pencil,Wine
0,False,True,True,True,True,True,False,True,True
1,False,True,True,True,False,True,True,True,True
2,False,False,True,False,True,True,True,False,True
3,False,False,True,False,True,True,True,False,True
4,False,False,False,False,False,True,False,True,True


### Applying Apriori

Now we use the apriori module from mlxtend library to find the frequent itemsets. Before that, let's look at some parameters of this module:

- `df` : One-Hot-Encoded DataFrame or DataFrame that has 0 and 1 or True and False as values
- `min_support` : Floating point value between 0 and 1 that indicates the minimum support required for an itemset to be selected.
- `use_colnames` : This allows to preserve column names for itemset making it more readable.
- `max_len` : Max length of itemset generated. If not set, all possible lengths are evaluated.

In [None]:
freq_items = apriori(df, min_support=0.2, use_colnames=True)
freq_items.head(10)

Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.504762,(Bread)
2,0.501587,(Cheese)
3,0.406349,(Diaper)
4,0.438095,(Eggs)
5,0.47619,(Meat)
6,0.501587,(Milk)
7,0.361905,(Pencil)
8,0.438095,(Wine)
9,0.279365,"(Bagel, Bread)"


In [None]:
freq_items.shape

(33, 2)

## Mining association rules

As you know by now, frequent if-then associations are called "association rules". They consist of an antecedent (if) and a consequent (then): `{antecedent} => {consequent}`.

The `metric` parameter can be set to "support", "confidence", "lift", "leverage" and "conviction" (see [this page](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) for more information on how these metrics are defined). In the example below, we use the confidence metric with a threshold of **0.6** This means that we are keeping only rules with a confidence at or above 0.6.

In [None]:
rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)
rules.head(15).sort_values(by="lift")

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
4,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
5,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
6,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754
11,"(Meat, Cheese)",(Milk),0.32381,0.501587,0.203175,0.627451,1.250931,0.040756,1.337845
7,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
2,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
3,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
1,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
13,"(Cheese, Milk)",(Meat),0.304762,0.47619,0.203175,0.666667,1.4,0.05805,1.571429


The `rules` dataframe contains all the association rules that we determined as interesting. What do you think? Are they really interesting? What does the __lift__ metric tells you?

Try to generate the above rules again but now with a smaller threshold for confidence, say **0.4**. What do you think about the rules now?

In [None]:
rules = association_rules(freq_items, metric="confidence", min_threshold=0.4)
rules[rules["lift"] < 1]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
4,(Cheese),(Bread),0.501587,0.504762,0.238095,0.474684,0.940411,-0.015087,0.942742
5,(Bread),(Cheese),0.504762,0.501587,0.238095,0.471698,0.940411,-0.015087,0.943424
8,(Meat),(Bread),0.47619,0.504762,0.206349,0.433333,0.858491,-0.034014,0.87395
9,(Bread),(Meat),0.504762,0.47619,0.206349,0.408805,0.858491,-0.034014,0.886018
15,(Diaper),(Cheese),0.406349,0.501587,0.2,0.492188,0.98126,-0.00382,0.98149
37,(Milk),(Wine),0.501587,0.438095,0.219048,0.436709,0.996835,-0.000695,0.997539
38,(Wine),(Milk),0.438095,0.501587,0.219048,0.5,0.996835,-0.000695,0.996825


## Exercise 2

Let's try this library on a more realistic and bigger dataset.

### Loading the data

First, download the `groceries.csv` file from the GitHub repository and put it in the same folder as your notebook (or, if working in Colab, upload it to the runtime files). Notice that this is not a proper CSV file and there are different number of values in each row. So you have to read the file manually.

In [None]:
with open("groceries.csv", "r") as f:
    dataset = []
    for line in f:
        transaction = []
        row = line.rstrip("\n").split(",")
        for item in row:
            transaction.append(item)
        dataset.append(transaction)

### Mining rules

Try to find association rules for the Groceries dataset using **confidence** as the `metric` parameter, and a support threshold of **0.001** and confidence threshold of **0.05**. 

Extract all the rules you have found containing "bottled beer" as *antecedent*. Which rules do you find interesting?

In [None]:
# Create the one_hot encoded dataframe with TransactionEncoder()
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


In [None]:
# Find the frequent itemsets with  min_support=0.001, max_len=2
freq_items = apriori(df, min_support=0.001, use_colnames=True, max_len=2)

In [None]:
# Find the association rules with metric='confidence' and min_threshold=0.05
rules = association_rules(freq_items, metric="confidence", min_threshold=0.05)

In [None]:
# Extract rules with 'bottled beer' as antecedents 
rules[rules["antecedents"].astype(str).str.contains("bottled beer")].sort_values(by="lift", ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
418,(bottled beer),(liquor),0.080529,0.011083,0.004677,0.058081,5.240594,0.003785,1.049896
446,(bottled beer),(red/blush wine),0.080529,0.019217,0.004881,0.060606,3.15376,0.003333,1.044059
378,(bottled beer),(bottled water),0.080529,0.110524,0.01576,0.195707,1.770726,0.00686,1.105911
382,(bottled beer),(butter),0.080529,0.055414,0.005796,0.07197,1.298756,0.001333,1.017839
422,(bottled beer),(margarine),0.080529,0.058566,0.006101,0.075758,1.293534,0.001384,1.0186
426,(bottled beer),(napkins),0.080529,0.052364,0.005186,0.064394,1.229737,0.000969,1.012858
461,(bottled beer),(soda),0.080529,0.174377,0.01698,0.210859,1.209209,0.002938,1.046229
409,(bottled beer),(fruit/vegetable juice),0.080529,0.072293,0.007016,0.087121,1.205116,0.001194,1.016244
402,(bottled beer),(frankfurter),0.080529,0.058973,0.005389,0.066919,1.134742,0.00064,1.008516
442,(bottled beer),(pork),0.080529,0.057651,0.005186,0.064394,1.116957,0.000543,1.007207
