<a href="https://colab.research.google.com/github/michalis0/Business-Intelligence-and-Analytics/blob/master/6%20-%20Association%20Rules/Walkthrough/AssociationRules.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mining Association Rules
In this week's lab, we are going to mine association rules using the Python library `mlxtend`. You can install the library using pip

In [None]:
!pip install mlxtend  

## Hands-On
Before starting to code let's practice with a toy example.
Calculate support and confidence for the following association rules given the shopping receipts database below.
<br><img src="https://github.com/michalis0/Business-Intelligence-and-Analytics/blob/master/6%20-%20Association%20Rules/img/association_rules.png?raw=1" width="500" style="float: left"><br>


Answer:
```
{Water} => {Juice}
    Support = 
    Confidence = 
    
{Juice} => {Water}
    Support = 
    Confidence = 
    
{Milk} => {Bread}
    Support = 
    Confidence = 
    
{Juice, Beer} => {Water}
    Support = 
    Confidence = 
```

Suppose that we have a support threshold of 40% and confidence threshold of 75%. Which rules are most interesting? Why?<br>

Do you think using only the support and confidence measures is enough to identify a rule as intersing?

## Apriori algorithm

We will use the apriori algorithm to mine the frequent itemsets. The `mlxtend` library has an implementation of this algorithm.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

## Load data
The dataset we are going to use is a synthetic dataset. It contains the purchases of customers. You can find the source of the dataset [here](https://gist.github.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751)

In [None]:
df = pd.read_csv('https://media.githubusercontent.com/media/michalis0/Business-Intelligence-and-Analytics/master/6%20-%20Association%20Rules/data/retail.csv', sep=',')
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


In [None]:
df.shape

(315, 7)

Each row of the dataset represents items that were purchased together on the same day at the same store.The dataset is a sparse dataset as relatively high percentage of data is NA or NaN or equivalent.
These NaNs make it hard to read the table. Let’s find out how many unique items are actually there in the table.

In [None]:
items = (df['0'].unique())
items

array(['Bread', 'Cheese', 'Meat', 'Eggs', 'Wine', 'Bagel', 'Pencil',
       'Diaper', 'Milk'], dtype=object)

## Data Preprocessing
To make use of the apriori module given by `mlxtend` library, we need to convert the dataset according to it’s liking. apriori module requires a dataframe that has either 0 and 1 or True and False as data. The data we have is all string (name of items), we need to One Hot Encode the data.

In [None]:
dataset = []
for ind, row in df.iterrows():
    transaction = []
    for item in row: 
        # check if item is NaN
        if item == item:
            transaction.append(item)
    dataset.append(transaction)

In [None]:
type(dataset)

list

In [None]:
len(dataset)

315

Next using the `TransactionEncoder` we can transform the transactions to True or False.

In [None]:
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df.head()

Unnamed: 0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Pencil,Wine
0,False,True,True,True,True,True,False,True,True
1,False,True,True,True,False,True,True,True,True
2,False,False,True,False,True,True,True,False,True
3,False,False,True,False,True,True,True,False,True
4,False,False,False,False,False,True,False,True,True


## Applying Apriori
Now we use the apriori module from mlxtend library to find the frequent itemsets. Before that, let's look at some parameters of this module:

- `df` : One-Hot-Encoded DataFrame or DataFrame that has 0 and 1 or True and False as values
- `min_support` : Floating point value between 0 and 1 that indicates the minimum support required for an itemset to be selected.
- `use_colnames` : This allows to preserve column names for itemset making it more readable.
- `max_len` : Max length of itemset generated. If not set, all possible lengths are evaluated.



In [None]:
freq_items = apriori(df, min_support=0.2, use_colnames=True)
freq_items.head(10)

Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.504762,(Bread)
2,0.501587,(Cheese)
3,0.406349,(Diaper)
4,0.438095,(Eggs)
5,0.47619,(Meat)
6,0.501587,(Milk)
7,0.361905,(Pencil)
8,0.438095,(Wine)
9,0.279365,"(Bagel, Bread)"


## Mining Association Rules
Frequent if-then associations called association rules which consists of an antecedent (if) and a consequent (then) in other words `{antecedent} => {consequent}`. The metric can be set to confidence, lift, support, leverage and conviction. In the example below we use the confidence metric with a threshold of __0.6__. This means that we are selecting the rules with a confidence more than __0.6__.

In [None]:
rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)
rules.sort_values(by='lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
12,"(Meat, Milk)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137
9,"(Eggs, Meat)",(Cheese),0.266667,0.501587,0.215873,0.809524,1.613924,0.082116,2.616667
10,"(Meat, Cheese)",(Eggs),0.32381,0.438095,0.215873,0.666667,1.521739,0.074014,1.685714
8,"(Eggs, Cheese)",(Meat),0.298413,0.47619,0.215873,0.723404,1.519149,0.073772,1.893773
13,"(Cheese, Milk)",(Meat),0.304762,0.47619,0.203175,0.666667,1.4,0.05805,1.571429
1,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
2,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
3,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
7,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624


The `rules` dataframe contains all the association rules that we determined as interesting. What do you think? Are they really interesting? What does the __lift__ metric tells you?

Try to generate the above rules again but now with a smaller threshold for confidence, say $0.4$. What do you think about the rules now?

In [None]:
rules = association_rules(freq_items, metric="confidence", min_threshold=0.4)
rules[rules["lift"] < 1]

Unnamed: 0,antecedants,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
4,(Cheese),(Bread),0.501587,0.504762,0.238095,0.474684,0.940411,-0.015087,0.942742
5,(Bread),(Cheese),0.504762,0.501587,0.238095,0.471698,0.940411,-0.015087,0.943424
8,(Meat),(Bread),0.47619,0.504762,0.206349,0.433333,0.858491,-0.034014,0.87395
9,(Bread),(Meat),0.504762,0.47619,0.206349,0.408805,0.858491,-0.034014,0.886018
15,(Diaper),(Cheese),0.406349,0.501587,0.2,0.492188,0.98126,-0.00382,0.98149
37,(Milk),(Wine),0.501587,0.438095,0.219048,0.436709,0.996835,-0.000695,0.997539
38,(Wine),(Milk),0.438095,0.501587,0.219048,0.5,0.996835,-0.000695,0.996825


## Exercise: your turn!
Let's try with a more real and bigger dataset. Load the `Groceries.csv` file and try to find association rules Using the __confidence__ metric and a support threshold of __0.001__ and confidence threshold of __0.05__. 

Check out ther rules you have found that have "bottled beer" as antecedant. Are all of these rules interesting?

### Load the data
Notice that this is not a proper csv file and there are different number of values in each row. So you have to read the file manually.

In [None]:
with open('Groceries.csv', 'r') as f:
    dataset = []
    for line in f:
        transaction = []
        row = line.rstrip('\n').split(',')
        for item in row:
            transaction.append(item)
        dataset.append(transaction)

In [None]:
# create the one_hot encoded dataframe with TransactionEncoder()


In [None]:
# find the frequent itemsets with  min_support=0.001, max_len=2


# find the association rules with metric='confidence' and min_threshold=0.05


# extract rules with 'bottled beer' as antecedents 


