## Lab 5: Association Rule Mining with Apriori Algorithm

Objectives of this lab: 
* Inroduction into rule mining with python and MLextend library.
* Explore how to generate frequent item sets using A-priori principle. 

In this Lab, we will work throughout toturial for a famuse example of association rule mining, which is the market basket analysis.


First, lets create a simple and small dataset, where each raw is a transaction, column is the item that was bought in this transaction.

In [None]:
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

In [None]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder() #te: transaction encoder
te_ary = te.fit(dataset).transform(dataset) #use the fit function to load the data and analyze the frequency of each item
df = pd.DataFrame(te_ary, columns=te.columns_)
df

# Support and Confidence: 
Support is basicaly how many times this item has appeared in the dataset, or how many times this item appeared in the purchesed items.
Confidence relates how many times a certain rule occured from the support data.

{Diaper, Gum} -> {Beer, Chips} 

For instance, A confidence of .5 in the above example would mean that in 50% of the cases where Diaper and Gum were purchased, the transaction also included Beer and Chips


Now let's see which items have a support of at least 60%

In [None]:
from mlxtend.frequent_patterns import apriori
apriori(df, min_support=0.6, use_colnames=True)

# Example 2 -- Selecting and Filtering Results
The advantage of working with pandas DataFrames is that we can use its convenient features to filter the results. For instance, let's assume we are only interested in itemsets of length 2 that have a support of at least 80 percent. First, we create the frequent itemsets via apriori and add a new column that stores the length of each itemset:

In [None]:
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Now we may want to select specifc length and support

In [None]:
frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.6) ]

In [None]:
apriori(df, min_support=0.5, use_colnames=True, max_len=None, n_jobs=1)

## Generating association rules:

Now suppose you want a function that allows you to (1) specify your metric of interest and (2) the according threshold. Currently implemented measures are confidence and lift. Let's say you are interesting in rules derived from the frequent itemsets only if the level of confidence is above the 90 percent threshold 

In [None]:
from mlxtend.frequent_patterns import association_rules

rules= association_rules(frequent_itemsets, metric='confidence', min_threshold=0.9)
rules


Pandas DataFrames make it easy to filter the results further. Let's say we are ony interested in rules that satisfy the following criteria:

1- at least 2 antecedents
2- a confidence > 0.75
We could compute the antecedent length as follows:

In [None]:
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules

Applying our condition:


In [None]:
rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] > 0.75) ]

# Exercise time! 
Lets use another dataset. First Download the dataset excel file from blackboard.

In [None]:
# put the path to excel file where you downloaded it
path_to_excel=''
df = pd.read_excel(path_to_excel)
df.head()

In [None]:
df

In [None]:
partial_data=df[0:100]
partial_data

There is a little cleanup, we need to do. First, some of the descriptions have spaces that need to be removed. We’ll also drop the rows that don’t have invoice numbers and remove the credit transactions (those with invoice numbers containing C).

There is a little cleanup, we need to do. First, some of the descriptions have spaces that need to be removed. We’ll also drop the rows that don’t have invoice numbers and remove the credit transactions (those with invoice numbers containing C).

In [None]:
df['Description'] = df['Description'].str.strip()
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

After the cleanup, we need to consolidate the items into 1 transaction per row with each product 1 hot encoded. For the sake of keeping the data set small, we are only looking at sales for France. However, in additional code below, I will compare these results to sales from Germany. Further country comparisons would be interesting to investigate.

In [None]:
basket = (df[df['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))
basket.shape

There are a lot of zeros in the data but we also need to make sure any positive values are converted to a 1 and anything less the 0 is set to 0. This step will complete the one hot encoding of the data and remove the postage column (since that charge is not one we wish to explore):

In [None]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)
basket_sets

Now that the data is structured properly
Generate a frequent item set with support >=7%

Now, generate the association rules with confidence >=50%

Next, show which rules have support >10%

Finally, how many "ALARM CLOCK BAKELIKE GREEN" were sold?
hint: use .sum()