# Association Analysis

Association analysis is the task of finding interesting relationships in large datasets. 

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases.

We will be working on the dataset where each row of the dataset represents items that were purchased together on the same day at the same store. It is a sparse dataset. 
The dataset can be found here: https://gist.github.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751



In [None]:
# Import Libraries
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt

In [None]:
# Read the dataset 
df = pd.read_csv('retail_dataset.csv', sep=',')

# Print the first 5 rows
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


In [None]:
# Find the unique items in the table
items = set()
for col in df:
    items.update(df[col].unique())
print(items)

{'Wine', nan, 'Bagel', 'Cheese', 'Eggs', 'Pencil', 'Milk', 'Bread', 'Meat', 'Diaper'}


Data Preprocessing

Apriori module requires a dataframe that has either 0 and 1 or True and False as data. The data we have is all string (name of items), we need to One Hot Encode the data.

In [None]:
itemset = set(items)
encoded_vals = []
for index, row in df.iterrows():
    rowset = set(row) 
    labels = {}
    uncommons = list(itemset - rowset)
    commons = list(itemset.intersection(rowset))
    for uc in uncommons:
        labels[uc] = 0
    for com in commons:
        labels[com] = 1
    encoded_vals.append(labels)
encoded_vals[0]
ohe_df = pd.DataFrame(encoded_vals)

Generate frequent itemsets that have a support value of at least 10% (this number is chosen so that you can get close enough).

Generate the rules with their corresponding support, confidence and lift.

In [None]:
%%time
# Applying apriori
freq_items = apriori(ohe_df, min_support=0.1, use_colnames=True)

# Mining association rules
apriori_rules = association_rules(freq_items, metric="confidence", min_threshold=0.5)
apriori_rules.head()

CPU times: user 13.7 ms, sys: 0 ns, total: 13.7 ms
Wall time: 19.1 ms


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Bagel),(Milk),0.425397,0.501587,0.225397,0.529851,1.056348,0.012023,1.060116
1,(Bagel),(nan),0.425397,0.869841,0.336508,0.791045,0.909413,-0.03352,0.622902
2,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
3,(Bread),(Bagel),0.504762,0.425397,0.279365,0.553459,1.301042,0.064641,1.286787
4,(Milk),(nan),0.501587,0.869841,0.409524,0.816456,0.938626,-0.026778,0.709141


The **confidence** tells us the number of times that a rule occurs. 

The lift gives us the strength of association

In [None]:
apriori_rules[ (apriori_rules['lift'] >= 1.5) &
      (apriori_rules['confidence'] >= 0.7) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
45,"(Milk, Meat)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137
55,"(Eggs, Meat)",(Cheese),0.266667,0.501587,0.215873,0.809524,1.613924,0.082116,2.616667
56,"(Eggs, Cheese)",(Meat),0.298413,0.47619,0.215873,0.723404,1.519149,0.073772,1.893773


# TASKS

1. Execute the association analysis using the **fpgrowth** and **ECLAT** algorithms on the same dataset. 
2. For the above 2 algorithms, find the following:
  
  a. rate of Milk, Meat and Cheese being purchased together.
  
  b. percentage of customers who buy Eggs, Meat and Cheese. 

3. Compute the overall time for the association analysis.