# Association Analysis

Association analysis is the task of finding interesting relationships in large datasets. 

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases.

We will be working on the dataset where each row of the dataset represents items that were purchased together on the same day at the same store. It is a sparse dataset. 
The dataset can be found here: https://gist.github.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751



In [2]:
# Import Libraries
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt

In [3]:
# Read the dataset 
df = pd.read_csv('retail_dataset.csv', sep=',')

# Print the first 5 rows
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


In [4]:
# Find the unique items in the table
items = set()
for col in df:
    items.update(df[col].unique())
print(items)

{'Diaper', nan, 'Meat', 'Eggs', 'Bread', 'Pencil', 'Cheese', 'Bagel', 'Wine', 'Milk'}


Data Preprocessing

Apriori module requires a dataframe that has either 0 and 1 or True and False as data. The data we have is all string (name of items), we need to One Hot Encode the data.

In [5]:
itemset = set(items)
encoded_vals = []
for index, row in df.iterrows():
    rowset = set(row) 
    labels = {}
    uncommons = list(itemset - rowset)
    commons = list(itemset.intersection(rowset))
    for uc in uncommons:
        labels[uc] = 0
    for com in commons:
        labels[com] = 1
    encoded_vals.append(labels)
encoded_vals[0]
ohe_df = pd.DataFrame(encoded_vals)

Generate frequent itemsets that have a support value of at least 10% (this number is chosen so that you can get close enough).

Generate the rules with their corresponding support, confidence and lift.

In [6]:
%%time
# Applying apriori
freq_items = apriori(ohe_df, min_support=0.1, use_colnames=True)

# Mining association rules
apriori_rules = association_rules(freq_items, metric="confidence", min_threshold=0.5)
apriori_rules.head()

CPU times: total: 31.2 ms
Wall time: 36 ms




Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Milk),(nan),0.501587,0.869841,0.409524,0.816456,0.938626,-0.026778,0.709141,-0.115976
1,(Bagel),(nan),0.425397,0.869841,0.336508,0.791045,0.909413,-0.03352,0.622902,-0.147743
2,(Diaper),(nan),0.406349,0.869841,0.31746,0.78125,0.898152,-0.035999,0.595011,-0.160381
3,(Meat),(nan),0.47619,0.869841,0.368254,0.773333,0.889051,-0.045956,0.57423,-0.192405
4,(Eggs),(nan),0.438095,0.869841,0.336508,0.768116,0.883053,-0.044565,0.56131,-0.190735


The **confidence** tells us the number of times that a rule occurs. 

The lift gives us the strength of association

In [7]:
apriori_rules[ (apriori_rules['lift'] >= 1.5) &
      (apriori_rules['confidence'] >= 0.7) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
85,"(Milk, Bagel)",(Bread),0.225397,0.504762,0.171429,0.760563,1.506777,0.057657,2.068347,0.434199
88,"(Meat, Milk)",(Eggs),0.244444,0.438095,0.177778,0.727273,1.660079,0.070688,2.060317,0.526261
89,"(Eggs, Milk)",(Meat),0.244444,0.47619,0.177778,0.727273,1.527273,0.061376,1.920635,0.456933
92,"(Meat, Milk)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137,0.524816
100,"(Eggs, Milk)",(Cheese),0.244444,0.501587,0.196825,0.805195,1.605293,0.074215,2.558519,0.499051
158,"(Meat, Eggs)",(Cheese),0.266667,0.501587,0.215873,0.809524,1.613924,0.082116,2.616667,0.518717
159,"(Cheese, Eggs)",(Meat),0.298413,0.47619,0.215873,0.723404,1.519149,0.073772,1.893773,0.487091
185,"(Eggs, Pencil)",(Wine),0.165079,0.438095,0.120635,0.730769,1.66806,0.048314,2.087075,0.479688
205,"(nan, Meat, Milk)",(Eggs),0.168254,0.438095,0.12381,0.735849,1.679655,0.050098,2.127211,0.486494
210,"(nan, Meat, Milk)",(Cheese),0.168254,0.501587,0.146032,0.867925,1.730356,0.061638,3.773696,0.507468


# TASKS

1. Execute the association analysis using the **fpgrowth** and **ECLAT** algorithms on the same dataset. 
2. For the above 2 algorithms, find the following:
  
  a. rate of Milk, Meat and Cheese being purchased together.
  
  b. percentage of customers who buy Eggs, Meat and Cheese. 

3. Compute the overall time for the association analysis.

In [8]:
from mlxtend.frequent_patterns import fpgrowth
from pyECLAT import ECLAT

In [9]:
# Initialize FPGrwoth rules
freq_items = fpgrowth(ohe_df, min_support=0.1, use_colnames=True)
fpgrowth_rules = association_rules(freq_items, metric="confidence", min_threshold=0.5)



In [10]:
%%time

# FPGrowth for Milk Meat Cheese
for index, row in fpgrowth_rules.iterrows():
    if 'Milk' in row['antecedents'] and 'Meat' in row['antecedents'] and 'Cheese' in row['antecedents']:
        rate = row['support']
        break
        
# FPGrowth Count total Eggs, Meat, Cheese Count
total_count, emc_count = 0, 0
for index, row in fpgrowth_rules.iterrows():
    if 'Eggs' in row['antecedents'] and 'Meat' in row['antecedents'] and 'Cheese' in row['antecedents']:
        emc_count += 1
    total_count += 1

print("Rate of Milk, Meat, and Cheese being purchased together:", rate)
print("Percentage of customers who buy Eggs, Meat, and Cheese:", (emc_count/total_count)*100)

Rate of Milk, Meat, and Cheese being purchased together: 0.14603174603174604
Percentage of customers who buy Eggs, Meat, and Cheese: 2.2813688212927756
CPU times: total: 46.9 ms
Wall time: 50 ms


In [13]:
# Preprocessing ECLAT
eclat_df = pd.DataFrame(df)
attrCount = len(eclat_df.count())
i = 0
for col in eclat_df.columns[:attrCount].tolist():
    eclat_df.rename(columns={col : i}, inplace=True)
    i += 1

In [15]:
%%time

# Instantiate ECLAT Instance
eclat_instance = ECLAT(eclat_df, verbose=False)
indexes, supports = eclat_instance.fit(min_support=0.08, min_combination=3, max_combination=3, separator=' & ', verbose=False)

# Count total Eggs, Meat, Cheese Count
total_count, emc_count = 0, 0
for support in supports:
    if 'Eggs' in support and 'Meat' in support and 'Cheese' in support:
        emc_count += 1
    total_count += 1

print("Rate of Milk, Meat, and Cheese being purchased together:", supports['Meat & Cheese & Milk'])
print("Percentage of customers who buy Eggs, Meat, and Cheese:", (emc_count/total_count)*100)

Rate of Milk, Meat, and Cheese being purchased together: 0.20317460317460317
Percentage of customers who buy Eggs, Meat, and Cheese: 1.36986301369863
CPU times: total: 1.22 s
Wall time: 1.25 s
