## ASSOCIATION RULES

In [16]:
!pip install pandas mlxtend

Collecting mlxtend
  Downloading mlxtend-0.23.4-py3-none-any.whl.metadata (7.3 kB)
Downloading mlxtend-0.23.4-py3-none-any.whl (1.4 MB)
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   ---------------------------------------- 1.4/1.4 MB 19.3 MB/s eta 0:00:00
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.4


### Dataset:

Use the Online retail dataset to apply the association rules.

In [17]:
import pandas as pd

In [18]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

In [24]:
df = pd.read_excel('Online retail.xlsx', header=None)
df.head()

Unnamed: 0,0
0,"shrimp,almonds,avocado,vegetables mix,green gr..."
1,"burgers,meatballs,eggs"
2,chutney
3,"turkey,avocado"
4,"mineral water,milk,energy bar,whole wheat rice..."


### Data Preprocessing

Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format. 

In [25]:
empty_transactions = df[0].isna() | df[0].apply(lambda x: str(x).strip() == '')
print(f"Number of empty transactions: {empty_transactions.sum()}")

Number of empty transactions: 0


In [26]:
df = df[~empty_transactions]

In [27]:
df['transaction_str'] = df[0].apply(lambda x: ','.join(sorted(str(x).split(','))))
duplicates = df.duplicated(subset='transaction_str', keep=False)
print(f"Number of duplicate transactions: {duplicates.sum()}")

Number of duplicate transactions: 2741


In [28]:
df = df.drop_duplicates(subset='transaction_str')
df = df.drop(columns=['transaction_str'])

In [29]:
transactions = df[0].apply(lambda x: [item.strip() for item in str(x).split(',')]).tolist()

In [32]:
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df_onehot = pd.DataFrame(te_ary, columns=te.columns_)

In [33]:
df_onehot.head()

Unnamed: 0,almonds,antioxydant juice,asparagus,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,body spray,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,True,True,False,True,False,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


In [34]:
frequent_itemsets = apriori(df_onehot, min_support=0.01, use_colnames=True)
print("Frequent Itemsets:")
print(frequent_itemsets)

Frequent Itemsets:
      support                               itemsets
0    0.029492                              (almonds)
1    0.011253                    (antioxydant juice)
2    0.046178                              (avocado)
3    0.012612                                (bacon)
4    0.015522                       (barbecue sauce)
..        ...                                    ...
433  0.014746  (olive oil, spaghetti, mineral water)
434  0.016686   (pancakes, spaghetti, mineral water)
435  0.012418     (shrimp, spaghetti, mineral water)
436  0.010865       (spaghetti, mineral water, soup)
437  0.013582   (tomatoes, spaghetti, mineral water)

[438 rows x 2 columns]


In [35]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
print("Association Rules:")
print(rules.head())

Association Rules:
                        antecedents      consequents  antecedent support  \
0              (chocolate, chicken)  (mineral water)            0.021149   
1            (olive oil, chocolate)  (mineral water)            0.023671   
2               (ground beef, eggs)  (mineral water)            0.028910   
3  (ground beef, frozen vegetables)  (mineral water)            0.024641   
4  (ground beef, frozen vegetables)      (spaghetti)            0.024641   

   consequent support   support  confidence      lift  representativity  \
0            0.299961  0.010865    0.513761  1.712760               1.0   
1            0.299961  0.012029    0.508197  1.694208               1.0   
2            0.299961  0.014552    0.503356  1.678069               1.0   
3            0.299961  0.013388    0.543307  1.811258               1.0   
4            0.230113  0.012612    0.511811  2.224177               1.0   

   leverage  conviction  zhangs_metric   jaccard  certainty  kulczynski  

### Association Rule Mining

##### A. Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.

In [36]:
frequent_itemsets = apriori(df_onehot, min_support=0.01, use_colnames=True)

In [37]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

In [38]:
rules['jaccard'] = rules['support'] / (rules['antecedent support'] + rules['consequent support'] - rules['support'])

In [39]:
display_columns = ['antecedents', 'consequents', 'support', 'confidence', 'lift', 'leverage', 'conviction', 'zhangs_metric', 'jaccard', 'certainty', 'kulczynski']
print(rules[display_columns])

                        antecedents      consequents   support  confidence  \
0              (chocolate, chicken)  (mineral water)  0.010865    0.513761   
1            (olive oil, chocolate)  (mineral water)  0.012029    0.508197   
2               (ground beef, eggs)  (mineral water)  0.014552    0.503356   
3  (ground beef, frozen vegetables)  (mineral water)  0.013388    0.543307   
4  (ground beef, frozen vegetables)      (spaghetti)  0.012612    0.511811   
5               (ground beef, milk)  (mineral water)  0.016104    0.506098   
6           (pancakes, ground beef)  (mineral water)  0.010865    0.518519   
7                 (olive oil, milk)  (mineral water)  0.012418    0.512000   
8                      (soup, milk)  (mineral water)  0.012418    0.576577   
9                 (spaghetti, soup)  (mineral water)  0.010865    0.523364   

       lift  leverage  conviction  zhangs_metric   jaccard  certainty  \
0  1.712760  0.004522    1.439702       0.425138  0.035022   0.30541

##### B. Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.

In [40]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)

In [41]:
rules['leverage'] = rules.apply(lambda x: x['support'] - (x['antecedent support'] * x['consequent support']), axis=1)
rules['conviction'] = rules.apply(lambda x: (1 - x['consequent support']) / (1 - x['confidence']) if (1 - x['confidence']) != 0 else float('inf'), axis=1)
rules['zhangs_metric'] = rules.apply(lambda x: (x['lift'] - 1) / (max(x['lift'], 1) + 1) if x['lift'] > 1 else (x['lift'] - 1) / (1 + max(x['lift'], 1)), axis=1)
rules['jaccard'] = rules.apply(lambda x: x['support'] / (x['antecedent support'] + x['consequent support'] - x['support']) if (x['antecedent support'] + x['consequent support'] - x['support']) != 0 else 0, axis=1)
rules['certainty'] = rules.apply(lambda x: (x['confidence'] - x['consequent support']) / (1 - x['consequent support']) if (1 - x['consequent support']) != 0 else 0, axis=1)
rules['kulczynski'] = rules.apply(lambda x: (x['confidence'] + (x['support'] / x['consequent support'])) / 2 if x['consequent support'] != 0 else 0, axis=1)

In [42]:
rules = rules[rules['lift'] >= 1.2].sort_values(by='lift', ascending=False)

In [43]:
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift', 'leverage', 'conviction', 'zhangs_metric', 'jaccard', 'certainty', 'kulczynski']].head(10)

Unnamed: 0,antecedents,consequents,support,confidence,lift,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
177,(whole wheat pasta),(olive oil),0.011059,0.271429,3.081372,0.00747,1.251645,0.509969,0.093904,0.201052,0.19849
127,(herb & pepper),(ground beef),0.022895,0.343023,2.514853,0.013791,1.314508,0.430986,0.127018,0.239259,0.255438
314,"(shrimp, mineral water)",(frozen vegetables),0.010477,0.310345,2.380234,0.006076,1.260943,0.408325,0.068182,0.206943,0.195351
306,"(spaghetti, frozen vegetables)",(ground beef),0.012612,0.321782,2.359126,0.007266,1.273339,0.404607,0.077381,0.214663,0.207122
345,"(spaghetti, milk)",(olive oil),0.010283,0.204633,2.323083,0.005857,1.146531,0.398149,0.080303,0.127804,0.160687
305,"(ground beef, frozen vegetables)",(spaghetti),0.012612,0.511811,2.224177,0.006941,1.577028,0.379687,0.052083,0.365896,0.283309
339,"(soup, mineral water)",(milk),0.012418,0.369942,2.17162,0.006699,1.316779,0.369407,0.064843,0.240571,0.221418
191,"(french fries, eggs)",(burgers),0.011447,0.245833,2.151146,0.006126,1.174435,0.36531,0.076623,0.148527,0.173002
331,"(spaghetti, mineral water)",(ground beef),0.024835,0.291572,2.13764,0.013217,1.219038,0.362578,0.126233,0.179681,0.236824
315,"(frozen vegetables, mineral water)",(shrimp),0.010477,0.206897,2.082705,0.005447,1.135614,0.351219,0.075104,0.119419,0.156183


#### C. •	Set appropriate threshold for support, confidence and lift to extract meaning full rules.

In [44]:
min_support = 0.01

In [45]:
min_confidence = 0.2

In [46]:
min_lift = 1.2

In [47]:
frequent_itemsets = apriori(df_onehot, min_support=min_support, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)
rules = rules[rules['lift'] >= min_lift].sort_values(by='lift', ascending=False)

In [48]:
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head(10)

Unnamed: 0,antecedents,consequents,support,confidence,lift
177,(whole wheat pasta),(olive oil),0.011059,0.271429,3.081372
127,(herb & pepper),(ground beef),0.022895,0.343023,2.514853
314,"(shrimp, mineral water)",(frozen vegetables),0.010477,0.310345,2.380234
306,"(spaghetti, frozen vegetables)",(ground beef),0.012612,0.321782,2.359126
345,"(spaghetti, milk)",(olive oil),0.010283,0.204633,2.323083
305,"(ground beef, frozen vegetables)",(spaghetti),0.012612,0.511811,2.224177
339,"(soup, mineral water)",(milk),0.012418,0.369942,2.17162
191,"(french fries, eggs)",(burgers),0.011447,0.245833,2.151146
331,"(spaghetti, mineral water)",(ground beef),0.024835,0.291572,2.13764
315,"(frozen vegetables, mineral water)",(shrimp),0.010477,0.206897,2.082705


### Analysis and Interpretation

Analyse the generated rules to identify interesting patterns and relationships between the products.

Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.

In [49]:
MIN_LIFT = 2.0
MIN_CONFIDENCE = 0.2

In [50]:
strong_rules = rules[
    (rules['lift'] >= MIN_LIFT) &
    (rules['confidence'] >= MIN_CONFIDENCE)
].sort_values(by=['lift', 'confidence'], ascending=[False, False])

In [51]:
strong_rules['antecedents'] = strong_rules['antecedents'].apply(lambda x: tuple(x))
strong_rules['consequents'] = strong_rules['consequents'].apply(lambda x: tuple(x))


In [52]:
final_rules = strong_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]


In [53]:
print(f"--- Top 10 Strongest Rules (Lift >= {MIN_LIFT}, Confidence >= {MIN_CONFIDENCE}) ---")
print(final_rules.head(10))

--- Top 10 Strongest Rules (Lift >= 2.0, Confidence >= 0.2) ---
                            antecedents           consequents   support  \
177                (whole wheat pasta,)          (olive oil,)  0.011059   
127                    (herb & pepper,)        (ground beef,)  0.022895   
314             (shrimp, mineral water)  (frozen vegetables,)  0.010477   
306      (spaghetti, frozen vegetables)        (ground beef,)  0.012612   
345                   (spaghetti, milk)          (olive oil,)  0.010283   
305    (ground beef, frozen vegetables)          (spaghetti,)  0.012612   
339               (soup, mineral water)               (milk,)  0.012418   
191                (french fries, eggs)            (burgers,)  0.011447   
331          (spaghetti, mineral water)        (ground beef,)  0.024835   
315  (frozen vegetables, mineral water)             (shrimp,)  0.010477   

     confidence      lift  
177    0.271429  3.081372  
127    0.343023  2.514853  
314    0.310345  2.380234 

#### association rules by using more different values of support,confidences 

In [54]:
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd

In [55]:
from mlxtend.frequent_patterns import apriori, association_rules
import pandas as pd

In [56]:
support_values = [0.05, 0.1, 0.15]
confidence_values = [0.5, 0.6, 0.7]


In [57]:
data = {
    'antecedents': [{'milk'}, {'bread'}, {'milk', 'bread'}],
    'consequents': [{'bread'}, {'butter'}, {'butter'}]
}
rules = pd.DataFrame(data)

support_values = [0.05, 0.1, 0.15]
confidence_values = [0.5, 0.6, 0.7]

results = []

for support, confidence in zip(support_values, confidence_values):
    rules["antecedents_str"] = rules["antecedents"].apply(lambda x: ', '.join(sorted(list(x))))
    rules["consequents_str"] = rules["consequents"].apply(lambda x: ', '.join(sorted(list(x))))
    rules["rule"] = rules["antecedents_str"] + " -> " + rules["consequents_str"]
    rules = rules.drop_duplicates(subset=["rule"])
    results.append({
        "support": support,
        "confidence": confidence,
        "num_rules": rules.shape[0]
    })

print(results)

[{'support': 0.05, 'confidence': 0.5, 'num_rules': 3}, {'support': 0.1, 'confidence': 0.6, 'num_rules': 3}, {'support': 0.15, 'confidence': 0.7, 'num_rules': 3}]


number of association rules gained

In [58]:
data = {
    'milk': [1, 0, 1, 1],
    'bread': [1, 1, 0, 1],
    'butter': [0, 1, 1, 1],
}

In [59]:
df = pd.DataFrame(data)

In [60]:
support_values = [0.05, 0.1, 0.15]
confidence_values = [0.5, 0.6, 0.7]

In [65]:
results = []

In [64]:
df_bool = df.astype(bool)

In [63]:
frequent_itemsets = apriori(df_bool, min_support=support, use_colnames=True)

for confidence in confidence_values:
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=confidence)

    num_rules = rules.shape[0]

    results.append({
        "support": support,
        "confidence": confidence,
        "num_rules": num_rules
    })

print(pd.DataFrame(results))

   support  confidence  num_rules
0     0.15         0.5          9
1     0.15         0.6          6
2     0.15         0.7          0


To eliminate redundant rules

In [66]:
support_values = [0.05, 0.1, 0.15]
confidence_values = [0.5, 0.6, 0.7]

In [67]:
for confidence in confidence_values:
    rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=confidence)
    
    # Drop duplicates to remove redundant rules
    rules = rules.drop_duplicates(subset=['antecedents', 'consequents'])
    
    num_rules = rules.shape[0]

    results.append({
        "support": support,
        "confidence": confidence,
        "num_rules": num_rules
    })
    print(pd.DataFrame(results))


   support  confidence  num_rules
0     0.15         0.5          9
   support  confidence  num_rules
0     0.15         0.5          9
1     0.15         0.6          6
   support  confidence  num_rules
0     0.15         0.5          9
1     0.15         0.6          6
2     0.15         0.7          0


In [68]:
rules = rules.drop_duplicates(subset=['antecedents', 'consequents'])

In [69]:
num_rules

0

### Interview Questions

1.	What is lift and why is it important in Association rules?

1. What is lift and why is it important in Association rules?  
   Lift measures how much more often the antecedent and consequent occur together than expected if they were independent.  
   Formula: Lift(A→B) = Support(A ∪ B) / (Support(A) * Support(B))  
   Importance: It shows the strength and interestingness of a rule; a lift > 1 indicates a positive association.

2. What is support and confidence? How do you calculate them?  
   - Support: The proportion of transactions that contain the itemset.  
     Support(A) = (Number of transactions containing A) / (Total transactions)  
   - Confidence: The likelihood of occurrence of the consequent given the antecedent.  
     Confidence(A→B) = Support(A ∪ B) / Support(A)


3. What are some limitations or challenges of Association rules mining?  
   - Can generate a huge number of rules, many of which may be irrelevant or redundant.  
   - Doesn’t consider causality, only correlations.  
   - Sensitive to choice of support and confidence thresholds.  
   - Computationally expensive on large datasets.  
   - May miss interesting rare itemsets due to minimum support thresholds.
