ASSOCIATION RULES

The Objective of this assignment is to introduce students to rule mining techniques, particularly focusing on market basket analysis and provide hands on experience.

Dataset:

Use the Online retail dataset to apply the association rules.

Data Preprocessing:

Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.  

Association Rule Mining:

* Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.

*	 Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.

*	Set appropriate threshold for support, confidence and lift to extract meaning full rules.
Analysis and Interpretation:

* Analyse the generated rules to identify interesting patterns and relationships between the products.

* 	Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.

Interview Questions:
1.	What is lift and why is it important in Association rules?
2.	What is support and Confidence. How do you calculate them?
3.	What are some limitations or challenges of Association rules mining?



In [2]:
!pip install pandas mlxtend openpyxl





In [3]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
import warnings
warnings.filterwarnings("ignore")

In [4]:
# Load the dataset
df = pd.read_excel('/content/Online retail.xlsx', engine='openpyxl')

# Display the first few rows of the dataset
print(df.head())
print(df.shape)

  shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
0                             burgers,meatballs,eggs                                                                                                                                                                             
1                                            chutney                                                                                                                                                                             
2                                     turkey,avocado                                                                                                                                                                             
3  mineral water,milk,energy bar,whole wheat rice...                                            

In [5]:
print(df.columns)

Index(['shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil'], dtype='object')


In [6]:
# Data Preprocessing
df['Transaction'] = df.iloc[:, 0].apply(lambda x: x.split(','))
df['Transaction'] = df['Transaction'].apply(lambda x: [item.strip() for item in x])
df

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil",Transaction
0,"burgers,meatballs,eggs","[burgers, meatballs, eggs]"
1,chutney,[chutney]
2,"turkey,avocado","[turkey, avocado]"
3,"mineral water,milk,energy bar,whole wheat rice...","[mineral water, milk, energy bar, whole wheat ..."
4,low fat yogurt,[low fat yogurt]
...,...,...
7495,"butter,light mayo,fresh bread","[butter, light mayo, fresh bread]"
7496,"burgers,frozen vegetables,eggs,french fries,ma...","[burgers, frozen vegetables, eggs, french frie..."
7497,chicken,[chicken]
7498,"escalope,green tea","[escalope, green tea]"


In [7]:
# Get all unique items
all_items = sorted(list(set([item for sublist in df['Transaction'] for item in sublist])))
all_items

['almonds',
 'antioxydant juice',
 'asparagus',
 'avocado',
 'babies food',
 'bacon',
 'barbecue sauce',
 'black tea',
 'blueberries',
 'body spray',
 'bramble',
 'brownies',
 'bug spray',
 'burger sauce',
 'burgers',
 'butter',
 'cake',
 'candy bars',
 'carrots',
 'cauliflower',
 'cereals',
 'champagne',
 'chicken',
 'chili',
 'chocolate',
 'chocolate bread',
 'chutney',
 'cider',
 'clothes accessories',
 'cookies',
 'cooking oil',
 'corn',
 'cottage cheese',
 'cream',
 'dessert wine',
 'eggplant',
 'eggs',
 'energy bar',
 'energy drink',
 'escalope',
 'extra dark chocolate',
 'flax seed',
 'french fries',
 'french wine',
 'fresh bread',
 'fresh tuna',
 'fromage blanc',
 'frozen smoothie',
 'frozen vegetables',
 'gluten free bar',
 'grated cheese',
 'green beans',
 'green grapes',
 'green tea',
 'ground beef',
 'gums',
 'ham',
 'hand protein bar',
 'herb & pepper',
 'honey',
 'hot dogs',
 'ketchup',
 'light cream',
 'light mayo',
 'low fat yogurt',
 'magazines',
 'mashed potato',
 'ma

In [8]:
# One-hot encode the dataset: Create a basket of items
basket = pd.DataFrame(0, index=df.index, columns=all_items)
for i, items in enumerate(df['Transaction']):
    basket.loc[i, list(set(items))] = 1
basket

Unnamed: 0,almonds,antioxydant juice,asparagus,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,body spray,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7496,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7497,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7498,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# Apply the Apriori algorithm
frequent_itemsets = apriori(basket, min_support=0.01, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.020267,(almonds)
1,0.033200,(avocado)
2,0.010800,(barbecue sauce)
3,0.014267,(black tea)
4,0.011467,(body spray)
...,...,...
254,0.011067,"(ground beef, mineral water, milk)"
255,0.017067,"(ground beef, mineral water, spaghetti)"
256,0.015733,"(mineral water, spaghetti, milk)"
257,0.010267,"(mineral water, olive oil, spaghetti)"


In [10]:
# Generate association rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(avocado),(mineral water),0.033200,0.238267,0.011467,0.345382,1.449559,1.0,0.003556,1.163629,0.320785,0.044103,0.140620,0.196753
1,(mineral water),(avocado),0.238267,0.033200,0.011467,0.048125,1.449559,1.0,0.003556,1.015680,0.407144,0.044103,0.015438,0.196753
2,(cake),(burgers),0.081067,0.087200,0.011467,0.141447,1.622103,1.0,0.004398,1.063185,0.417349,0.073129,0.059430,0.136473
3,(burgers),(cake),0.087200,0.081067,0.011467,0.131498,1.622103,1.0,0.004398,1.058068,0.420154,0.073129,0.054881,0.136473
4,(chocolate),(burgers),0.163867,0.087200,0.017067,0.104150,1.194377,1.0,0.002777,1.018920,0.194639,0.072934,0.018569,0.149934
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
403,"(pancakes, spaghetti)",(mineral water),0.025200,0.238267,0.011467,0.455026,1.909736,1.0,0.005462,1.397744,0.488682,0.045503,0.284561,0.251576
404,"(mineral water, spaghetti)",(pancakes),0.059733,0.095067,0.011467,0.191964,2.019260,1.0,0.005788,1.119917,0.536836,0.080000,0.107077,0.156291
405,(pancakes),"(mineral water, spaghetti)",0.095067,0.059733,0.011467,0.120617,2.019260,1.0,0.005788,1.069235,0.557797,0.080000,0.064752,0.156291
406,(mineral water),"(pancakes, spaghetti)",0.238267,0.025200,0.011467,0.048125,1.909736,1.0,0.005462,1.024084,0.625373,0.045503,0.023518,0.251576


In [11]:
# Lower thresholds for filtering
filtered_rules = rules[(rules['confidence'] > 0.5) & (rules['lift'] > 1.0)]
filtered_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
344,"(ground beef, eggs)",(mineral water),0.02,0.238267,0.010133,0.506667,2.126469,1.0,0.005368,1.544054,0.540548,0.040838,0.352354,0.274598
379,"(ground beef, milk)",(mineral water),0.022,0.238267,0.011067,0.50303,2.111207,1.0,0.005825,1.532756,0.538177,0.044409,0.34758,0.274738


In [12]:
# Check if rules are generated
if filtered_rules.empty:
    print("No rules found with the given thresholds. Consider lowering the thresholds.")
else:
    # Display the filtered rules with antecedents, consequents, support, confidence, and lift
    print("Filtered Association Rules:")
    print(filtered_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

Filtered Association Rules:
             antecedents      consequents   support  confidence      lift
344  (ground beef, eggs)  (mineral water)  0.010133    0.506667  2.126469
379  (ground beef, milk)  (mineral water)  0.011067    0.503030  2.111207


In [13]:
# user input checking for above association ruless

#@title User Input for Association Rule Filtering

min_confidence = 0.3  #@param {type:"number"}
min_lift = 1.0  #@param {type:"number"}


# Filter rules based on user input
user_filtered_rules = rules[(rules['confidence'] > min_confidence) & (rules['lift'] > min_lift)]


if user_filtered_rules.empty:
    print("No rules found with the given thresholds. Consider lowering the thresholds.")
else:
    # Display the filtered rules with antecedents, consequents, support, confidence, and lift
    print("Filtered Association Rules (based on user input):")
    print(user_filtered_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])


Filtered Association Rules (based on user input):
                    antecedents      consequents   support  confidence  \
0                     (avocado)  (mineral water)  0.011467    0.345382   
7                     (burgers)           (eggs)  0.028800    0.330275   
38                       (cake)  (mineral water)  0.027467    0.338816   
45                    (cereals)  (mineral water)  0.010267    0.398964   
58                    (chicken)  (mineral water)  0.022800    0.380000   
..                          ...              ...       ...         ...   
392           (spaghetti, milk)  (mineral water)  0.015733    0.443609   
396  (mineral water, olive oil)      (spaghetti)  0.010267    0.373786   
398      (olive oil, spaghetti)  (mineral water)  0.010267    0.447674   
402   (pancakes, mineral water)      (spaghetti)  0.011467    0.339921   
403       (pancakes, spaghetti)  (mineral water)  0.011467    0.455026   

         lift  
0    1.449559  
7    1.837585  
38   1.422002

In [14]:
# assosciation checking by entering the item names

#@title Association Rule Check by Item Names

item1 = "mineral water"  #@param {type:"string"}
item2 = "pancakes"  #@param {type:"string"}


def check_association(item1, item2):
  """
  Checks if there's an association rule between two items.

  Args:
    item1: The first item.
    item2: The second item.

  Returns:
    A string indicating the association rule or a message if no rule is found.
  """

  # Find rules where item1 is in the antecedent and item2 is in the consequent
  rule1 = rules[(rules['antecedents'] == frozenset([item1])) & (rules['consequents'] == frozenset([item2]))]

  # Find rules where item2 is in the antecedent and item1 is in the consequent
  rule2 = rules[(rules['antecedents'] == frozenset([item2])) & (rules['consequents'] == frozenset([item1]))]


  if not rule1.empty:
    print(f"Association Rule: {item1} -> {item2}")
    print(rule1[['support', 'confidence', 'lift']])
    return

  if not rule2.empty:
    print(f"Association Rule: {item2} -> {item1}")
    print(rule2[['support', 'confidence', 'lift']])
    return

  print(f"No association rule found between {item1} and {item2}.")


check_association(item1, item2)


Association Rule: mineral water -> pancakes
      support  confidence     lift
267  0.033733    0.141578  1.48925


## Association Rule Mining - Key Concepts

### 1. **What is lift and why is it important in Association rules?**

**Lift** is a metric used to evaluate the strength of an association rule in association rule mining. It helps determine how much more likely two items (or itemsets) are to be bought together compared to being bought independently. It is defined as:

$$
\text{Lift} = \frac{\text{Confidence of the rule}}{\text{Expected confidence if the items were independent}}
$$

**Importance of Lift**:
- A lift value greater than 1 indicates that the occurrence of the antecedent increases the likelihood of the occurrence of the consequent, which signifies a positive association.
- A lift value of less than 1 implies that the antecedent reduces the likelihood of the consequent, indicating a negative association.
- A lift value equal to 1 suggests no association between the two itemsets.

Lift is crucial in **market basket analysis** because it helps retailers understand the strength of associations and can be used to identify items that should be promoted together. For example, if `milk` and `bread` have a lift of 2, it means customers are twice as likely to buy bread when they buy milk compared to what would happen by chance.

### 2. **What is support and confidence? How do you calculate them?**

**Support** and **confidence** are key metrics in association rule mining.

- **Support**:
  Support is the proportion of transactions in the dataset that contain a specific itemset. It indicates how frequently an item or itemset appears in the data.

$$
\text{Support (A)} = \frac{\text{Number of transactions containing A}}{\text{Total number of transactions}}
 $$

  For example, if `milk` appears in 30 out of 100 transactions, its support is 0.3.

- **Confidence**:
  Confidence measures how often items in the consequent appear in transactions that also contain the antecedent. It is the conditional probability of the consequent given the antecedent.

$$
\text{Confidence (A \rightarrow B)} = \frac{\text{Support (A \cap B)}}{\text{Support (A)}}
$$

  For example, if 30 out of 50 transactions containing `milk` also contain `bread`, the confidence of the rule `milk → bread` is 0.6 (or 60%).

### 3. **What are some limitations or challenges of Association rules mining?**

**Limitations or Challenges** of Association Rule Mining:

1. **Computational Complexity**:
   - As the dataset grows, the number of possible item combinations increases exponentially. This makes association rule mining computationally expensive for large datasets, especially if the minimum support threshold is set low.

2. **Selection of Thresholds**:
   - The results depend heavily on the selection of minimum support, confidence, and lift thresholds. If the thresholds are set too high, interesting rules may be missed. If set too low, too many trivial or irrelevant rules might be generated.

3. **Overfitting to Noise**:
   - Association rule mining can sometimes generate rules based on noise or random co-occurrences in the data, leading to misleading conclusions.

4. **Interpretability**:
   - When many rules are generated, especially in large datasets, it becomes difficult to interpret which rules are truly meaningful and actionable.

5. **Sparsity of Data**:
   - In many real-world datasets (e.g., market basket data), most transactions contain only a small subset of all available items. This can result in a large number of itemsets having very low support, making it challenging to identify frequent patterns.

6. **Lack of Temporal Considerations**:
   - Association rule mining does not take into account the timing of purchases or events. In reality, some associations may only hold for a specific time period or under certain conditions, but traditional rule mining doesn’t capture these dynamics.

Addressing these challenges often involves adjusting thresholds, using pruning techniques, or applying more advanced algorithms like FP-Growth for large datasets.
