# ASSOCIATION RULES

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [55]:
pip install mlxtend

Collecting mlxtend
  Downloading mlxtend-0.23.1-py3-none-any.whl.metadata (7.3 kB)
Downloading mlxtend-0.23.1-py3-none-any.whl (1.4 MB)
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.4 MB ? eta -:--:--
    --------------------------------------- 0.0/1.4 MB 1.3 MB/s eta 0:00:02
    --------------------------------------- 0.0/1.4 MB 1.3 MB/s eta 0:00:02
    --------------------------------------- 0.0/1.4 MB 1.3 MB/s eta 0:00:02
   -- ------------------------------------- 0.1/1.4 MB 744.7 kB/s eta 0:00:02
   -- ------------------------------------- 0.1/1.4 MB 744.7 kB/s eta 0:00:02
   ----- ---------------------------------- 0.2/1.4 MB 841.6 kB/s eta 0:00:02
   ----- ---------------------------------- 0.2/1.4 MB 841.6 kB/s eta 0:00:02
   -------- ------------------------------- 0.3/1.4 MB 905.4 kB/s eta 0:00:02
   -------- ------------------------------- 0.3/1.4 MB 905.4 kB/s eta 0:00:02
   --------- -----------------

## Association Rule Mining:

In [57]:
# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# Step 1: Load or create a dataset
# Example dataset: Transactions of products
data = {
    'Transaction_ID': [1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5],
    'Product': ['Milk', 'Bread', 'Milk', 'Diapers', 'Beer', 
                'Bread', 'Butter', 'Milk', 'Bread', 'Diapers', 
                'Bread', 'Beer']
}
df = pd.DataFrame(data)

# Step 2: Pre-process the data to get a one-hot encoded table
basket = df.pivot_table(index='Transaction_ID', columns='Product', aggfunc=lambda x: 1, fill_value=0)

print("One-hot encoded dataset:")
print(basket)

# Step 3: Apply the Apriori algorithm to find frequent itemsets
min_support = 0.3  # Minimum support threshold
frequent_itemsets = apriori(basket, min_support=min_support, use_colnames=True)

print("\nFrequent Itemsets:")
print(frequent_itemsets)

# Step 4: Generate association rules
min_confidence = 0.5  # Minimum confidence threshold
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)

# Filter rules by lift (optional)
rules = rules[rules['lift'] > 1]  # Only keep rules with lift > 1

print("\nAssociation Rules:")
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

# Optional: Save the rules to a CSV file
rules.to_csv('association_rules.csv', index=False)

One-hot encoded dataset:
Product         Beer  Bread  Butter  Diapers  Milk
Transaction_ID                                    
1                  0      1       0        0     1
2                  1      0       0        1     1
3                  0      1       1        0     0
4                  0      1       0        1     1
5                  1      1       0        0     0

Frequent Itemsets:
   support         itemsets
0      0.4           (Beer)
1      0.8          (Bread)
2      0.4        (Diapers)
3      0.6           (Milk)
4      0.4    (Milk, Bread)
5      0.4  (Milk, Diapers)

Association Rules:
  antecedents consequents  support  confidence      lift
2      (Milk)   (Diapers)      0.4    0.666667  1.666667
3   (Diapers)      (Milk)      0.4    1.000000  1.666667




## Analysis and Interpretation

In [65]:
# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# Sample Data
data = {
    'Transaction_ID': [1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5],
    'Product': ['Milk', 'Bread', 'Milk', 'Diapers', 'Beer', 
                'Bread', 'Butter', 'Milk', 'Bread', 'Diapers', 
                'Bread', 'Beer']
}
df = pd.DataFrame(data)

# Step 1: One-hot encode the dataset
basket = df.pivot_table(index='Transaction_ID', columns='Product', aggfunc=lambda x: 1, fill_value=0)

# Step 2: Apply the Apriori algorithm to find frequent itemsets
frequent_itemsets = apriori(basket, min_support=0.3, use_colnames=True)

# Step 3: Generate association rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules = rules[rules['lift'] > 1]  # Only keep rules with lift > 1

# Step 4: Analyze the rules for insights
def analyze_rules(rules):
    print("\nTop Rules Sorted by Lift:")
    top_lift_rules = rules.sort_values(by='lift', ascending=False).head(5)
    print(top_lift_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

    print("\nRules with High Confidence (> 0.7):")
    high_conf_rules = rules[rules['confidence'] > 0.7]
    print(high_conf_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

    print("\nRules Involving 'Milk' as an Antecedent:")
    milk_rules = rules[rules['antecedents'].apply(lambda x: 'Milk' in x)]
    print(milk_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

# Call the analysis function
analyze_rules(rules)


Top Rules Sorted by Lift:
  antecedents consequents  support  confidence      lift
2      (Milk)   (Diapers)      0.4    0.666667  1.666667
3   (Diapers)      (Milk)      0.4    1.000000  1.666667

Rules with High Confidence (> 0.7):
  antecedents consequents  support  confidence      lift
3   (Diapers)      (Milk)      0.4         1.0  1.666667

Rules Involving 'Milk' as an Antecedent:
  antecedents consequents  support  confidence      lift
2      (Milk)   (Diapers)      0.4    0.666667  1.666667




## Interview Questions:

# 1.What is lift and why is it important in Association rules?

Lift is a key metric used in Association Rule Mining to measure the strength of a rule. It tells us how much more likely the consequent (right-hand side of the rule) is to occur given the antecedent (left-hand side), compared to when the consequent occurs independently.

In other words, lift quantifies the degree to which two products are dependent on each other, or whether their co-occurrence is more than what would be expected by chance.

Formula for Lift:
Lift(A->B)=Confidence(A->B)/Support(B)
​
 Where:

Confidence(A → B) = P(B | A) = The probability of buying B, given that A was bought.
Support(B) = P(B) = The probability that B occurs independently.



Interpretation of Lift Values
* Lift > 1:

1 The occurrence of A increases the likelihood of B.
2 A and B are positively correlated, meaning they are more likely to be purchased together than by random chance.
 
* Lift = 1:

1 A and B are independent of each other.
2 The presence of A has no effect on the likelihood of B.

* Lift < 1:

1 The occurrence of A reduces the likelihood of B.
2 A and B are negatively correlated, meaning they are less likely to occur together than expected by chance.
Why is Lift Important in Association Rules?
1 Evaluates True Strength of a Rule:

 * Unlike confidence, which only considers the proportion of transactions containing both items, lift accounts for the baseline occurrence of the consequent. It tells us if the rule has real value beyond what would occur by chance.

2 Removes Bias from Frequent Items:

 * Items with high support (e.g., bread, milk) might appear frequently with many other items, resulting in high confidence values. Lift helps filter out trivial associations by checking if the association is stronger than expected by chance.

3 Identifies Interesting Relationships:

* Rules with high lift values highlight products that are highly associated, which can be used for product bundling, promotions, or cross-selling strategies.

4 Helps in Prioritization:
 * Business decisions like store layout and targeted marketing require identifying the most meaningful associations. Lift enables better prioritization by focusing on rules with the highest impact.


Example:
If:

40% of transactions contain Milk (Support(Milk) = 0.4)

50% of transactions contain Bread (Support(Bread) = 0.5)

20% of transactions contain both Milk and Bread (Support(Milk ∩ Bread) = 0.2)

Confidence(Milk → Bread) = 0.2 / 0.4 = 0.5
Using the Lift formula:

Lift(Milk->Bread)=0.5/0.5=1

In this case, the lift is 1, meaning buying Milk does not increase or decrease the likelihood of buying Bread beyond what would be expected by chance.

Conclusion

Lift is a crucial metric in Association Rule Mining because it helps us differentiate between meaningful and trivial relationships. By focusing on rules with lift > 1, businesses can identify opportunities to improve product bundling, promotions, and marketing strategies.



## 2.What is support and Confidence. How do you calculate them?

 1 What is Support in Association Rules?

Support measures how frequently an itemset or rule appears in the dataset. It reflects the proportion of transactions in which a particular product (or combination of products) occurs. It is useful for identifying popular items or itemsets.

Formula for Support:
Support(A->B)=Transactions containing both A and B/Total number of transactions
 
 * Support(A ∩ B): The percentage of transactions that contain both A and B.
* Support Threshold: A predefined value (e.g., 0.3 or 30%) used to filter out infrequent itemsets.

 Example of Support Calculation:

If there are 10 transactions, and 3 of them contain both Milk and Bread, then:

Support(milk->Bread)=3/10=0.3
So, the support for the rule is 0.3, meaning 30% of the transactions contain both Milk and Bread.

2 What is Confidence in Association Rules?
Confidence measures the proportion of transactions that contain the antecedent (A), which also contain the consequent (B). It indicates the likelihood of purchasing B given that A has already been purchased.

Formula for Confidence:
Confidence(A->B)=Transactions containing both A and B/Transactions containing A
 
 * Confidence(A → B): The probability of buying B when A has been purchased.
 * Confidence Threshold: A predefined value (e.g., 0.5 or 50%) to filter out weak rules.
Example of Confidence Calculation:
If:

 * 5 transactions contain Milk
 * 3 of those transactions also contain Bread

     
Then:

Confidence(Milk->Bread)=3/5=0.6

This means that 60% of the time, when Milk is purchased, Bread is also purchased.

Support vs Confidence:
    
 1 Support helps identify frequent itemsets or product combinations across the entire dataset.
 2 Confidence indicates the strength of a rule by measuring the likelihood of one product being purchased given the purchase of another.


In [None]:
# Python Example for Support and Confidence Calculation

In [74]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# Sample Data: Transactions with multiple products
data = {
    'Transaction_ID': [1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5],
    'Product': ['Milk', 'Bread', 'Milk', 'Diapers', 'Beer', 
                'Bread', 'Butter', 'Milk', 'Bread', 'Diapers', 
                'Bread', 'Beer']
}
df = pd.DataFrame(data)

# Step 1: Create a one-hot encoded dataset
basket = df.pivot_table(index='Transaction_ID', columns='Product', aggfunc=lambda x: 1, fill_value=0)

# Step 2: Generate frequent itemsets using the Apriori algorithm
frequent_itemsets = apriori(basket, min_support=0.2, use_colnames=True)

print("\nFrequent Itemsets with Support >= 0.2:")
print(frequent_itemsets)

# Step 3: Generate association rules with minimum confidence of 0.5
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

print("\nAssociation Rules:")
print(rules[['antecedents', 'consequents', 'support', 'confidence']])



Frequent Itemsets with Support >= 0.2:
    support                itemsets
0       0.4                  (Beer)
1       0.8                 (Bread)
2       0.2                (Butter)
3       0.4               (Diapers)
4       0.6                  (Milk)
5       0.2           (Beer, Bread)
6       0.2         (Beer, Diapers)
7       0.2            (Beer, Milk)
8       0.2         (Butter, Bread)
9       0.2        (Bread, Diapers)
10      0.4           (Milk, Bread)
11      0.4         (Milk, Diapers)
12      0.2   (Beer, Milk, Diapers)
13      0.2  (Milk, Bread, Diapers)

Association Rules:
         antecedents      consequents  support  confidence
0             (Beer)          (Bread)      0.2    0.500000
1             (Beer)        (Diapers)      0.2    0.500000
2          (Diapers)           (Beer)      0.2    0.500000
3             (Beer)           (Milk)      0.2    0.500000
4           (Butter)          (Bread)      0.2    1.000000
5          (Diapers)          (Bread)      0.2



## 3 What are some limitations or challenges of Association rules mining?

Association rule mining is a powerful technique for discovering interesting relationships in large datasets, particularly in market basket analysis. However, it comes with several limitations and challenges that can affect the quality and applicability of the results. Here are some key limitations:

1. High Computational Complexity
 * Scalability Issues: The algorithm can become computationally intensive as the dataset size increases, particularly with a large number of items. This can lead to long processing times and high memory usage.
 * Exponential Growth of Itemsets: As the number of items increases, the number of possible itemsets grows exponentially, making it challenging to compute frequent itemsets efficiently.
2. Choice of Parameters
* Support and Confidence Thresholds: Choosing the right thresholds for support and confidence is crucial but can be subjective.
* Low Thresholds: May yield too many trivial or uninteresting rules.
* High Thresholds: Might filter out potentially useful rules, leading to loss of meaningful associations.
3. Interpretability of Results
* Large Number of Rules: The algorithm can generate a large number of association rules, making it difficult to identify the most relevant ones.
* Complexity of Relationships: Rules may not adequately capture the complexity of relationships between items, especially in more nuanced datasets.
4. Data Quality and Preprocessing
* Missing or Noisy Data: The presence of missing, incomplete, or noisy data can lead to misleading results and affect the reliability of the rules.
* Data Transformation: Effective transformation of data into a suitable format for mining is essential. Poor preprocessing can lead to poor-quality output.
5. Sparsity of Data
In datasets with a large number of items but relatively few transactions, many potential itemsets will have low support, leading to a sparsity issue. This makes it challenging to find meaningful associations.
6. Overfitting
The algorithm may identify associations that do not generalize well to new data, leading to overfitting. This is particularly a risk when rules are created based on high confidence without considering support or lift.
7. Lack of Context
Contextual Information: Association rules do not account for temporal or contextual information (e.g., seasonality, promotions) that might influence buying behavior. This limits their applicability in dynamic environments.
8. Causality vs. Correlation
Misinterpretation of Rules: Association rules indicate correlation but do not imply causation. Users may mistakenly assume that a strong association implies that purchasing one item causes the purchase of another, which is not necessarily true.
9. Simplicity of the Model
Association rule mining assumes that the relationship between items is independent. It does not consider interactions between items that might alter their association.
10. Single-level Analysis
Single-Level Mining: Traditional association rule mining does not capture hierarchical or multi-level relationships between products, limiting insights in more complex datasets.

Conclusion
While association rule mining provides valuable insights, practitioners must be aware of these limitations and challenges. Careful parameter selection, preprocessing, and validation of results are essential to enhance the effectiveness of the analysis. Combining association rule mining with other data analysis techniques (e.g., clustering, classification) can also help mitigate some of these challenges and yield more robust insights.