# ASSOCIATION RULES

## Dataset:
#### Use the Online retail dataset to apply the association rules.
## Data Preprocessing:
#### Pre-process the dataset to ensure it is suitable for Association rules, this may include handling missing values, removing duplicates, and converting the data to appropriate format.  
## Association Rule Mining:
#### •	Implement an Apriori algorithm using tool like python with libraries such as Pandas and Mlxtend etc.
#### •	 Apply association rule mining techniques to the pre-processed dataset to discover interesting relationships between products purchased together.
#### •	Set appropriate threshold for support, confidence and lift to extract meaning full rules.


In [2]:
import pandas as pd

df = pd.read_csv("Online retail.csv")

df.info(), df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 1 columns):
 #   Column                                                                                                                                                                                                                           Non-Null Count  Dtype 
---  ------                                                                                                                                                                                                                           --------------  ----- 
 0   shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil  7500 non-null   object
dtypes: object(1)
memory usage: 58.7+ KB


(None,
   shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
 0                             burgers,meatballs,eggs                                                                                                                                                                             
 1                                            chutney                                                                                                                                                                             
 2                                     turkey,avocado                                                                                                                                                                             
 3  mineral water,milk,energy bar,whole wheat rice...                                

In [3]:
# Splitting the single-column data into transactions
df_cleaned = df.iloc[:, 0].str.split(',', expand=False)

# Convert to a list of lists format
transactions = df_cleaned.tolist()

# Display a sample of processed transactions
transactions[:5]


[['burgers', 'meatballs', 'eggs'],
 ['chutney'],
 ['turkey', 'avocado'],
 ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea'],
 ['low fat yogurt']]

In [9]:
pip install mlxtend

Note: you may need to restart the kernel to use updated packages.


In [10]:
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# Convert transactions into a format suitable for apriori
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)

# Apply Apriori algorithm to find frequent itemsets
min_support = 0.02  # Setting a minimum support threshold
frequent_itemsets = apriori(df_encoded, min_support=min_support, use_colnames=True)

# Generate association rules
min_confidence = 0.3  # Setting minimum confidence
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_confidence)

# Display the top rules sorted by lift
rules.sort_values(by="lift", ascending=False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
8,(ground beef),(spaghetti),0.098267,0.174133,0.0392,0.398915,2.290857,1.0,0.022088,1.373959,0.624888,0.168096,0.272176,0.312015
18,(olive oil),(spaghetti),0.065733,0.174133,0.022933,0.348884,2.003547,1.0,0.011487,1.268387,0.536127,0.105716,0.211597,0.240292
14,(soup),(mineral water),0.050533,0.238267,0.023067,0.456464,1.915771,1.0,0.011026,1.401441,0.503458,0.086804,0.286449,0.276637
0,(burgers),(eggs),0.0872,0.179733,0.0288,0.330275,1.837585,1.0,0.013127,1.224782,0.499351,0.120941,0.183528,0.245256
19,(tomatoes),(spaghetti),0.0684,0.174133,0.020933,0.306043,1.75752,1.0,0.009023,1.190083,0.462663,0.094465,0.159723,0.213129
11,(olive oil),(mineral water),0.065733,0.238267,0.027467,0.41785,1.753707,1.0,0.011805,1.308483,0.460018,0.099325,0.235756,0.266563
7,(ground beef),(mineral water),0.098267,0.238267,0.040933,0.416554,1.748266,1.0,0.01752,1.305576,0.474647,0.138475,0.234054,0.294175
4,(cooking oil),(mineral water),0.051067,0.238267,0.020133,0.394256,1.654683,1.0,0.007966,1.257517,0.416947,0.074789,0.204782,0.239378
2,(chicken),(mineral water),0.06,0.238267,0.0228,0.38,1.594852,1.0,0.008504,1.228602,0.39679,0.082769,0.186067,0.237846
6,(frozen vegetables),(mineral water),0.095333,0.238267,0.035733,0.374825,1.573133,1.0,0.013019,1.218433,0.402718,0.119964,0.179273,0.262399


## Analysis and Interpretation:
#### •	Analyse the generated rules to identify interesting patterns and relationships between the products.
#### •	Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules.


### Analysis of Association Rules

#### Support: The proportion of transactions that contain a specific itemset. Higher support means the itemset appears frequently.
#### Confidence: The probability that if a customer buys item A, they will also buy item B. Higher confidence suggests a strong relationship.
#### Lift: Measures how much more likely item B is bought when item A is purchased, compared to random chance. Lift > 1 indicates a strong positive association.

## Steps for Interpretation
### Identify strong rules
#### Sort the rules by lift to find the most impactful ones.
#### Look for high confidence values (e.g., above 0.5) to ensure reliability.

### Find complementary products
#### If "mineral water → green tea" has a high lift and confidence, it suggests customers who buy mineral water often buy green tea.
#### Such insights can be used for bundling products or promotions.

### Spot substitute products
#### If "almonds → cashews" appears frequently, it may indicate customers switch between these items.

### Understand customer segments
#### If health-related products (e.g., "low-fat yogurt → honey") frequently appear together, it indicates a segment of health-conscious buyers.
#### If luxury items (e.g., "salmon → olive oil") show up together, they might appeal to premium customers.

### Marketing and Recommendation Strategies
#### Use high-lift rules for product recommendations (e.g., “People who bought this also bought...”).
#### Offer discounts or combo deals on frequently bought-together items.

## Interview Questions:
### 1.	What is lift and why is it important in Association rules?
#### Definition:
#### Lift is a measure used in association rule mining to evaluate the strength of a rule compared to the expected occurrence of the consequent item, assuming the antecedent and consequent are independent. It helps determine how much more likely the consequent (B) is to appear when the antecedent (A) is present compared to when A and B are independent.

#### Formula:
#### Lift(A⇒B)= Support(A∩B) / Support(A)×Support(B)
 
#### Where:
#### Support(A ∩ B) = Probability of both A and B occurring together.
#### Support(A) = Probability of A occurring.
#### Support(B) = Probability of B occurring.

#### Importance of Lift:
#### Measures Rule Strength: A higher lift indicates a stronger association between A and B.
#### Interpretability:
#### Lift = 1 → No association (A and B are independent).
#### Lift > 1 → Positive correlation (A increases the likelihood of B).
#### Lift < 1 → Negative correlation (A reduces the likelihood of B).
#### Better than Confidence Alone: Unlike confidence, which only considers the frequency of B given A, lift accounts for how common B is in general, making it more reliable.

### 2.	What is support and Confidence. How do you calculate them?
#### 1. Support
#### Definition:
#### Support measures how frequently an itemset appears in the dataset. It helps in identifying commonly occurring itemsets.

#### Formula:
#### Support(𝐴)=Number of transactions containing A / Total number of transactions

#### or for a rule A⇒B:

#### Support(𝐴⇒𝐵)=Number of transactions containing both A and B / Total number of transactions

#### Example:
#### If we have 10,000 transactions and "Milk & Bread" appear together in 800 transactions

#### 2. Confidence
#### Definition:
#### Confidence measures how often item B appears in transactions that already contain item A. It indicates the likelihood of B occurring given that A is present.

#### Formula:

#### Confidence(A⇒B)= Support(A∩B) / Support(A)

### 3.	What are some limitations or challenges of Association rules mining?

#### Association rule mining is a powerful technique for discovering interesting relationships between variables in large datasets. However, it comes with several limitations and challenges, including:

#### High Computational Complexity – The process of generating frequent itemsets and association rules can be computationally expensive, especially for large datasets with many attributes.

#### Handling Large Datasets – The number of possible itemsets grows exponentially with the number of items, making it challenging to process large-scale databases efficiently.

#### Choosing the Right Support and Confidence Thresholds – Setting appropriate minimum support and confidence values is often difficult. If thresholds are too high, important rules may be missed; if too low, too many trivial or irrelevant rules may be generated.

#### Generating Too Many Rules – Many association rule mining algorithms generate an excessive number of rules, making it difficult to extract meaningful insights from the results.

#### Redundant and Uninteresting Rules – Many discovered rules may be redundant or uninteresting, providing little new information to decision-makers.

#### Lack of Temporal Considerations – Traditional association rule mining does not take into account the sequence or time-order of transactions, which can be important in some applications.

#### Handling Continuous and Numeric Data – Association rule mining is primarily designed for categorical data, and preprocessing is often required to discretize numerical attributes, which can lead to loss of information.

#### Scalability Issues – As the dataset grows, the memory and processing power required to compute frequent itemsets and generate rules increase significantly.

#### Interpretability of Rules – Some discovered rules may be difficult to interpret or apply in real-world decision-making.

#### Ignoring Negative Associations – Most algorithms focus on finding positive associations, ignoring negative correlations (e.g., if one item is purchased, another is less likely to be purchased).

#### Data Sparsity – In datasets with a large number of unique items (e.g., e-commerce transactions), meaningful associations may be rare, leading to difficulties in mining useful rules.

#### Privacy Concerns – Mining associations in sensitive datasets (e.g., medical or financial data) may pose privacy risks and ethical concerns.