# Association rules

In [6]:
import pandas as pd
df = pd.read_excel('Online retail.xlsx',header = None)
df

Unnamed: 0,0
0,"shrimp,almonds,avocado,vegetables mix,green gr..."
1,"burgers,meatballs,eggs"
2,chutney
3,"turkey,avocado"
4,"mineral water,milk,energy bar,whole wheat rice..."
...,...
7496,"butter,light mayo,fresh bread"
7497,"burgers,frozen vegetables,eggs,french fries,ma..."
7498,chicken
7499,"escalope,green tea"


# Data Preprocessing :

In [7]:
df.isnull().sum()

0    0
dtype: int64

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       7501 non-null   object
dtypes: object(1)
memory usage: 58.7+ KB


In [9]:
# Generate transactions
transactions = []
for i in range(len(df)):
    transactions.append([str(item) for item in df.iloc[i, 0].split(',')])

# Remove unnecessary first transaction (header)
transactions = transactions[1:]

# Display number of transactions
print("Number of transactions:", len(transactions))

Number of transactions: 7500


# Association Rule Mining:

In [11]:
# Apply Apriori algorithm
from apyori import apriori
rules = apriori(transactions=transactions,
                min_support=0.003,
                min_confidence=0.2,
                min_lift=3,
                min_length=2,
                max_length=2)

In [12]:
# Converting rules to list
report = list(rules)

# Displaying number of generated association rules
print("Number of association rules:", len(report))

Number of association rules: 9


In [13]:
report

[RelationRecord(items=frozenset({'chicken', 'light cream'}), support=0.004533333333333334, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.2905982905982906, lift=4.843304843304844)]),
 RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}), support=0.005733333333333333, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.30069930069930073, lift=3.7903273197390845)]),
 RelationRecord(items=frozenset({'escalope', 'pasta'}), support=0.005866666666666667, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.37288135593220345, lift=4.700185158809287)]),
 RelationRecord(items=frozenset({'fromage blanc', 'honey'}), support=0.0033333333333333335, ordered_statistics=[OrderedStatistic(items_base=frozenset({'fromage blanc'}), items_add=frozenset({'honey'}), confiden

In [14]:
# reports accesing 
report[0]#gives first report
report[0][0]#set of required data
report[0][1] # support
report[0][2][0][0] # base item
report[0][2][0][1] # add item
report[0][2][0][2] # confidence


a=[]
b=[]
c=[]
d=[]
e = []

In [15]:
for i in range(0,9):
    a.append(report[i][1]) # support
    b.append(report[i][2][0][0]) # base item
    c.append(report[i][2][0][1]) # add item
    d.append(report[i][2][0][2]) # confidence
    e.append(report[i][2][0][3]) # lift

# Interpret the results and provide insights into customer purchasing behaviour based on the discovered rules

In [16]:
df_new  = pd.concat([pd.DataFrame(a),
          pd.DataFrame(b),
          pd.DataFrame(c),
          pd.DataFrame(d),
          pd.DataFrame(e)],axis=1)

df_new.columns = ['Support','baseitem','add item','confidence','lift']
df_new

Unnamed: 0,Support,baseitem,add item,confidence,lift
0,0.004533,light cream,chicken,0.290598,4.843305
1,0.005733,mushroom cream sauce,escalope,0.300699,3.790327
2,0.005867,pasta,escalope,0.372881,4.700185
3,0.003333,fromage blanc,honey,0.245098,5.178128
4,0.016,herb & pepper,ground beef,0.32345,3.291555
5,0.005333,tomato sauce,ground beef,0.377358,3.840147
6,0.0032,light cream,olive oil,0.205128,3.120612
7,0.008,whole wheat pasta,olive oil,0.271493,4.130221
8,0.005067,pasta,shrimp,0.322034,4.514494


# Interview Questions

### 1. What is lift and why is it important in Association rules?

**Lift** is a measure used in association rule mining to evaluate the strength and importance of a rule. It is defined as the ratio of the observed support of a rule to the expected support if the items were independent.

**Importance of Lift:**
- **Independence Check:** Lift helps in identifying whether the occurrence of itemsets \( X \) and \( Y \) together is more than expected by chance. A lift value of 1 indicates independence, greater than 1 indicates a positive correlation (items co-occur more frequently than expected), and less than 1 indicates a negative correlation.
- **Strength of Association:** A higher lift value indicates a stronger association between the itemsets \( X \) and \( Y \). This helps in identifying the most impactful rules.
- **Filter Weak Rules:** It helps in filtering out rules that might have high support and confidence but are not interesting because they occur merely due to the high frequency of individual items.

### 2. What is support and Confidence? How do you calculate them?

**Support** and **Confidence** are fundamental measures in association rule mining.

- **Support:** It indicates how frequently the itemset appears in the dataset.

- **Confidence:** It measures how often the rule \( X > Y \) is found to be true. It is the ratio of the support of the combined itemset 

### 3. What are some limitations or challenges of Association rules mining?

**Limitations and Challenges:**

1. **Scalability:** Association rule mining can be computationally expensive, especially with large datasets and long itemsets. Algorithms like Apriori can be inefficient due to the need to generate and test a vast number of candidate itemsets.

2. **Support Threshold:** Setting an appropriate minimum support threshold is challenging. A high threshold may miss interesting patterns with lower frequency, while a low threshold may produce an overwhelming number of rules, many of which could be irrelevant or spurious.

3. **Redundancy:** Many rules can be redundant or very similar, making it difficult to identify the truly interesting or novel patterns.

4. **Actionability:** Not all discovered rules are actionable or useful in a practical sense. The relevance of the rules must be evaluated in the context of the specific business or research objective.

5. **Interpretability:** Rules with a high number of items can be complex and hard to interpret, reducing their practical usability.

6. **Noise and Outliers:** Data quality issues such as noise and outliers can affect the reliability of the mined rules. These can lead to discovering misleading or unrepresentative rules.

7. **Imbalance:** In datasets with imbalanced transactions (some items are much more frequent than others), frequent itemsets can dominate the rule generation process, leading to biased results.

8. **Parameter Sensitivity:** The results of association rule mining are sensitive to the parameters set for support and confidence. Small changes in these parameters can significantly alter the rules generated.

Addressing these challenges often requires preprocessing, parameter tuning, and the use of advanced algorithms or heuristic methods to improve the efficiency and relevance of the mined rules.