# Market Basket Analysis

Market Basket Analysis and Association Rule Mining are crucial for businesses, particularly in retail, to understand consumer behavior and optimize their marketing efforts. Here's a breakdown of how these techniques work and their benefits:

### Key Concepts:
- **Market Basket Analysis**: Involves analyzing customer transactions to determine which products are often purchased together. The goal is to identify associations between different items in a shopping cart. For example, if customers who buy bread often buy butter, this association can be used for cross-selling.
  
- **Association Rule Mining**: This is the mathematical technique behind Market Basket Analysis. It involves finding relationships between variables in large datasets. In retail, it typically focuses on identifying frequent itemsets and generating rules that describe these relationships.

### Techniques:
1. **Frequent Itemset Mining**: The first step is to identify frequent combinations of items purchased together using algorithms like the **Apriori algorithm** or **FP-Growth**.
   
2. **Association Rules**: After finding frequent itemsets, association rules are generated to describe the relationships between items. These rules are often expressed in the form of "If X, then Y," meaning if a customer buys product X, they are likely to buy product Y as well.

3. **Metrics Used in Association Rule Mining**:
   - **Support**: The frequency of an itemset appearing in the dataset. It helps to filter out itemsets that don't appear frequently enough to be useful.
   - **Confidence**: A measure of the likelihood that if one item is purchased, another item will also be purchased.
   - **Lift**: The ratio of the observed support of a rule to the expected support if the items were independent. A lift greater than 1 indicates a positive association between the items.

### Applications:
1. **Optimizing Store Layout**: By analyzing which products are frequently bought together, retailers can arrange items in a store or on an e-commerce website in a way that increases cross-selling opportunities.
   
2. **Cross-Selling**: Retailers can recommend complementary products (e.g., recommending a phone case when a customer buys a smartphone).

3. **Promotions and Discounts**: Targeted promotions or bundle discounts can be offered for items frequently purchased together, enticing customers to purchase more.

4. **Customer Behavior Analysis**: Understanding purchasing patterns allows businesses to segment customers based on their shopping habits, leading to more personalized marketing campaigns.

5. **Catalog Design**: Retailers can design their product catalog by highlighting frequently purchased items together, making it easier for customers to find related products.

6. **Customized Emails**: Companies can send personalized emails offering discounts or suggestions based on past purchasing behavior, improving customer engagement.

### Example:
Suppose a customer buys milk. Through Market Basket Analysis, the system may reveal that customers who buy milk also often buy cookies. The retailer could then place these items near each other in the store or suggest them as a bundle online.

### Tools:
- **Apriori Algorithm**: Widely used for mining frequent itemsets and generating association rules.
- **FP-Growth Algorithm**: Faster and more efficient than Apriori, especially for larger datasets.

By leveraging these techniques, companies can improve their sales strategies and provide a more personalized shopping experience for customers.

### Matrices

- **Support** : Its the default popularity of an item. In mathematical terms, the support of item A is the ratio of transactions involving A to the total number of transactions.


- **Confidence** : Likelihood that customer who bought both A and B. It is the ratio of the number of transactions involving both A and B and the number of transactions involving B.
     - Confidence(A => B) = Support(A, B)/Support(A)


- **Lift** : Increase in the sale of A when you sell B.
    
    - Lift(A => B) = Confidence(A, B)/Support(B)
        
    - Lift (A => B) = 1 means that there is no correlation within the itemset.
    - Lift (A => B) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, A, and B, are more likely to be bought together.
    - Lift (A => B) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, A, and B, are unlikely to be bought together.

**Apriori Algorithm:** Apriori algorithm assumes that any subset of a frequent itemset must be frequent. Its the algorithm behind Market Basket Analysis. Say, a transaction containing {Grapes, Apple, Mango} also contains {Grapes, Mango}. So, according to the principle of Apriori, if {Grapes, Apple, Mango} is frequent, then {Grapes, Mango} must also be frequent.

In [2]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules  

In [None]:
orders = pd.read_csv(r"instacart-market_basket_analysis\orders.csv")
products = pd.read_csv(r"instacart-market_basket_analysis\products.csv")
order_products_prior = pd.read_csv(r"instacart-market_basket_analysis\order_products__prior.csv")
order_products_train = pd.read_csv(r"instacart-market_basket_analysis\order_products__train.csv")

In [4]:
order_products = pd.concat([order_products_prior, order_products_train], axis=0)

In [5]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [6]:
order_products.shape

(33819106, 4)

In [7]:
order_products['product_id'].nunique()

49685

In [8]:
product_counts = order_products.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})
product_counts = product_counts.sort_values('frequency', ascending=False)[0:100].reset_index(drop = True)
product_counts = product_counts.merge(products, on = 'product_id', how = 'left')
product_counts.head(20)

Unnamed: 0,product_id,frequency,product_name,aisle_id,department_id
0,24852,491291,Banana,24,4
1,13176,394930,Bag of Organic Bananas,24,4
2,21137,275577,Organic Strawberries,24,4
3,21903,251705,Organic Baby Spinach,123,4
4,47209,220877,Organic Hass Avocado,24,4
5,47766,184224,Organic Avocado,24,4
6,47626,160792,Large Lemon,24,4
7,16797,149445,Strawberries,24,4
8,26209,146660,Limes,24,4
9,27845,142813,Organic Whole Milk,84,16


In [9]:
product_counts.shape

(100, 5)

In [10]:
freq_products = list(product_counts.product_id)
freq_products[1:10]

[13176, 21137, 21903, 47209, 47766, 47626, 16797, 26209, 27845]

In [11]:
len(freq_products)

100

In [12]:
order_products = order_products[order_products["product_id"].isin(freq_products)]
order_products.shape

(7795471, 4)

In [13]:
order_products = order_products.merge(products, on = 'product_id', how='left')
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,2,28985,2,1,Michigan Organic Kale,83,4
1,2,17794,6,1,Carrots,83,4
2,3,24838,2,1,Unsweetened Almondmilk,91,16
3,3,21903,4,1,Organic Baby Spinach,123,4
4,3,46667,6,1,Organic Ginger Root,83,4


In [14]:
order_products['reordered'].nunique()

2

In [15]:
basket = order_products.pivot_table(index='order_id', 
                                    columns='product_name', 
                                    values='reordered', 
                                    aggfunc='count', 
                                    fill_value=0)


In [16]:
basket.head()

product_name,100% Raw Coconut Water,100% Whole Wheat Bread,2% Reduced Fat Milk,Apple Honeycrisp Organic,Asparagus,Bag of Organic Bananas,Banana,Bartlett Pears,Blueberries,Boneless Skinless Chicken Breasts,...,Sparkling Natural Mineral Water,Sparkling Water Grapefruit,Spring Water,Strawberries,Uncured Genoa Salami,Unsalted Butter,Unsweetened Almondmilk,Unsweetened Original Almond Breeze Almond Milk,Whole Milk,Yellow Onions
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
5,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
basket.shape

(2444982, 100)

# Apriori Function in Association Rule Mining

The `apriori` function is used to find **frequent itemsets**—combinations of products that frequently appear together in transactions. Here's an explanation of its key parameters:

---

## Key Parameters

### 1. **`min_support=0.01`**
- **Support** measures how often an itemset appears in transactions:

  \[
  \text{Support} = \frac{\text{Number of transactions containing the itemset}}{\text{Total number of transactions}}
  \]

- Setting `min_support=0.01` means **only itemsets that appear in at least 1% of all transactions** will be considered frequent.

---

### 2. **`use_colnames=True`**
- This ensures the **product names** are displayed in the output, rather than column indices.
- It makes the results more interpretable, especially when working with datasets that have meaningful column names.

---

### 3. **`low_memory=True`**
- This parameter helps **optimize memory usage** by processing data in smaller chunks, which is particularly useful when dealing with large datasets.

---

## Example Usage

```python
from mlxtend.frequent_patterns import apriori

# Applying the apriori algorithm on a transaction dataset
frequent_itemsets = apriori(
    basket,
    min_support=0.01,
    use_colnames=True,
    low_memory=True
)

print(frequent_itemsets)


In [18]:
frequent_items = apriori(basket, min_support=0.01, use_colnames=True, low_memory=True)
frequent_items.head()



Unnamed: 0,support,itemsets
0,0.016062,(100% Raw Coconut Water)
1,0.025814,(100% Whole Wheat Bread)
2,0.0158,(2% Reduced Fat Milk)
3,0.035694,(Apple Honeycrisp Organic)
4,0.029101,(Asparagus)


In [19]:
frequent_items.shape

(129, 2)

The table represents various **metrics** used in **association rule mining** to evaluate the strength and importance of relationships between items. Let’s break down each metric with formulas and interpretations.

---

## **📌 Key Terms**
- **Antecedent (`A`)** – The item(s) on the **left side** of the rule (e.g., `{Milk}`)
- **Consequent (`B`)** – The item(s) on the **right side** of the rule (e.g., `{Bread}`)
- **Rule Format:**  
  \[
  A \Rightarrow B
  \]
  **(If A is bought, then B is likely to be bought)**

---

## **📊 Explanation of Each Column**
| **Metric**              | **Formula** | **Interpretation** | **Range** | **Ideal Value** |
|-------------------------|------------|--------------------|-----------|----------------|
| **Antecedents**         | -          | Items that **trigger** the rule (left-hand side of the rule) | - | - |
| **Consequents**         | -          | Items that **result** from the rule (right-hand side of the rule) | - | - |
| **Antecedent Support**  | $P(A)$ | How often the antecedent (A) appears in transactions | $[0,1]$ | Higher is better |
| **Consequent Support**  | $P(B)$ | How often the consequent (B) appears in transactions | $[0,1]$ | Higher is better |
| **Support**            | $P(A \cup B)$ | Fraction of transactions where both A & B appear **together** | $[0,1]$ | Higher is better |
| **Confidence**         | $P(B|A) = \frac{P(A \cup B)}{P(A)}$ | Probability of buying B **given** that A was bought (higher = stronger rule) | $[0,1]$ | Higher is better |
| **Lift**               | $\frac{P(B|A)}{P(B)}$ | Measures how much more likely B is purchased when A is bought (Lift > 1 means strong association) | $(0, \infty)$ | Greater than 1 |
| **Representativity**    | $\frac{P(A \cup B)}{P(A) + P(B) - P(A \cup B)}$ | Measures **how well the rule covers** the dataset | $[0,1]$ | Higher is better |
| **Leverage**           | $P(A \cup B) - P(A) P(B)$ | Difference between observed and expected co-occurrence (higher = stronger rule) | $[-1,1]$ | Higher is better |
| **Conviction**         | $\frac{1 - P(B)}{1 - P(B|A)}$ | Measures **how strongly B depends on A** (higher = stronger association) | $[1, \infty)$ | Higher is better |
| **Zhang's Metric**     | $\frac{P(A \cup B) - P(A) P(B)}{\max(P(A) P(B), P(A \cup B) - P(A) P(B))}$ | **Balances confidence and lift** to measure association | $[-1,1]$ | Closer to 1 |
| **Jaccard**            | $\frac{P(A \cup B)}{P(A) + P(B) - P(A \cup B)}$ | Measures **overlap** between A & B (like similarity) | $[0,1]$ | Higher is better |
| **Certainty**          | $\frac{P(B|A) - P(B)}{1 - P(B)}$ | Measures **certainty improvement** of B when A is present | $[-1,1]$ | Higher is better |
| **Kulczynski**         | $\frac{1}{2} \left( P(B|A) + P(A|B) \right)$ | Measures the **average conditional probability** of A and B occurring together | $[0,1]$ | Higher is better |
| **Cosine Similarity**  | $\frac{P(A \cup B)}{\sqrt{P(A) P(B)}}$ | Measures similarity between A and B based on co-occurrence | $[0,1]$ | Higher is better |
| **Odds Ratio**         | $\frac{P(A \cup B) P(\neg A \neg B)}{P(A \neg B) P(\neg A B)}$ | Measures how much A increases the odds of B occurring | $[0, \infty)$ | Higher is better |
| **Yule's Q**           | $\frac{P(A \cup B) P(\neg A \neg B) - P(A \neg B) P(\neg A B)}{P(A \cup B) P(\neg A \neg B) + P(A \neg B) P(\neg A B)}$ | Measures **association symmetry** between A and B | $[-1,1]$ | Closer to 1 or -1 |
| **Yule's Y**           | $\frac{\sqrt{P(A \cup B) P(\neg A \neg B)} - \sqrt{P(A \neg B) P(\neg A B)}}{\sqrt{P(A \cup B) P(\neg A \neg B)} + \sqrt{P(A \neg B) P(\neg A B)}}$ | Measures **association strength** in a balanced way | $[-1,1]$ | Closer to 1 or -1 |
| **Gini Index**         | $P(A \cup B) (1 - P(A \cup B))$ | Measures **uncertainty reduction** when knowing A or B | $[0, 0.5]$ | Lower is better |


---

## **📌 Example**
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction |
|------------|------------|--------------------|--------------------|---------|------------|------|----------|------------|
| {Milk}     | {Bread}    | 0.50               | 0.40               | 0.30    | 0.60       | 1.50 | 0.05     | 1.25       |

🔹 **Interpretation:**  
- **60% of people who bought Milk also bought Bread** (`confidence = 0.60`).  
- **Milk increases the likelihood of buying Bread by 1.5x** (`lift = 1.50`).  
- **The rule is statistically significant** (`leverage = 0.05`).  
- **If Bread were independent of Milk, the probability of buying Bread would be lower** (`conviction = 1.25`).  

---

## **📌 Summary**
- **Support** – How often A & B appear together  
- **Confidence** – How often B appears given A was bought  
- **Lift** – Strength of the association (Lift > 1 means a strong positive relationship)  
- **Leverage & Conviction** – Measure **statistical significance**  
- **Jaccard, Kulczynski & Zhang's Metric** – Alternative ways to evaluate associations  

In [20]:
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.sort_values('lift', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
35,(Limes),(Large Lemon),0.059984,0.065764,0.01186,0.197723,3.006544,1.0,0.007915,1.16448,0.70998,0.104139,0.141248,0.189034
34,(Large Lemon),(Limes),0.065764,0.059984,0.01186,0.180345,3.006544,1.0,0.007915,1.146843,0.714372,0.104139,0.128041,0.189034
53,(Organic Strawberries),(Organic Raspberries),0.112711,0.058325,0.014533,0.12894,2.210731,1.0,0.007959,1.081069,0.61723,0.092861,0.074989,0.189057
52,(Organic Raspberries),(Organic Strawberries),0.058325,0.112711,0.014533,0.249174,2.210731,1.0,0.007959,1.181751,0.581582,0.092861,0.153798,0.189057
37,(Organic Avocado),(Large Lemon),0.075348,0.065764,0.010538,0.139862,2.126728,1.0,0.005583,1.086147,0.572966,0.080708,0.079314,0.150053
36,(Large Lemon),(Organic Avocado),0.065764,0.075348,0.010538,0.160244,2.126728,1.0,0.005583,1.101097,0.567088,0.080708,0.091815,0.150053
47,(Organic Blueberries),(Organic Strawberries),0.042956,0.112711,0.010235,0.238274,2.114024,1.0,0.005394,1.16484,0.550621,0.070378,0.141513,0.164542
46,(Organic Strawberries),(Organic Blueberries),0.112711,0.042956,0.010235,0.090809,2.114024,1.0,0.005394,1.052633,0.593909,0.070378,0.050002,0.164542
48,(Organic Raspberries),(Organic Hass Avocado),0.058325,0.090339,0.010966,0.188018,2.081257,1.0,0.005697,1.120298,0.551699,0.079639,0.10738,0.154704
49,(Organic Hass Avocado),(Organic Raspberries),0.090339,0.058325,0.010966,0.121389,2.081257,1.0,0.005697,1.071777,0.571115,0.079639,0.06697,0.154704
