# Practicum Problems

In [11]:
# 2.1 Problem 1
# Step 1: Load Online Retail Dataset for France
# Import required libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# Load the Online Retail dataset
df = pd.read_excel("Online Retail.xlsx")                              # UCI ML Repo dataset
print(df.head())
print(df.info())

  InvoiceNo StockCode                          Description  Quantity  \
0    536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1    536365     71053                  WHITE METAL LANTERN         6   
2    536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3    536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4    536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

          InvoiceDate  UnitPrice  CustomerID         Country  
0 2010-12-01 08:26:00       2.55     17850.0  United Kingdom  
1 2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
2 2010-12-01 08:26:00       2.75     17850.0  United Kingdom  
3 2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
4 2010-12-01 08:26:00       3.39     17850.0  United Kingdom  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       -----------

In [15]:
# Step 2: Filter transactions for France and preprocess data for Apriori

# Filter the dataset to include only transactions from France
france_data = df[df['Country'] == 'France'].copy()

# Drop rows with missing invoice or description
france_data.dropna(subset=['InvoiceNo', 'Description'], inplace=True)

# Remove cancelled transactions (InvoiceNo starting with 'C')
france_data = france_data[~france_data['InvoiceNo'].astype(str).str.contains('C')]

# Clean string columns
france_data['InvoiceNo'] = france_data['InvoiceNo'].astype(str)
france_data['Description'] = france_data['Description'].str.strip()



The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


In [20]:
# Step 4: Create basket (transaction x item) matrix
basket = (france_data
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum()
          .unstack(fill_value=0))

# Convert quantities > 0 to 1, else 0 (binary encoding)
basket_sets = basket.gt(0).astype(int)



In [21]:
# Step 5: Generate frequent itemsets with minimum support of 5% 
# Note: apriori uses sparse matrix internally for memory efficiency
frequent_itemsets = apriori(basket_sets, min_support=0.05, use_colnames=True)

# Sort frequent itemsets by support descending
frequent_itemsets = frequent_itemsets.sort_values(by='support', ascending=False)

# Identify itemset with the largest support
max_support_itemset = frequent_itemsets.iloc[0]
print("\nItemset with the largest support:")
print(f"Itemset: {{{', '.join(max_support_itemset['itemsets'])}}}")
print(f"Support: {max_support_itemset['support']:.4f}")


Itemset with the largest support:
Itemset: {POSTAGE}
Support: 0.7653




In [27]:
# Step 6: Generate association rules and find highest confidence and lift rules 
# Using min_threshold=0.01 to allow even low-confidence rules to be considered,
# which ensures we don't miss interesting rules that may have low confidence but high lift.
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.01)

# Rule with highest confidence
max_confidence_rule = rules.loc[rules['confidence'].idxmax()]
print("\nRule with highest confidence:")
print(f"Antecedents: {', '.join(max_confidence_rule['antecedents'])}")
print(f"Consequents: {', '.join(max_confidence_rule['consequents'])}")
print(f"Support: {max_confidence_rule['support']:.4f}")
print(f"Confidence: {max_confidence_rule['confidence']:.4f}")
print(f"Lift: {max_confidence_rule['lift']:.4f}")

# Rule with highest lift
max_lift_rule = rules.loc[rules['lift'].idxmax()]
print("\nRule with highest lift:")
print(f"Antecedents: {', '.join(max_lift_rule['antecedents'])}")
print(f"Consequents: {', '.join(max_lift_rule['consequents'])}")
print(f"Support: {max_lift_rule['support']:.4f}")
print(f"Confidence: {max_lift_rule['confidence']:.4f}")
print(f"Lift: {max_lift_rule['lift']:.4f}")




Rule with highest confidence:
Antecedents: JUMBO BAG WOODLAND ANIMALS
Consequents: POSTAGE
Support: 0.0765
Confidence: 1.0000
Lift: 1.3067

Rule with highest lift:
Antecedents: PACK OF 6 SKULL PAPER CUPS
Consequents: PACK OF 6 SKULL PAPER PLATES
Support: 0.0510
Confidence: 0.8000
Lift: 14.2545


## Is the rule with the highest confidence the same as the rule with the highest lift? Why or why not?

The **rule with the highest confidence** and the **rule with the highest lift** are **not the same** in this analysis, and here is why:

---

### Rule with highest confidence:
- **Antecedents:** `JUMBO BAG WOODLAND ANIMALS`
- **Consequents:** `POSTAGE`
- **Support:** **0.0765** (7.65%)
- **Confidence:** **1.00** (100%)
- **Lift:** **1.31**

This means that **every transaction** containing the *JUMBO BAG WOODLAND ANIMALS* also includes *POSTAGE*. The confidence of **1.00** indicates a perfect conditional probability: whenever the bag is bought, postage is always present. However, the lift is only **1.31**, showing that the presence of *POSTAGE* is only moderately more likely than by random chance in these transactions. This happens because *POSTAGE* is a very common item, appearing in many transactions overall, which reduces the relative strength of association.

---

### Rule with highest lift:
- **Antecedents:** `PACK OF 6 SKULL PAPER CUPS`
- **Consequents:** `PACK OF 6 SKULL PAPER PLATES`
- **Support:** **0.0510** (5.1%)
- **Confidence:** **0.80** (80%)
- **Lift:** **14.25**

This rule shows a very strong association between these two items. The confidence of **0.80** means that 80% of transactions containing the *PACK OF 6 SKULL PAPER CUPS* also contain the *PACK OF 6 SKULL PAPER PLATES*. The **lift of 14.25** indicates that these two items are bought together **over 14 times more frequently than if they were independent**, representing a very strong and meaningful relationship.

---

### Why are these rules different?

- **Confidence** measures the likelihood of the consequent given the antecedent, which can be very high when the consequent item (*POSTAGE*) is very common overall. This explains the perfect confidence in the first rule but with a modest lift.
  
- **Lift** measures how much more frequently the antecedent and consequent occur together than expected if they were independent. A high lift value (like 14.25) indicates a statistically significant and strong association, even if confidence is lower.

---

### Summary:

- The **highest confidence rule** shows a very predictable but less interesting association, largely because *POSTAGE* appears frequently in many transactions.
- The **highest lift rule** reveals a stronger, more surprising relationship between two less common items that are frequently purchased together.

Therefore, **the rule with the highest confidence is not the same as the rule with the highest lift**, as they capture different dimensions of association strength in the data.

---

## Citations:
- https://www.kaggle.com/code/sefercanapaydn/apriori-algorithm-on-a-online-retail-store
- https://www.kaggle.com/code/xvivancos/market-basket-analysis/report#data-analysis
- https://www.kaggle.com/code/rockystats/apriori-algorithm-or-market-basket-analysis
- https://www.kaggle.com/code/mbalvi75/15-apriori-arm

In [29]:
# 2.2 Problem 2
# Step 1: Load Extended Bakery Dataset
import pandas as pd
import numpy as np

# Load the Extended Bakery dataset 
df = pd.read_csv('75000-out2-binary.csv')
print(df.head())
print(df.info())

   Transaction Number  Chocolate Cake  Lemon Cake  Casino Cake  Opera Cake  \
0                   1               0           0            0           0   
1                   2               0           0            0           0   
2                   3               0           0            0           1   
3                   4               0           0            0           0   
4                   5               0           0            0           0   

   Strawberry Cake  Truffle Cake  Chocolate Eclair  Coffee Eclair  \
0                0             0                 0              0   
1                0             0                 0              1   
2                0             0                 0              0   
3                0             1                 0              0   
4                0             0                 1              0   

   Vanilla Eclair  ...  Lemon Lemonade  Raspberry Lemonade  Orange Juice  \
0               0  ...               0  

In [30]:
# Step 3: Extract the two item columns of interest: 'Chocolate Coffee' and 'Chocolate Cake'
choco_coffee = df['Chocolate Coffee']
choco_cake = df['Chocolate Cake']

In [31]:
# Step 4: Calculate the components of the contingency table (2x2 table)
# Count of transactions where both items are present (1,1)
n11 = np.sum((choco_coffee == 1) & (choco_cake == 1))

# Count where Chocolate Coffee = 1, Chocolate Cake = 0
n10 = np.sum((choco_coffee == 1) & (choco_cake == 0))

# Count where Chocolate Coffee = 0, Chocolate Cake = 1
n01 = np.sum((choco_coffee == 0) & (choco_cake == 1))

# Count where both items are absent (0,0)
n00 = np.sum((choco_coffee == 0) & (choco_cake == 0))

# Total number of transactions
N = len(df)

In [34]:
# Step 5: Calculate the binary correlation coefficient Φ (phi coefficient)
# Formula:
# Φ = (n11 * n00 - n10 * n01) / sqrt( (n1dot * n0dot * ndot1 * ndot0) )
# where
# n1dot = n11 + n10 (row sum for Chocolate Coffee=1)
# n0dot = n01 + n00 (row sum for Chocolate Coffee=0)
# ndot1 = n11 + n01 (column sum for Chocolate Cake=1)
# ndot0 = n10 + n00 (column sum for Chocolate Cake=0)

n1dot = n11 + n10
n0dot = n01 + n00
ndot1 = n11 + n01
ndot0 = n10 + n00

numerator = (n11 * n00) - (n10 * n01)
denominator = np.sqrt(n1dot * n0dot * ndot1 * ndot0)

phi = numerator / denominator if denominator != 0 else 0

In [35]:
# Step 6: Display contingency table clearly
print("\nContingency Table:")
print(f"{'':<20} {'Chocolate Cake=1':<15} {'Chocolate Cake=0':<15}")
print(f"{'Chocolate Coffee=1':<20} {n11:<15} {n10:<15}")
print(f"{'Chocolate Coffee=0':<20} {n01:<15} {n00:<15}")
print(f"Total transactions: {N}")

print(f"\nPhi coefficient (Φ) between Chocolate Coffee and Chocolate Cake: {phi:.4f}")

# Step 9: Explain symmetry of Φ coefficient
print("\nSymmetry check:")
print("The binary correlation coefficient Φ is symmetric by definition.")
print("Therefore, Φ({Chocolate Coffee} ⇒ {Chocolate Cake}) = Φ({Chocolate Cake} ⇒ {Chocolate Coffee}).")
print("This is because the formula and contingency table counts are invariant to swapping the variables.")


Contingency Table:
                     Chocolate Cake=1 Chocolate Cake=0
Chocolate Coffee=1   3303            2933           
Chocolate Coffee=0   2962            65802          
Total transactions: 75000

Phi coefficient (Φ) between Chocolate Coffee and Chocolate Cake: 0.4856

Symmetry check:
The binary correlation coefficient Φ is symmetric by definition.
Therefore, Φ({Chocolate Coffee} ⇒ {Chocolate Cake}) = Φ({Chocolate Cake} ⇒ {Chocolate Coffee}).
This is because the formula and contingency table counts are invariant to swapping the variables.


## 1) Are these two items symmetric binary variables? Provide supporting calculations.

To evaluate whether **Chocolate Coffee** and **Chocolate Cake** can be considered **symmetric binary variables**, we first compute the **binary correlation coefficient (Phi)**. This coefficient measures the strength and direction of the association between two binary variables, with values ranging from -1 (perfect negative association) to +1 (perfect positive association), and 0 indicating no association.

---

### Contingency Table Overview

The observed counts from the transaction data are summarized in the contingency table below:

|                            | **Chocolate Cake = 1** | **Chocolate Cake = 0** |
|----------------------------|------------------------|------------------------|
| **Chocolate Coffee = 1**   | 3303                   | 2933                   |
| **Chocolate Coffee = 0**   | 2962                   | 65802                  |

These counts represent the frequency of co-occurrence and individual occurrences of the two items:

- n11 = 3303 — both items purchased together  
- n10 = 2933 — only Chocolate Coffee purchased  
- n01 = 2962 — only Chocolate Cake purchased  
- n00 = 65802 — neither item purchased  

---

### Calculation of Phi Coefficient (Phi)

The Phi coefficient is defined as:

**Phi = (n11 × n00 - n10 × n01) / sqrt(n1• × n0• × n•1 × n•0)**

where the marginal totals are:

- n1• = n11 + n10 = 3303 + 2933 = 6236  
- n0• = n01 + n00 = 2962 + 65802 = 68764  
- n•1 = n11 + n01 = 3303 + 2962 = 6265  
- n•0 = n10 + n00 = 2933 + 65802 = 68735  

Substituting the values:

Phi = (3303 × 65802 - 2933 × 2962) / sqrt(6236 × 68764 × 6265 × 68735)  
= (217,411,506 - 8,686,846) / sqrt(1.2412 × 10¹⁶)  
= 208,724,660 / 429,756,218.5  
≈ 0.4856

This indicates a **moderate positive correlation** between the purchase of Chocolate Coffee and Chocolate Cake. Such a relationship suggests that when a customer buys one of these items, there is an increased likelihood of also purchasing the other compared to chance.

---

## 2) Would the association rules {Chocolate Coffee} ⇒ {Chocolate Cake} have the same value for Phi as {Chocolate Cake} ⇒ {Chocolate Coffee}?

By the mathematical properties of the Phi coefficient, it is a **symmetric measure** of association between two binary variables. The formula depends solely on the contingency table counts, which remain unchanged when swapping the antecedent and consequent.

Consequently,

**Phi({Chocolate Coffee} ⇒ {Chocolate Cake}) = Phi({Chocolate Cake} ⇒ {Chocolate Coffee}) = 0.4856**

---

### Implications

- While **confidence** and **lift** for these two association rules may differ, reflecting the directional nature of conditional probabilities, the **Phi coefficient remains identical regardless of rule direction**.  
- This symmetry makes Phi an effective metric for assessing the **intrinsic strength of co-occurrence** without bias toward causality or directionality.

---

### Summary

### Conclusion

- The computed Phi coefficient of approximately **0.4856** demonstrates a **moderate positive correlation** between Chocolate Coffee and Chocolate Cake.  
- The **symmetry of Phi confirms** that these two items are **symmetric binary variables**.  
- Consequently, the value of Phi is **identical for both association rules**:  
  Phi({Chocolate Coffee} ⇒ {Chocolate Cake}) = Phi({Chocolate Cake} ⇒ {Chocolate Coffee})  
  highlighting the bidirectional nature of their association.

---

#### Citations:

- https://www.youtube.com/watch?v=SVM_pX0oTU8  
- https://www.youtube.com/watch?v=4QIWJVVWJdQ  
- https://www.kaggle.com/code/busegngr/recommendation-system  
- https://www.kaggle.com/code/mgmarques/customer-segmentation-and-market-basket-analysis  
