# Phase 5: Association Rule Mining (Apriori)



**Objective**: Discover hidden patterns and relationships in the data.

**Theory**:
Association Rule Mining is often used in Market Basket Analysis (e.g., "People who buy Bread also buy Butter").
We will use the **Apriori Algorithm**.

**Key Metrics**:
1.  **Support**: How frequently the itemset appears in the dataset.
2.  **Confidence**: How often the rule has been found to be true.
3.  **Lift**: The ratio of the observed support to that expected if X and Y were independent. Lift > 1 means a strong association.

**Steps**:
1.  **Discretization**: Apriori works on categorical data. We must convert continuous variables (Rating, Size) into bins (e.g., 'High', 'Medium', 'Low').
2.  **One-Hot Encoding**: Convert to a format suitable for the algorithm.
3.  **Generate Rules**: Apply Apriori.
    


## 1. Import Libraries & Load Data


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori, association_rules
import os

def get_data_path(filename):
    possible_paths = [
        f"../output/{filename}",
        f"output/{filename}",
        f"/Users/jatinbisen/Desktop/Data_mining/output/{filename}"
    ]
    for path in possible_paths:
        if os.path.exists(path):
            return path
    return None

# Load cleaned data
google_df = pd.read_csv(get_data_path('google_cleaned.csv'))
print("Data loaded.")


ValueError: Invalid file path or buffer object type: <class 'NoneType'>

## 2. Discretization



We need to convert our numerical columns into categories.
- **Rating**: Low (<3.5), Medium (3.5-4.5), High (>4.5).
- **Size**: Small, Medium, Large (using quantiles).
- **Price**: Free vs Paid.
    


In [None]:
# Create a copy for rule mining
df_rules = google_df[['Category', 'Rating', 'Size', 'Price', 'Content_Rating']].dropna().copy()

# Bin Rating
df_rules['Rating_Bin'] = pd.cut(df_rules['Rating'], bins=[0, 3.5, 4.5, 5], labels=['Low_Rating', 'Avg_Rating', 'High_Rating'])

# Bin Size (using qcut for equal-sized buckets)
df_rules['Size_Bin'] = pd.qcut(df_rules['Size'], q=3, labels=['Small_App', 'Medium_App', 'Large_App'])

# Bin Price
df_rules['Price_Bin'] = df_rules['Price'].apply(lambda x: 'Free' if x == 0 else 'Paid')

# Drop original numeric columns
df_rules.drop(['Rating', 'Size', 'Price'], axis=1, inplace=True)

print("Data discretized.")
df_rules.head()


Data discretized.


Unnamed: 0,Category,Content_Rating,Rating_Bin,Size_Bin,Price_Bin
0,Adventure,Everyone,,Medium_App,Free
1,Tools,Everyone,Avg_Rating,Small_App,Free
2,Productivity,Everyone,,Small_App,Free
3,Communication,Everyone,High_Rating,Small_App,Free
4,Tools,Everyone,,Small_App,Free


## 3. One-Hot Encoding


In [None]:
# Convert to One-Hot Encoded format
df_encoded = pd.get_dummies(df_rules)
print(f"Encoded Shape: {df_encoded.shape}")
# Display first few columns
df_encoded.iloc[:, :5].head()


Encoded Shape: (2156375, 62)


Unnamed: 0,Category_Action,Category_Adventure,Category_Arcade,Category_Art & Design,Category_Auto & Vehicles
0,False,True,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False


## 4. Generate Frequent Itemsets



We look for itemsets that appear in at least 5% of the apps (`min_support=0.05`).
    


In [None]:
frequent_itemsets = apriori(df_encoded, min_support=0.05, use_colnames=True)
print(f"Found {len(frequent_itemsets)} frequent itemsets.")
frequent_itemsets.sort_values(by='support', ascending=False).head()


Found 63 frequent itemsets.


Unnamed: 0,support,itemsets
14,0.97985,(Price_Bin_Free)
6,0.872884,(Content_Rating_Everyone)
30,0.854833,"(Content_Rating_Everyone, Price_Bin_Free)"
12,0.336767,(Size_Bin_Medium_App)
11,0.336216,(Size_Bin_Small_App)


## 5. Generate Association Rules



We filter for rules with `min_confidence=0.2` (20%).
    


In [None]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
print(f"Found {len(rules)} rules.")

# Sort by Lift to find the strongest associations
rules.sort_values(by='lift', ascending=False).head(10)


Found 148 rules.


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
7,(Category_Tools),(Content_Rating_Everyone),0.060286,0.872884,0.058456,0.969638,1.110845,1.0,0.005833,4.186748,0.106186,0.066829,0.761151,0.518304
47,"(Category_Tools, Price_Bin_Free)",(Content_Rating_Everyone),0.058481,0.872884,0.05668,0.969216,1.110361,1.0,0.005634,4.129342,0.105566,0.064801,0.757831,0.517075
42,(Category_Business),"(Content_Rating_Everyone, Price_Bin_Free)",0.06438,0.854833,0.060927,0.946365,1.107077,1.0,0.005893,2.706591,0.103375,0.070987,0.630532,0.50882
2,(Category_Education),(Content_Rating_Everyone),0.105149,0.872884,0.101187,0.962327,1.102468,1.0,0.009405,3.374187,0.103866,0.115399,0.703632,0.539125
44,"(Category_Education, Price_Bin_Free)",(Content_Rating_Everyone),0.102222,0.872884,0.098327,0.961892,1.101971,1.0,0.009099,3.335723,0.103071,0.112145,0.700215,0.537269
48,(Category_Tools),"(Content_Rating_Everyone, Price_Bin_Free)",0.060286,0.854833,0.05668,0.940185,1.099846,1.0,0.005146,2.426921,0.096606,0.066027,0.587955,0.503245
45,(Category_Education),"(Content_Rating_Everyone, Price_Bin_Free)",0.105149,0.854833,0.098327,0.93512,1.093921,1.0,0.008442,2.237456,0.095946,0.114114,0.553064,0.525072
28,(Size_Bin_Large_App),(Rating_Bin_Avg_Rating),0.327017,0.284754,0.101678,0.310924,1.091905,1.0,0.008558,1.037979,0.125069,0.199331,0.036589,0.333998
27,(Rating_Bin_Avg_Rating),(Size_Bin_Large_App),0.284754,0.327017,0.101678,0.357072,1.091905,1.0,0.008558,1.046746,0.117679,0.199331,0.044659,0.333998
0,(Category_Business),(Content_Rating_Everyone),0.06438,0.872884,0.061219,0.950903,1.089381,1.0,0.005023,2.589096,0.087693,0.069882,0.613765,0.510519


## 6. Interpretation



Let's interpret the top rules.
Example:
`If (Price_Bin_Free) -> (Rating_Bin_Avg_Rating)`
This means free apps are likely to have average ratings.

We can filter for specific consequences, like 'High_Rating'.
    


In [None]:
# Find rules leading to High Rating
high_rating_rules = rules[rules['consequents'].apply(lambda x: 'Rating_Bin_High_Rating' in x)]
high_rating_rules.sort_values(by='lift', ascending=False).head()


NameError: name 'rules' is not defined