# **453 Individual Assignment 2 - End-to-End ML Project**
## **Recommender System using Association Rules (Predictive Analysis)**
### Shaolong (Fred) Xue

### **Introduction**

The goal of the assignment is to predict items that are more likely to be purchased together. We want to do this using association rule mining. 

The data we have is from Instacart, provided here: https://www.kaggle.com/competitions/instacart-market-basket-analysis/data. I will only be using one dataset from Instacart for the analysis, "order_products__prior.csv".  

In this notebook, I will write up codes for two parts: 

* Part A: Generate frequent itemsets and association rules for a recommender system

* Part B: Make two business recommendations for Instacart

Before these two parts, I spend some time learning and processing the datasets. 

In [2]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

### **Data Preprocessing**

In [3]:
order_products_train = pd.read_csv("order_products__train.csv")
orders = pd.read_csv("orders.csv")
products = pd.read_csv("products.csv")

In [4]:
# Prepare the list of transactions
order_list = order_products_train[['order_id', 'product_id']]

In [5]:
# Transpose the list horizontally
transactions = list(order_list.groupby('order_id')['product_id'].apply(list))

In [6]:
# One-Hot Encode the smaller transaction list
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_ary, columns=te.columns_)

In [7]:
df.shape

(131209, 39123)

### **Part A**

**Frequent Itemsets**

I had to choose low minimum support threshold given the large number of unique items in the data (39,123) and large number of orders (131,209).

I tried a few thresholds (0.05, 0.02, 0.015, 0.01, 0.005, etc.). I want to strike a balance between having enough itemsets and also having at least a good amount of itemsets with a length of 2. So that I can formulate some association rules between itemsets of two. I also don't want to have too many itemsets, because it will computationally extensive to generate association rules. 

Eventualy, I settle at the minimum threshold of 0.005. It generates 364 itemsets, with many length-2 itemsets. 

In [8]:
# Generate frequent itemsets, min support = 0.015
freq_itemsets_015 = apriori(df, min_support=0.015, use_colnames=True)

In [9]:
# Generate frequent itemsets, min support = 0.01
freq_itemsets_01 = apriori(df, min_support=0.01, use_colnames=True)

In [10]:
# Generate frequent itemsets, min support = 0.005
freq_itemsets_005 = apriori(df, min_support=0.005, use_colnames=True)

In [11]:
print(freq_itemsets_005.shape)
print(freq_itemsets_01.shape)
print(freq_itemsets_015.shape)

(364, 2)
(120, 2)
(71, 2)


**Association Rules**

In [12]:
# Generate association rules
rules = association_rules(freq_itemsets_005, metric="confidence", min_threshold=0.3)

In [13]:
rules.head(5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(5876),(13176),0.026713,0.11798,0.008132,0.304422,2.580293,0.00498,1.26804,0.629257
1,(8174),(13176),0.01509,0.11798,0.005526,0.366162,3.103598,0.003745,1.391554,0.688178
2,(8277),(13176),0.017163,0.11798,0.005236,0.305062,2.585717,0.003211,1.269207,0.62397
3,(8424),(24852),0.022346,0.142719,0.00705,0.315484,2.21053,0.003861,1.252391,0.560137
4,(9076),(24852),0.017705,0.142719,0.005457,0.308222,2.159645,0.00293,1.239243,0.546639


In [14]:
# Integrate with product info to identify the names of the products

rules_named = rules.copy()

# Convert antecedents and consequents into integer type for merging.
rules_named['antecedents'] = rules_named['antecedents'].apply(lambda antecedent: list(antecedent)[0])
rules_named['consequents'] = rules_named['consequents'].apply(lambda consequent: list(consequent)[0])

# Merge antecedents
rules_named = rules_named.merge(products[['product_id', 'product_name']], 
                                left_on='antecedents', 
                                right_on='product_id', 
                                how='left')

rules_named.rename(columns={"product_name": "antecedent_name"}, inplace=True)

# Merge consequents
rules_named = rules_named.merge(products[['product_id', 'product_name']], 
                                left_on='consequents', 
                                right_on='product_id', 
                                how='left')

rules_named.rename(columns={"product_name": "consequent_name"}, inplace=True)

rules_named.drop(['product_id_x', 'product_id_y'], axis=1, inplace=True)

cols = ['antecedents', 'antecedent_name', 'consequents', 'consequent_name']  + [col for col in rules_named.columns if col not in ['antecedents', 'antecedent_name', 'consequents', 'consequent_name']]
rules_named = rules_named[cols]

rules_named

Unnamed: 0,antecedents,antecedent_name,consequents,consequent_name,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,5876,Organic Lemon,13176,Bag of Organic Bananas,0.026713,0.11798,0.008132,0.304422,2.580293,0.00498,1.26804,0.629257
1,8174,Organic Navel Orange,13176,Bag of Organic Bananas,0.01509,0.11798,0.005526,0.366162,3.103598,0.003745,1.391554,0.688178
2,8277,Apple Honeycrisp Organic,13176,Bag of Organic Bananas,0.017163,0.11798,0.005236,0.305062,2.585717,0.003211,1.269207,0.62397
3,8424,Broccoli Crown,24852,Banana,0.022346,0.142719,0.00705,0.315484,2.21053,0.003861,1.252391,0.560137
4,9076,Blueberries,24852,Banana,0.017705,0.142719,0.005457,0.308222,2.159645,0.00293,1.239243,0.546639
5,19057,Organic Large Extra Fancy Fuji Apple,13176,Bag of Organic Bananas,0.022034,0.11798,0.007416,0.336562,2.852709,0.004816,1.329469,0.664088
6,27966,Organic Raspberries,13176,Bag of Organic Bananas,0.042268,0.11798,0.013566,0.320952,2.7204,0.008579,1.298907,0.660318
7,47209,Organic Hass Avocado,13176,Bag of Organic Bananas,0.055583,0.11798,0.018444,0.331825,2.81256,0.011886,1.320044,0.682381
8,27966,Organic Raspberries,21137,Organic Strawberries,0.042268,0.083028,0.012728,0.301118,3.62671,0.009218,1.312056,0.756233
9,28204,Organic Fuji Apple,24852,Banana,0.024823,0.142719,0.009222,0.371508,2.603072,0.005679,1.364028,0.631515


### **Part B**

Given the association rules, check out how should items be laid out on Instacart's app. How should recommendations be made. etc. Use other datasets like Products.csv to learn more. 

**Cross Recommend Fresh Produces**

From the association rules generated, it's obvious that the overwhelming majority of associated baskets consists of fresh produces. It seems Instacart users tend to buy fruits and fresh vegetables together. For example, 

* Hass avocados with bananas
* Fuji apples and bananas
* Naval oranges and bananas

My recommendation here is for Instacart to design its items page, shopping cart, and checkout interfaces to make recommendations of other products not in the cart based on things that are in the cart. 

For example, when a customer views one product, the others frequently bought with it could be highlighted as "Customers who bought this also bought these items" similar to the feature on Amazon.

**Bundling Frequently Bought Together Products**

Continuing on the observation that produces are frequently bought together, Instacart could consider creating bundled items that offer a bit of discounts. 

For exmaple, since customers who buy Organic Lemons also tend to buy Bags of Organic Bananas, Instacart could create a bundle of Organic Lemons and Bags of Organic Bananas and offer a small discount for buying both together. This could potentially make shopping at Instacart more cost-effective and convenient for users. 

Lastly, from the rules I've defined with minimum support and confidence, it seems fruits and vegetables are quite popularly bought by users. So it might not be a bad idea for Instacart to evaluate its logistic process of sourcing and delivering these products. Because fresh produces have higher perishable rate, this could be potentially cost-reducing for Instacart. 