# <p style="background-color:#F8F1E8; font-family:newtimeroman;color:#602F44; font-size:150%; text-align:center; border-radius: 15px 50px;"> 🛒 Market Basket Analysis 🛍️ </p>

<div style="border-radius:10px; border:#4E5672 solid; padding: 15px; background-color: #F8F1E8; font-size:150%; text-align:left">

<h3 align="left"><font color='#4E5672'>❔ What is it?</font></h3>

* **Basket Analytics:** It is a data science method used to examine the products that customers buy together in a certain time interval and the relationships between these products.
    

<div style="border-radius:10px; border:#4E5672 solid; padding: 15px; background-color: #F8F1E8; font-size:150%; text-align:left">

<h3 align="left"><font color='#4E5672'> ❔ What is it used for? </font></h3>

1. **Forecasting and Strategy Development**: It allows us to predict customers' future purchases based on their shopping habits, so that we can organize the market in a way that will increase our sales, such as daily/hourly. 
    
    **Example**: We can emphasize more leisure consumption products on different days of the week, for example on Saturday.

2. **Product Placement and Shelf Layout**: If we can understand which products are bought together, we can optimize in-store product placement and shelf layout to increase profits. 
    
    **Example**: People who buy cola buy chips, people who buy salad ingredients buy lemons, people who buy organic vegetables buy another organic vegetable.

3. **Campaign and Discount Strategies**: When we know which products are bought together, we can increase our sales by campaigning for one and slightly increasing the price of the other, without reducing our profit, since we know that those who buy one will buy the other. 
    
    **Example**: If the purchase of potato chips increases with the purchase of Coke, we can apply a discount strategy such as 50% discount on potato chips for those who buy Coke.    

4. **Inventory Management**: It can be used to regulate the stock of products that increase in sales seasonally or to ensure that the stocks of products sold together are kept equal or close.
    
    **Example**: Keeping more stock of umbrellas during rainy periods or keeping close stock of Shampoo and Conditioner.


<div style="border-radius:10px; border:#D0C2F0 solid; padding: 15px; background-color: #F8F1E8; font-size:150%; text-align:left">

<h3 align="left"><font color='#5E5273'>📄 The Story of the Dataset </font></h3>
    <center><img src="https://static-prod.adweek.com/wp-content/uploads/2022/03/instacart-new-logo-featured-image-2022.jpg"> </center>
    
* Instacart is an American company that offers grocery shopping, delivery and pickup services in the US and Canada.
* The company offers its services through a website and mobile application. 
* The service allows customers to shop at participating retailers and allows shopping to be done personally by end customers.
    
* We obtained the data from a 2017 Instacart Market Basket Analysis contest with a $25,000 reward, the goal of which was to predict users' next order.

In [1]:
import pandas as pd

In [2]:
orders = pd.read_csv("/kaggle/input/instacart-market-basket-analysis/orders.csv")

In [3]:
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


<div style="border-radius:10px; border:#6B8BA0 solid; padding: 15px; background-color: #F2EADF; font-size:100%; text-align:left">

<h3 align="left"><font color='#6B8BA0'>👀 Features: </font></h3>
    
The meanings of the variables in the Orders data are as follows:

- **order_id**: A unique ID for each order.
- **user_id**: A unique ID of the user.
- **eval_set**: This variable specifies the type of the dataset. It can take three different values, "prior", "train", and "test".

    **prior**: This value represents the prior part of the dataset. That is, users' past orders belong to this category.

    **train**: This value represents the training dataset. It is the part where the model is trained and the learning process takes place. The model learns using the data in this section.

    **test**: This value represents the test dataset. This is the part where the performance of the model is evaluated after training and predictions are made. The data in this section usually contains new data that the model has not seen before and needs to make predictions.

- **order_number**: Indicates the user's order within each order.

- **order_dow**: Indicates on which day of the week the order was made (a number from 0 to 6, Sunday to Saturday).

- **order_hour_of_day**: Indicates the time of day the order was made (a number from 0 to 23).

- **days_since_prior_order**: Indicates the number of days from the previous order to this order. If this is a user's first order, it can be NaN (Not a Number).

In [4]:
orders.shape

(3421083, 7)

In [5]:
# if you want to reduce data size
#orders = orders.sample(frac=0.5, random_state=42)

In [6]:
orders["day_hour"] = [f"{day}-{hour}" for day,hour in zip(orders["order_dow"],orders["order_hour_of_day"])]

In [7]:
orders["user_day"] = [f"{user}-{day}" for user,day in zip(orders["user_id"],orders["order_dow"])]

In [8]:
orders = orders[orders["eval_set"]=="prior"]

In [9]:
orders.shape

(3214874, 9)

In [10]:
orders.isnull().sum()

order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
day_hour                       0
user_day                       0
dtype: int64

In [11]:
orders.duplicated().sum()

0

In [12]:
order_products = pd.read_csv("/kaggle/input/instacart-market-basket-analysis/order_products__prior.csv")

In [13]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [14]:
order_products.shape

(32434489, 4)

In [15]:
order_products.isnull().sum()

order_id             0
product_id           0
add_to_cart_order    0
reordered            0
dtype: int64

In [16]:
order_products.duplicated().sum()

0

In [17]:
df = pd.merge(orders,order_products, how="inner", on="order_id")[["order_id","user_id","product_id","day_hour","user_day"]]
df.head()

Unnamed: 0,order_id,user_id,product_id,day_hour,user_day
0,2539329,1,196,2-8,1-2
1,2539329,1,14084,2-8,1-2
2,2539329,1,12427,2-8,1-2
3,2539329,1,26088,2-8,1-2
4,2539329,1,26405,2-8,1-2


In [18]:
df.shape

(32434489, 5)

In [19]:
df["product_id"].value_counts()

product_id
24852    472565
13176    379450
21137    264683
21903    241921
47209    213584
          ...  
11356         1
34463         1
2769          1
16461         1
28818         1
Name: count, Length: 49677, dtype: int64

In [20]:
df["product_id"].value_counts().mean()

652.90756285605

In [21]:
import statsmodels.stats.api as sms
low_conf, up_conf = sms.DescrStatsW(df["product_id"].value_counts()).tconfint_mean()
print(f"Lower Confidence Interval: {low_conf:.0f}")
print(f"Upper Confidence Interval: {up_conf:.0f}")

Lower Confidence Interval: 611
Upper Confidence Interval: 695


In [22]:
important_products = df["product_id"].value_counts()[df["product_id"].value_counts() > low_conf].index
important_products

Index([24852, 13176, 21137, 21903, 47209, 47766, 47626, 16797, 26209, 27845,
       ...
       38635, 23495, 42221, 17070,  4933, 18102,  5386, 30278, 32420, 24920],
      dtype='int64', name='product_id', length=7248)

In [23]:
df["user_id"].value_counts()

user_id
201268    3725
129928    3638
164055    3061
186704    2936
176478    2921
          ... 
178594       3
201027       3
106736       3
119944       3
63995        3
Name: count, Length: 206209, dtype: int64

In [24]:
df = df[df["product_id"].isin(important_products)]
df.shape

(28214831, 5)

In [25]:
df["product_id"].value_counts()

product_id
24852    472565
13176    379450
21137    264683
21903    241921
47209    213584
          ...  
24920       611
17070       611
18102       611
4933        611
42221       611
Name: count, Length: 7248, dtype: int64

In [26]:
low_conf, up_conf = sms.DescrStatsW(df["user_id"].value_counts()).tconfint_mean()
print(f"Lower Confidence Interval: {low_conf:.0f}")
print(f"Upper Confidence Interval: {up_conf:.0f}")

Lower Confidence Interval: 136
Upper Confidence Interval: 138


In [27]:
important_baskets = df["user_id"].value_counts()[df["user_id"].value_counts() > low_conf].index
important_baskets

Index([201268, 129928, 186704, 182401, 137629, 176478, 164055,  79106,  60694,
        13701,
       ...
       195006, 182111, 184400, 142455, 201579, 188746, 102881,  96393,  37849,
       157130],
      dtype='int64', name='user_id', length=60532)

In [28]:
df = df[df["user_id"].isin(important_baskets)]
df.shape

(20528009, 5)

In [29]:
df["user_id"].value_counts()

user_id
201268    3547
129928    3166
186704    2712
182401    2669
137629    2640
          ... 
42655      137
34421      137
168604     137
21474      137
5094       137
Name: count, Length: 60532, dtype: int64

In [30]:
basket = df.groupby(["user_id","product_id"])["order_id"].count().unstack().notnull()
basket

product_id,1,10,23,25,28,34,37,45,49,54,...,49615,49621,49628,49640,49644,49652,49655,49667,49680,49683
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
10,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
17,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
21,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
206201,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
206202,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,True
206206,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
206207,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


<div style="border-radius:10px; border:#4E5672 solid; padding: 15px; background-color: #F8F1E8; font-size:150%; text-align:left">

<h3 align="left"><font color='#4E5672'>🛠️ Apriori Algorithm</font></h3>
    
* The Apriori algorithm is often used for market basket analysis.

* The Apriori algorithm in the Mlxtend library uses the "support" metric to identify patterns and relationships above a specified support value.r.


In [31]:
from mlxtend.frequent_patterns import apriori, association_rules

In [32]:
frequent_itemsets = apriori(basket,min_support=0.1,use_colnames=True,verbose=1)
frequent_itemsets.sort_values("support", ascending=False)

Processing 580 combinations | Sampling itemset size 5 4


Unnamed: 0,support,itemsets
53,0.545893,(24852)
40,0.531834,(21137)
22,0.499851,(13176)
43,0.464862,(21903)
57,0.417449,(26209)
...,...,...
686,0.100211,"(26209, 5876, 21903)"
1079,0.100211,"(21137, 45007, 24964, 21903)"
366,0.100079,"(21616, 40706)"
469,0.100063,"(24852, 39877)"


In [33]:
rules = association_rules(frequent_itemsets,metric="support",min_threshold=0.01)
rules.sort_values(by="lift")

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
1703,(24852),"(13176, 47209)",0.545893,0.273492,0.116897,0.214139,0.782981,-0.032400,0.924474,-0.379023
1698,"(13176, 47209)",(24852),0.273492,0.545893,0.116897,0.427424,0.782981,-0.032400,0.793094,-0.276155
218,(13176),(24852),0.499851,0.545893,0.222031,0.444195,0.813703,-0.050834,0.817025,-0.314018
219,(24852),(13176),0.545893,0.499851,0.222031,0.406730,0.813703,-0.050834,0.843038,-0.335184
1451,(24852),"(13176, 21137)",0.545893,0.322953,0.145477,0.266493,0.825176,-0.030821,0.923027,-0.318127
...,...,...,...,...,...,...,...,...,...,...
3106,(24964),"(46667, 22935)",0.351302,0.136275,0.104028,0.296120,2.172961,0.056154,1.227092,0.832126
3667,"(47209, 24964)","(13176, 22935)",0.211657,0.215737,0.102673,0.485092,2.248533,0.057011,1.523113,0.704345
3666,"(13176, 22935)","(47209, 24964)",0.215737,0.211657,0.102673,0.475917,2.248533,0.057011,1.504234,0.708010
3668,"(47209, 22935)","(13176, 24964)",0.209691,0.216993,0.102673,0.489640,2.256482,0.057172,1.534225,0.704575


<div style="border-radius:10px; border:#4E5672 solid; padding: 15px; background-color: #F8F1E8; font-size:150%; text-align:left">

<h3 align="left"><font color='#4E5672'>🛠️ Association Rules Algoritması </font></h3>
    
* It is an algorithm for extracting association rules. 
    
* It is used to generate rules from the relationships between items in the **itemset** values of sets obtained by an association analysis algorithm such as Apriori.
    

1. **antecedents:** Items on the left hand side of the rules (reasons)
2. **consequents:** Items on the right side of the rules (consequences)
3. **antecedent support:** Support value of Antecedent.
4. **consequent support:** Support value of Consequent.
5. **support:** Support value indicating the probability of antecedent and consequent occurring together.
6. **confidence:** The metric that measures "if A exists, then B is likely to exist" (support / antecedent support)
7. **lift:** Expresses the strength of association rules, the higher it is, the more meaningful the association rule (confidence / consequent support)
8. **leverage:** A value that measures the reliability of a rule.
9. **conviction:** Indicates the dependency between elements in association rules, the higher the higher the dependency.
10. **zhangs_metric:** Used to measure the quality of association rules, the higher the higher the quality.


In [34]:
random_product = rules.sample(1,random_state=45)["antecedents"].explode().iloc[0]
random_product

47626

In [35]:
lime  = 26209
banana = 24852


In [36]:
def arl_recommender(rules_df, id, rec=1):
    sorted_rules = rules_df.sort_values("lift", ascending=False)
    recommendation_list = []
    for i, k in enumerate(sorted_rules["antecedents"]):
        for j in list(k):
            if j == id :
                for k in list(sorted_rules.iloc[i]["consequents"]):
                    if k not in recommendation_list:
                        recommendation_list.append(k)

    return recommendation_list[0:rec]

In [37]:
arl_recommender(rules,random_product,5)

[26209, 21903, 24964, 21137, 24852]

In [38]:
products = pd.read_csv("/kaggle/input/instacart-market-basket-analysis/products.csv")
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


In [39]:
def names_of_products(rules_df, bought,recommend = 5):
    
    rec = arl_recommender(rules_df,bought,recommend)

    name_of_rec={}
    bought_name = products[products["product_id"]==bought]["product_name"].iloc[0]
    for i in rec:
        name_of_rec[i] = products[products["product_id"]==i]["product_name"].iloc[0]
    recommend_df = pd.DataFrame(name_of_rec.items(), columns=["product_id","product_name"])
    print(f"Bought: {bought_name}\n")
    return recommend_df

In [40]:
names_of_products(rules,random_product,5)

Bought: Large Lemon



Unnamed: 0,product_id,product_name
0,26209,Limes
1,21903,Organic Baby Spinach
2,24964,Organic Garlic
3,21137,Organic Strawberries
4,24852,Banana


<div style="border-radius:10px; border:#4E5672 solid; padding: 15px; background-color: #F8F1E8; font-size:150%; text-align:left">

<h3 align="left"><font color='#4E5672'>🛠️ H-Mine Algorithm</font></h3>
    
* The H-Mine (Hash-Based Mining) algorithm is a frequency-based mining algorithm that works quickly on large datasets used in relational rule mining such as Apriori.
    
* It scans the dataset and applies frequency-based filtering.
    
* It uses a hash-based structure, so memory usage can be more efficient.


In [41]:
from mlxtend.frequent_patterns import hmine

In [42]:
frequent_itemsets = hmine(basket,min_support=0.05,use_colnames=True)
frequent_itemsets.sort_values("support", ascending=False)

Unnamed: 0,support,itemsets
11508,0.545893,(24852)
6871,0.531834,(21137)
3534,0.499851,(13176)
9244,0.464862,(21903)
12523,0.417449,(26209)
...,...,...
857,0.050007,"(4920, 49683, 47766)"
10065,0.050007,"(26209, 47209, 47626, 21903, 47766)"
5048,0.050007,"(13176, 26209, 47626, 46667)"
9542,0.050007,"(24489, 26209, 21903)"


<div style="border-radius:10px; border:#4E5672 solid; padding: 15px; background-color: #F8F1E8; font-size:150%; text-align:left">

<h3 align="left"><font color='#4E5672'>👂 A Hearsay </font></h3>
    
* FP-Growth still outperforms H-Mine when it comes to large data sets.

In [43]:
from mlxtend.frequent_patterns import fpgrowth

In [44]:
frequent_itemsets = fpgrowth(basket,min_support=0.05,use_colnames=True)
frequent_itemsets.sort_values("support", ascending=False)

Unnamed: 0,support,itemsets
0,0.545893,(24852)
31,0.531834,(21137)
1,0.499851,(13176)
125,0.464862,(21903)
32,0.417449,(26209)
...,...,...
13265,0.050007,"(40706, 47766, 48679)"
3192,0.050007,"(4920, 49683, 47766)"
8458,0.050007,"(27086, 44359)"
8926,0.050007,"(13176, 26209, 46667, 44359)"
