Dataset:

https://github.com/amankharwal/Website-data/blob/master/Groceries_dataset.csv

### Import libraries

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import apyori
from apyori import apriori

load the dataset

In [2]:
data = pd.read_csv("Groceries_dataset.csv")
data.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


### Data Exploration

Let’s first have a look at the top 10 most selling products:

In [3]:
print("Top 10 frequently sold products(Tabular Representation)")
x = data['itemDescription'].value_counts().sort_values(ascending=False)[:10]
fig = px.bar(x = x.index, y = x.values)
fig.update_layout(title_text="Top 10 frequently sold products (Graphical Representation)", xaxis_title="Products",yaxis_title="Count")
fig.show()

Top 10 frequently sold products(Tabular Representation)


Now let’s explore the higher sales:

In [4]:
data["Year"] = data['Date'].str.split("-").str[-1]
data["Month-Year"] = data['Date'].str.split("-").str[1] + "-" + data['Date'].str.split("-").str[-1]
fig1 = px.bar(data["Month-Year"].value_counts(ascending=False),
              orientation="v",
              color = data["Month-Year"].value_counts(ascending=False),
              labels={'value':'Count','index':'Date','color':'Meter'})

fig1.update_layout(title_text="Exploring higher sales by the date")

fig1.show()

### Observations

1. Milk is bought the most, followed by vegetables.
2. Most shopping takes place in August / September, while February / March is the least demanding.

In [5]:
products = data["itemDescription"].unique()

In [6]:
# one hot encoding the products

dummy = pd.get_dummies(data['itemDescription'], dtype=int)
data.drop(['itemDescription'], inplace=True, axis=1)

data = data.join(dummy)

data.head()

Unnamed: 0,Member_number,Date,Year,Month-Year,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,1808,21-07-2015,2015,07-2015,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2552,05-01-2015,2015,01-2015,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,2300,19-09-2015,2015,09-2015,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1187,12-12-2015,2015,12-2015,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3037,01-02-2015,2015,02-2015,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [8]:
# Transaction: If a customer bought multiple products in one day, it will be considered as 1 transaction:

data1 = data.groupby(['Member_number', 'Date'])[products[:]].sum()
data1 = data1.reset_index()[products]

print("New Dimension", data1.shape)
data1.head()

New Dimension (14963, 167)


Unnamed: 0,tropical fruit,whole milk,pip fruit,other vegetables,rolls/buns,pot plants,citrus fruit,beef,frankfurter,chicken,...,flower (seeds),rice,tea,salad dressing,specialty vegetables,pudding powder,ready soups,make up remover,toilet cleaner,preservation products
0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
#Replacing all non-zero values with the name of product:

def product_names(x):
    for product in products:
        if x[product] > 0:
            x[product] = product
    return x

data1 = data1.apply(product_names, axis=1)
data1.head()


Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'whole milk' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.


Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'canned beer' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.


Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'sausage' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.


Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'soda' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.


Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value 'frankfurter' has dtype incompatible with int64, please explicitly cast to a compatible d

Unnamed: 0,tropical fruit,whole milk,pip fruit,other vegetables,rolls/buns,pot plants,citrus fruit,beef,frankfurter,chicken,...,flower (seeds),rice,tea,salad dressing,specialty vegetables,pudding powder,ready soups,make up remover,toilet cleaner,preservation products
0,0,whole milk,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,whole milk,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
print("Total Number of Transactions:", len(data1))

Total Number of Transactions: 14963


In [12]:
# Removing Zeros, Extracting the list of items bought per customer

x = data1.values
x = [sub[~(sub==0)].tolist() for sub in x if sub [sub != 0].tolist()]
transactions = x
transactions[0:10]

[['whole milk', 'yogurt', 'sausage', 'semi-finished bread'],
 ['whole milk', 'pastry', 'salty snack'],
 ['canned beer', 'misc. beverages'],
 ['sausage', 'hygiene articles'],
 ['soda', 'pickled vegetables'],
 ['frankfurter', 'curd'],
 ['whole milk', 'rolls/buns', 'sausage'],
 ['whole milk', 'soda'],
 ['beef', 'white bread'],
 ['frankfurter', 'soda', 'whipped/sour cream']]

### Implementation of Apriori Algorithm

In [13]:
rules = apriori(transactions, min_support=0.00030,min_confidence=0.05, min_lift=3, max_length=2, target="rules")
association_results = list(rules)
print(association_results[0])

RelationRecord(items=frozenset({'liver loaf', 'fruit/vegetable juice'}), support=0.00040098910646260775, ordered_statistics=[OrderedStatistic(items_base=frozenset({'liver loaf'}), items_add=frozenset({'fruit/vegetable juice'}), confidence=0.12, lift=3.5276227897838903)])


In [14]:
for item in association_results:

    pair = item[0]
    items = [x for x in pair]

    print("Rule : ",items[0], " -> "+ items[1])
    print("Support : ",str(items[1]))
    print("Confidence : ",str(item[2][0][2]))
    print("Lift : ", str(item[2][0][3]))

    print("=================================")

Rule :  liver loaf  -> fruit/vegetable juice
Support :  fruit/vegetable juice
Confidence :  0.12
Lift :  3.5276227897838903
Rule :  ham  -> pickled vegetables
Support :  pickled vegetables
Confidence :  0.05970149253731344
Lift :  3.4895055970149254
Rule :  roll products   -> meat
Support :  meat
Confidence :  0.06097560975609757
Lift :  3.620547812620984
Rule :  misc. beverages  -> salt
Support :  salt
Confidence :  0.05617977528089888
Lift :  3.5619405827461437
Rule :  spread cheese  -> misc. beverages
Support :  misc. beverages
Confidence :  0.05
Lift :  3.170127118644068
Rule :  seasonal products  -> soups
Support :  soups
Confidence :  0.10416666666666667
Lift :  14.704205974842768
Rule :  spread cheese  -> sugar
Support :  sugar
Confidence :  0.06
Lift :  3.3878490566037733


Source:

https://thecleverprogrammer.com/2020/11/16/apriori-algorithm-using-python/

https://github.com/amankharwal/Website-data/blob/master/association_rule_market_basket_analysis.ipynb