In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Association rule learning** is a rule based machine learning method for discovering relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.
Association rule learning tries to quantify the strength of co-occurrence. Additionally, it does not contain any order preference as it is a set of elements and it emphasizes on capturing all the items within a transaction as a group itself.
Association rule mining finds interesting associations and relationships among large sets of data items. This rule shows how frequently a itemset occurs in a transaction. A typical example is a Market Based Analysis.
Market Based Analysis is one of the key techniques used by large relations to show associations between items.It allows retailers to identify relationships between the items that people buy together frequently.

The Association rule is very useful in analyzing datasets. The data is collected using bar-code scanners in supermarkets. Such databases consists of a large number of transaction records which list all items bought by a customer on a single purchase. So the manager could know if certain groups of items are consistently purchased together and use this data for adjusting store layouts, cross-selling, promotions based on statistics.

Name of the algorithm is **Apriori** because it uses prior knowledge of frequent itemset properties. We apply an iterative approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent itemsets, an important property is used called Apriori property which helps by reducing the search space.

All non-empty subset of frequent itemset must be frequent. The key concept of Apriori algorithm is its anti-monotonicity of support measure. Apriori assumes that
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.

**The Apriori Algorithm** is the foundation of market basket analysis and is used to find items frequently bought together. This is very useful for retailers because:

* Both X and Y can be placed on the same shelf, so that buyers of one item would be prompted to buy the other.
* Promotional discounts could be applied to just one out of the two items.
* Advertisements on X could be targeted at buyers who purchase Y.
* X and Y could be combined into a new product, such as having Y in flavors of X.

* **Support Count(\sigma)** – Frequency of occurrence of a itemset.
* **Frequent Itemset** – An itemset whose support is greater than or equal to minsup threshold.
* **Association Rule** – An implication expression of the form X -> Y, where X and Y are any 2 itemsets.


Rule Evaluation Metrics –

* **Support(s)** –
The number of transactions that include items in the {X} and {Y} parts of the rule as a percentage of the total number of transaction.It is a measure of how frequently the collection of items occur together as a percentage of all transactions.
* **Support = \sigma(X+Y) \div total** –
It is interpreted as fraction of transactions that contain both X and Y.
* **Confidence(c)** –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of transactions that includes all items in {A} to the no of transactions that includes all items in {A}.
* **Conf(X=>Y) = Supp(X\cupY) \div Supp(X)** –
It measures how often each item in Y appears in transactions that contains items in X also.
* **Lift(l)** –
The lift of the rule X=>Y is the confidence of the rule divided by the expected confidence, assuming that the itemsets X and Y are independent of each other.The expected confidence is the confidence divided by the frequency of {Y}.
* **Lift(X=>Y) = Conf(X=>Y) \div Supp(Y)** –
Lift value near 1 indicates X and Y almost often appear together as expected, greater than 1 means they appear together more than expected and less than 1 means they appear less than expected.Greater lift values indicate stronger association.

In [2]:
! pip install squarify
! pip install --index-url https://test.pypi.org/simple/ PyARMViz

In [3]:
# Importing libraries
import numpy as np
import pandas as pd 

# mlxtend will be used for market basket analysis
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder
import squarify
import matplotlib
from matplotlib import style
import matplotlib.pyplot as plt
import seaborn as sns
from PyARMViz import PyARMViz
from PyARMViz.Rule import generate_rule_from_dict

sns.set()
matplotlib.rcParams['figure.figsize'] = (50, 40)
style.use('ggplot')

* mlxtend-
Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks.
* squarify-
Squarify is the best fit when you have to plot a Treemap. Treemaps display hierarchical data as a set of nested squares/rectangles-based visualization. Squarify is a great choice: To plot a huge amount of data
* PyARMViz-
PyARMViz library is an advanced python association rule visualization library which uses the Efficient-Apriori algorithm as its backend. 

In [4]:
import pandas as pd

basket = pd.read_csv('/kaggle/input/groceries-dataset/Groceries_dataset.csv')
basket.head()

In [5]:
basket.info()

In [6]:
basket['Date'] = pd.to_datetime(basket['Date'])

In [7]:
basket['Member_number'].nunique()

**There are 3898 unique customers**

In [8]:
basket.describe(include=object)

**There are 167 unique grocery item**

In [9]:
basket['itemDescription'].value_counts()

In [10]:
Top_20_selling_product = basket["itemDescription"].value_counts().reset_index(name='Quantity').head(20)
Top_20_selling_product

In [11]:
plt.figure(figsize=(30,10))
colors=sns.color_palette('crest')
ax=sns.barplot(x="index",y="Quantity",data=Top_20_selling_product,palette=colors)
for i in ax.containers:
    ax.bar_label(i)
plt.title("Top_20_selling_product")
plt.show()

In [12]:
basket_agg = basket.groupby(['Member_number']).agg({'count'})['Date'].sort_values(['count'])
basket_agg

In [13]:
basket_agg.describe()

**On an average each customer usually buys 5-6 products**

In [14]:
# customer or member numbers, and the dates they purchased each item, say customer number- 1187
basket[basket['Member_number'] == 1187].sort_values(by='Date')

In [15]:
# Get the items purchased for each transaction
transactions = [a[1]['itemDescription'].tolist() for a in list(basket.groupby(['Member_number','Date']))]
print(len(transactions))

**There are 14963 unique transactions**

In [16]:
transactions

In [17]:
# We use TransactionEncoder from the mlxtend module to encode our date
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
te_ary

In [18]:
# Convert to a pandas dataframe
transactions = pd.DataFrame(te_ary, columns=te.columns_)
transactions

In [19]:
pf = transactions.describe()
pf

In [20]:
pf.iloc[0]

In [21]:
pf.iloc[3]

In [22]:
f = pf.iloc[0]-pf.iloc[3] # difference of count and frequency
f

In [23]:
a=f.tolist()
a

In [24]:
b=list(f.index)
b

In [25]:
#creating item purhcase counts from our transaction data
#pf = transactions.describe()
#f = pf.iloc[0]-pf.iloc[3]
#a = f.tolist()
#b = list(f.index)
item = pd.DataFrame([[a[r],b[r]]for r in range(len(a))], columns=['Count','Item'])
item = item.sort_values(['Count'], ascending=False).head(20) #Focusing on top 20 items 
item

In [26]:
item.info()

In [27]:
mini = min(item["Count"])
maxi = max(item["Count"])
print(mini,maxi)

In [28]:
fig, ax = plt.subplots()

# set color scheme
cmap = matplotlib.cm.coolwarm

# Get upper and lower boudns for the color mapping
mini = min(item["Count"])
maxi = max(item["Count"])

# Set out color mapping limits 
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)

# Obtain our raw colors 
colors = [cmap(norm(value)) for value in item["Count"]]

# Create the TreeMap plot with Squarify
squarify.plot(sizes=item["Count"], label=item["Item"], alpha=0.8, color=colors)
plt.axis('off')
plt.title("Top 20 Frequent Basket Items", fontsize=32)
ttl = ax.title
ttl.set_position([.5, 1.05])

In [29]:
#finding Frequent Itemsets using our Apriori Module by setting the minimum support to 0.001 and maxlen of itemset to be 5
frequent_itemsets = apriori(transactions, min_support=0.001, use_colnames=True, max_len=5)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets.sort_values(['support'], ascending=False)

Above is the list (in descending order) of the most frequently purchased items (alone or with other products).
Item whole milk is present in 15% of baskets, while item other vegetables is present in 12% of baskets...

**Let's first analyze the rules with min_support 5% and then for 1% respectively. Both using the metric support.**

In [30]:
ar = association_rules(frequent_itemsets, metric="support", min_threshold=0.05)
ar["antecedent_len"] = ar["antecedents"].apply(lambda x: len(x))
ar["consequents_len"] = ar["consequents"].apply(lambda x: len(x))
ar = ar[ar['antecedent_len'] == 1]
ar['antecedents'] = ar['antecedents'].apply(lambda x: list(x)[0])
ar = ar[['antecedents', "antecedent_len", 'consequents', "consequents_len", 'antecedent support','consequent support', 'support', 'confidence']]
ar

In [31]:
ar[ar['antecedents'] == 'other vegetables'].sort_values(['confidence'], ascending=False).head(10)

In [32]:
ar[ar['antecedents'] == 'whole milk'].sort_values(['confidence'], ascending=False).head(10)

In [33]:
ar = association_rules(frequent_itemsets, metric="support", min_threshold=0.01)
ar["antecedent_len"] = ar["antecedents"].apply(lambda x: len(x))
ar["consequents_len"] = ar["consequents"].apply(lambda x: len(x))
ar = ar[ar['antecedent_len'] == 1]
ar['antecedents'] = ar['antecedents'].apply(lambda x: list(x)[0])
ar = ar[['antecedents', "antecedent_len", 'consequents', "consequents_len", 'antecedent support','consequent support', 'support', 'confidence']]
ar

In [34]:
ar[ar['antecedents'] == 'other vegetables'].sort_values(['confidence'], ascending=False).head(10)

In [35]:
ar[ar['antecedents'] == 'whole milk'].sort_values(['confidence'], ascending=False).head(10)

**Let's now analyze the rules with min_support 5% and then for 1% respectively. Both using the metric lift.**

In [36]:
ar = association_rules(frequent_itemsets, metric="lift", min_threshold=0.05)
ar["antecedent_len"] = ar["antecedents"].apply(lambda x: len(x))
ar["consequents_len"] = ar["consequents"].apply(lambda x: len(x))
ar = ar[ar['antecedent_len'] == 1]
ar['antecedents'] = ar['antecedents'].apply(lambda x: list(x)[0])
ar = ar[['antecedents', "antecedent_len", 'consequents', "consequents_len", 'antecedent support','consequent support', 'support', 'confidence','lift']]
ar

In [37]:
ar[ar['antecedents'] == 'other vegetables'].sort_values(['confidence'], ascending=False).head(20)

In [38]:
ar[ar['antecedents'] == 'whole milk'].sort_values(['confidence'], ascending=False).head(20)

In [39]:
ar = association_rules(frequent_itemsets, metric="lift", min_threshold=0.01)
ar["antecedent_len"] = ar["antecedents"].apply(lambda x: len(x))
ar["consequents_len"] = ar["consequents"].apply(lambda x: len(x))
ar = ar[ar['antecedent_len'] == 1]
ar['antecedents'] = ar['antecedents'].apply(lambda x: list(x)[0])
ar = ar[['antecedents', "antecedent_len", 'consequents', "consequents_len", 'antecedent support','consequent support', 'support', 'confidence','lift']]
ar

In [40]:
ar[ar['antecedents'] == 'other vegetables'].sort_values(['confidence'], ascending=False).head(20)

In [41]:
ar[ar['antecedents'] == 'whole milk'].sort_values(['confidence'], ascending=False).head(20)

In [42]:
b = association_rules(frequent_itemsets, metric="lift", min_threshold=0.001)
b['uni'] = np.nan
b['ant'] = np.nan
b['con'] = np.nan
b['tot'] = 14963

In [43]:
transactions = [a[1]['itemDescription'].tolist() for a in list(basket.groupby(['Member_number','Date']))]

def trans():
    for t in transactions:
        yield t
    
def ant(x):
    cnt = 0
    for t in trans():
        t = set(t)
        if x.intersection(t) == x:
            cnt = cnt + 1 
    return cnt

bb = b.values.tolist()  

In [44]:
rules_dict = []
for bbb in bb:
    bbb[10] = ant(bbb[0])
    bbb[11] = ant(bbb[1])
    bbb[9] = ant(bbb[0].union(bbb[1]))
    diction = {
        'lhs': tuple(bbb[0]), 
        'rhs': tuple(bbb[1]),
        'count_full': bbb[9],
        'count_lhs': bbb[10],
        'count_rhs': bbb[11],
        'num_transactions': bbb[12]
    }
    rules_dict.append(diction)


In [45]:
rules = []
for rd in rules_dict: 
    rules.append(generate_rule_from_dict(rd))

**Parallel categories diagram** is a visualization of multi-dimensional categorical datasets wherein each variable in the dataset is represented by a column of rectangles, where each rectangle corresponds to a discrete value taken on by that variable

**Affinity Analysis**: By looking for combinations of items that occur together frequently in transactions, we try to uncover associations between these items, for improving product placements for offline shopping. The graph shown outlines these associations visualized as rules.


In [46]:
#Parallel Category Plot
PyARMViz.generate_parallel_category_plot(rules)

The above plots depicts the relationship between items for which rules were identified. For example: {Bread, Cake} → {Coffee} relationship is clearly noted.

**Network Graph**-Graph-based techniques visualize association rules using vertices and edges where vertices annotated with item labels represent items, and Itemsets or rules are represented as a second set of vertices.

In [47]:
# Network Graph
PyARMViz.generate_rule_graph_plotly(rules)

These powerful visualizations show a beautiful representation of the rules mined. However, they tend to get cluttered and messy. Hence, it is only useful to visualize small set of association rules mined


**Association rule strength** is a scatterplot with support and confidence on its axes, in addition to the third measure which is lift by default indicated by the color scale.
Association rule is nothing more than a mapping between an event that occurred (the purchase of product X by a customer, for example) and an event that is likely to occur (the purchase of product Y given that X was purchased, for example) .

In [48]:
#Association rule strength distribution
PyARMViz.generate_rule_strength_plot(rules)

From the plot above, it is clear that rules with high lift have a relatively low support.