# **CSI 382 - Data Mining and Knowledge Discovery**

# **Lab 10 - Association Rules**

Affinity analysis is the study of attributes or characteristics that “go together.”
Methods for affinity analysis, also known as market basket analysis, seek to uncover associations among these attributes; that is, it seeks to uncover rules for
quantifying the relationship between two or more attributes. Association rules
take the form “If antecedent, then consequent,” along with a measure of the support and confidence associated with the rule.

For example, a particular supermarket may find that of the 1000 customers shopping on a Thursday night, 200 bought a pen, and of the 200 who bought a pen,
50 bought paper. Thus, the association rule would be: “If buy pen, then buy
paper,” with a support of 50/1000 = 5% and a confidence of 50/200 = 25%.

# **Installing packages and importing libraries**

In [None]:
!pip install squarify

In [None]:
# for basic operations
import numpy as np
import pandas as pd

# for visualizations
import matplotlib.pyplot as plt
import squarify
import seaborn as sns
plt.style.use('fivethirtyeight')

# for defining path
import os

# for market basket analysis
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# **Dataset for Lab 10**

**Data Set Information:**

We have a dataset of a mall with 7500 transactions of different customers buying different items from the store. We have to find correlations between the different items in the store. so that we can know if a customer is buying apple, banana and mango. what is the next item, The customer would be interested in buying from the store.


**Problem Statement**

Market owners should want to know what customers will buy next by looking at the products they buy, so that market owners can adjust their product placement to increase product sales. This can be overcome by using the Apriori Algorithm to perform a Market Basket Analysis of the customer's buying behavior.

The dataset can be found here in this [URL](https://drive.google.com/file/d/1DDtVOZwFQJBn0zXc69sfb2zld1PofX57/view?usp=sharing)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## **Loading the dataset**

In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/CSI 382 - Datasets/Market_Basket_Optimisation.csv', header = None)

#Check number of rows and columns in the dataset
print("The dataset has %d rows and %d columns." % df.shape)

In [None]:
df.head()

# **Dataset Visualization**


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from wordcloud import WordCloud

plt.rcParams['figure.figsize'] = (15, 15)
wordcloud = WordCloud(background_color = 'white', width = 1200,  height = 1200, max_words = 121).generate(str(df))
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Most Popular Items',fontsize = 20)
plt.show()

The bigger words in the wordcloud depicts the most popular selling items in the supermarket.

In [None]:
# looking at the frequency of most popular items

plt.rcParams['figure.figsize'] = (18, 7)
color = plt.cm.copper(np.linspace(0, 1, 121))
df[0].value_counts().head(40).plot.bar(color = color)
plt.title('frequency of most popular items', fontsize = 20)
plt.xticks(rotation = 90 )
plt.grid()
plt.show()

Let's see a treemap implementation of the frequency of the data.

In [None]:

y = df[1].value_counts().head(40).to_frame()

y.index

In [None]:
# plotting a tree map

plt.rcParams['figure.figsize'] = (20, 20)
color = plt.cm.cool(np.linspace(0, 1, 40))
squarify.plot(sizes = y.values, label = y.index, alpha=.8, color = color)
plt.title('Tree Map for Popular Items')
plt.axis('off')
plt.show()

Now we can check the first choices of the customers for all data.

In [None]:
df['food'] = 'Food'
food = df.truncate(before = -1, after = 15)

import networkx as nx

food = nx.from_pandas_edgelist(food, source = 'food', target = 0, edge_attr = True)

In [None]:
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (20, 20)
pos = nx.spring_layout(food)
color = plt.cm.Wistia(np.linspace(0, 15, 1))
nx.draw_networkx_nodes(food, pos, node_size = 15000, node_color = color)
nx.draw_networkx_edges(food, pos, width = 3, alpha = 0.6, edge_color = 'black')
nx.draw_networkx_labels(food, pos, font_size = 20, font_family = 'sans-serif')
plt.axis('off')
plt.grid()
plt.title('Top 15 First Choices', fontsize = 40)
plt.show()

Now we can check the second choices of the customers for all data.

In [None]:
df['secondchoice'] = 'Second Choice'
secondchoice = df.truncate(before = -1, after = 15)
secondchoice = nx.from_pandas_edgelist(secondchoice, source = 'food', target = 1, edge_attr = True)

In [None]:
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (20, 20)
pos = nx.spring_layout(secondchoice)
color = plt.cm.Blues(np.linspace(0, 15, 1))
nx.draw_networkx_nodes(secondchoice, pos, node_size = 15000, node_color = color)
nx.draw_networkx_edges(secondchoice, pos, width = 3, alpha = 0.6, edge_color = 'brown')
nx.draw_networkx_labels(secondchoice, pos, font_size = 20, font_family = 'sans-serif')
plt.axis('off')
plt.grid()
plt.title('Top 15 Second Choices', fontsize = 40)
plt.show()

Now we can check the third choices of the customers for all data.

In [None]:
df['thirdchoice'] = 'Third Choice'
thirdchoice = df.truncate(before = -1, after = 10)
thirdchoice = nx.from_pandas_edgelist(thirdchoice, source = 'food', target = 2, edge_attr = True)


In [None]:
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (20, 20)
pos = nx.spring_layout(thirdchoice)
color = plt.cm.Reds(np.linspace(0, 15, 1))
nx.draw_networkx_nodes(thirdchoice, pos, node_size = 15000, node_color = color)
nx.draw_networkx_edges(thirdchoice, pos, width = 3, alpha = 0.6, edge_color = 'pink')
nx.draw_networkx_labels(thirdchoice, pos, font_size = 20, font_family = 'sans-serif')
plt.axis('off')
plt.grid()
plt.title('Top 10 Third Choices', fontsize = 40)
plt.show()


# **Data Preprocessing**

There are two principal methods of representing this type of market basket data:
using either the transactional data format or the tabular data format. The transactional data format requires only two fields, an ID field and a content field,
with each record representing a single item only.

For example, the data in Table 1 could be represented using transactional data
format as shown in Table 2. In the tabular data format, each record represents
a separate transaction, with as many 0/1 flag fields as there are items. The data
from Table 2 could be represented using the tabular data format, as shown in
Figure 1.

In [None]:
# making each customers shopping items an identical list
trans = []
for i in range(0, 7501):
    trans.append([str(df.values[i,j]) for j in range(0, 20)])

# conveting it into an numpy array
trans = np.array(trans)

# checking the shape of the array
print(trans.shape)

In [None]:
trans[:,0:10]

In [None]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
df = te.fit_transform(trans)
df = pd.DataFrame(df, columns = te.columns_)

# getting the shape of the data
df.shape

In [None]:
df.head(10)

In [None]:
import warnings
warnings.filterwarnings('ignore')

# getting correlations for 121 items would be messy
# so let's reduce the items from 121 to 40

df = df.loc[:, ['mineral water', 'burgers', 'turkey', 'chocolate', 'frozen vegetables', 'spaghetti',
                    'shrimp', 'grated cheese', 'eggs', 'cookies', 'french fries', 'herb & pepper', 'ground beef',
                    'tomatoes', 'milk', 'escalope', 'fresh tuna', 'red wine', 'ham', 'cake', 'green tea',
                    'whole wheat pasta', 'pancakes', 'soup', 'muffins', 'energy bar', 'olive oil', 'champagne',
                    'avocado', 'pepper', 'butter', 'parmesan cheese', 'whole wheat rice', 'low fat yogurt',
                    'chicken', 'vegetables mix', 'pickles', 'meatballs', 'frozen smoothie', 'yogurt cake']]

# checking the shape
df.shape

In [None]:
df.columns

In [None]:
df.head()

# **Apriori Algorithm**

The algorithm was first proposed in 1994 by Rakesh Agrawal and Ramakrishnan Srikant. Apriori algorithm finds the most frequent itemsets or elements in a transaction database and identifies association rules between the items just like the above-mentioned example.

To construct association rules between elements or items, the algorithm considers 3 important factors which are, support, confidence and lift. Each of these factors is explained as follows:

**Support**:

The support of item I is defined as the ratio between the number of transactions containing the item I by the total number of transactions expressed as the equation specified in the Lecture slides.

**Confidence**:

This is measured by the proportion of transactions with item I1, in which item I2 also appears. The confidence between two items I1 and I2, in a transaction is defined as the total number of transactions containing both items I1 and I2 divided by the total number of transactions containing I1. ( Assume I1 as X , I2 as Y )

**Lift**:

Lift is the ratio between the confidence and support.

**Strong Rules**

Analysts may prefer rules that have either high support or high confidence, and
usually both. Strong rules are those that meet or surpass certain minimum support and confidence criteria.
For example, an analyst interested in finding which supermarket items are purchased together may set a minimum support level of 20% and a minimum confidence level of 70%. On the other hand, a fraud detection analyst or a terrorism
detection analyst would need to reduce the minimum support level to 1% or less,
since comparatively few transactions are either fraudulent or terror-related.

**Itemset**

An itemset is a set of items contained in $I$ , and a $k-itemset$ is an itemset containing
$k$ items. For example, \{beans, squash\} is a 2-itemset, and \{broccoli, green peppers,
corn\} is a 3-itemset, each from the vegetable stand set $I$. The itemset frequency is
simply the number of transactions that contain the particular itemset.

A frequent
itemset is an itemset that occurs at least a certain minimum number of times, having
itemset frequency $\geq \phi$. For example, suppose that we set $\phi = 4.$ Then itemsets that
occur more than four times are said to be frequent. We denote the set of frequent
$k$-itemsets as $F_{k}$.

## **Finding Frequent itemsets**

In [None]:
from mlxtend.frequent_patterns import apriori

#Now, let us return the items and itemsets with at least 5% support:
apriori(df, min_support = 0.03, use_colnames = True)


In [None]:
frequent_itemsets = apriori(df, min_support = 0.01, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

In [None]:
# getting th item sets with length = 2 and support more han 10%

frequent_itemsets[ (frequent_itemsets['length'] >= 2) &
                   (frequent_itemsets['support'] >= 0.01) ]


## Finding Association Rules

In [None]:
rules_mlxtend = association_rules(frequent_itemsets, metric="support", min_threshold=0.02)
rules_mlxtend.sort_values(by=["support"],ascending=False)

In [None]:
rules_mlxtend[ (rules_mlxtend['lift'] >= 1) & (rules_mlxtend['confidence'] >= 0.3) ].sort_values(by=["support"],ascending=False).head()

# **That's all for today!**

# **Tasks**

## **Dataset**

This data set was produced for the purpose of analyzing the products purchased in the same basket. For more information you can follow this link - [URL](https://www.kaggle.com/ahmtcnbs/datasets-for-appiori).

Download the dataset from here - [Download Link](https://drive.google.com/file/d/1bIQ0R3GC43h6TG7FlVHME7Ev8DtazM_-/view?usp=sharing)


Now try to do the following:

1. Apply appriori algorithm to explore the frequent datasets
2. Use support,consequent support, support, confidence, lift, leverage and  conviction to measure the association rules.