# __D212 PA Task 3:__ *Association Rules and Risk Analysis*
>### Aaminah Halipoto
>### Western Governor's University
>### D212: Data Mining II
>### Prof. Kesselly Kamara
>### Nov 12. 2024

### Table of Contents
>A1. [Proposal of Question](#question) </br>
A2. [Defined Goal](#goal)</br>
B1. [Explanation of Market Basket](#explanation)</br>
B2. [Transaction Example](#transaction)</br>
B3. [Market Basket Assumption](#market)</br>
C1. [Transforming the dataset](#transforming)</br>
C2. [Code execution](#codeexecution)</br>
C3. [Association rules table](#association)</br>
C4. [Top three rules](#topthree)</br>
D1. [Significance of support, life and confidence summary](#significance)</br>
D2. [Practical significance and findings](#practical)</br>
D3. [Course of action](#course)</br> 
E, F. [Panopto video](#panoptocode)</br>
G. [Sources of Third-Party Code](#codesources)</br>
H. [Web Sources](#sources)</br>


#### __A1. Proposal of Question__ <a name="question"></a>
Using market basket analysis, what are the most popular and likely items bought together by our customers?

#### __A2. Defined Goal__ <a name="goal"></a>
I intend to use market basket analysis and the apriori algorithm's 3 parameters to support the finding of the most frequently bought itemsets within our customer base, in order to uncover unseen patterns amongst our products and create better marketing strategies.

#### __B1. Explanation of Market Basket__ <a name="explanation"></a>
Market basket analysis is a "data mining method that is used to define the strength of a relationship between pairs of items bought together" (_Data Mining II - D212 Theory_). Finding hidden patterns within groups of items bought together is beneficial to our business, and can help us create different strategies to encourage sales.

There are 2 aspects to the theory: IF and THEN. IF certain _antecedent_ items are bought, THEN _consequent_ items are bought (_Data Mining II - D212 Theory_). This theory applied in data mining as the Apriori algorithm, which allows us to examine large datasets and learn about any frequently-coupled items (Engati). Three parameters define the algorithm: 
><ins>Support</ins>, the number of appearances made by an item in a dataset, </br>
><ins>Confidence</ins>, how likely an item will be purchased based on the occurrence of another item's purchase, and </br>
><ins>Lift</ins>, a ratio measuring an item's overall popularity,</br>

all of which "denote the reasoning of association rules obtained from logged transaction datasets" (_Data Mining II - D212 Theory_).

The dataset in question is a log of transactions, including up to 20 items per record. I expect to transform this data and then, using association rules, pare down the itemsets that have the highest support (bought together most frequently), the most confidence, and the most lift overall. 

#### __B2. Transaction Example__ <a name="transaction"></a>
One example of a transaction in the dataset is the single purchase of a product "UNEN Mfi Certified 5-pack Lightning Cable". 

#### __B3. Market Basket Assumption__ <a name="market"></a>
Market basket analysis will be carried out using the apriori algorithm, where we measure the likelihood of certain items being bought together or individually by our customers. The central assumption of the apriori algorithm that makes it usable by definition of the market basket theory is "that all items in a frequent itemset must also be frequent" (Chaudhary, 2011). Thus, by finding the most-frequently bought itemsets (known as association rules) and creating thresholds for significant item support and confidence, we can isolate items that occur together <ins>and</ins> likely influence each other's purchase. Lift is the measure of how much more likely items are bought together, and a rule with a lift value equal to or less than 1 indicates that items, at best, are bought independently of each other, or preclude purchases of other items. Rules with lift values greater than 1 can be prioritized within the apriori algorithm, indicating items that are related and bought together because of a significant positive association.

#### __C1. Transforming the dataset__ <a name="transforming"></a>
Transforming the data consists mainly of using mlxtend's <TransactionEncoder> to take a list of all the unique values within the rows, or all products involved within the dataset, and one-hot encode their occurrences per transaction into a new dataframe. This process is illustrated in the code section below.

In [17]:
%matplotlib inline
import pandas as pd
import math
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import tensorflow

import warnings
warnings.filterwarnings('ignore')

#Below is code for relative pathing the file, which did not work on my computer. An absolute path is implemented instead
#import os
#dirname = os.getcwd()
#filename = os.path.join(dirname, 'teleco_market_basket.csv')
#df = pd.read_csv(filename)

df = pd.read_csv('C://Users/Aaminah/Desktop/masters/D212/teleco_market_basket.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15002 entries, 0 to 15001
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Item01  7501 non-null   object
 1   Item02  5747 non-null   object
 2   Item03  4389 non-null   object
 3   Item04  3345 non-null   object
 4   Item05  2529 non-null   object
 5   Item06  1864 non-null   object
 6   Item07  1369 non-null   object
 7   Item08  981 non-null    object
 8   Item09  654 non-null    object
 9   Item10  395 non-null    object
 10  Item11  256 non-null    object
 11  Item12  154 non-null    object
 12  Item13  87 non-null     object
 13  Item14  47 non-null     object
 14  Item15  25 non-null     object
 15  Item16  8 non-null      object
 16  Item17  4 non-null      object
 17  Item18  4 non-null      object
 18  Item19  3 non-null      object
 19  Item20  1 non-null      object
dtypes: object(20)
memory usage: 2.3+ MB


In [12]:
df.head(5)

Unnamed: 0,Item01,Item02,Item03,Item04,Item05,Item06,Item07,Item08,Item09,Item10,Item11,Item12,Item13,Item14,Item15,Item16,Item17,Item18,Item19,Item20
0,,,,,,,,,,,,,,,,,,,,
1,Logitech M510 Wireless mouse,HP 63 Ink,HP 65 ink,nonda USB C to USB Adapter,10ft iPHone Charger Cable,HP 902XL ink,Creative Pebble 2.0 Speakers,Cleaning Gel Universal Dust Cleaner,Micro Center 32GB Memory card,YUNSONG 3pack 6ft Nylon Lightning Cable,TopMate C5 Laptop Cooler pad,Apple USB-C Charger cable,HyperX Cloud Stinger Headset,TONOR USB Gaming Microphone,Dust-Off Compressed Gas 2 pack,3A USB Type C Cable 3 pack 6FT,HOVAMP iPhone charger,SanDisk Ultra 128GB card,FEEL2NICE 5 pack 10ft Lighning cable,FEIYOLD Blue light Blocking Glasses
2,,,,,,,,,,,,,,,,,,,,
3,Apple Lightning to Digital AV Adapter,TP-Link AC1750 Smart WiFi Router,Apple Pencil,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,


In [13]:
#(Kamara, 2023)
#removing all rows that are completely empty, easy to find as those that have NA in the first column of purchases
df = df[df['Item01'].notna()]
df.shape

(7501, 20)

In [14]:
rows = []
for i in range (0,7501): 
    rows.append([str(df.values[i,j])
for j in range(0, 20)])

In [15]:
#Kamara, 2024
#creating encoder object, transforming an array to one-hot encoded values based on unique entries in all rows
#then storing this transformation of our data in transaction dataframe
trenc = TransactionEncoder()
array = trenc.fit(rows).transform(rows)

transaction = pd.DataFrame(array, columns = trenc.columns_)

In [16]:
#all items ever bought are now columns and one-hot encoded
transaction

Unnamed: 0,10ft iPHone Charger Cable,10ft iPHone Charger Cable 2 Pack,3 pack Nylon Braided Lightning Cable,3A USB Type C Cable 3 pack 6FT,5pack Nylon Braided USB C cables,ARRIS SURFboard SB8200 Cable Modem,Anker 2-in-1 USB Card Reader,Anker 4-port USB hub,Anker USB C to HDMI Adapter,Apple Lightning to Digital AV Adapter,...,iFixit Pro Tech Toolkit,iPhone 11 case,iPhone 12 Charger cable,iPhone 12 Pro case,iPhone 12 case,iPhone Charger Cable Anker 6ft,iPhone SE case,nan,nonda USB C to USB Adapter,seenda Wireless mouse
0,True,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
7497,False,False,False,False,False,True,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
7498,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
7499,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False


In [17]:
#'nan' is erroneously given a column 
clean_apriori = transaction.drop(['nan'], axis=1)
clean_apriori.shape

(7501, 119)

In [18]:
#exporting cleaned dataset for apriori learning
clean_apriori.to_csv('C://Users/Aaminah/Desktop/masters/D212/clean_apriori.csv', index=False)

Rules in data mining are understood as the combinations of itemsets within a list of transactions. The different rules and their respective supports, or frequency of occurrence, are listed within the rules dataframe generated by the Sklearn apriori algorithm function. According to the function, rules are only shown if they occur in at least 2% of the dataset.

In [20]:
#creating dataframe containing every rule, or combination of purchases, that occurs within the record of transactions
rules = apriori(clean_apriori, min_support = 0.02, max_len=None, use_colnames = True)
rules.head(105)

Unnamed: 0,support,itemsets
0,0.050527,(10ft iPHone Charger Cable 2 Pack)
1,0.042528,(3A USB Type C Cable 3 pack 6FT)
2,0.029463,(Anker 2-in-1 USB Card Reader)
3,0.068391,(Anker USB C to HDMI Adapter)
4,0.087188,(Apple Lightning to Digital AV Adapter)
...,...,...
98,0.023730,"(USB 2.0 Printer cable, Screen Mom Screen Clea..."
99,0.035462,"(VIVO Dual LCD Monitor Desk mount, Screen Mom ..."
100,0.020131,"(USB 2.0 Printer cable, Stylus Pen for iPad)"
101,0.025197,"(VIVO Dual LCD Monitor Desk mount, Stylus Pen ..."


#### __C2. Code execution__ <a name="codeexecution"></a>
The entire script is located in the document 'ahalipoto_pa3.ipynb'.

#### __C3. Association rules table__ <a name="association"></a>
The association rules table contains the information for each rule containing both antecedents and consequents, their respective supports alone, as well as each rule's support, confidence, and lift. Only rules with a lift of at least 1 are shown -- this isolates itemsets where both items are likely to be bought together, while taking into account any distortion for a certain item being popular in general (Ng, 2016). This table is useful for isolating the rules that are most valuable to us, and can be filtered based on the 3 main parameters.

In [22]:
rul_table = association_rules(rules, metric ='lift', min_threshold = 1)
rul_table.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Dust-Off Compressed Gas 2 pack),(10ft iPHone Charger Cable 2 Pack),0.238368,0.050527,0.023064,0.096756,1.914955,0.01102,1.051182
1,(10ft iPHone Charger Cable 2 Pack),(Dust-Off Compressed Gas 2 pack),0.050527,0.238368,0.023064,0.456464,1.914955,0.01102,1.401255
2,(Anker USB C to HDMI Adapter),(Dust-Off Compressed Gas 2 pack),0.068391,0.238368,0.024397,0.356725,1.49653,0.008095,1.183991
3,(Dust-Off Compressed Gas 2 pack),(Anker USB C to HDMI Adapter),0.238368,0.068391,0.024397,0.102349,1.49653,0.008095,1.03783
4,(Anker USB C to HDMI Adapter),(VIVO Dual LCD Monitor Desk mount),0.068391,0.17411,0.020931,0.306043,1.757755,0.009023,1.190117
5,(VIVO Dual LCD Monitor Desk mount),(Anker USB C to HDMI Adapter),0.17411,0.068391,0.020931,0.120214,1.757755,0.009023,1.058905
6,(Apple Lightning to Digital AV Adapter),(Apple Pencil),0.087188,0.179709,0.028796,0.330275,1.83783,0.013128,1.224818
7,(Apple Pencil),(Apple Lightning to Digital AV Adapter),0.179709,0.087188,0.028796,0.160237,1.83783,0.013128,1.086988
8,(Apple Lightning to Digital AV Adapter),(Dust-Off Compressed Gas 2 pack),0.087188,0.238368,0.024397,0.279817,1.173883,0.003614,1.057552
9,(Dust-Off Compressed Gas 2 pack),(Apple Lightning to Digital AV Adapter),0.238368,0.087188,0.024397,0.102349,1.173883,0.003614,1.016889


#### __C4. Top three rules__ <a name="topthree"></a>
The three rules for each parameter - confidence, lift, and support - are measured in ascending order of magnitude within the rule tables below.

In [26]:
top_three_rules = rul_table.sort_values('support', ascending=False).head(3)
top_three_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
62,(VIVO Dual LCD Monitor Desk mount),(Dust-Off Compressed Gas 2 pack),0.17411,0.238368,0.059725,0.343032,1.439085,0.018223,1.159314
63,(Dust-Off Compressed Gas 2 pack),(VIVO Dual LCD Monitor Desk mount),0.238368,0.17411,0.059725,0.250559,1.439085,0.018223,1.102008
41,(HP 61 ink),(Dust-Off Compressed Gas 2 pack),0.163845,0.238368,0.05266,0.3214,1.348332,0.013604,1.122357


In [24]:
top_three_rules = rul_table.sort_values('confidence', ascending=False).head(3)
top_three_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
1,(10ft iPHone Charger Cable 2 Pack),(Dust-Off Compressed Gas 2 pack),0.050527,0.238368,0.023064,0.456464,1.914955,0.01102,1.401255
36,(FEIYOLD Blue light Blocking Glasses),(Dust-Off Compressed Gas 2 pack),0.065858,0.238368,0.027596,0.419028,1.757904,0.011898,1.310962
53,(SanDisk Ultra 64GB card),(Dust-Off Compressed Gas 2 pack),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401


In [25]:
top_three_rules = rul_table.sort_values('lift', ascending=False).head(3)
top_three_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
84,(VIVO Dual LCD Monitor Desk mount),(SanDisk Ultra 64GB card),0.17411,0.098254,0.039195,0.225115,2.291162,0.022088,1.163716
85,(SanDisk Ultra 64GB card),(VIVO Dual LCD Monitor Desk mount),0.098254,0.17411,0.039195,0.398915,2.291162,0.022088,1.373997
64,(FEIYOLD Blue light Blocking Glasses),(VIVO Dual LCD Monitor Desk mount),0.065858,0.17411,0.02293,0.348178,1.999758,0.011464,1.267048


The top 3 rules are the ones with the greatest lift overall. These rules are summarized below.

In [28]:
sort_rules = rul_table[(rul_table['lift'] > 0.08)]
sort_rules.head(3)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Dust-Off Compressed Gas 2 pack),(10ft iPHone Charger Cable 2 Pack),0.238368,0.050527,0.023064,0.096756,1.914955,0.01102,1.051182
1,(10ft iPHone Charger Cable 2 Pack),(Dust-Off Compressed Gas 2 pack),0.050527,0.238368,0.023064,0.456464,1.914955,0.01102,1.401255
2,(Anker USB C to HDMI Adapter),(Dust-Off Compressed Gas 2 pack),0.068391,0.238368,0.024397,0.356725,1.49653,0.008095,1.183991


#### __D1. Significance of support, life and confidence summary__ <a name="significance"></a>
The top 3 rules ranked by support are: 
>1. (VIVO Dual LCD Monitor Desk mount) leading into (Dust-Off Compressed Gas 2 pack) with a support of 0.059725,</br>
>2. (Dust-Off Compressed Gas 2 pack) leading into (VIVO Dual LCD Monitor Desk mount) with a support of 0.059725, and</br>
>3. (Dust-Off Compressed Gas 2 pack) leading into (HP 61 ink) with a support of 0.052660.</br>

Support is the measure of how frequently an item is bought, and is measured in the table per-item and per-rule overall. All 3 rules contain the Dust-Off Compressed Gas 2 pack, which is supported by its support value of 0.238368 -- this means the item occurs in 23.84% of all recorded transactions. The top 3 rules are very significant with overall support values of at least 0.05 -- meaning these rules occur within 5-6% of all transactions within this dataset. In particular, the VIVO Dual LCD Monitor Desk mount and Dust-Off Compressed Gas 2 pack have 2 rules containing them both as antecedent and consequent of each other -- meaning that in the incidences of customers buying these items together, either item has a similar influence on the purchase of the other. 

The top 3 rules ranked by confidence are:
>1. (10ft iPHone Charger Cable 2 Pack) leading into (Dust-Off Compressed Gas 2 pack) with a confidence of 0.456464,</br>
>2. (FEIYOLD Blue light Blocking Glasses) leading into (Dust-Off Compressed Gas 2 pack) with a confidence of 0.419028, and</br>
>3. (SanDisk Ultra 64GB card) leading into (Dust-Off Compressed Gas 2 pack) with a confidence of 0.416554.</br>

Confidence measures the proportion of a rule occurring compared to the number of total antecedent purchases. All 3 rules contain the Dust-Off Compressed Gas 2 pack, which is supported by its individual support value of 0.238368 -- this means the item occurs in 23.84% of all recorded transactions. What is interesting when ranking these rules by confidence, is that the gas pack is a <ins>consequent</ins> in all 3. The rules' confidence values range from 45% to 41%, meaning almost half of all purchases containing the antecedent items also include the gas packs -- this is in spite of how customers buy the antecedent items much less frequently (all have less than 10% support individually). One needs only look at the support values shifting to buttress the understanding of confidence: the 10ft iPHone Charger Cable 2 Pack has a rule with the gas pack witha support of 0.023064 or around 2.3%, which is almost half of the charging cable's individual support of about 5%. 

The top 3 rules ranked by lift are:
>1. (VIVO Dual LCD Monitor Desk mount) leading into (SanDisk Ultra 64GB card) with a lift of 2.291162,</br>
>2. (SanDisk Ultra 64GB card) leading into (VIVO Dual LCD Monitor Desk mount) with a lift of 2.291162, and</br>
>3. (FEIYOLD Blue light Blocking Glasses) leading into (VIVO Dual LCD Monitor Desk mount) with a lift of 1.999758.</br>

Lift accounts for the popularity of individual items as well as the likelihood of items being bought together. The high support value for the VIVO Dual LCD Monitor Desk mount, occuring in 17.41% of all recorded transactions, leads it to positively influence the sale of less-frequently bought items. 2 rules contain the same 2 items, the VIVO Dual LCD Monitor Desk mount along with SanDisk Ultra 64GB card. These rules have the same lift, where customers are around 2.3 times more likely to buy these items together. The FEIYOLD Blue light Blocking Glasses are bought only 6% of the time, but customers are almost 2 times more likely to buy the glasses with the monitor.

#### __D2. Practical significance and findings__ <a name="practical"></a>
The highest support being 0.0597 means that the VIVO desk mount is bought together with the compressed gas pack in almost 6% of all transactions, not accounting for the popularity of each individual item. Overall, we can understand the VIVO desk mount, the gas pack, as well as the ink cartridges as popular items amongst all transactions.

The top confidence measure being 0.456464 indicates the ratio of occurrences of the consequent item, the phone charger pack, also sold in a transaction when the antecedent item, the gas pack, is sold (Ng, 2016). Only 4.5% of all gas packs are sold with phone charger packs. This measure may distort the actual proportion of phones sold overall, since its a ratio of individual item sales. In general, the top 3 rules of confidence show us that the gas packs are very likely to be bought alongside the charging packs, blue light blocking glasses, and the Sandisk Ultra card.

Lift is the most thorough measure when it comes to understanding the frequency and likelihood of popular itemsets -- the 1.914955 lift score for the VIVO desk mount and the Sandisk Ultra card indicates that there is a strong relationship between the sale of the antecedent and consequent items (Ng, 2016). Customers will likely buy the card when they buy the desk mount, and this purchase occurs quite often. The top 3 rules sorted by lift show a great popularity of the VIVO desk mount with the Sandisk Ultra card, as well as a tendency to be bought with blue light blocking glasses. These are purchases that we can count on happening, with the items being related and likely complementary to each other.


#### __D3. Course of action__ <a name="course"></a>
With the association rule summary above in mind, I can make many suggestions to our marketing department to incentivize common purchases that customers make -- the VIVO desk mount, Sandisk Ultra card, and blue light glasses are an excellent trio of products to be marketed together. Having a sale on any of these items will likely drive the purchase of the subsequent items, and we could expect many customers to partake in the deal. Additionally, technology maintenance items like the compressed gas packs are typically paired with other computer-related products -- physically placing gas packs near where these products are sold in stores, or suggesting gas packs to customers browsing phone or computer products online would likely guarantee a sale. Learning about these pairs influences me to proffer better business solutions and market our products more effectively. 

#### __E, F. Panopto video__ <a name="panoptocode"></a>
https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=55197073-b33d-4bd0-a14c-b227003f7b48

#### __G. Sources of Third-Party Code__ <a name="codesources"></a>
Kamara, K. (n.d.). Data Mining II - D212 Task 3. Panopto. https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=db85c4f1-0da5-4bde-a1a4-b07c0019d46d

#### __H. Web Sources__ <a name="sources"></a>
Apriori algorithm. Engati. (n.d.). https://www.engati.com/glossary/apriori-algorithm 

Chaudhary, S. (2022, February 11). Understanding market basket analysis in data mining. Understanding Market Basket Analysis in Data Mining. https://www.turing.com/kb/market-basket-analysis 

Kamara, K. (n.d.). Data Mining II - D212 Theory. Panopto. https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=9541a29b-2f14-4c5d-9d86-af030005bcf6

Ng, Annalyn. Association rules and the Apriori Algorithm: A tutorial. KDnuggets. (n.d.). https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html 