# D212 Data Mining II - Market Basket Analysis
<br>David Harvell
<br>Master of Science, Data Analytics
<br>October 2021
<br>
<br>

#### <font color="blue">A-1.  Propose one question relevant to a real-world organizational situation that you will answer using market basket analysis.</font>

Can we identify products that are commonly purchased together for marketing and product layout purposes?

#### <font color="blue">A-2.  Define one goal of the data analysis.</font>

Discover at least 3 pairs of products that are commonly purchased together.  If we can do this, the information can be used for marketing and store layouts - ultimately helping to increase sales.

#### <font color="blue">B-1.  Explain how market basket analyzes the selected dataset.</font>

We will first encode the data in a manner that will make it easy for analysis by pivoting the products to the columns and using one hot encoding.
<br><br>
After that, we will use the Apriori algorithm to prune results and examine the top associations.  Apriori will first look at the frequency of single items, use that to limit sets, and then can continue limited larger sets based on the frequency of earlier sets.  This allows us to create a workable set of associations to investigate. (GeeksforGeeks, 2020)
<br><br>
Finally, we will compute metrics like confidence, lift, and Zhang's rule.  This will allow us to report the "best" associations.

#### <font color="blue">B-2.  Provide one example of transactions in the dataset.</font>

<div class="alert alert-block alert-warning">
We will begin by reviewing the dataset for anomalies.</div>

In [1]:
import pandas as pd
df = pd.read_csv('teleco_market_basket.csv')
df.head(10)

Unnamed: 0,Item01,Item02,Item03,Item04,Item05,Item06,Item07,Item08,Item09,Item10,Item11,Item12,Item13,Item14,Item15,Item16,Item17,Item18,Item19,Item20
0,,,,,,,,,,,,,,,,,,,,
1,Logitech M510 Wireless mouse,HP 63 Ink,HP 65 ink,nonda USB C to USB Adapter,10ft iPHone Charger Cable,HP 902XL ink,Creative Pebble 2.0 Speakers,Cleaning Gel Universal Dust Cleaner,Micro Center 32GB Memory card,YUNSONG 3pack 6ft Nylon Lightning Cable,TopMate C5 Laptop Cooler pad,Apple USB-C Charger cable,HyperX Cloud Stinger Headset,TONOR USB Gaming Microphone,Dust-Off Compressed Gas 2 pack,3A USB Type C Cable 3 pack 6FT,HOVAMP iPhone charger,SanDisk Ultra 128GB card,FEEL2NICE 5 pack 10ft Lighning cable,FEIYOLD Blue light Blocking Glasses
2,,,,,,,,,,,,,,,,,,,,
3,Apple Lightning to Digital AV Adapter,TP-Link AC1750 Smart WiFi Router,Apple Pencil,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,
5,UNEN Mfi Certified 5-pack Lightning Cable,,,,,,,,,,,,,,,,,,,
6,,,,,,,,,,,,,,,,,,,,
7,Cat8 Ethernet Cable,HP 65 ink,,,,,,,,,,,,,,,,,,
8,,,,,,,,,,,,,,,,,,,,
9,Dust-Off Compressed Gas 2 pack,Screen Mom Screen Cleaner kit,Moread HDMI to VGA Adapter,HP 62XL Tri-Color ink,Apple USB-C Charger cable,,,,,,,,,,,,,,,


<div class="alert alert-block alert-warning">
Some blank entries are appearing in the dataset.  We will check the underlying data.</div>

![title](dataset.png)

In [2]:
df.Item01.isna().sum()

7501

In [3]:
len(df)

15002

<div class="alert alert-block alert-warning">
It appears as though there are blank entries between each valid entry.  We will clean by removing all records where there is no Item 01 in the transaction.</div>

In [4]:
df = df[df.Item01.notna()]
df.head()

Unnamed: 0,Item01,Item02,Item03,Item04,Item05,Item06,Item07,Item08,Item09,Item10,Item11,Item12,Item13,Item14,Item15,Item16,Item17,Item18,Item19,Item20
1,Logitech M510 Wireless mouse,HP 63 Ink,HP 65 ink,nonda USB C to USB Adapter,10ft iPHone Charger Cable,HP 902XL ink,Creative Pebble 2.0 Speakers,Cleaning Gel Universal Dust Cleaner,Micro Center 32GB Memory card,YUNSONG 3pack 6ft Nylon Lightning Cable,TopMate C5 Laptop Cooler pad,Apple USB-C Charger cable,HyperX Cloud Stinger Headset,TONOR USB Gaming Microphone,Dust-Off Compressed Gas 2 pack,3A USB Type C Cable 3 pack 6FT,HOVAMP iPhone charger,SanDisk Ultra 128GB card,FEEL2NICE 5 pack 10ft Lighning cable,FEIYOLD Blue light Blocking Glasses
3,Apple Lightning to Digital AV Adapter,TP-Link AC1750 Smart WiFi Router,Apple Pencil,,,,,,,,,,,,,,,,,
5,UNEN Mfi Certified 5-pack Lightning Cable,,,,,,,,,,,,,,,,,,,
7,Cat8 Ethernet Cable,HP 65 ink,,,,,,,,,,,,,,,,,,
9,Dust-Off Compressed Gas 2 pack,Screen Mom Screen Cleaner kit,Moread HDMI to VGA Adapter,HP 62XL Tri-Color ink,Apple USB-C Charger cable,,,,,,,,,,,,,,,


<div class="alert alert-block alert-warning">
This shows the first 5 valid transactions in the dataset.</div>

#### <font color="blue">B-3.  Summarize one assumption of market basket analysis.</font>

A key assumption of market basket analysis is the existence of relationships/associations between products.  We use varying metrics to measure these relationships and determine likely recurring sets of items that are purchased together. (Kamakura, 2012)

#### <font color="blue">C-1.  Transform the dataset to make it suitable for market basket analysis.</font>

We have already removed the empty records.  Next we will use one hot encoding to move products to the columns.  This will prepare us for Apriori and Assocation Rules in the following steps.

In [5]:
# Transform the current dataframe into a list of lists that contain the items for each purchase
transactions = []

for index, row in df.iterrows():
    products = []
    for col in row:
        if not pd.isna(col):
            products.append(col)
    transactions.append(products)

# Encode the new lists using one hot encoding
from mlxtend.preprocessing import TransactionEncoder

encoder = TransactionEncoder().fit(transactions)
onehot = encoder.transform(transactions)
onehot = pd.DataFrame(onehot, columns = encoder.columns_)
onehot.head()

Unnamed: 0,10ft iPHone Charger Cable,10ft iPHone Charger Cable 2 Pack,3 pack Nylon Braided Lightning Cable,3A USB Type C Cable 3 pack 6FT,5pack Nylon Braided USB C cables,ARRIS SURFboard SB8200 Cable Modem,Anker 2-in-1 USB Card Reader,Anker 4-port USB hub,Anker USB C to HDMI Adapter,Apple Lightning to Digital AV Adapter,...,hP 65 Tri-color ink,iFixit Pro Tech Toolkit,iPhone 11 case,iPhone 12 Charger cable,iPhone 12 Pro case,iPhone 12 case,iPhone Charger Cable Anker 6ft,iPhone SE case,nonda USB C to USB Adapter,seenda Wireless mouse
0,True,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [6]:
# Save the encoded data for upload with findings
onehot.to_csv('market_basket_encoded.csv', index = False)

#### <font color="blue">C-2.  Execute the code used to generate association rules with the Apriori algorithm.</font>

<div class="alert alert-block alert-warning">
Run Apriori for pairs of antecedents and consequents.</div>

In [7]:
len(onehot)

7501

In [8]:
from mlxtend.frequent_patterns import apriori, association_rules
frequent_itemsets = apriori(onehot, min_support = 0.005, max_len = 2, use_colnames = True)

len(frequent_itemsets)

552

In [9]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets.head()

Unnamed: 0,support,itemsets,length
0,0.009065,(10ft iPHone Charger Cable),1
1,0.050527,(10ft iPHone Charger Cable 2 Pack),1
2,0.005199,(3 pack Nylon Braided Lightning Cable),1
3,0.042528,(3A USB Type C Cable 3 pack 6FT),1
4,0.019064,(5pack Nylon Braided USB C cables),1


<div class="alert alert-block alert-warning">
Limit to the best rules and sort by confidence.</div>

In [10]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.40)
rules = rules.sort_values(by = 'confidence', ascending = False)
rules.describe()

Unnamed: 0,antecedent support,consequent support,support,confidence,lift,leverage,conviction
count,11.0,11.0,11.0,11.0,11.0,11.0,11.0
mean,0.032796,0.232527,0.014022,0.43691,1.895332,0.006287,1.366972
std,0.028483,0.019375,0.011836,0.031215,0.252447,0.005023,0.086197
min,0.010399,0.17411,0.005066,0.401254,1.683336,0.002364,1.272045
25%,0.014131,0.238368,0.005999,0.413951,1.736601,0.003062,1.299629
50%,0.018531,0.238368,0.007732,0.419028,1.757904,0.003614,1.310962
75%,0.046527,0.238368,0.020064,0.463275,1.988233,0.008973,1.447858
max,0.098254,0.238368,0.040928,0.487179,2.546642,0.017507,1.485182


#### <font color="blue">C-3.  Provide values for the support, lift, and confidence of the association rules table.</font>

<div class="alert alert-block alert-warning">
All relevant metrics for the remaining rules.</div>

In [11]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
8,(SanDisk Extreme 256GB card),(Dust-Off Compressed Gas 2 pack),0.010399,0.238368,0.005066,0.487179,2.043811,0.002587,1.485182
6,(DisplayPort ot HDMI adapter),(Dust-Off Compressed Gas 2 pack),0.011998,0.238368,0.005733,0.477778,2.004369,0.002873,1.458444
2,(Apple Lightning to USB cable),(Dust-Off Compressed Gas 2 pack),0.015598,0.238368,0.007332,0.470085,1.972098,0.003614,1.437273
0,(10ft iPHone Charger Cable 2 Pack),(Dust-Off Compressed Gas 2 pack),0.050527,0.238368,0.023064,0.456464,1.914955,0.01102,1.401255
4,(AutoFocus 1080p Webcam),(VIVO Dual LCD Monitor Desk mount),0.014131,0.17411,0.006266,0.443396,2.546642,0.003805,1.483802
7,(FEIYOLD Blue light Blocking Glasses),(Dust-Off Compressed Gas 2 pack),0.065858,0.238368,0.027596,0.419028,1.757904,0.011898,1.310962
5,(Brother Genuine High Yield Toner Cartridge),(Dust-Off Compressed Gas 2 pack),0.018531,0.238368,0.007732,0.417266,1.750511,0.003315,1.306998
10,(SanDisk Ultra 64GB card),(Dust-Off Compressed Gas 2 pack),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401
9,(SanDisk Extreme Pro 128GB card),(Dust-Off Compressed Gas 2 pack),0.018797,0.238368,0.007732,0.411348,1.725681,0.003252,1.293856
3,(AutoFocus 1080p Webcam),(Dust-Off Compressed Gas 2 pack),0.014131,0.238368,0.005733,0.40566,1.701822,0.002364,1.281476


#### <font color="blue">C-4.  Identify the top three rules generated by the Apriori algorithm. </font>

<div class="alert alert-block alert-warning">
The top 3 rules.</div>

In [12]:
rules.head(3)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
8,(SanDisk Extreme 256GB card),(Dust-Off Compressed Gas 2 pack),0.010399,0.238368,0.005066,0.487179,2.043811,0.002587,1.485182
6,(DisplayPort ot HDMI adapter),(Dust-Off Compressed Gas 2 pack),0.011998,0.238368,0.005733,0.477778,2.004369,0.002873,1.458444
2,(Apple Lightning to USB cable),(Dust-Off Compressed Gas 2 pack),0.015598,0.238368,0.007332,0.470085,1.972098,0.003614,1.437273


#### <font color="blue">D-1.  Summarize the significance of support, lift, and confidence from the results of the analysis.</font>

<strong>Support</strong><br>
Support is the proportation of all equations that contain the assocation.  This is the simplest metric. A value of one would represent the item (or combination) being in every transaction.  The support isn't enough to indicate a strong relationship between items, because items with high popularity will seem to be related to the other items being purchased. (Sivek, 2020)
<br><br>
<strong>Confidence</strong><br>
Confidence is yet another proportion, but it limits the denominator to records that have the antecedent.  This makes the metric more relevant when looking at relationships. It gives us the probability that the consequent will be purchased when purchasing the antecedent.  The relationship will be weak when close to 0 and strongest at a value of 1. (Sivek, 2020)
<br><br>
<strong>Lift</strong><br>
Lift is another metric that assists in determining strength of relationships.  It is the support of both items divided by the product of the supports for the individual items.  This denominator mimics the two items being independently assigned to transactions.  A lift greater than 1 indicates a probability that the items have a relationship that is not random; a lift equal to 1 represents no correlation; and finally, a lift less than 1 indicates the items are possible substitutions for one another. (Sivek, 2020)
<br><br><br>
<strong>Results of Analysis</strong><br>
Our top three assocations <strong>do not have a strong support</strong>.  This is not surprising, based on the large and varied amount of items.<br><br>
The assocations have <strong>confidence between .47 and .48</strong>.  This indicates that the associations are purchased together around 50% of the time.<br><br>
The <strong>lift values are all around 2</strong>, indicating that the assocations could be bundled together. While the bundles should work according to this analysis, all three assocations have the same consequent. Because of this, I would suggest a separate course of action (see below in D-2).

#### <font color="blue">D-2.  Discuss the practical significance of the findings from the analysis.</font>

The results appear to suggest that compressed gas is one of the most common consequents.  Because of the wide array of antecedents, I would believe that gas is picked up as a impulse item rather than directly tied to the antecedents.

#### <font color="blue">D-3.  Recommend a course of action.</font>

I would recommend that displays of compressed gas are included near the registers, so we can try to capitalize on the impulse purchases when possible.

#### <font color="blue">Code Resources</font>

Raschka, S. (n.d.). Apriori - mlxtend. Mlxtend. Retrieved November 10, 2021, from http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/

#### <font color="blue">Text Resources</font>

GeeksforGeeks. (2020, April 4). Apriori Algorithm. Retrieved November 10, 2021, from https://www.geeksforgeeks.org/apriori-algorithm/
<br><br>
Kamakura, W. A. (2012). Sequential market basket analysis. Marketing Letters, 23(3), 505–516. https://doi.org/10.1007/s11002-012-9181-6
<br><br>
Sivek, S. C., PhD. (2020, November 17). Market Basket Analysis 101: Key Concepts - Towards Data Science. Medium. Retrieved November 10, 2021, from https://towardsdatascience.com/market-basket-analysis-101-key-concepts-1ddc6876cd00