Context

A real online retail transaction data set of two years.

Content

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Column Descriptors

InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.

StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.

Description: Product (item) name. Nominal.

Quantity: The quantities of each product (item) per transaction. Numeric.

InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.

UnitPrice: Unit price. Numeric, Product price per unit in sterling.

CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.

Country: Country name. Nominal, the name of the country where each customer resides.

Acknowledgements

Here you can find references about data set: http://archive.ics.uci.edu/ml/datasets/Online+Retail

In [2]:
import numpy as np
import pandas as pd
#from mlxtend.frequent_patterns import apriori, association_rules
from apyori import apriori

In [8]:
retail_data = pd.read_csv(r'retail_dataset.csv')

In [9]:
retail_data.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


In [10]:
retail_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 315 entries, 0 to 314
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       315 non-null    object
 1   1       285 non-null    object
 2   2       245 non-null    object
 3   3       187 non-null    object
 4   4       133 non-null    object
 5   5       71 non-null     object
 6   6       41 non-null     object
dtypes: object(7)
memory usage: 17.4+ KB


In [11]:
records = []
for i in range(0, 315):
    records.append([str(retail_data.values[i,j]) for j in range(0, 7)])

print(records)

[['Bread', 'Wine', 'Eggs', 'Meat', 'Cheese', 'Pencil', 'Diaper'], ['Bread', 'Cheese', 'Meat', 'Diaper', 'Wine', 'Milk', 'Pencil'], ['Cheese', 'Meat', 'Eggs', 'Milk', 'Wine', 'nan', 'nan'], ['Cheese', 'Meat', 'Eggs', 'Milk', 'Wine', 'nan', 'nan'], ['Meat', 'Pencil', 'Wine', 'nan', 'nan', 'nan', 'nan'], ['Eggs', 'Bread', 'Wine', 'Pencil', 'Milk', 'Diaper', 'Bagel'], ['Wine', 'Pencil', 'Eggs', 'Cheese', 'nan', 'nan', 'nan'], ['Bagel', 'Bread', 'Milk', 'Pencil', 'Diaper', 'nan', 'nan'], ['Bread', 'Diaper', 'Cheese', 'Milk', 'Wine', 'Eggs', 'nan'], ['Bagel', 'Wine', 'Diaper', 'Meat', 'Pencil', 'Eggs', 'Cheese'], ['Cheese', 'Meat', 'Eggs', 'Milk', 'Wine', 'nan', 'nan'], ['Bagel', 'Eggs', 'Meat', 'Bread', 'Diaper', 'Wine', 'Milk'], ['Bread', 'Diaper', 'Pencil', 'Bagel', 'Meat', 'nan', 'nan'], ['Bagel', 'Cheese', 'Milk', 'Meat', 'nan', 'nan', 'nan'], ['Bread', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan'], ['Pencil', 'Diaper', 'Bagel', 'nan', 'nan', 'nan', 'nan'], ['Meat', 'Bagel', 'Bread', 'nan',

The Apriori library that we are using requires the dataset to be in the form of a list of lists, where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list.

In [12]:
association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)
association_results = list(association_rules)

The apriori class requires some parameter values to work. The first parameter is the list of list that you want to extract rules from. The second parameter is the min_support parameter. This parameter is used to select the items with support values greater than the value specified by the parameter. Next, the min_confidence parameter filters those rules that have confidence greater than the confidence threshold specified by the parameter. Similarly, the min_lift parameter specifies the minimum lift value for the short listed rules. Finally, the min_length parameter specifies the minimum number of items that you want in your rules.

In [18]:
print(association_results[0])

RelationRecord(items=frozenset({'Bread', 'Bagel', 'Meat', 'Cheese', 'Diaper', 'Eggs', 'nan'}), support=0.006349206349206349, ordered_statistics=[OrderedStatistic(items_base=frozenset({'Bread', 'Bagel', 'Diaper', 'Eggs', 'nan'}), items_add=frozenset({'Cheese', 'Meat'}), confidence=1.0, lift=3.0882352941176467)])


we can see that bread and bagel are commonly bought together

The support value for the first rule is 0.0063. This number is calculated by dividing the number of transactions containing bread divided by total number of transactions. The confidence level for the rule is 1.0 which shows that out of all the transactions that contain bread, 100% of the transactions also contain bagel. Finally, the lift of 3.08 tells us that bagel is 3.08 times more likely to be bought by the customers who buy bread compared to the default likelihood of the sale of bagel.

In [15]:
for item in association_results:

    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])

    #second index of the inner list
    print("Support: " + str(item[1]))

    #third index of the list located at 0th
    #of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

Rule: Bread -> Bagel
Support: 0.006349206349206349
Confidence: 1.0
Lift: 3.0882352941176467
Rule: Bread -> Bagel
Support: 0.006349206349206349
Confidence: 1.0
Lift: 3.5795454545454546
Rule: Bread -> Milk
Support: 0.006349206349206349
Confidence: 0.6666666666666666
Lift: 3.230769230769231
Rule: Bagel -> Wine
Support: 0.012698412698412698
Confidence: 0.2857142857142857
Lift: 3.1034482758620685
Rule: Milk -> Bagel
Support: 0.006349206349206349
Confidence: 0.5
Lift: 3.2142857142857144
Rule: Bread -> Milk
Support: 0.006349206349206349
Confidence: 1.0
Lift: 4.2
Rule: Bread -> Milk
Support: 0.009523809523809525
Confidence: 1.0
Lift: 4.090909090909091
Rule: Bread -> Milk
Support: 0.006349206349206349
Confidence: 1.0
Lift: 5.0


Advantages

Easy to understand algorithm

Join and Prune steps are easy to implement on large itemsets in large databases

Disadvantages

It requires high computation if the itemsets are very large and the minimum support is kept very low.

The entire database needs to be scanned.

Methods To Improve Apriori Efficiency

Many methods are available for improving the efficiency of the algorithm.

Hash-Based Technique: This method uses a hash-based structure called a hash table for generating the k-itemsets and its corresponding count. It uses a hash function for generating the table.

Transaction Reduction: This method reduces the number of transactions scanning in iterations. The transactions which do not contain frequent items are marked or removed.

Partitioning: This method requires only two database scans to mine the frequent itemsets. It says that for any itemset to be potentially frequent in the database, it should be frequent in at least one of the partitions of the database.

Sampling: This method picks a random sample S from Database D and then searches for frequent itemset in S. It may be possible to lose a global frequent itemset. This can be reduced by lowering the min_sup.
Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked start point of the database during the 
scanning of the database.

Applications Of Apriori Algorithm

Some fields where Apriori is used:

In Education Field: Extracting association rules in data mining of admitted students through characteristics and specialties.

In the Medical field: For example Analysis of the patient's database.

In Forestry: Analysis of probability and intensity of forest fire with the forest fire data.

Apriori is used by many companies like Amazon in the Recommender System and by Google for the auto-complete feature.

Conclusion

Apriori algorithm is an efficient algorithm that scans the database only once.

It reduces the size of the itemsets in the database considerably providing a good performance. Thus, data mining helps consumers and industries better in the decision-making process.