<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Association-Rule-Mining-Metode-Apriori-Menggunakan-mlxtend" data-toc-modified-id="Association-Rule-Mining-Metode-Apriori-Menggunakan-mlxtend-1">Association Rule Mining Metode Apriori Menggunakan mlxtend</a></span><ul class="toc-item"><li><span><a href="#Exploratory-Data-Analysis" data-toc-modified-id="Exploratory-Data-Analysis-1.1">Exploratory Data Analysis</a></span></li><li><span><a href="#Cek-Struktur-Dataset" data-toc-modified-id="Cek-Struktur-Dataset-1.2">Cek Struktur Dataset</a></span><ul class="toc-item"><li><span><a href="#Cek-Duplicate-Value" data-toc-modified-id="Cek-Duplicate-Value-1.2.1">Cek Duplicate Value</a></span></li><li><span><a href="#Cek-Missing-Value" data-toc-modified-id="Cek-Missing-Value-1.2.2">Cek Missing Value</a></span></li><li><span><a href="#Cek-Produk-Terlaris" data-toc-modified-id="Cek-Produk-Terlaris-1.2.3">Cek Produk Terlaris</a></span></li></ul></li><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-1.3">Data Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Drop-Feature-Date-&amp;-Time" data-toc-modified-id="Drop-Feature-Date-&amp;-Time-1.3.1">Drop Feature Date &amp; Time</a></span></li><li><span><a href="#Drop-Transaksi-dengan-Item-NONE" data-toc-modified-id="Drop-Transaksi-dengan-Item-NONE-1.3.2">Drop Transaksi dengan Item NONE</a></span></li><li><span><a href="#Penyesuaian-Struktur-Dataset" data-toc-modified-id="Penyesuaian-Struktur-Dataset-1.3.3">Penyesuaian Struktur Dataset</a></span></li></ul></li><li><span><a href="#Implementasi-Apriori" data-toc-modified-id="Implementasi-Apriori-1.4">Implementasi Apriori</a></span><ul class="toc-item"><li><span><a href="#Encode-Item-Transaksi" data-toc-modified-id="Encode-Item-Transaksi-1.4.1">Encode Item Transaksi</a></span></li><li><span><a href="#Cari-Itemsets-dengan-support->=-min_support" data-toc-modified-id="Cari-Itemsets-dengan-support->=-min_support-1.4.2">Cari Itemsets dengan support &gt;= min_support</a></span></li><li><span><a href="#Cari-Association-Rule" data-toc-modified-id="Cari-Association-Rule-1.4.3">Cari Association Rule</a></span></li></ul></li><li><span><a href="#Referensi" data-toc-modified-id="Referensi-1.5">Referensi</a></span></li></ul></li></ul></div>

# Association Rule Mining Metode Apriori Menggunakan mlxtend

In [1]:
# import library
import pandas as pd
import numpy as np

In [2]:
# save dataset location
dataset_path = "../input/BreadBasket_DMS.csv"

# load dataset
df = pd.read_csv(dataset_path)

## Exploratory Data Analysis

Bagian ini menampilkan hasil eksplorasi secara sederhana mengenai dataset.

## Cek Struktur Dataset

In [3]:
df.head(10)

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam
5,2016-10-30,10:07:57,3,Cookies
6,2016-10-30,10:08:41,4,Muffin
7,2016-10-30,10:13:03,5,Coffee
8,2016-10-30,10:13:03,5,Pastry
9,2016-10-30,10:13:03,5,Bread


Dataset terdiri dari 4 feature. Feature **Date** dan **Time** akan kita drop karena tidak diperlukan untuk analisis ARM. 

Pada transaksi dapat kita lihat ada transaksi yang memuat 1 item saja seperti **Transaction 1**. Dan juga ada transaksi dengan item yang sama seperti **Transaction 2**. 

### Cek Duplicate Value

In [4]:
df.duplicated().sum()

1653

Ada 1653 instance yang duplicate. Instance duplicate ini menunjukkan bahwa ada item yang dibeli lebih dari 1 kali pada transaksi yang sama.

### Cek Missing Value

In [5]:
df.isna().sum()

Date           0
Time           0
Transaction    0
Item           0
dtype: int64

Tidak ada missing value pada dataset.

### Cek Produk Terlaris

In [6]:
# save item frequencies
top_selling = df.Item.value_counts()

# top 10 most sold products
top_selling.head(10)

Coffee           5471
Bread            3325
Tea              1435
Cake             1025
Pastry            856
NONE              786
Sandwich          771
Medialuna         616
Hot chocolate     590
Cookies           540
Name: Item, dtype: int64

Dari 10 produk terlaris, ada produk bernama **NONE**. Item NONE ini mungkin terjadi karena kesalahan penginputan item transaksi. Kita akan menghapusnya dari dataset. 

## Data Preprocessing

### Drop Feature Date & Time

In [7]:
# drop feature date and time
df.drop(columns=['Date', 'Time'], 
        inplace=True)

### Drop Transaksi dengan Item NONE

Transaksi dengan item NONE dihapus dengan mencari index transaksi tersebut. Dengan informasi index transaksi, fungsi drop akan menghapus transaksi tersebut.

In [8]:
# save index where Item == NONE
index_of_none = df[ df.Item == 'NONE' ].index

# dimension of  instance where item == NONE 
index_of_none.shape

(786,)

In [9]:
# drop instance according to index_of_none
df.drop(index=index_of_none, 
        inplace=True)

### Penyesuaian Struktur Dataset

Library **mlxtend** mengharuskan dataset dalam bentuk list-of-list. Dimana dataset adalah list terluar dan transaksi adalah list didalamnya.

In [10]:
# id transaksi unik
transaksi_unik = df.Transaction.unique()
transaksi_unik

array([   1,    2,    3, ..., 9682, 9683, 9684], dtype=int64)

In [11]:
# jumlah id transaksi
n_transaksi = df.Transaction.nunique()
n_transaksi

9465

In [12]:
list_transaksi = []

for id in transaksi_unik:
    
    # set agar item di transaksi tidak duplikat    
    transaksi = set( df[ df.Transaction == id]['Item'] )
    
    # konversi ke list 
    listed_transaksi = list(transaksi)
    
    # append list yang telah disort
    list_transaksi.append( sorted(listed_transaksi) )

In [14]:
# view list transaksi
list_transaksi[:15]

[['Bread'],
 ['Scandinavian'],
 ['Cookies', 'Hot chocolate', 'Jam'],
 ['Muffin'],
 ['Bread', 'Coffee', 'Pastry'],
 ['Medialuna', 'Muffin', 'Pastry'],
 ['Coffee', 'Medialuna', 'Pastry', 'Tea'],
 ['Bread', 'Pastry'],
 ['Bread', 'Muffin'],
 ['Medialuna', 'Scandinavian'],
 ['Bread', 'Medialuna'],
 ['Coffee', 'Jam', 'Pastry', 'Tartine', 'Tea'],
 ['Basket', 'Bread', 'Coffee'],
 ['Bread', 'Medialuna', 'Pastry'],
 ['Mineral water', 'Scandinavian']]

In [15]:
# save list transaksi
df_transaksi = pd.DataFrame(data=list_transaksi)
df_transaksi.to_csv('../output/valid_transactions.csv', 
                    index=False, 
                    header=False)

## Implementasi Apriori

### Encode Item Transaksi

In [16]:
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
te_array = te.fit_transform(list_transaksi)

te_array

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [18]:
# konversi ke dataframe
encoded_transactions = pd.DataFrame(data=te_array, 
                                    columns=te.columns_)
# simpan transaksi setelah diencode
encoded_transactions.to_csv('../output/encoded_transactions.csv', 
                            index=False)

In [19]:
encoded_transactions.head()

Unnamed: 0,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### Cari Itemsets dengan support >= min_support

Parameter yang ditentukan:
- min_support: 0.005


In [20]:
from mlxtend.frequent_patterns import apriori

# frequent itemset dengan parameter min_support
frequent_itemsets = apriori(df=encoded_transactions, 
                            min_support=0.005,
                            use_colnames=True)

# sort itemsets dengan support tertinggi
frequent_itemsets.sort_values(by=['support'], 
                              ascending=False, 
                              inplace=True)

# save frequent itemsets
frequent_itemsets.to_csv('../output/frequent_itemsets.csv', 
                         index=False)

### Cari Association Rule

Parameter yang ditentukan:

- confidence:  0.6
- lift:        3.0 

In [23]:
from mlxtend.frequent_patterns import association_rules

# cari rule dengan lift
assoc_rules = association_rules(df=frequent_itemsets, 
                                metric='lift', 
                                min_threshold=2.5)

# sort association rules
assoc_rules.sort_values(by=['lift'], 
                        ascending=False, 
                        inplace=True)

In [24]:
assoc_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
3,(Coke),(Sandwich),0.01944,0.071844,0.005177,0.266304,3.706722,0.00378,1.265043
2,(Sandwich),(Coke),0.071844,0.01944,0.005177,0.072059,3.706722,0.00378,1.056705
0,(Juice),(Cookies),0.038563,0.054411,0.006128,0.158904,2.920442,0.00403,1.124234
1,(Cookies),(Juice),0.054411,0.038563,0.006128,0.112621,2.920442,0.00403,1.083457


## Kesimpulan

Diberikan kriterai awal yaitu, min_support=0.005, min_confidence=0.6, min_lift=3 dan min_length=2. 
Setelah dilakukan analisis, tidak ditemukan *association rule* yang memenuhi kriteria tersebut.

Saya berinisiatif untuk mengubah parameter berikut:
- min_lift=2.5
- min_confidence=0.2

Dari analisis dengan parameter baru, ditemukan *association rule* sebagai berikut:

In [25]:
assoc_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
3,(Coke),(Sandwich),0.01944,0.071844,0.005177,0.266304,3.706722,0.00378,1.265043
2,(Sandwich),(Coke),0.071844,0.01944,0.005177,0.072059,3.706722,0.00378,1.056705
0,(Juice),(Cookies),0.038563,0.054411,0.006128,0.158904,2.920442,0.00403,1.124234
1,(Cookies),(Juice),0.054411,0.038563,0.006128,0.112621,2.920442,0.00403,1.083457


## Referensi

- [Fast Algorithms for Mining Association Rules](https://www.it.uu.se/edu/course/homepage/infoutv/ht08/vldb94_rj.pdf)
- http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/
- http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
- http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/