# Association Rule Mining

Association rule mining adalah salah satu metode data mining yang dapat mengidentifikasi hubungan kesamaan antar item. Dalam metode ini dibutuhkan algoritma untuk mencari kandidat aturan asosiasi. Dalam hal praktikum ini kita akan memakai contoh dataset dari Transaksi sebuah produk yang dibeli customer.

**Load Dataset**

In [35]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from mlxtend.preprocessing import TransactionEncoder

dataset = pd.read_csv("https://raw.githubusercontent.com/brandonndun/Summary_Week-9/main/OnlineRetail.csv", sep = ";")
dataset.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01-12-2010 08:26,3.39,17850.0,United Kingdom


**code ini digunakan untuk membaca sebuah dataset yang telah di import ke csv. Sehingga kita dapat melihat beberapa data yang ada disini**

# Data Preprocessing

Di data preprocessing ini, dilakukannya persiapan data yang berguna untuk mengetahui adanya beberapa data yang perlu atau tidaknya dipakai dalam melakukan Association Rule Mining ini

In [36]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 269123 entries, 0 to 269122
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    269123 non-null  object 
 1   StockCode    269123 non-null  object 
 2   Description  268136 non-null  object 
 3   Quantity     269123 non-null  int64  
 4   InvoiceDate  269123 non-null  object 
 5   UnitPrice    269123 non-null  float64
 6   CustomerID   191906 non-null  float64
 7   Country      269123 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 16.4+ MB


**Dataset info disini, berfungsi untuk mengetahui adanya nilai yang null atau tidaknya sebuah dataset**

In [37]:
dataset.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,269123.0,269123.0,191906.0
mean,9.242882,5.068197,15291.868691
std,211.090869,115.127509,1729.210455
min,-74215.0,0.0,12346.0
25%,1.0,1.25,13862.0
50%,3.0,2.1,15159.0
75%,10.0,4.15,16842.0
max,74215.0,38970.0,18287.0


In [38]:
dataset = dataset[['InvoiceNo', 'StockCode']]
dataset.head()

Unnamed: 0,InvoiceNo,StockCode
0,536365,85123A
1,536365,71053
2,536365,84406B
3,536365,84029G
4,536365,84029E


# Data Cleaning

In [39]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 269123 entries, 0 to 269122
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   InvoiceNo  269123 non-null  object
 1   StockCode  269123 non-null  object
dtypes: object(2)
memory usage: 4.1+ MB


In [40]:
dataset[dataset.duplicated()]

Unnamed: 0,InvoiceNo,StockCode
125,536381,71270
498,536409,90199C
502,536409,85116
517,536409,21866
525,536409,90199C
...,...,...
268087,560385,85123A
268091,560385,21408
268190,560393,22233
268191,560393,22629


In [41]:
dataset = dataset.drop_duplicates()
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 264861 entries, 0 to 269122
Data columns (total 2 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   InvoiceNo  264861 non-null  object
 1   StockCode  264861 non-null  object
dtypes: object(2)
memory usage: 6.1+ MB


In [42]:
def getProductList(row):
  return [product for product in row.unique()]

dataset = dataset[['InvoiceNo', 'StockCode']].groupby('InvoiceNo').agg(getProductList).reset_index()['StockCode']
dataset = list(dataset)

dataset[:5]

[['85123A', '71053', '84406B', '84029G', '84029E', '22752', '21730'],
 ['22633', '22632'],
 ['84879',
  '22745',
  '22748',
  '22749',
  '22310',
  '84969',
  '22623',
  '22622',
  '21754',
  '21755',
  '21777',
  '48187'],
 ['22960', '22913', '22912', '22914'],
 ['21756']]

In [43]:
tEncoder = TransactionEncoder()
datasetArr = tEncoder.fit(dataset).transform(dataset)
data = pd.DataFrame(datasetArr, columns = tEncoder.columns_)
data.head(5)

Unnamed: 0,10002,10080,10120,10123C,10123G,10124A,10124G,10125,10133,10134,...,M,PADS,POST,S,gift_0001_10,gift_0001_20,gift_0001_30,gift_0001_40,gift_0001_50,m
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
