# EDA: Market Basket Analysis
- Ontology에 사전정의된 Slot(`meta_slot`)과 Dialogue 내 발화에 등장하는 단어 간 연관성을 분석
- Skill. 장바구니 분석(market basket analysis)

## Market Basket Analysis
- Association Rules라고 부르기도 함
- 다음의 3가지 지표를 활용
    1. 지지도(Support): (상품 A와 B가 동시에 포함된 사례 수) / (전체 사례 수), P(A∩B)
    2. 신뢰도(Confidence): (상품 A와 B가 동시에 포함된 사례 수) / (상품 A가 포함된 사례 수), P(A∩B) / P(A)
    3. 향상도(Lift): 상품 A가 없을 때 상품 B가 있을 확률과 상품 A가 있을 때 상품 B가 있을 확률의 비율, P(A∩B) / P(A)*P(B) = P (B|A) / P (B)

## WoS
- \# Meta Domain: Travel
- \# Domain: 5
- \# Slot: 45
- \# Avg Turn: 14.67
- \# Tokens per turn


In [5]:
import json
import sys
from tqdm import tqdm
import pandas as pd
import konlpy
from konlpy.tag import Okt # Mecab은 윈도우 미지원
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
sys.path.insert(0, '../')

from utils import load_json

#### Load Data

In [2]:
SLOTMETA = '../input/data/train_dataset/slot_meta.json'
ONTOLOGY = '../input/data/train_dataset/ontology.json'
DIALS = '../input/data/train_dataset/train_dials.json'

In [3]:
ontology = load_json(ONTOLOGY)
slot_meta = load_json(SLOTMETA)
dials = load_json(DIALS)

#### Preprocessing for Market Basket Analysis

In [7]:
tagger = Okt()

In [23]:
NOUN = 'Noun'

def get_slot_meta(x):
    return '-'.join(x.split('-')[:-1])

def get_keywords(x, tagger: konlpy.tag):
    text_parsed = tagger.pos(x)
    parsed_filtered = filter(lambda x: x[-1] == NOUN and len(x[0]) > 1, text_parsed) # 명사 & 길이 2 이상
    keywords = list(map(lambda x: x[0], parsed_filtered))
    return keywords

total_items = [] # dialogue 단위로 ㅁ

for dial in tqdm(dials):
    dial_items = set()
    for turn in dial['dialogue']:
        if turn['role'] == 'sys':
            continue
        slot_items = set(map(get_slot_meta, turn['state']))
        keywords_items = set(get_keywords(turn['text'], tagger))
        turn_items = slot_items.union(keywords_items)
        dial_items = dial_items.union(turn_items)
    
    total_items.append(list(dial_items))

100%|█████████████████████████████████████████████████████████████████████████████| 7000/7000 [01:03<00:00, 110.45it/s]


In [24]:
enc = TransactionEncoder()
enc.fit(total_items)

transactions = enc.transform(total_items)
transaction_table = pd.DataFrame(transactions, columns=enc.columns_)

In [26]:
transaction_table.head()

Unnamed: 0,가가,가게,가격,가구,가기,가까이,가끔,가나,가능,가능성,...,후보,후움,휴가,휴식,휴일,흡연,흥미,흥인지문,힐링,힙니
0,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [None]:
temp = apriori(transaction_table.head(10), min_support=0.005, use_colnames=True)

In [None]:
temp