## Synthetic Multiple Dataset Generation
Author: Lin Lee Cheong <br>
Modified By: Tesfagabir Meharizghi <br>
Date: 01/20/2021 <br>

Goal of this synthetic dataset is to create datasets to help understand how different relationships between tokens affect attention, SHAP and other interpretability factors.
- length of events (30, 300, 900)
- spacing between 2+ coupled events, i.e. order of sequence matters
- amount of noise, i.e. performance vs interpretability
- vocabulary space

It generates multiple datasets for experimentation purposes by changing `EXP_NUMBER` variable.

In [98]:
#! pip install nb-black

#! pip install botocore==1.12.201

#! pip install shap
#! pip install xgboost

In [99]:
%load_ext lab_black

%load_ext autoreload

%autoreload 2

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [100]:
import yaml
import string
import os
import pandas as pd

from utils import *

In [101]:
%load_ext autoreload

%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [102]:
TOKEN_NAMES_FP = "./tokens.yaml"

SEQ_LEN = 30

EXP_NUMBER = 10

TRAIN_FP = f"data/{SEQ_LEN}/{EXP_NUMBER:02}/train.csv"
VAL_FP = f"data/{SEQ_LEN}/{EXP_NUMBER:02}/val.csv"
TEST_FP = f"data/{SEQ_LEN}/{EXP_NUMBER:02}/test.csv"

UID_COLNAME = "patient_id"

TRAIN_NROWS = 3000
VAL_NROWS = 1000
TEST_NROWS = 1000

UID_LEN = 10

In [103]:
# Load tokens from yaml file path
tokens = load_tokens(TOKEN_NAMES_FP)
for key in tokens.keys():
    print(f"{key}: {len(tokens[key])} tokens")

adverse_tokens: 4 tokens
adverse_helper_tokens: 6 tokens
adverse_unhelper_tokens: 5 tokens
noise_tokens: 15 tokens


### Simple dataset

Get simple dataset:
- positive set: (+++, 1 major + a helper), (++, 1 major), (+, 3 helper)
- negative set: (---, 3 unhelper), (--, 1 helper + 2 unhelper), (-, 2 helper + 1 unhelper)


**NOTES**<br>
n_ppp_adverse = 2000 # 1 adverse event + 1 helper event <br>
n_pp_adverse = 2000 # 1 adverse event <br>
n_p_adverse = 2000 # 3 helper events <br><br>
n_nnn_adverse = 2000 # 3 unhelper events <br>
n_nn_adverse = 2000 # 1 helper + 2 unhelper <br>
n_n_adverse = 2000 # 2 helper + 1 unhelper <br>

In [104]:
train_count_dict = {
    "n_ppp_adverse": TRAIN_NROWS,
    "n_pp_adverse": TRAIN_NROWS,
    "n_p_adverse": TRAIN_NROWS,
    "n_nnn_adverse": TRAIN_NROWS,
    "n_nn_adverse": TRAIN_NROWS,
    "n_n_adverse": TRAIN_NROWS,
}

val_count_dict = {
    "n_ppp_adverse": VAL_NROWS,
    "n_pp_adverse": VAL_NROWS,
    "n_p_adverse": VAL_NROWS,
    "n_nnn_adverse": VAL_NROWS,
    "n_nn_adverse": VAL_NROWS,
    "n_n_adverse": VAL_NROWS,
}

test_count_dict = {
    "n_ppp_adverse": TEST_NROWS,
    "n_pp_adverse": TEST_NROWS,
    "n_p_adverse": TEST_NROWS,
    "n_nnn_adverse": TEST_NROWS,
    "n_nn_adverse": TEST_NROWS,
    "n_n_adverse": TEST_NROWS,
}

In [105]:
train_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=train_count_dict,
    tokens=tokens,
)

val_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=val_count_dict,
    tokens=tokens,
)

test_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=test_count_dict,
    tokens=tokens,
)

dataset: (18000, 33)
ratio:
1    0.500833
0    0.499167
Name: label, dtype: float64

dataset: (6000, 33)
ratio:
0    0.503667
1    0.496333
Name: label, dtype: float64

dataset: (6000, 33)
ratio:
1    0.503833
0    0.496167
Name: label, dtype: float64



In [106]:
save_csv(train_simple_data, TRAIN_FP)
save_csv(val_simple_data, VAL_FP)
save_csv(test_simple_data, TEST_FP)

In [107]:
df = pd.read_csv(TRAIN_FP)
print(df.shape)
df.head()

(18000, 33)


Unnamed: 0,index,29,28,27,26,25,24,23,22,21,...,7,6,5,4,3,2,1,0,label,patient_id
0,1068,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,peanut_allergy_N,cardiac_rehab_U,hay_fever_N,cold_sore_N,foot_pain_N,foot_pain_N,quad_injury_N,annual_physical_N,0,N5UE5PRL3K
1,1490,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,headache_N,hay_fever_N,headache_N,cold_sore_N,dental_exam_N,peanut_allergy_N,ankle_sprain_N,dental_exam_N,1,QZX2RVSPU3
2,1575,<pad>,<pad>,backache_N,annual_physical_N,cut_finger_N,backache_N,cut_finger_N,ingrown_nail_N,headache_N,...,peanut_allergy_N,eye_exam_N,eye_exam_N,dental_exam_N,ACE_inhibitors_U,headache_N,headache_N,dental_exam_N,1,QZDARC3OF1
3,1759,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,ACE_inhibitors_U,normal_bmi_U,ACE_inhibitors_U,cold_sore_N,cut_finger_N,foot_pain_N,ACL_tear_N,eye_exam_N,0,CCOYO5RILI
4,2184,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,tachycardia_H,backache_N,quad_injury_N,...,annual_physical_N,peanut_allergy_N,ankle_sprain_N,headache_N,ingrown_nail_N,ankle_sprain_N,backache_N,backache_N,1,ITA6KU5O0A
