## Synthetic dataset generation
Author: Lin Lee Cheong <br>
Date: 12/12/ 2020 <br>

Goal of this synthetic dataset is to create datasets to help understand how different relationships between tokens affect attention, SHAP and other interpretability factors.
- length of events (30, 300, 900)
- spacing between 2+ coupled events, i.e. order of sequence matters
- amount of noise, i.e. performance vs interpretability
- vocabulary space

In [1]:
import yaml
import string
import os
import pandas as pd

from utils import *

In [2]:
%load_ext autoreload

%autoreload 2

In [3]:
TOKEN_NAMES_FP = './tokens.yaml'
TRAIN_FP = 'data/train.csv'
VAL_FP = 'data/val.csv'
TEST_FP = 'data/test.csv'

UID_COLNAME = 'patient_id'

TRAIN_NROWS = 3000
VAL_NROWS = 1000
TEST_NROWS = 1000

UID_LEN = 10
SEQ_LEN = 30

In [4]:
#Load tokens from yaml file path
tokens = load_tokens(TOKEN_NAMES_FP)
for key in tokens.keys():
    print(f"{key}: {len(tokens[key])} tokens")

adverse_tokens: 4 tokens
adverse_helper_tokens: 6 tokens
adverse_unhelper_tokens: 5 tokens
noise_tokens: 15 tokens


### Simple dataset

Get simple dataset:
- positive set: (+++, 1 major + a helper), (++, 1 major), (+, 3 helper)
- negative set: (---, 3 unhelper), (--, 1 helper + 2 unhelper), (-, 2 helper + 1 unhelper)


**NOTES**<br>
n_ppp_adverse = 2000 # 1 adverse event + 1 helper event <br>
n_pp_adverse = 2000 # 1 adverse event <br>
n_p_adverse = 2000 # 3 helper events <br><br>
n_nnn_adverse = 2000 # 3 unhelper events <br>
n_nn_adverse = 2000 # 1 helper + 2 unhelper <br>
n_n_adverse = 2000 # 2 helper + 1 unhelper <br>

In [5]:
train_count_dict = {
    'n_ppp_adverse': TRAIN_NROWS,
    'n_pp_adverse': TRAIN_NROWS,
    'n_p_adverse': TRAIN_NROWS,
    'n_nnn_adverse': TRAIN_NROWS,
    'n_nn_adverse': TRAIN_NROWS,
    'n_n_adverse': TRAIN_NROWS
}

val_count_dict = {
    'n_ppp_adverse': VAL_NROWS,
    'n_pp_adverse': VAL_NROWS,
    'n_p_adverse': VAL_NROWS,
    'n_nnn_adverse': VAL_NROWS,
    'n_nn_adverse': VAL_NROWS,
    'n_n_adverse': VAL_NROWS
}

test_count_dict = {
    'n_ppp_adverse': TEST_NROWS,
    'n_pp_adverse': TEST_NROWS,
    'n_p_adverse': TEST_NROWS,
    'n_nnn_adverse': TEST_NROWS,
    'n_nn_adverse': TEST_NROWS,
    'n_n_adverse': TEST_NROWS
}

In [6]:
train_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN, uid_len=UID_LEN, uid_colname=UID_COLNAME, count_dict=train_count_dict, tokens=tokens)

val_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN, uid_len=UID_LEN, uid_colname=UID_COLNAME, count_dict=val_count_dict, tokens=tokens)

test_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN, uid_len=UID_LEN, uid_colname=UID_COLNAME, count_dict=test_count_dict, tokens=tokens)

dataset: (18000, 33)
ratio:
1    0.501
0    0.499
Name: label, dtype: float64

dataset: (6000, 33)
ratio:
1    0.5035
0    0.4965
Name: label, dtype: float64

dataset: (6000, 33)
ratio:
0    0.500167
1    0.499833
Name: label, dtype: float64



In [7]:
save_csv(train_simple_data, TRAIN_FP)
save_csv(val_simple_data, VAL_FP)
save_csv(test_simple_data, TEST_FP)

In [8]:
df = pd.read_csv(TRAIN_FP)
print(df.shape)
df.head()

(18000, 33)


Unnamed: 0,index,29,28,27,26,25,24,23,22,21,...,7,6,5,4,3,2,1,0,label,patient_id
0,244,<pad>,<pad>,<pad>,normal_bmi,dental_exam,dental_exam,backache,eye_exam,quad_injury,...,myopia,peanut_allergy,backache,cold_sore,ACE_inhibitors,eye_exam,foot_pain,headache,0,4X1LVMG7ZT
1,540,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,quad_injury,annual_physical,headache,eye_exam,resistent_hyp,cold_sore,high_creatinine,furosemide,1,JQXIG7JVAO
2,2778,<pad>,<pad>,<pad>,<pad>,<pad>,cold_sore,normal_bmi,ACL_tear,headache,...,annual_physical,ingrown_nail,headache,quad_injury,eye_exam,cardiac_rehab,ankle_sprain,low_salt_diet,0,4OKFQV743F
3,1631,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,dental_exam,dental_exam,eye_exam,headache,ACL_tear,furosemide,pneumonia,ACE_inhibitors,0,F53WVKKBHR
4,261,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,peanut_allergy,backache,cold_sore,ingrown_nail,myopia,quad_injury,cold_sore,eye_exam,1,1DUEQUAA9R
