## Synthetic dataset generation
Author: Lin Lee Cheong <br>
Date: 12/12/ 2020 <br>

Goal of this synthetic dataset is to create datasets to help understand how different relationships between tokens affect attention, SHAP and other interpretability factors.
- length of events (30, 300, 900)
- spacing between 2+ coupled events, i.e. order of sequence matters
- amount of noise, i.e. performance vs interpretability
- vocabulary space

In [12]:
import yaml
import string
import os
import pandas as pd

from utils import *

In [13]:
%load_ext autoreload

%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [26]:
TOKEN_NAMES_FP = './tokens.yaml'

SEQ_LEN = 900 #300 #900

TRAIN_FP = 'data/train_seq{}.csv'.format(SEQ_LEN)
VAL_FP = 'data/val_seq{}.csv'.format(SEQ_LEN)
TEST_FP = 'data/test_seq{}.csv'.format(SEQ_LEN)

UID_COLNAME = 'patient_id'

TRAIN_NROWS = 3000
VAL_NROWS = 1000
TEST_NROWS = 1000

UID_LEN = 10


In [27]:
#Load tokens from yaml file path
tokens = load_tokens(TOKEN_NAMES_FP)
for key in tokens.keys():
    print(f"{key}: {len(tokens[key])} tokens")

adverse_tokens: 4 tokens
adverse_helper_tokens: 6 tokens
adverse_unhelper_tokens: 5 tokens
noise_tokens: 15 tokens


### Simple dataset

Get simple dataset:
- positive set: (+++, 1 major + a helper), (++, 1 major), (+, 3 helper)
- negative set: (---, 3 unhelper), (--, 1 helper + 2 unhelper), (-, 2 helper + 1 unhelper)


**NOTES**<br>
n_ppp_adverse = 2000 # 1 adverse event + 1 helper event <br>
n_pp_adverse = 2000 # 1 adverse event <br>
n_p_adverse = 2000 # 3 helper events <br><br>
n_nnn_adverse = 2000 # 3 unhelper events <br>
n_nn_adverse = 2000 # 1 helper + 2 unhelper <br>
n_n_adverse = 2000 # 2 helper + 1 unhelper <br>

In [28]:
train_count_dict = {
    'n_ppp_adverse': TRAIN_NROWS,
    'n_pp_adverse': TRAIN_NROWS,
    'n_p_adverse': TRAIN_NROWS,
    'n_nnn_adverse': TRAIN_NROWS,
    'n_nn_adverse': TRAIN_NROWS,
    'n_n_adverse': TRAIN_NROWS
}

val_count_dict = {
    'n_ppp_adverse': VAL_NROWS,
    'n_pp_adverse': VAL_NROWS,
    'n_p_adverse': VAL_NROWS,
    'n_nnn_adverse': VAL_NROWS,
    'n_nn_adverse': VAL_NROWS,
    'n_n_adverse': VAL_NROWS
}

test_count_dict = {
    'n_ppp_adverse': TEST_NROWS,
    'n_pp_adverse': TEST_NROWS,
    'n_p_adverse': TEST_NROWS,
    'n_nnn_adverse': TEST_NROWS,
    'n_nn_adverse': TEST_NROWS,
    'n_n_adverse': TEST_NROWS
}

In [29]:
train_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN, uid_len=UID_LEN, uid_colname=UID_COLNAME, count_dict=train_count_dict, tokens=tokens)

val_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN, uid_len=UID_LEN, uid_colname=UID_COLNAME, count_dict=val_count_dict, tokens=tokens)

test_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN, uid_len=UID_LEN, uid_colname=UID_COLNAME, count_dict=test_count_dict, tokens=tokens)

dataset: (18000, 903)
ratio:
0    0.501111
1    0.498889
Name: label, dtype: float64

dataset: (6000, 903)
ratio:
1    0.502333
0    0.497667
Name: label, dtype: float64

dataset: (6000, 903)
ratio:
1    0.503667
0    0.496333
Name: label, dtype: float64



In [30]:
save_csv(train_simple_data, TRAIN_FP)
save_csv(val_simple_data, VAL_FP)
save_csv(test_simple_data, TEST_FP)

In [31]:
df = pd.read_csv(TRAIN_FP)
print(df.shape)
df.head()

(18000, 903)


Unnamed: 0,index,899,898,897,896,895,894,893,892,891,...,7,6,5,4,3,2,1,0,label,patient_id
0,1881,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,cold_sore,...,eye_exam,headache,cold_sore,backache,ACL_tear,ankle_sprain,ankle_sprain,tachycardia,0,XCR8CS7TK3
1,883,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,quad_injury,peanut_allergy,ACL_tear,myopia,cut_finger,ingrown_nail,CHF,apnea,1,ELGHNAU6ES
2,1674,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,quad_injury,headache,dental_exam,foot_pain,headache,dental_exam,ACL_tear,eye_exam,0,96NBVNP9W4
3,597,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,eye_exam,myopia,peanut_allergy,hay_fever,myopia,annual_physical,headache,headache,1,VYDH3LX0RJ
4,321,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,quad_injury,cut_finger,headache,ACL_tear,annual_physical,quad_injury,headache,dental_exam,1,X2FESVIS2O
