## Synthetic dataset generation
Author: Lin Lee Cheong <br>
Date: 12/12/ 2020 <br>

Goal of this synthetic dataset is to create datasets to help understand how different relationships between tokens affect attention, SHAP and other interpretability factors.
- length of events (30, 300, 900)
- spacing between 2+ coupled events, i.e. order of sequence matters
- amount of noise, i.e. performance vs interpretability
- vocabulary space

In [11]:
#! pip install nb-black

#! pip install botocore==1.12.201

#! pip install shap
#! pip install xgboost

In [12]:
%load_ext lab_black

%load_ext autoreload

%autoreload 2

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
import yaml
import string
import os
import pandas as pd

from utils import *

In [14]:
%load_ext autoreload

%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [15]:
TOKEN_NAMES_FP = "./tokens_v2.yaml"

SEQ_LEN = 300

TRAIN_FP = "data/event_final_v2/{}/train.csv".format(SEQ_LEN)
VAL_FP = "data/event_final_v2/{}/val.csv".format(SEQ_LEN)
TEST_FP = "data/event_final_v2/{}/test.csv".format(SEQ_LEN)

UID_COLNAME = "patient_id"

TRAIN_NROWS = 3000
VAL_NROWS = 1000
TEST_NROWS = 1000

UID_LEN = 10

In [16]:
# Load tokens from yaml file path
tokens = load_tokens(TOKEN_NAMES_FP)
for key in tokens.keys():
    print(f"{key}: {len(tokens[key])} tokens")

adverse_tokens: 10 tokens
adverse_helper_tokens: 10 tokens
adverse_unhelper_tokens: 10 tokens
noise_tokens: 15 tokens


### Simple dataset

Get simple dataset:
- positive set: (+++, 1 major + a helper), (++, 1 major), (+, 3 helper)
- negative set: (---, 3 unhelper), (--, 1 helper + 2 unhelper), (-, 2 helper + 1 unhelper)


**NOTES**<br>
n_ppp_adverse = 2000 # 1 adverse event + 1 helper event <br>
n_pp_adverse = 2000 # 1 adverse event <br>
n_p_adverse = 2000 # 3 helper events <br><br>
n_nnn_adverse = 2000 # 3 unhelper events <br>
n_nn_adverse = 2000 # 1 helper + 2 unhelper <br>
n_n_adverse = 2000 # 2 helper + 1 unhelper <br>

In [17]:
train_count_dict = {
    "n_ppp_adverse": TRAIN_NROWS,
    "n_pp_adverse": TRAIN_NROWS,
    "n_p_adverse": TRAIN_NROWS,
    "n_nnn_adverse": TRAIN_NROWS,
    "n_nn_adverse": TRAIN_NROWS,
    "n_n_adverse": TRAIN_NROWS,
}

val_count_dict = {
    "n_ppp_adverse": VAL_NROWS,
    "n_pp_adverse": VAL_NROWS,
    "n_p_adverse": VAL_NROWS,
    "n_nnn_adverse": VAL_NROWS,
    "n_nn_adverse": VAL_NROWS,
    "n_n_adverse": VAL_NROWS,
}

test_count_dict = {
    "n_ppp_adverse": TEST_NROWS,
    "n_pp_adverse": TEST_NROWS,
    "n_p_adverse": TEST_NROWS,
    "n_nnn_adverse": TEST_NROWS,
    "n_nn_adverse": TEST_NROWS,
    "n_n_adverse": TEST_NROWS,
}

In [18]:
train_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=train_count_dict,
    tokens=tokens,
)

val_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=val_count_dict,
    tokens=tokens,
)

test_simple_data = get_simple_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=test_count_dict,
    tokens=tokens,
)

dataset: (18000, 303)
ratio:
1    0.502667
0    0.497333
Name: label, dtype: float64

dataset: (6000, 303)
ratio:
1    0.507167
0    0.492833
Name: label, dtype: float64

dataset: (6000, 303)
ratio:
1    0.503667
0    0.496333
Name: label, dtype: float64



In [19]:
save_csv(train_simple_data, TRAIN_FP)
save_csv(val_simple_data, VAL_FP)
save_csv(test_simple_data, TEST_FP)

In [20]:
df = pd.read_csv(TRAIN_FP)
print(df.shape)
df.head()

(18000, 303)


Unnamed: 0,index,299,298,297,296,295,294,293,292,291,...,7,6,5,4,3,2,1,0,label,patient_id
0,2808,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,dental_exam_N,ankle_sprain_N,cold_sore_N,backache_N,myopia_N,ankle_sprain_N,headache_N,foot_pain_N,1,ERVCYRZXH1
1,2065,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,quad_injury_N,cut_finger_N,dental_exam_N,ACL_tear_N,peanut_allergy_N,cut_finger_N,cut_finger_N,ingrown_nail_N,0,BT2Z13OBJF
2,25,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,ACL_tear_N,headache_N,dental_exam_N,cold_sore_N,cut_finger_N,foot_pain_N,cut_finger_N,quad_injury_N,0,2ATVRLS02I
3,2202,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,cut_finger_N,cut_finger_N,myopia_N,annual_physical_N,ingrown_nail_N,dental_exam_N,annual_physical_N,cut_finger_N,0,0WF3PEIBZS
4,1800,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,quad_injury_N,foot_pain_N,ingrown_nail_N,dental_exam_N,ankle_sprain_N,quad_injury_N,cold_sore_N,peanut_allergy_N,0,8MBXJX6H1M
