## Synthetic dataset generation -- Sequence based
**Author: Lin Lee Cheong <br>
Updated by: Tesfagabir Meharizghi<br>
Date created: 12/12/ 2020 <br>
Date updated: 02/18/2021 <br>**

Goal of this synthetic dataset is to create datasets to help understand how different relationships between tokens affect attention, SHAP and other interpretability factors.
- length of events (30, 300)
- spacing between 2+ coupled events, i.e. order of sequence matters
- amount of noise, i.e. performance vs interpretability
- vocabulary space

### Sequence dataset

Positive label is driven by a sequence of tokens
- Positive label probability is driven by the following formula
``` min(1.0, math.exp(-(a * ta)) + math.exp(-(h * th)) - math.exp(-(u * tu))) ```
Where:
- `a` is a constant related to `_A` events. It is the inverse of the contribution of `_A` events for positive label
- `h` is a constant related to `_H` events. It is the inverse of the contribution of `_H` events for positive label
- `u` is a constant related to `_U` events. It is the inverse of the contribution of `_U` events for positive label

- `ta` is the absolute position of the `_A` event in the sequence from the end.
- `th` is the absolute position of the `_H` event in the sequence from the end.
- `tu` is the absolute position of the `_U` event in the sequence from the end.

Note:
- All patients have one `_A`, one `_H` and one `_U` events each.
- since `_U` events have opposite effect to the adverse event, their contribution is subtracted.

In [495]:
%load_ext lab_black

%load_ext autoreload

%autoreload 2

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [496]:
import yaml
import string
import os
import numpy as np
import pandas as pd
import math

from utils import *

In [497]:
%load_ext autoreload

%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [498]:
TOKEN_NAMES_FP = "./tokens_v2.yaml"

SEQ_LEN = 300

TRAIN_FP = "data/seq_final_v3/{}/train.csv".format(SEQ_LEN)
VAL_FP = "data/seq_final_v3/{}/val.csv".format(SEQ_LEN)
TEST_FP = "data/seq_final_v3/{}/test.csv".format(SEQ_LEN)

UID_COLNAME = "patient_id"

TRAIN_NROWS = 4000
VAL_NROWS = 2000
TEST_NROWS = 2000

UID_LEN = 10

# Total patients in the each split (will be balanced)
TOTAL_TRAIN = 18000
TOTAL_VAL = 6000
TOTAL_TEST = 6000

In [499]:
# Load tokens from yaml file path
tokens = load_tokens(TOKEN_NAMES_FP)
for key in tokens.keys():
    print(f"{key}: {len(tokens[key])} tokens")

adverse_tokens: 10 tokens
adverse_helper_tokens: 10 tokens
adverse_unhelper_tokens: 10 tokens
noise_tokens: 15 tokens


In [500]:
for key, tok in tokens.items():
    print(key)
    print(tok)
    print("-" * 50)

adverse_tokens
['Acute_Myocardial_Infarction_A', 'hypertension_A', 'arrhythmia_A', 'congestive_heart_failure_A', 'heart_valve_failure_A', 'pulmonary_embolism_A', 'ventricular_aneurysm_A', 'ventricular_hypertrophy_A', 'cardiomyopathy_A', 'Chronic_Obstructive_Pulmonary_Disease_A']
--------------------------------------------------
adverse_helper_tokens
['sleep_apnea_H', 'pneumonia_H', 'coronary_artery_disease_H', 'edema_H', 'troponin_H', 'Brain_Natriuretic_Peptide_H', 'alchoholism_H', 'metabolic_disorder_H', 'elevated_creatinine_H', 'electrolyte_imbalance_H']
--------------------------------------------------
adverse_unhelper_tokens
['Percutaneous_Coronary_Intervention_U', 'electrical_cardioversion_U', 'catheter_ablation_U', 'pacemaker_U', 'cardiac_rehab_U', 'sleep_apnea_treatment_U', 'ACE_inhibitors_U', 'ARB_U', 'diuretics_U', 'beta_blockers_U']
--------------------------------------------------
noise_tokens
['eye_exam_N', 'annual_physical_N', 'hay_fever_N', 'headache_N', 'foot_pain_N',

Total number of observations

In [501]:
# x = TRAIN_NROWS
# total = x * 6
# pos_lab = x * 0.99 + x * 0.8 + x * 0.6 + x * 0.4 + x * 0.2 + x * 0.01
# neg_lab = total - pos_lab
# print(f"#pos: {pos_lab}, #neg: {neg_lab}")

In [502]:
# key--> sequence of adverse(A), helper(H), and unhelper(U)
# tuple --> (probability of positive label, number of rows)
train_count_dict = {
    "UHA": TRAIN_NROWS,
    "UAH": TRAIN_NROWS,
    "HUA": TRAIN_NROWS,
    "AUH": TRAIN_NROWS,
    "HAU": TRAIN_NROWS,
    "AHU": TRAIN_NROWS,
}

val_count_dict = {
    "UHA": VAL_NROWS,
    "UAH": VAL_NROWS,
    "HUA": VAL_NROWS,
    "AUH": VAL_NROWS,
    "HAU": VAL_NROWS,
    "AHU": VAL_NROWS,
}

test_count_dict = {
    "UHA": TEST_NROWS,
    "UAH": TEST_NROWS,
    "HUA": TEST_NROWS,
    "AUH": TEST_NROWS,
    "HAU": TEST_NROWS,
    "AHU": TEST_NROWS,
}

In [503]:
# Mappings of the token groups with the abbreviation
token_mappings = {
    "A": "adverse_tokens",
    "H": "adverse_helper_tokens",
    "U": "adverse_unhelper_tokens",
}

In [504]:
def downsample(df0, label, total):
    """Downsample the dataset to make it balanced class."""
    df = df0.copy()
    df_c0 = df[df[label] == 0]
    df_c1 = df[df[label] == 1]

    df_c0 = df_c0.sample(int(total / 2))
    df_c1 = df_c1.sample(int(total / 2))

    df = pd.concat([df_c0, df_c1], axis=0)
    df = df.sample(frac=1)
    return df


def get_proba(seq, base_seq_len=30):
    """Get probability of being positive label for a sequence."""

    def get_position(seq, substring, base_seq_len):
        """Get position of event with substring from end of sequence"""
        pos = -1
        for i, event in enumerate(seq):
            if event.endswith(substring):
                pos = i
                break
        if pos == -1:
            raise ValueError(f"Error! {substring} not found!")

        pos = len(seq) - pos - 1
        return pos

        a = 0.1  # Constant for Adverse
        h = 0.5  # Constant for helper
        u = 0.95  # Constant for unhelper

    #     a = 0.03  # Constant for Adverse
    #     h = 0.05  # Constant for helper
    #     u = 0.09  # Constant for unhelper

    seq_len = len(seq)
    multiplier = float(base_seq_len) / seq_len
    ta = get_position(seq, "_A", base_seq_len) * multiplier
    th = get_position(seq, "_H", base_seq_len) * multiplier
    tu = get_position(seq, "_U", base_seq_len) * multiplier

    prob = min(1.0, math.exp(-(a * ta)) + math.exp(-(h * th)) - math.exp(-(u * tu)))
    prob = round(prob, 4)
    return prob


def get_a_sequence_seq_v2(seq_len, label, tokens, token_mappings, seq_tokens):
    """creates sequence + label (at the end of list) with specific orderings.
    returns list of list"""
    n_seq_tokens = len(seq_tokens)
    n_noise = (
        np.max(
            (
                10,
                random.choices(range(n_seq_tokens, seq_len), k=1)[0],
            )
        )
        - (n_seq_tokens)
    )
    sel_positions = sorted(random.sample(range(n_noise), k=n_seq_tokens))
    sel_tokens = []
    for key in seq_tokens:
        key_mapping = token_mappings[key]
        sel_tokens.append(random.choices(tokens[key_mapping])[0])
    sel_tokens = list(zip(sel_positions, sel_tokens))
    sel_noise = get_tokens(seq_len, tokens, "noise_tokens", n_noise)

    for idx, event in sel_tokens:
        sel_noise.insert(idx, event)

    sel_noise = ["<pad>"] * (seq_len - len(sel_noise)) + sel_noise

    # Get probability of being positive label
    proba = get_proba(sel_noise)
    # sel_noise.reverse()
    sim_lab = get_label(proba, target=label)

    sequence = sel_noise + [proba] + [sim_lab]

    return sequence


def get_sequences_v2(
    seq_len,
    label,
    uid_len,
    uid_colname,
    tokens,
    token_mappings,
    seq_tokens,
    n_seq,
):
    """Get multiple sequences."""

    sequences = [
        get_a_sequence_seq_v2(seq_len, label, tokens, token_mappings, seq_tokens)
        + [get_uid(uid_len)]
        for _ in range(n_seq)
    ]
    # print(f"seq based events generated")

    seq_df = pd.DataFrame(sequences)
    seq_df.columns = [str(x) for x in range(seq_len - 1, -1, -1)] + [
        "proba",
        "label",
        uid_colname,
    ]

    return seq_df


def get_sequence_dataset(
    seq_len, uid_len, uid_colname, count_dict, tokens, token_mappings, total_rows
):
    """Generate a simple toy dataset.

    Arg:
    -----
        seq_len (int) : length of the generated sequence
        uid_len (int) : length of uid token
        uid_colname (str) : name of uid column, usually patient_id
        count_dict (dict) : dictionary of various sequence types.
            6 different types are allowed:
                n_ppp_adverse, n_pp_adverse, n_p_adverse
                n_nnn_adverse, n_nn_adverse, n_n_adverse
        tokens (dict) : dictionary of the various token types

    Returns:
    --------
        dataset (dataframe) : dataframe containing all the
                              generated dataset, randomly mixed

    """
    label = 1
    cat_lst = []
    for seq_tokens, n_seq in count_dict.items():
        df = get_sequences_v2(
            seq_len,
            label,
            uid_len,
            uid_colname,
            tokens,
            token_mappings,
            seq_tokens,
            n_seq,
        )

        df["seq_event"] = seq_tokens
        cat_lst.append(df.copy())
    dataset = pd.concat(cat_lst, axis=0)
    dataset.reset_index(inplace=True)
    indexes = [idx for idx in range(dataset.shape[0])]
    random.shuffle(indexes)
    dataset = dataset.iloc[indexes, :]
    # dataset = dataset.sample(frac=1).reset_index(drop=True)

    dataset = downsample(dataset, "label", total_rows)
    print(f"dataset: {dataset.shape}")
    print(f"ratio:\n{dataset.label.value_counts(normalize=True)}\n")

    return dataset

In [505]:
print(f"Train Data Imbalance for seq_len={SEQ_LEN}...")
df_train = get_sequence_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=train_count_dict,
    tokens=tokens,
    token_mappings=token_mappings,
    total_rows=TOTAL_TRAIN,
)

print(f"Val Data Imbalance for seq_len={SEQ_LEN}...")
df_val = get_sequence_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=val_count_dict,
    tokens=tokens,
    token_mappings=token_mappings,
    total_rows=TOTAL_VAL,
)

print(f"Test Data Imbalance for seq_len={SEQ_LEN}...")
df_test = get_sequence_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=test_count_dict,
    tokens=tokens,
    token_mappings=token_mappings,
    total_rows=TOTAL_TEST,
)

Train Data Imbalance for seq_len=300...
dataset: (18000, 305)
ratio:
1    0.5
0    0.5
Name: label, dtype: float64

Val Data Imbalance for seq_len=300...
dataset: (6000, 305)
ratio:
1    0.5
0    0.5
Name: label, dtype: float64

Test Data Imbalance for seq_len=300...
dataset: (6000, 305)
ratio:
1    0.5
0    0.5
Name: label, dtype: float64



In [506]:
print(df_train.shape)
# df_train.sort_values("proba")[::-1]
df_train.head()

(18000, 305)


Unnamed: 0,index,299,298,297,296,295,294,293,292,291,...,5,4,3,2,1,0,proba,label,patient_id,seq_event
8126,126,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,hay_fever_N,ingrown_nail_N,headache_N,annual_physical_N,quad_injury_N,myopia_N,0.771,0,0KK1UGRDLC,HUA
17036,1036,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,headache_N,cold_sore_N,annual_physical_N,eye_exam_N,backache_N,quad_injury_N,0.6367,0,MIU7M8JI0L,HAU
3286,3286,<pad>,<pad>,<pad>,cut_finger_N,ACL_tear_N,cold_sore_N,headache_N,cold_sore_N,backache_N,...,foot_pain_N,headache_N,ingrown_nail_N,hay_fever_N,cut_finger_N,cut_finger_N,0.2039,0,WFGDTDTDY5,UHA
418,418,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,ingrown_nail_N,hay_fever_N,eye_exam_N,myopia_N,quad_injury_N,ankle_sprain_N,1.0,1,C4T1HF7ISF,UHA
13981,1981,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,ACL_tear_N,cut_finger_N,cut_finger_N,hay_fever_N,backache_N,eye_exam_N,0.3799,0,AL3WPUN3O9,AUH


In [507]:
df_train[df_train["seq_event"] == "UHA"]["label"].value_counts()

1    1907
0     869
Name: label, dtype: int64

In [508]:
# df_train[df_train["seq_event"] == "UHA"].iloc[0]

In [509]:
df_train.seq_event.value_counts()

AHU    3290
HAU    3238
AUH    2981
HUA    2865
UAH    2850
UHA    2776
Name: seq_event, dtype: int64

In [510]:
save_csv(df_train, TRAIN_FP)
save_csv(df_val, VAL_FP)
save_csv(df_test, TEST_FP)

In [511]:
df = pd.read_csv(TRAIN_FP)
print(df.shape)
df.head()

(18000, 305)


Unnamed: 0,index,299,298,297,296,295,294,293,292,291,...,5,4,3,2,1,0,proba,label,patient_id,seq_event
0,126,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,hay_fever_N,ingrown_nail_N,headache_N,annual_physical_N,quad_injury_N,myopia_N,0.771,0,0KK1UGRDLC,HUA
1,1036,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,headache_N,cold_sore_N,annual_physical_N,eye_exam_N,backache_N,quad_injury_N,0.6367,0,MIU7M8JI0L,HAU
2,3286,<pad>,<pad>,<pad>,cut_finger_N,ACL_tear_N,cold_sore_N,headache_N,cold_sore_N,backache_N,...,foot_pain_N,headache_N,ingrown_nail_N,hay_fever_N,cut_finger_N,cut_finger_N,0.2039,0,WFGDTDTDY5,UHA
3,418,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,ingrown_nail_N,hay_fever_N,eye_exam_N,myopia_N,quad_injury_N,ankle_sprain_N,1.0,1,C4T1HF7ISF,UHA
4,1981,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,ACL_tear_N,cut_finger_N,cut_finger_N,hay_fever_N,backache_N,eye_exam_N,0.3799,0,AL3WPUN3O9,AUH


In [512]:
df.label.value_counts(normalize=True)

1    0.5
0    0.5
Name: label, dtype: float64