## Synthetic dataset generation -- Sequence based
**Author: Lin Lee Cheong <br>
Updated by: Tesfagabir Meharizghi<br>
Date created: 12/12/ 2020 <br>
Date updated: 02/18/2021 <br>**

Goal of this synthetic dataset is to create datasets to help understand how different relationships between tokens affect attention, SHAP and other interpretability factors.
- length of events (30, 300)
- spacing between 2+ coupled events, i.e. order of sequence matters
- amount of noise, i.e. performance vs interpretability
- vocabulary space

### Sequence dataset

Positive label is driven by a sequence of tokens
- Positive label probability is driven by the following formula
``` min(1.0, math.exp(-(a * ta)) + math.exp(-(h * th)) - math.exp(-(u * tu))) ```
Where:
- `a` is a constant related to `_A` events. It is the inverse of the contribution of `_A` events for positive label
- `h` is a constant related to `_H` events. It is the inverse of the contribution of `_H` events for positive label
- `u` is a constant related to `_U` events. It is the inverse of the contribution of `_U` events for positive label

- `ta` is the absolute position of the `_A` event in the sequence from the end.
- `th` is the absolute position of the `_H` event in the sequence from the end.
- `tu` is the absolute position of the `_U` event in the sequence from the end.

Note:
- All patients have one `_A`, one `_H` and one `_U` events each.
- since `_U` events have opposite effect to the adverse event, their contribution is subtracted.

In [28]:
%load_ext lab_black

%load_ext autoreload

%autoreload 2

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [29]:
import yaml
import string
import os
import numpy as np
import pandas as pd
import math
import random
import json

from utils import *

In [30]:
%load_ext autoreload

%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [31]:
TOKEN_NAMES_FP = "./tokens_v2.yaml"

SEQ_LEN = 300

TRAIN_FP = "data/final_final/raw/{}/train.json".format(SEQ_LEN)
VAL_FP = "data/final_final/raw/{}/val.json".format(SEQ_LEN)
TEST_FP = "data/final_final/raw/{}/test.json".format(SEQ_LEN)

TRAIN_FP_DF0 = "data/final_final/event_based/{}/train_org0.csv".format(SEQ_LEN)
VAL_FP_DF0 = "data/final_final/event_based/{}/val_org0.csv".format(SEQ_LEN)
TEST_FP_DF0 = "data/final_final/event_based/{}/test_org0.csv".format(SEQ_LEN)

TRAIN_FP_DF = "data/final_final/event_based/{}/train_orig.csv".format(SEQ_LEN)
VAL_FP_DF = "data/final_final/event_based/{}/val_orig.csv".format(SEQ_LEN)
TEST_FP_DF = "data/final_final/event_based/{}/test_orig.csv".format(SEQ_LEN)

UID_COLNAME = "patient_id"

TRAIN_NROWS = 4000
VAL_NROWS = 2000
TEST_NROWS = 2000

UID_LEN = 10

# Total patients in the each split (will be balanced)
TOTAL_TRAIN = 18000
TOTAL_VAL = 6000
TOTAL_TEST = 6000

In [32]:
# df_train = pd.read_csv(TRAIN_FP_DF0)
# df_val = pd.read_csv(VAL_FP_DF0)
# df_test = pd.read_csv(TEST_FP_DF0)

In [33]:
# df_train.head()

In [34]:
# def reverse_events(row):
#     seq_len = 30
#     columns = [str(i) for i in range(seq_len - 1, -1, -1)]
#     row2 = row[columns].tolist()
#     row2.reverse()
#     row[columns] = row2[:]
#     return row.tolist()


# # Train data
# columns = df_train.columns
# results = df_train.apply(reverse_events, axis=1)
# results = [np.array(res) for res in results]
# df = pd.DataFrame(np.array(results), columns=columns)
# # df.sample(frac=1)
# save_csv(df, TRAIN_FP_DF)

# # Val data
# columns = df_val.columns
# results = df_val.apply(reverse_events, axis=1)
# results = [np.array(res) for res in results]
# df = pd.DataFrame(np.array(results), columns=columns)
# # df.sample(frac=1)
# save_csv(df, VAL_FP_DF)

# # Test data
# columns = df_test.columns
# results = df_test.apply(reverse_events, axis=1)
# results = [np.array(res) for res in results]
# df = pd.DataFrame(np.array(results), columns=columns)
# # df.sample(frac=1)
# save_csv(df, TEST_FP_DF)

In [35]:
# Load tokens from yaml file path
tokens = load_tokens(TOKEN_NAMES_FP)
for key in tokens.keys():
    print(f"{key}: {len(tokens[key])} tokens")

adverse_tokens: 10 tokens
adverse_helper_tokens: 10 tokens
adverse_unhelper_tokens: 10 tokens
noise_tokens: 15 tokens


In [36]:
# for key, tok in tokens.items():
#     print(key)
#     print(tok)
#     print("-" * 50)

Total number of observations

* 90%
    * 2 adverse + 1 helper
* 80%
    * 1 adverse + 2 helper
* 70%
    * 1 adverse + 1 helper
* 40%
    * 1 helper + 1 unhelper
* 30%
    * 1 adverse + 2 unhelper
* 20%
    * 1 helper + 2 unhelper
* 10%
    * 2 unhelpers

In [37]:
with open(TRAIN_FP, "r") as fp:
    json_train = json.load(fp)

with open(VAL_FP, "r") as fp:
    json_val = json.load(fp)

with open(TEST_FP, "r") as fp:
    json_test = json.load(fp)

In [38]:
len(json_val["AAH"])

1000

In [39]:
TRAIN_COUNTS = 3000
VAL_COUNTS = 1000
TEST_COUNTS = 1000

TRAIN_COUNT_DICT = {
    "AAH": [0.9, TRAIN_COUNTS],
    "AHH": [0.8, TRAIN_COUNTS],
    "AH": [0.7, TRAIN_COUNTS],
    "HU": [0.4, TRAIN_COUNTS],
    "AUU": [0.3, TRAIN_COUNTS],
    "HUU": [0.2, TRAIN_COUNTS],
    "UU": [0.1, TRAIN_COUNTS],
}

VAL_COUNT_DICT = {
    "AAH": [0.9, VAL_COUNTS],
    "AHH": [0.8, VAL_COUNTS],
    "AH": [0.7, VAL_COUNTS],
    "HU": [0.4, VAL_COUNTS],
    "AUU": [0.3, VAL_COUNTS],
    "HUU": [0.2, VAL_COUNTS],
    "UU": [0.1, VAL_COUNTS],
}

TEST_COUNT_DICT = {
    "AAH": [0.9, TEST_COUNTS],
    "AHH": [0.8, TEST_COUNTS],
    "AH": [0.7, TEST_COUNTS],
    "HU": [0.4, TEST_COUNTS],
    "AUU": [0.3, TEST_COUNTS],
    "HUU": [0.2, TEST_COUNTS],
    "UU": [0.1, TEST_COUNTS],
}

In [40]:
# Mappings of the token groups with the abbreviation
TOKEN_MAPPINGS = {
    "A": "adverse_tokens",
    "H": "adverse_helper_tokens",
    "U": "adverse_unhelper_tokens",
}

In [41]:
# def downsample(df0, label, total):
#     """Downsample the dataset to make it balanced class."""
#     df = df0.copy()
#     df_c0 = df[df[label] == 0]
#     df_c1 = df[df[label] == 1]

#     df_c0 = df_c0.sample(int(total / 2))
#     df_c1 = df_c1.sample(int(total / 2))

#     df = pd.concat([df_c0, df_c1], axis=0)
#     df = df.sample(frac=1)
#     return df


# def get_proba(seq, base_seq_len=30):
#     """Get probability of being positive label for a sequence."""

#     def get_position(seq, substring, base_seq_len):
#         """Get position of event with substring from end of sequence"""
#         pos = -1
#         for i, event in enumerate(seq):
#             if event.endswith(substring):
#                 pos = i
#                 break
#         if pos == -1:
#             raise ValueError(f"Error! {substring} not found!")

#         pos = len(seq) - pos - 1
#         return pos

#         a = 0.1  # Constant for Adverse
#         h = 0.5  # Constant for helper
#         u = 0.95  # Constant for unhelper

#     #     a = 0.03  # Constant for Adverse
#     #     h = 0.05  # Constant for helper
#     #     u = 0.09  # Constant for unhelper

#     seq_len = len(seq)
#     multiplier = float(base_seq_len) / seq_len
#     ta = get_position(seq, "_A", base_seq_len) * multiplier
#     th = get_position(seq, "_H", base_seq_len) * multiplier
#     tu = get_position(seq, "_U", base_seq_len) * multiplier

#     prob = min(1.0, math.exp(-(a * ta)) + math.exp(-(h * th)) - math.exp(-(u * tu)))
#     prob = round(prob, 4)
#     return prob


def get_a_sequence_seq_v2(
    seq_len, label, tokens, token_mappings, seq_tokens, proba, json_row=None
):
    """creates sequence + label (at the end of list) with specific orderings.
    returns list of list"""
    n_seq_tokens = len(seq_tokens)

    min_n_noise = 10
    if json_row is not None:
        # {'39': 'electrolyte_imbalance_H', '45': 'hypertension_A', '137': 'cardiomyopathy_A'}
        row_inds = list(json_row.keys())
        row_tokens = list(json_row.values())
        row_inds = [int(indx) for indx in row_inds]
        min_n_noise = min(seq_len, max(row_inds) + 1)

    n_noise = (
        np.max(
            (
                min_n_noise,
                random.choices(range(n_seq_tokens, seq_len), k=1)[0],
            )
        )
        - (n_seq_tokens)
    )
    if json_row is None:
        sel_positions = sorted(random.sample(range(n_noise), k=n_seq_tokens))
        sel_tokens = []
        for key in seq_tokens:
            key_mapping = token_mappings[key]
            sel_tokens.append(random.choices(tokens[key_mapping])[0])

        # Randomize sequence
        random.shuffle(sel_tokens)

        sel_tokens = list(zip(sel_positions, sel_tokens))
    else:
        sel_tokens = list(zip(row_inds, row_tokens))

    sel_noise = get_tokens(seq_len, tokens, "noise_tokens", n_noise)

    for idx, event in sel_tokens:
        sel_noise.insert(idx, event)

    sel_noise = ["<pad>"] * (seq_len - len(sel_noise)) + sel_noise

    # Get probability of being positive label
    # sel_noise.reverse()
    sim_lab = get_label(proba, target=label)

    sequence = sel_noise + [proba] + [sim_lab]

    return sequence


def get_sequences_v2(
    seq_len,
    label,
    uid_len,
    uid_colname,
    tokens,
    token_mappings,
    seq_tokens,
    n_seq,
    proba,
    json_group=None,
):
    """Get multiple sequences."""

    if json_group is None:
        sequences = [
            get_a_sequence_seq_v2(
                seq_len, label, tokens, token_mappings, seq_tokens, proba, None
            )
            + [get_uid(uid_len)]
            for _ in range(n_seq)
        ]
    else:
        sequences = [
            get_a_sequence_seq_v2(
                seq_len, label, tokens, token_mappings, seq_tokens, proba, json_row
            )
            + [get_uid(uid_len)]
            for json_row in json_group
        ]
    # print(f"seq based events generated")

    seq_df = pd.DataFrame(sequences)
    seq_df.columns = [str(x) for x in range(seq_len - 1, -1, -1)] + [
        "proba",
        "label",
        uid_colname,
    ]

    return seq_df


def get_sequence_dataset(
    seq_len,
    uid_len,
    uid_colname,
    count_dict,
    tokens,
    token_mappings,
    total_rows,
    json_data=None,
):
    """Generate a simple toy dataset.

    Arg:
    -----
        seq_len (int) : length of the generated sequence
        uid_len (int) : length of uid token
        uid_colname (str) : name of uid column, usually patient_id
        count_dict (dict) : dictionary of various sequence types.
            6 different types are allowed:
                n_ppp_adverse, n_pp_adverse, n_p_adverse
                n_nnn_adverse, n_nn_adverse, n_n_adverse
        tokens (dict) : dictionary of the various token types

    Returns:
    --------
        dataset (dataframe) : dataframe containing all the
                              generated dataset, randomly mixed

    """
    label = 1
    cat_lst = []
    for seq_tokens, (proba, n_seq) in count_dict.items():
        json_group = None
        if json_data is not None:
            json_group = json_data[seq_tokens]
        df = get_sequences_v2(
            seq_len,
            label,
            uid_len,
            uid_colname,
            tokens,
            token_mappings,
            seq_tokens,
            n_seq,
            proba,
            json_group,
        )

        df["seq_event"] = seq_tokens
        cat_lst.append(df.copy())
    dataset = pd.concat(cat_lst, axis=0)
    dataset.reset_index(inplace=True)
    indexes = [idx for idx in range(dataset.shape[0])]
    if json_group is None:
        random.shuffle(indexes)
        dataset = dataset.iloc[indexes, :]
    # dataset = dataset.sample(frac=1).reset_index(drop=True)

    # dataset = downsample(dataset, "label", total_rows)
    print(f"dataset: {dataset.shape}")
    print(f"ratio:\n{dataset.label.value_counts(normalize=True)}\n")

    return dataset

In [42]:
print(f"Train Data Imbalance for seq_len={SEQ_LEN}...")
df_train = get_sequence_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=TRAIN_COUNT_DICT,
    tokens=tokens,
    token_mappings=TOKEN_MAPPINGS,
    total_rows=TRAIN_COUNTS,
    json_data=json_train,
)

print(f"Val Data Imbalance for seq_len={SEQ_LEN}...")
df_val = get_sequence_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=VAL_COUNT_DICT,
    tokens=tokens,
    token_mappings=TOKEN_MAPPINGS,
    total_rows=VAL_COUNTS,
    json_data=json_val,
)

print(f"Test Data Imbalance for seq_len={SEQ_LEN}...")
df_test = get_sequence_dataset(
    seq_len=SEQ_LEN,
    uid_len=UID_LEN,
    uid_colname=UID_COLNAME,
    count_dict=TEST_COUNT_DICT,
    tokens=tokens,
    token_mappings=TOKEN_MAPPINGS,
    total_rows=TEST_COUNTS,
    json_data=json_test,
)

Train Data Imbalance for seq_len=300...
dataset: (21000, 305)
ratio:
0    0.514286
1    0.485714
Name: label, dtype: float64

Val Data Imbalance for seq_len=300...
dataset: (7000, 305)
ratio:
0    0.523429
1    0.476571
Name: label, dtype: float64

Test Data Imbalance for seq_len=300...
dataset: (7000, 305)
ratio:
0    0.513
1    0.487
Name: label, dtype: float64



In [43]:
print(df_train.shape)
# df_train.sort_values("proba")[::-1]
df_train.head()

(21000, 305)


Unnamed: 0,index,299,298,297,296,295,294,293,292,291,...,5,4,3,2,1,0,proba,label,patient_id,seq_event
0,0,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,annual_physical_N,quad_injury_N,backache_N,backache_N,peanut_allergy_N,annual_physical_N,0.9,0,12DP15TQ9W,AAH
1,1,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,foot_pain_N,hay_fever_N,myopia_N,hay_fever_N,cold_sore_N,eye_exam_N,0.9,1,JET8LDGL4Y,AAH
2,2,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,peanut_allergy_N,peanut_allergy_N,ankle_sprain_N,ACL_tear_N,dental_exam_N,Acute_Myocardial_Infarction_A,0.9,1,GNS7WLNPUP,AAH
3,3,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,eye_exam_N,myopia_N,ingrown_nail_N,ingrown_nail_N,cut_finger_N,hypertension_A,0.9,1,IJ650PXV3S,AAH
4,4,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,dental_exam_N,backache_N,ACL_tear_N,hay_fever_N,eye_exam_N,arrhythmia_A,0.9,1,9H8H8D7V97,AAH


In [54]:
TRAIN_FP_DF

'data/final_final/event_based/300/train_orig.csv'

In [45]:
save_csv(df_train, TRAIN_FP_DF)
save_csv(df_val, VAL_FP_DF)
save_csv(df_test, TEST_FP_DF)

In [46]:
# For sequence-based dataset

In [106]:
SEQ_LEN = 300

# Input from event-based
TRAIN_FP_DF = "data/final_final/event_based/{}/train_orig.csv".format(SEQ_LEN)
VAL_FP_DF = "data/final_final/event_based/{}/val_orig.csv".format(SEQ_LEN)
TEST_FP_DF = "data/final_final/event_based/{}/test_orig.csv".format(SEQ_LEN)

# Labels from Seq-based
TRAIN_SEQ_LABELS_FP = "data/final_final/raw/{}/seq_based_labels/label_train.json".format(
    SEQ_LEN
)
VAL_SEQ_LABELS_FP = "data/final_final/raw/{}/seq_based_labels/label_val.json".format(SEQ_LEN)
TEST_SEQ_LABELS_FP = "data/final_final/raw/{}/seq_based_labels/label_test.json".format(SEQ_LEN)

# Output for seq-based from event-based
TRAIN_FP_DF2 = "data/final_final/seq_based/{}/train_orig.csv".format(SEQ_LEN)
VAL_FP_DF2 = "data/final_final/seq_based/{}/val_orig.csv".format(SEQ_LEN)
TEST_FP_DF2 = "data/final_final/seq_based/{}/test_orig.csv".format(SEQ_LEN)

In [107]:
df_train = pd.read_csv(TRAIN_FP_DF)
df_val = pd.read_csv(VAL_FP_DF)
df_test = pd.read_csv(TEST_FP_DF)

In [108]:
with open(TRAIN_SEQ_LABELS_FP, "r") as fp:
    json_train_seq = json.load(fp)

with open(VAL_SEQ_LABELS_FP, "r") as fp:
    json_val_seq = json.load(fp)

with open(TEST_SEQ_LABELS_FP, "r") as fp:
    json_test_seq = json.load(fp)

In [109]:
df_train.head()

Unnamed: 0,index,299,298,297,296,295,294,293,292,291,...,5,4,3,2,1,0,proba,label,patient_id,category
0,0,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,annual_physical_N,quad_injury_N,backache_N,backache_N,peanut_allergy_N,annual_physical_N,0.9,0,12DP15TQ9W,AAH
1,1,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,foot_pain_N,hay_fever_N,myopia_N,hay_fever_N,cold_sore_N,eye_exam_N,0.9,1,JET8LDGL4Y,AAH
2,2,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,peanut_allergy_N,peanut_allergy_N,ankle_sprain_N,ACL_tear_N,dental_exam_N,Acute_Myocardial_Infarction_A,0.9,1,GNS7WLNPUP,AAH
3,3,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,eye_exam_N,myopia_N,ingrown_nail_N,ingrown_nail_N,cut_finger_N,hypertension_A,0.9,1,IJ650PXV3S,AAH
4,4,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,dental_exam_N,backache_N,ACL_tear_N,hay_fever_N,eye_exam_N,arrhythmia_A,0.9,1,9H8H8D7V97,AAH


In [110]:
# Train Data
df = df_train.copy()
json_data = json_train_seq.copy()

all_labels = []
all_seq_events = []
for seq_event, labels in json_data.items():
    all_labels += labels
    all_seq_events += [seq_event] * len(labels)
df["label"] = all_labels[:]
df["seq_event"] = all_seq_events
print("Train", df.shape)
print(df.shape[0], sum(df["category"] == df["seq_event"]))
if "proba" in df.columns:
    del df["proba"]
del df["category"]

save_csv(df, TRAIN_FP_DF2)

# Val Data
df = df_val.copy()
json_data = json_val_seq.copy()

all_labels = []
all_seq_events = []
for seq_event, labels in json_data.items():
    all_labels += labels
    all_seq_events += [seq_event] * len(labels)
df["label"] = all_labels[:]
df["seq_event"] = all_seq_events
print("Val", df.shape)
print(df.shape[0], sum(df["category"] == df["seq_event"]))
if "proba" in df.columns:
    del df["proba"]
del df["category"]
save_csv(df, VAL_FP_DF2)

# Test Data
df = df_test.copy()
json_data = json_test_seq.copy()

all_labels = []
all_seq_events = []
for seq_event, labels in json_data.items():
    all_labels += labels
    all_seq_events += [seq_event] * len(labels)
df["label"] = all_labels[:]
df["seq_event"] = all_seq_events
print("Test", df.shape)
print(df.shape[0], sum(df["category"] == df["seq_event"]))
if "proba" in df.columns:
    del df["proba"]
del df["category"]
save_csv(df, TEST_FP_DF2)

Train (21000, 306)
21000 21000
Val (7000, 306)
7000 7000
Test (7000, 306)
7000 7000


In [111]:
df.head()

Unnamed: 0,index,299,298,297,296,295,294,293,292,291,...,6,5,4,3,2,1,0,label,patient_id,seq_event
0,0,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,cut_finger_N,eye_exam_N,dental_exam_N,backache_N,myopia_N,ankle_sprain_N,eye_exam_N,1,MEVFK5VO3K,AAH
1,1,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,annual_physical_N,peanut_allergy_N,hay_fever_N,myopia_N,myopia_N,cold_sore_N,hay_fever_N,1,D33UGF3XH6,AAH
2,2,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,dental_exam_N,Chronic_Obstructive_Pulmonary_Disease_A,Brain_Natriuretic_Peptide_H,quad_injury_N,ACL_tear_N,myopia_N,myopia_N,1,RS0NOAZ6KR,AAH
3,3,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,ankle_sprain_N,cut_finger_N,annual_physical_N,myopia_N,hay_fever_N,ingrown_nail_N,quad_injury_N,1,WCOHV9D8Q2,AAH
4,4,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,...,peanut_allergy_N,headache_N,quad_injury_N,ACL_tear_N,cold_sore_N,eye_exam_N,annual_physical_N,1,UPT4WL3MHA,AAH


In [114]:
TRAIN_FP_DF

'data/final_final/event_based/300/train_orig.csv'

In [112]:
print("Done")

Done


In [113]:
# Next
# SHUFFLE...

In [143]:
seq_lens = [30, 300]
dtypes = ["event_based", "seq_based"]
print("Shuffling Data...")
for SEQ_LEN in seq_lens:
    for dtype in dtypes:
        train_path_in = f"data/final_final/{dtype}/{SEQ_LEN}/train_orig.csv"
        val_path_in = f"data/final_final/{dtype}/{SEQ_LEN}/val_orig.csv"
        test_path_in = f"data/final_final/{dtype}/{SEQ_LEN}/test_orig.csv"

        train_path_out = f"data/final_final/{dtype}/{SEQ_LEN}/train.csv"
        val_path_out = f"data/final_final/{dtype}/{SEQ_LEN}/val.csv"
        test_path_out = f"data/final_final/{dtype}/{SEQ_LEN}/test.csv"

        df_train = pd.read_csv(train_path_in)
        df_val = pd.read_csv(val_path_in)
        df_test = pd.read_csv(test_path_in)

        df_train = df_train.sample(frac=1, random_state=42)
        df_val = df_val.sample(frac=1, random_state=42)
        df_test = df_test.sample(frac=1, random_state=42)

        save_csv(df_train, train_path_out)
        save_csv(df_val, val_path_out)
        save_csv(df_test, test_path_out)
print("SUCCESS!")

Shuffling Data...
SUCCESS!


In [32]:
def process(df0, count_dict, seq_len, output_path):
    """Process data and converting to list of dicts."""
    print("Processing data...")
    feature_names = [str(i) for i in range(seq_len - 1, -1, -1)]

    data = {}
    for category, values in count_dict.items():
        data[category] = []
        df = df0[df0["seq_event"] == category]
        df = df[feature_names]
        n_rows = df.shape[0]
        for idx in range(n_rows):
            row = df.iloc[idx].tolist()
            row.reverse()
            row = dict(zip(range(SEQ_LEN), row))
            row2 = row.copy()
            for key, value in row.items():
                if value.endswith("_N") or value == "<pad>":
                    del row2[key]
            data[category].append(row2.copy())

    output_dir = os.path.dirname(output_path)
    os.makedirs(output_dir, exist_ok=True)

    with open(output_path, "w") as fp:
        json.dump(data, fp, indent=4)
    print("SUCCESS!")

In [33]:
# process(df_train, TRAIN_COUNT_DICT, SEQ_LEN, TRAIN_FP)
# process(df_val, VAL_COUNT_DICT, SEQ_LEN, VAL_FP)
# process(df_test, TEST_COUNT_DICT, SEQ_LEN, TEST_FP)

Processing data...
SUCCESS!
Processing data...
SUCCESS!
Processing data...
SUCCESS!


In [35]:
# df = pd.read_csv(TRAIN_FP)
# print(df.shape)
# df.head()

In [36]:
# df.label.value_counts(normalize=True)

In [None]:
###Extra

In [34]:
split = "train"
train_data_org = f"/home/ec2-user/SageMaker/CMSAI_Research/data/toy_dataset/data/final_final/event_based/30/{split}_org.csv"
train_json_org = f"/home/ec2-user/SageMaker/CMSAI_Research/data/toy_dataset/data/final_final/raw/30/{split}.json"

In [35]:
df = pd.read_csv(train_data_org)

In [36]:
with open(train_json_org, "r") as fp:
    data = json.load(fp)

In [37]:
df.head()

Unnamed: 0,29,28,27,26,25,24,23,22,21,20,...,6,5,4,3,2,1,0,patient_id,label,category
0,annual_physical_N,pneumonia_H,annual_physical_N,eye_exam_N,quad_injury_N,cut_finger_N,myopia_N,dental_exam_N,ACL_tear_N,foot_pain_N,...,headache_N,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,XLDQ61URWF,1,AAH
1,hay_fever_N,eye_exam_N,myopia_N,hay_fever_N,hay_fever_N,headache_N,annual_physical_N,headache_N,annual_physical_N,annual_physical_N,...,cut_finger_N,electrolyte_imbalance_H,headache_N,quad_injury_N,ingrown_nail_N,<pad>,<pad>,SSC6U2R69W,1,AAH
2,dental_exam_N,annual_physical_N,eye_exam_N,eye_exam_N,hay_fever_N,dental_exam_N,annual_physical_N,backache_N,foot_pain_N,eye_exam_N,...,ACL_tear_N,backache_N,eye_exam_N,dental_exam_N,hay_fever_N,<pad>,<pad>,ILIDFNY4YX,1,AAH
3,cold_sore_N,ventricular_aneurysm_A,hypertension_A,myopia_N,coronary_artery_disease_H,annual_physical_N,ACL_tear_N,ACL_tear_N,backache_N,annual_physical_N,...,cold_sore_N,ingrown_nail_N,ingrown_nail_N,ankle_sprain_N,<pad>,<pad>,<pad>,D5KAFPVMK4,1,AAH
4,cut_finger_N,quad_injury_N,eye_exam_N,pulmonary_embolism_A,cardiomyopathy_A,headache_N,annual_physical_N,sleep_apnea_H,peanut_allergy_N,cold_sore_N,...,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,<pad>,CBC97BW27T,1,AAH


In [38]:
# print(df.shape)
print(df.iloc[5].tolist())

['myopia_N', 'headache_N', 'ingrown_nail_N', 'peanut_allergy_N', 'cut_finger_N', 'cut_finger_N', 'peanut_allergy_N', 'Acute_Myocardial_Infarction_A', 'headache_N', 'Acute_Myocardial_Infarction_A', 'troponin_H', 'annual_physical_N', 'cut_finger_N', 'annual_physical_N', 'cold_sore_N', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', 'HBTMH73L7U', 1, 'AAH']


In [39]:
data["AAH"][5]

{'4': 'troponin_H',
 '5': 'Acute_Myocardial_Infarction_A',
 '7': 'Acute_Myocardial_Infarction_A'}