#Important Instructions

####To get the Dataset mounted on Google Drive:  

*  Go to MIMIC IV Website: https://physionet.org/content/mimiciv/2.2
*  Download the dataset to your drive
*  Validate the dataset is present under -> MyDrive/mimiciv/2.2/hosp.
    Example:
    *   /content/drive/MyDrive/mimiciv/2.2/hosp/admissions.csv.gz
    *   /content/drive/MyDrive/mimiciv/2.2/hosp/diagnoses_icd.csv.gz

####Link to the Project draft Google Colab Notebook (.ipynb):
https://colab.research.google.com/drive/1RMl4T3FsAQaJDColj0xPma8t3Xwk0c3z?authuser=1#scrollTo=-Ua2cM28fPO4








# Imports Modules

In [None]:
#Required installations
#!pip install transformers[torch]
!pip install --upgrade accelerate>=0.21.0
!pip install --upgrade transformers



In [None]:
#Adding all modules to import for the project.
from google.colab import drive
import pandas as pd
from datetime import datetime
import random
import math
import numpy as np
import pandas as pd
from google.colab import drive
import logging
import os
from typing import Callable, Dict, List, Optional, Tuple
import csv
import json, time
from collections import defaultdict
from itertools import combinations, islice
import pickle

from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

import torch
torch.__version__
import torch.nn.functional as F
from torch import Tensor, nn

from torch.utils.data.dataloader import DataLoader
from torch.utils.data.dataset import Dataset
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data.sampler import RandomSampler, Sampler, SequentialSampler

from transformers.data.data_collator import DataCollator
from transformers.modeling_utils import PreTrainedModel
from transformers.optimization import AdamW, get_linear_schedule_with_warmup
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR, EvalPrediction, PredictionOutput, TrainOutput
from transformers.training_args import TrainingArguments

from transformers.activations import ACT2FN
from transformers.models.bart.configuration_bart import BartConfig
from transformers import BertTokenizer, BartTokenizer
# from transformers.file_utils import add_start_docstrings, add_start_docstrings_to_callable

from transformers import (
    CONFIG_MAPPING,
    MODEL_WITH_LM_HEAD_MAPPING,
    AutoConfig,
    AutoModelWithLMHead,
    AutoTokenizer,
    BertTokenizer,
    DataCollatorForLanguageModeling,
    HfArgumentParser,
    LineByLineTextDataset,
    PreTrainedTokenizer,
    TextDataset,
    TrainingArguments,
    set_seed,
)

from dataclasses import dataclass, field
from transformers import Trainer
#from dataset import DataCollatorForICDBERT, DataCollatorForICDBERTFINALPRED, DataCollatorForICDBART
#from icdmodelbart import ICDBartForPreTraining

In [None]:
## Pyhealth installation and related module imports for future enhacements

# !pip install pyhealth
# # PyHealth modules
# from pyhealth.data import *
# from pyhealth.datasets import *
# from pyhealth.tasks import *
# from pyhealth.models import *
# from pyhealth.trainer import *
# from pyhealth.medcode import *
# from pyhealth.tokenizer import *

# Mount Notebook to Google Drive


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
#Check the File listing

!ls -lr /content/drive/MyDrive/mimiciv/2.2/*

-rw------- 1 root root  2884 Jan  6  2023 /content/drive/MyDrive/mimiciv/2.2/SHA256SUMS.txt
-rw------- 1 root root  2518 Jan  6  2023 /content/drive/MyDrive/mimiciv/2.2/LICENSE.txt
-rw------- 1 root root   789 Mar 29 00:31 /content/drive/MyDrive/mimiciv/2.2/index.html
-rw------- 1 root root 13332 Jan  5  2023 /content/drive/MyDrive/mimiciv/2.2/CHANGELOG.txt

/content/drive/MyDrive/mimiciv/2.2/icu:
total 3077969
-rw------- 1 root root   20717852 Jan  5  2023 procedureevents.csv.gz
-rw------- 1 root root   38747895 Jan  5  2023 outputevents.csv.gz
-rw------- 1 root root  324218488 Jan  5  2023 inputevents.csv.gz
-rw------- 1 root root  251962313 Jan  5  2023 ingredientevents.csv.gz
-rw------- 1 root root       1336 Mar 29 00:31 index.html
-rw------- 1 root root    2614571 Jan  5  2023 icustays.csv.gz
-rw------- 1 root root      57476 Jan  5  2023 d_items.csv.gz
-rw------- 1 root root   45721062 Jan  5  2023 datetimeevents.csv.gz
-rw------- 1 root root 2467761053 Jan  5  2023 chartevents.

# Introduction

  Longitudinal Electronic Health Records (EHRs) successfully used for clinical disease and outcome prediction using Deep Learning models.
  State-of-the-art (SOTA) models outperform traditional ML models by using pretrain-finetune methods in EHR-based predictive modeling. However, their pre-training objectives are limited in predicting fraction of ICD codes within each visit. In real life scenarios, patients have multiple diseases which can be correlated and can contribute to disease progression and change in outcome.
  Additionally, generalizing the same model on out-of-domain data in different medical settings with limited computing resources is a major challenge today.


*   Paper explanation

  The paper proposes TransformEHR, which is a generative encoder-decoder model with a transformer that is pre-trained using a new strategy, to predict the complete set of diseases and outcomes of patients at a future visit from previous visits. The model is generalizable and can be finetuned for various clinical prediction tasks with limited data.

  TransformEHR uses encoder-decoder transformer architecture. The encoder processes the input embeddings and generates a set of hidden representations. The model performs cross-attention over hidden representations from the encoder and assigns an attention weight for each representation. The decoder generates ICD codes following the sequential order of code priority within a visit. It includes the date of each visit as a feature to integrate temporal information. The model uses 3 unique components compared to state-of-the-art models (BERT - Bidirectional Encoder Representations from Transformers) i.e. Visit masking, Encoder-decoder architecture and time embedding.

  As per author, during the pretraining with a larger set of longitudinal EHR data, TransformEHR model learned the probability distribution of ICD codes through correlation of cross attention. Later It was fine-tuned to the predictions of a single disease or outcome.


# Scope of Reproducibility:

Hypothesis1: Whether using the TransformEHR model can effectively predict the complete set of diseases and outcomes of a patient from past visits (i.e. Disease or outcome agnostic prediction (DOAP) task). The result will be compared to the state-of-the-art model (BERT - Bidirectional Encoder Representations from Transformers) that is usually trained to predict a fraction of ICD codes within each visit. The model will use a transformer architecture and seek to outperform bidirectional encode-only models.

Hypothesis2: The other hypothesis could be that the TransformEHR model can perform better for single disease outcomes with pre-training than without pre-training. After carefully reviewing the large data requirement of the VHA dataset for pretraining, we realized that testing this hypothesis may not be feasible with limited computational resources.

Abalations planned:
1.	To evaluate the effectiveness of 3 of the unique components of the TransformEHR model, which are visit masking, encoder-decoder architecture, and time embedding.
2.	To assess the performance of an Encoder-only architecture model compared to an encoder-decoder architecture.
3.	Impact of the inclusion and exclusion of certain temporal features such as date of each visit.




# Methodology



We are planning to follow below steps for this implementation:

* Data Load - Extract below data files from MIMIC-IV dataset https://physionet.org/content/mimiciv/2.2 :

    1.   admissions.csv.gz
    2.   diagnoses_icd.csv.gz

* Preprocess MIMIC-IV dataset files:

    * Create subset dataset by extracting ICD10 version records to align with the cohorts from Pretrained model.

    * For Project Draft purpose - create subset of data (100 Patients and related records from admissions and diagnosis_icd files)

    * Map the ICD codes to CUI codes [Concepts to concept Unique Identifiers (CUIs) from the United Medical Language System (UMLS)]

* Create Test and Train datasets  

* Model Architecture:
    *   Create Positional Embedding
    *   Multi-head attention mechanism
    *   Encoder Layer
    *   Decoder Layer
* Model Initialization
* Model Training
* Model Evaluation
* Results



##  Data
Data includes raw data (MIMIC III tables), descriptive statistics (our homework questions), and data processing (feature engineering).
  * Source of the data: where the data is collected from; if data is synthetic or self-generated, explain how. If possible, please provide a link to the raw datasets.
  * Statistics: include basic descriptive statistics of the dataset like size, cross validation split, label distribution, etc.
  * Data process: how do you munipulate the data, e.g., change the class labels, split the dataset to train/valid/test, refining the dataset.
  * Illustration: printing results, plotting figures for illustration.
  * You can upload your raw dataset to Google Drive and mount this Colab to the same directory. If your raw dataset is too large, you can upload the processed dataset and have a code to load the processed dataset.

MIMIC-IV dataset comprises data from intensive care unit patients admitted to the Beth Israel Deaconess Medical Center in Boston, Massachusetts. Although the dataset spans from 2008 to 2019, the implementation of ICD-10CM began in October 2015. As per authors implementation plan, to align with the cohorts from the Veterans Health Administration (VHA) dataset pretrained model, only patients with ICD-10CM records will be selected.


Data Load - Extract below data files from MIMIC-IV dataset https://physionet.org/content/mimiciv/2.2 :

1.   admissions.csv.gz
2.   diagnoses_icd.csv.gz


        
        



In [None]:
# Read diagnoses_icd.csv.gz
diagnoses_df = pd.read_csv('/content/drive/MyDrive/mimiciv/2.2/hosp/diagnoses_icd.csv.gz',
                           nrows=None,
                           compression='gzip',
                           dtype={'subject_id': str, 'hadm_id': str, 'icd_code': str, 'icd_version': str},
#                           error_bad_lines=False)
                           on_bad_lines = 'skip')
print(f'Number of Rows and Columns in Diagnoses_icd file: {diagnoses_df.shape}')
#print(diagnoses_df.head(5))

Number of Rows and Columns in Diagnoses_icd file: (4756326, 5)


In [None]:
# Read admissions.csv.gz file
admissions_df = pd.read_csv('/content/drive/MyDrive/mimiciv/2.2/hosp/admissions.csv.gz',
                            nrows=None,
                            compression='gzip',
                            dtype={'subject_id': str, 'hadm_id': str},
#                            error_bad_lines=False)
                           on_bad_lines = 'skip')
print(f'Number of Rows and Columns in Admissions file: {admissions_df.shape}')
#print(admissions_df.head(5))

Number of Rows and Columns in Admissions file: (431231, 16)


#### Preprocess Data


*   Create subset dataset by extracting ICD10 version records to align with the cohorts from Pretrained model.

*   For Project Draft purpose - create subset of data (100 Patients and related records from admissions and diagnosis_icd files)

*   Map the ICD codes to CUI codes [Concepts to concept Unique Identifiers (CUIs) from the United Medical Language System (UMLS)]



In [None]:
# Filter icd_version=10
diagnoses_df = diagnoses_df[diagnoses_df['icd_version'] == '10']
print(f'Number of Rows and Columns in Diagnoses_icd_10 file: {diagnoses_df.shape}')

# Select 100 unique subject_id from the filtered diagnoses file
selected_subject_ids = diagnoses_df['subject_id'].unique()[:100]
print(f'Selected 100 Patients from Diagnoses_icd_10 file: {selected_subject_ids}')


# Filter both the Diagnoses and admissions to include only the selected 100 subject_id
data_df = diagnoses_df[diagnoses_df['subject_id'].isin(selected_subject_ids)]

print(f'Number of Rows and Columns in Subset dataset Diagnoses_icd_10 file: {data_df.shape}')
admissions_data_df = admissions_df[admissions_df['subject_id'].isin(selected_subject_ids)]
print(f'Number of Rows and Columns in Subset dataset Admissions file: {admissions_data_df.shape}')
#print(len(data_df))
#print(len(admissions_data_df))

Number of Rows and Columns in Diagnoses_icd_10 file: (1989449, 5)
Selected 100 Patients from Diagnoses_icd_10 file: ['10000084' '10000117' '10000980' '10001401' '10001667' '10001843'
 '10001884' '10001919' '10002013' '10002131' '10002221' '10002266'
 '10002315' '10002348' '10002428' '10002430' '10002443' '10002495'
 '10002528' '10002545' '10002755' '10002800' '10002807' '10002869'
 '10002930' '10002976' '10003019' '10003299' '10003372' '10003385'
 '10003400' '10003412' '10003502' '10003637' '10003731' '10003757'
 '10004113' '10004296' '10004322' '10004457' '10004606' '10004719'
 '10004720' '10004764' '10005001' '10005123' '10005236' '10005308'
 '10005606' '10005749' '10005817' '10005858' '10005866' '10005909'
 '10006029' '10006269' '10006431' '10006457' '10006513' '10006630'
 '10006640' '10006825' '10007058' '10007134' '10007232' '10007266'
 '10007818' '10007920' '10008077' '10008245' '10008287' '10008647'
 '10008742' '10008816' '10008819' '10009129' '10010038' '10010058'
 '10010150' '

In [None]:
#Print the Max and Min of discharge date in admissions subset
visitid2dischargedate = {}
for ind, row in admissions_data_df.iterrows():
    visitid2dischargedate[row['hadm_id']] = row['dischtime'][0:10]

print(min(visitid2dischargedate.values()))
print(max(visitid2dischargedate.values()))
print(visitid2dischargedate)

2112-12-10
2204-07-25
{'23052089': '2160-11-25', '29888819': '2160-12-28', '22927623': '2181-11-15', '27988844': '2183-09-21', '20897796': '2193-08-17', '24947999': '2190-11-08', '25242409': '2191-04-11', '25911675': '2191-05-24', '26913865': '2189-07-03', '29654838': '2188-01-05', '29659838': '2191-07-19', '21544441': '2131-06-15', '24818636': '2131-08-04', '26840593': '2131-07-02', '27012892': '2133-07-13', '27060146': '2131-10-05', '28058085': '2131-11-15', '22672901': '2173-08-24', '21728396': '2131-11-11', '21192799': '2130-10-06', '21268656': '2125-10-20', '21577720': '2125-12-27', '22532141': '2130-10-14', '23594368': '2125-12-03', '24325811': '2126-11-04', '24746267': '2130-12-30', '24962904': '2130-12-08', '25758848': '2128-07-17', '26170293': '2130-04-19', '26184834': '2131-01-20', '26202981': '2130-08-23', '26679629': '2125-10-27', '26812645': '2127-07-25', '27016754': '2130-06-24', '27507515': '2130-12-24', '27765344': '2127-12-12', '28475784': '2130-10-22', '28664981': '21

In [None]:
#Create Patient dictionary
#Code reference: https://github.com/whaleloops/TransformEHR/blob/main/preprocess.py

patients = defaultdict(lambda: defaultdict(list))
print("Number of rows:" , data_df.shape[0])
for ind, row in data_df.iterrows():
    hadm_id = row['hadm_id']
    scrssn = row['subject_id']
    visit_date = visitid2dischargedate[hadm_id]
    patients[scrssn][visit_date].append(row['icd_version'] +'-'+ row['icd_code'])

num_icd_pat = defaultdict(int)
for k,v in patients.items():
    for kv, vv in v.items():
        for icdcode in vv:
            if icdcode.startswith("10-"):
                num_icd_pat[k] += 1
                break

#print(len(patients))
#print(len(num_icd_pat))
num_pos = 0
for k,v in num_icd_pat.items():
    if v > 1:
        num_pos += 1
#print(num_pos)
print(f'num_icd_pat dictionary: {num_icd_pat}')
#print("Done")

Number of rows: 2741
num_icd_pat dictionary: defaultdict(<class 'int'>, {'10000084': 2, '10000117': 2, '10000980': 3, '10001401': 6, '10001667': 1, '10001843': 1, '10001884': 12, '10001919': 1, '10002013': 4, '10002131': 1, '10002221': 4, '10002266': 1, '10002315': 1, '10002348': 1, '10002428': 2, '10002430': 4, '10002443': 1, '10002495': 1, '10002528': 2, '10002545': 1, '10002755': 1, '10002800': 4, '10002807': 1, '10002869': 2, '10002930': 6, '10002976': 1, '10003019': 1, '10003299': 5, '10003372': 1, '10003385': 1, '10003400': 1, '10003412': 1, '10003502': 1, '10003637': 4, '10003731': 1, '10003757': 1, '10004113': 1, '10004296': 1, '10004322': 3, '10004457': 2, '10004606': 4, '10004719': 1, '10004720': 1, '10004764': 1, '10005001': 2, '10005123': 2, '10005236': 1, '10005308': 1, '10005606': 2, '10005749': 1, '10005817': 1, '10005858': 2, '10005866': 6, '10005909': 1, '10006029': 4, '10006269': 1, '10006431': 6, '10006457': 3, '10006513': 2, '10006630': 1, '10006640': 1, '10006825':

In [None]:
#Map the ICD codes to CUI codes [Concepts to concept Unique Identifiers (CUIs) from the United Medical Language System (UMLS)]
#Code reference: https://github.com/whaleloops/TransformEHR/blob/main/preprocess.py

def icd2cui(patients, logging_step=50000):
    dictionary = defaultdict(int)
    # cuis_li = []
    cuis_di = {}
    date_di = {}
    num_idx = 0
    for pssn,v in patients.items():
        num_idx += 1
        if num_idx%logging_step == 0:
            print("|{} - Processed {}".format(time.asctime(time.localtime(time.time())), num_idx), flush=True)
        cuis_di[pssn] = []
        cuis_li_tmp = []
        date_li_tmp = []
        for datetime_str in sorted(v.keys()): # sort by time
            datetime_object = datetime.strptime(datetime_str, '%Y-%m-%d') # make sure time str is correct
            infos = v[datetime_str]
            if len(infos) > 0:
                # cuis_di[pssn].append((cuis, ext_cuis, strs))
                cuis_li_tmp.append((infos, [], []))
                date_li_tmp.append(datetime_str)
            for cui_id in infos:
                dictionary[cui_id] += 1
        if len(cuis_li_tmp) > 0:
            cuis_di[pssn] = cuis_li_tmp
            date_di[pssn] = date_li_tmp
    return cuis_di, date_di, dictionary

patients_few = dict(islice(patients.items(), 0, 200))
# cuis, date, dictionary = icd2cui(patients_few, logging_step=50000)
cuis, date, dictionary = icd2cui(patients, logging_step=50000)

dir_apth = '/content/drive/My Drive/'
print("Number of cui in dictionary: {}".format(len(dictionary)), flush=True)
with open(dir_apth + '/dict.txt', 'w') as handle: #TODO
    handle.write("[PAD]"+"\n")
    for i in range(99):
        handle.write("[unused{}]".format(i)+"\n")
    handle.write("[UNK]"+"\n")
    handle.write("[CLS]"+"\n")
    handle.write("[SEP]"+"\n")
    handle.write("[MASK]"+"\n")
    for i in range(99,194):
        handle.write("[unused{}]".format(i)+"\n")
    for k,v in dictionary.items():
        handle.write("{}\n".format(k))
# save data
print("Saving patient data...", flush=True)
f1 = open(dir_apth + '/value.pickle', 'wb')
f3 = open(dir_apth + '/dates.pickle', 'wb')
f2 = open(dir_apth + '/key.txt', 'w')
for k,v in cuis.items():
    pickle.dump(v, f1, protocol=pickle.HIGHEST_PROTOCOL)
    pickle.dump(date[k], f3, protocol=pickle.HIGHEST_PROTOCOL)
    f2.write("{}\n".format(k))
f1.close()
f3.close()
f2.close()

#print("Done")

Number of cui in dictionary: 971
Saving patient data...


#### Load Data

In [None]:
# Load key.txt, value.pickle and dates.pickle files
# Code reference: https://github.com/whaleloops/TransformEHR/blob/main/sample_load.py

dir_path = "/content/drive/My Drive/"
do_date = True


f1 = open(dir_path+ '/value.pickle', 'rb')
f3 = open(dir_path+ '/dates.pickle', 'rb')
f2 = open(dir_path+ '/key.txt', 'r')
keys = f2.readlines()

patients = {}
for key in keys:
    patient_idd = key.strip()
    each_visit = pickle.load(f1)
    f1obj = []
    for (cuis, ext_cuis, strs) in each_visit:
        f1obj.append((cuis, [], []))
    f3obj = pickle.load(f3)
    assert len(f1obj) == len(f3obj)
    if do_date:
        if patient_idd in patients:
            patients[patient_idd] += list(zip(f1obj, f3obj))
        else:
            patients[patient_idd] = list(zip(f1obj, f3obj))
    else:
        if patient_idd in patients:
            patients[patient_idd] += f1obj
        else:
            patients[patient_idd] = f1obj

#print("Number of patients in the sample dataset")
print(f'Number of patients in the sample dataset : {len(patients)}')
print(f'Patients : {patients}')
#print()

Number of patients in the sample dataset : 100
Patients : {'10000084': [((['10-G3183', '10-F0280', '10-R441', '10-R296', '10-E785', '10-Z8546'], [], []), '2160-11-25'), ((['10-R4182', '10-G20', '10-F0280', '10-R609', '10-E785', '10-Z8546'], [], []), '2160-12-28')], '10000117': [((['10-R1310', '10-R0989', '10-K31819', '10-K219', '10-K449', '10-F419', '10-I341', '10-M810', '10-Z87891'], [], []), '2181-11-15'), ((['10-S72012A', '10-W010XXA', '10-Y93K1', '10-Y92480', '10-K219', '10-E7800', '10-I341', '10-G43909', '10-Z87891', '10-Z87442', '10-F419', '10-M810', '10-Z7901'], [], []), '2183-09-21')], '10000980': [((['10-D500', '10-I5023', '10-N184', '10-E118', '10-K2970', '10-Z23', '10-K259', '10-K5730', '10-I2510', '10-Z87891', '10-I252', '10-Z955', '10-I129', '10-Z794', '10-Z8673', '10-R0789', '10-Z86718', '10-R791', '10-T45515A', '10-I70218', '10-K222', '10-K219'], [], []), '2191-05-24'), ((['10-I5023', '10-N184', '10-D631', '10-E1121', '10-Z86718', '10-I129', '10-Z955', '10-I2510', '10-Z7

#### Train Test Data Preparation

In [None]:
#Split the Subset dataset into Train and Test (80/20 split)
s = pd.Series(patients )
train_dataset , test_dataset  = [i.to_dict() for i in train_test_split(s, train_size=0.8)]
print("len train:", len(json.dumps(train_dataset, indent=4)))
print(f'Train dataset: {train_dataset}')
print("len test:", len(test_dataset))
print(f'Test dataset: {test_dataset}')

len train: 80
Train dataset: {'10000980': [((['10-D500', '10-I5023', '10-N184', '10-E118', '10-K2970', '10-Z23', '10-K259', '10-K5730', '10-I2510', '10-Z87891', '10-I252', '10-Z955', '10-I129', '10-Z794', '10-Z8673', '10-R0789', '10-Z86718', '10-R791', '10-T45515A', '10-I70218', '10-K222', '10-K219'], [], []), '2191-05-24'), ((['10-I5023', '10-N184', '10-D631', '10-E1121', '10-Z86718', '10-I129', '10-Z955', '10-I2510', '10-Z7901', '10-Z794', '10-I340', '10-I252', '10-Z8673', '10-Z87891', '10-Z91128', '10-E785'], [], []), '2191-07-19'), ((['10-I130', '10-I5033', '10-E872', '10-N184', '10-E1122', '10-N2581', '10-I2510', '10-E11319', '10-D6489', '10-E785', '10-Z955', '10-Z86718', '10-I252', '10-Z2239', '10-G4700', '10-M1A9XX0', '10-R0902', '10-E1151', '10-Z794', '10-E669', '10-Z6831'], [], []), '2193-08-17')], '10005817': [((['10-J9621', '10-J910', '10-J189', '10-N170', '10-E883', '10-I82412', '10-C7B8', '10-I5022', '10-I871', '10-C7A8', '10-J449', '10-I2510', '10-F17210', '10-Z951', '10-

##   Model

TransformEHR uses an encoder–decoder transformer architecture. The encoder processes the input embeddings and generates a set of hidden representations for each predictor. TransformEHR performs cross-attention over the hidden representations from the encoder and assigns an attention weight for each
representation. These weighted representations are then fed to the decoder, which generates ICD codes of the future visit. The decoder generates ICDcodes following the sequential order of code priority within a visit.

We are using the existing code for defining the Model architecture - https://github.com/whaleloops/TransformEHR/blob/main/icdmodelbart.py. We made sure the code is updated with no compilation errors.

####Positional Embedding and Masking

In [None]:
#Code Reference: https://github.com/whaleloops/TransformEHR/blob/main/icdmodelbart.py

def create_position_ids_from_input_ids(input_ids, padding_idx):
    """ Replace non-padding symbols with their position numbers. Position numbers begin at
    padding_idx+1. Padding symbols are ignored. This is modified from fairseq's
    `utils.make_positions`.
    :param torch.Tensor x:
    :return torch.Tensor:
    """
    # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
    mask = input_ids.ne(padding_idx).int()
    incremental_indices = torch.cumsum(mask, dim=1).type_as(mask) * mask
    return incremental_indices.long() + padding_idx


logger = logging.getLogger(__name__)

def invert_mask(attention_mask):
    assert attention_mask.dim() == 2
    return attention_mask.eq(0)


def _make_linear_from_emb(emb):
    vocab_size, emb_size = emb.weight.shape
    lin_layer = nn.Linear(vocab_size, emb_size, bias=False)
    lin_layer.weight.data = emb.weight.data
    return lin_layer


# Helper Functions, mostly for making masks
def _check_shapes(shape_1, shape2):
    if shape_1 != shape2:
        raise AssertionError("shape mismatch: {} != {}".format(shape_1, shape2))


def shift_tokens_right(input_ids, pad_token_id):
    """Shift input ids one token to the right, and wrap the last non pad token (usually <eos>)."""
    prev_output_tokens = input_ids.clone()
    index_of_eos = (input_ids.ne(pad_token_id).sum(dim=1) - 1).unsqueeze(-1)
    prev_output_tokens[:, 0] = input_ids.gather(1, index_of_eos).squeeze()
    prev_output_tokens[:, 1:] = input_ids[:, :-1]
    return prev_output_tokens


def make_padding_mask(input_ids, padding_idx=1):
    """True for pad tokens"""
    padding_mask = input_ids.eq(padding_idx)
    if not padding_mask.any():
        padding_mask = None
    return padding_mask

class DateYearMonthDayEmbedding(nn.Embedding):
    """This module produces sinusoidal positional embeddings(year) and learned positional embedding(month day)"""
    def __init__(self, num_positions, embedding_dim, padding_idx=None):
        # print("num_positions:",num_positions) # 1024
        # print("embedding_dim:",embedding_dim)
        print("DateYearMonthDayEmbedding- starting init")
        self.embed_year = SinusoidalPositionalEmbedding(num_positions, embedding_dim, padding_idx=padding_idx)
        print("DateYearMonthDayEmbedding- finish SinusoidalPositionalEmbedding call??")
        self.embed_month = nn.Embedding(13, embedding_dim)
        self.embed_day = nn.Embedding(32, embedding_dim)
        print("DateYearMonthDayEmbedding- finish init")

    def forward(self, input, use_cache=False):
        assert type(input) == datetime.datetime
        year = self.embed_year(input.year)
        month = self.embed_month(input.month)
        day = self.embed_day(input.day)
        # print(year,month,day)
        return year + month + day

class SinusoidalPositionalEmbedding(nn.Embedding):
    """This module produces sinusoidal positional embeddings of any length."""

    def __init__(self, num_positions, embedding_dim, padding_idx=None):
        print("SinusoidalPositionalEmbedding- starting init")
        super().__init__(num_positions, embedding_dim)
        if embedding_dim % 2 != 0:
            raise NotImplementedError(f"odd embedding_dim {embedding_dim} not supported")
        self.weight = self._init_weight(self.weight)
        print("SinusoidalPositionalEmbedding- finish init 1.")

    @staticmethod
    def _init_weight(out: nn.Parameter):
        """Identical to the XLM create_sinusoidal_embeddings except features are not interleaved.
            The cos features are in the 2nd half of the vector. [dim // 2:]
        """
        n_pos, dim = out.shape
        position_enc = np.array(
            [[pos / np.power(10000, 2 * (j // 2) / dim) for j in range(dim)] for pos in range(n_pos)]
        )

        # test
        # out_copy = out.clone().detach().requires_grad_(False)
        # sin_values = torch.FloatTensor(np.sin(position_enc[:, 0::2]))
        # cos_values = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
        # out_copy[:, 0 : dim // 2].copy_(sin_values)
        # out_copy[:, dim // 2 :].copy_(cos_values)
        # out.detach_()
        # out.requires_grad = False
        # return out_copy

        # test 2
        sin_values = torch.FloatTensor(np.sin(position_enc[:, 0::2]))
        cos_values = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
        # out[:, 0 : dim // 2].copy_(sin_values)
        # out[:, dim // 2 :].copy_(cos_values)

        with torch.no_grad():
          out[:, 0 : dim // 2].copy_(sin_values)
          out[:, dim // 2 :].copy_(cos_values)

        out.detach_()
        out.requires_grad = False
        print("SinusoidalPositionalEmbedding- finish init_weight 2.")
        return out

        # # Orginial
        # out[:, 0 : dim // 2] = torch.FloatTensor(np.sin(position_enc[:, 0::2]))  # This line breaks for odd n_pos
        # out[:, dim // 2 :] = torch.FloatTensor(np.cos(position_enc[:, 1::2]))
        # out.detach_()
        # out.requires_grad = False
        # return out



    @torch.no_grad()
    def forward(self, input_ids, use_cache=False):
        """Input is expected to be of size [bsz x seqlen]."""
        bsz, seq_len = input_ids.shape[:2]
        if use_cache:
            positions = input_ids.data.new(1, 1).fill_(seq_len - 1)  # called before slicing
        else:
            # starts at 0, ends at 1-seq_len
            positions = torch.arange(seq_len, dtype=torch.long, device=self.weight.device)
        return super().forward(positions)

class LearnedPositionalEmbedding(nn.Embedding):
    """
    This module learns positional embeddings up to a fixed maximum size.
    Padding ids are ignored by either offsetting based on padding_idx
    or by setting padding_idx to None and ensuring that the appropriate
    position ids are passed to the forward function.
    """

    def __init__(
        self, num_embeddings: int, embedding_dim: int, padding_idx: int,
    ):
        # if padding_idx is specified then offset the embedding ids by
        # this index and adjust num_embeddings appropriately
        print("LearnedPositionalEmbedding- starting init")
        assert padding_idx is not None
        num_embeddings += padding_idx + 1  # WHY?
        super().__init__(num_embeddings, embedding_dim, padding_idx=padding_idx)

    def forward(self, input, use_cache=False):
        """Input is expected to be of size [bsz x seqlen]."""
        if use_cache:  # the position is our current step in the decoded sequence
            pos = int(self.padding_idx + input.size(1))
            positions = input.data.new(1, 1).fill_(pos)
        else:
            positions = create_position_ids_from_input_ids(input, self.padding_idx)
        return super().forward(positions)

# Helper Modules

def fill_with_neg_inf(t):
    """FP16-compatible function that fills a input_ids with -inf."""
    return t.float().fill_(float("-inf")).type_as(t)


def _filter_out_falsey_values(tup) -> Tuple:
    """Remove entries that are None or [] from an iterable."""
    return tuple(x for x in tup if isinstance(x, torch.Tensor) or x)


# Public API
def _get_shape(t):
    return getattr(t, "shape", None)

####Attention Mechanism

In [None]:
#Code Reference: https://github.com/whaleloops/TransformEHR/blob/main/icdmodelbart.py

class SelfAttention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""

    def __init__(
        self,
        embed_dim,
        num_heads,
        dropout=0.0,
        bias=True,
        encoder_decoder_attention=False,  # otherwise self_attention
    ):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.dropout = dropout
        self.head_dim = embed_dim // num_heads
        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
        self.scaling = self.head_dim ** -0.5

        self.encoder_decoder_attention = encoder_decoder_attention
        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.cache_key = "encoder_decoder" if self.encoder_decoder_attention else "self"

    def _shape(self, tensor, dim_0, bsz):
        return tensor.contiguous().view(dim_0, bsz * self.num_heads, self.head_dim).transpose(0, 1)

    def forward(
        self,
        query,
        key: Optional[Tensor],
        key_padding_mask: Optional[Tensor] = None,
        layer_state: Optional[Dict[str, Optional[Tensor]]] = None,
        attn_mask: Optional[Tensor] = None,
        need_weights=False,
    ) -> Tuple[Tensor, Optional[Tensor]]:
        """Input shape: Time(SeqLen) x Batch x Channel"""
        static_kv: bool = self.encoder_decoder_attention
        tgt_len, bsz, embed_dim = query.size()
        assert embed_dim == self.embed_dim
        assert list(query.size()) == [tgt_len, bsz, embed_dim]
        # get here for encoder decoder cause of static_kv
        if layer_state is not None:  # reuse k,v and encoder_padding_mask
            saved_state = layer_state.get(self.cache_key, {})
            if "prev_key" in saved_state:
                # previous time steps are cached - no need to recompute key and value if they are static
                if static_kv:
                    key = None
        else:
            saved_state = None
            layer_state = {}

        q = self.q_proj(query) * self.scaling
        if static_kv:
            if key is None:
                k = v = None
            else:
                k = self.k_proj(key)
                v = self.v_proj(key)
        else:
            k = self.k_proj(query)
            v = self.v_proj(query)

        q = self._shape(q, tgt_len, bsz)
        if k is not None:
            k = self._shape(k, -1, bsz)
        if v is not None:
            v = self._shape(v, -1, bsz)

        if saved_state is not None:
            k, v, key_padding_mask = self._use_saved_state(k, v, saved_state, key_padding_mask, static_kv, bsz)

        # Update cache
        layer_state[self.cache_key] = {
            "prev_key": k.view(bsz, self.num_heads, -1, self.head_dim),
            "prev_value": v.view(bsz, self.num_heads, -1, self.head_dim),
            "prev_key_padding_mask": key_padding_mask if not static_kv else None,
        }

        assert k is not None
        src_len = k.size(1)
        attn_weights = torch.bmm(q, k.transpose(1, 2))
        assert attn_weights.size() == (bsz * self.num_heads, tgt_len, src_len)

        if attn_mask is not None:
            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attn_mask
            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)

        # This is part of a workaround to get around fork/join parallelism not supporting Optional types.
        if key_padding_mask is not None and key_padding_mask.dim() == 0:
            key_padding_mask = None
        assert key_padding_mask is None or key_padding_mask.size()[:2] == (bsz, src_len,)

        if key_padding_mask is not None:  # don't attend to padding symbols
            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
            reshaped = key_padding_mask.unsqueeze(1).unsqueeze(2)
            attn_weights = attn_weights.masked_fill(reshaped, float("-inf"))
            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
        attn_weights = F.softmax(attn_weights, dim=-1)
        attn_probs = F.dropout(attn_weights, p=self.dropout, training=self.training,)

        assert v is not None
        attn_output = torch.bmm(attn_probs, v)
        assert attn_output.size() == (bsz * self.num_heads, tgt_len, self.head_dim)
        attn_output = attn_output.transpose(0, 1).contiguous().view(tgt_len, bsz, embed_dim)
        attn_output = self.out_proj(attn_output)
        if need_weights:
            attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
        else:
            attn_weights = None
        return attn_output, attn_weights

    def _use_saved_state(self, k, v, saved_state, key_padding_mask, static_kv, bsz):
        # saved states are stored with shape (bsz, num_heads, seq_len, head_dim)
        if "prev_key" in saved_state:
            _prev_key = saved_state["prev_key"]
            assert _prev_key is not None
            prev_key = _prev_key.view(bsz * self.num_heads, -1, self.head_dim)
            if static_kv:
                k = prev_key
            else:
                assert k is not None
                k = torch.cat([prev_key, k], dim=1)
        if "prev_value" in saved_state:
            _prev_value = saved_state["prev_value"]
            assert _prev_value is not None
            prev_value = _prev_value.view(bsz * self.num_heads, -1, self.head_dim)
            if static_kv:
                v = prev_value
            else:
                assert v is not None
                v = torch.cat([prev_value, v], dim=1)
        assert k is not None and v is not None
        prev_key_padding_mask: Optional[Tensor] = saved_state.get("prev_key_padding_mask", None)
        key_padding_mask = self._cat_prev_key_padding_mask(
            key_padding_mask, prev_key_padding_mask, bsz, k.size(1), static_kv
        )
        return k, v, key_padding_mask

    @staticmethod
    def _cat_prev_key_padding_mask(
        key_padding_mask: Optional[Tensor],
        prev_key_padding_mask: Optional[Tensor],
        batch_size: int,
        src_len: int,
        static_kv: bool,
    ) -> Optional[Tensor]:
        # saved key padding masks have shape (bsz, seq_len)
        if prev_key_padding_mask is not None:
            if static_kv:
                new_key_padding_mask = prev_key_padding_mask
            else:
                new_key_padding_mask = torch.cat([prev_key_padding_mask, key_padding_mask], dim=1)

        elif key_padding_mask is not None:
            filler = torch.zeros(
                batch_size,
                src_len - key_padding_mask.size(1),
                dtype=key_padding_mask.dtype,
                device=key_padding_mask.device,
            )
            new_key_padding_mask = torch.cat([filler, key_padding_mask], dim=1)
        else:
            new_key_padding_mask = prev_key_padding_mask
        return new_key_padding_mask


####Encoder and BART Encoder

In [None]:
#Code Reference: https://github.com/whaleloops/TransformEHR/blob/main/icdmodelbart.py

class EncoderLayer(nn.Module):
    def __init__(self, config: BartConfig):
        super().__init__()
        self.embed_dim = config.d_model
        self.output_attentions = config.output_attentions
        self.self_attn = SelfAttention(
            self.embed_dim, config.encoder_attention_heads, dropout=config.attention_dropout,
        )
        self.normalize_before = config.normalize_before
        self.self_attn_layer_norm = torch.nn.LayerNorm(self.embed_dim, 1e-5, True)
        self.dropout = config.dropout
        self.activation_fn = ACT2FN[config.activation_function]
        self.activation_dropout = config.activation_dropout
        self.fc1 = nn.Linear(self.embed_dim, config.encoder_ffn_dim)
        self.fc2 = nn.Linear(config.encoder_ffn_dim, self.embed_dim)
        self.final_layer_norm = torch.nn.LayerNorm(self.embed_dim, 1e-5, True)

    def forward(self, x, encoder_padding_mask):
        """
        Args:
            x (Tensor): input to the layer of shape `(seq_len, batch, embed_dim)`
            encoder_padding_mask (ByteTensor): binary ByteTensor of shape
                `(batch, src_len)` where padding elements are indicated by ``1``.
            for t_tgt, t_src is excluded (or masked out), =0 means it is
            included in attention

        Returns:
            encoded output of shape `(seq_len, batch, embed_dim)`
        """
        residual = x
        # test
        # if self.normalize_before:
        #     x = self.self_attn_layer_norm(x)
        x, attn_weights = self.self_attn(
            query=x, key=x, key_padding_mask=encoder_padding_mask, need_weights=self.output_attentions
        )
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = residual + x
        if not self.normalize_before:
            x = self.self_attn_layer_norm(x)

        residual = x
        # test
        # if self.normalize_before:
        #     x = self.final_layer_norm(x)
        x = self.activation_fn(self.fc1(x))
        x = F.dropout(x, p=self.activation_dropout, training=self.training)
        x = self.fc2(x)
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = residual + x
        if not self.normalize_before:
            x = self.final_layer_norm(x)
        return x, attn_weights


class BartEncoder(nn.Module):
    """
    Transformer encoder consisting of *config.encoder_layers* self attention layers. Each layer
    is a :class:`EncoderLayer`.

    Args:
        config: BartConfig
    """

    def __init__(self, config: BartConfig, embed_tokens):
        super().__init__()

        self.dropout = config.dropout
        self.layerdrop = config.encoder_layerdrop
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states

        embed_dim = embed_tokens.embedding_dim
        self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0
        self.padding_idx = embed_tokens.padding_idx
        self.max_source_positions = config.max_position_embeddings

        self.embed_tokens = embed_tokens
        # # test
        # self.embed_positions = LearnedPositionalEmbedding(
        #         config.max_position_embeddings, embed_dim, self.padding_idx,
        #     )
        # self.embed_visitids = nn.Embedding(1460, embed_dim)
        if config.static_position_embeddings:
            self.embed_positions = SinusoidalPositionalEmbedding(
                config.max_position_embeddings, embed_dim, self.padding_idx
            )
        else:
            self.embed_positions = LearnedPositionalEmbedding(
                config.max_position_embeddings, embed_dim, self.padding_idx,
            )
        if config.date_visit_embeddings:
            self.embed_visitids = DateYearMonthDayEmbedding(config.max_position_embeddings, embed_dim, self.padding_idx)
        else:
            self.embed_visitids = nn.Embedding(1460, embed_dim)
        self.layers = nn.ModuleList([EncoderLayer(config) for _ in range(config.encoder_layers)])
        self.layernorm_embedding = torch.nn.LayerNorm(embed_dim, 1e-5, True) if config.normalize_embedding else nn.Identity()
        # mbart has one extra layer_norm
        self.layer_norm = torch.nn.LayerNorm(config.d_model, 1e-5, True) if config.normalize_before else None
        print("BartEncoder- finish init")

    def forward(
        self, input_ids, attention_mask=None, visit_ids=None
    ):
        """
        Args:
            input_ids (LongTensor): tokens in the source language of shape
                `(batch, src_len)`
            attention_mask (torch.LongTensor): indicating which indices are padding tokens.
        Returns:
            Tuple comprised of:
                - **x** (Tensor): the last encoder layer's output of
                  shape `(src_len, batch, embed_dim)`
                - **encoder_states** (List[Tensor]): all intermediate
                  hidden states of shape `(src_len, batch, embed_dim)`.
                  Only populated if *self.output_hidden_states:* is True.
                - **all_attentions** (List[Tensor]): Attention weights for each layer.
                During training might not be of length n_layers because of layer dropout.
        """
        # check attention mask and invert
        if attention_mask is not None:
            attention_mask = invert_mask(attention_mask)

        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale
        embed_pos = self.embed_positions(input_ids)
        embed_visit = self.embed_visitids(visit_ids)
        x = inputs_embeds + embed_pos + embed_visit
        x = self.layernorm_embedding(x)
        x = F.dropout(x, p=self.dropout, training=self.training)

        # B x T x C -> T x B x C
        x = x.transpose(0, 1)

        encoder_states, all_attentions = [], []
        for encoder_layer in self.layers:
            if self.output_hidden_states:
                encoder_states.append(x)
            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
            dropout_probability = random.uniform(0, 1)
            if self.training and (dropout_probability < self.layerdrop):  # skip the layer
                attn = None
            else:
                x, attn = encoder_layer(x, attention_mask)

            if self.output_attentions:
                all_attentions.append(attn)

        if self.layer_norm:
            x = self.layer_norm(x)
        if self.output_hidden_states:
            encoder_states.append(x)

        # T x B x C -> B x T x C
        encoder_states = [hidden_state.transpose(0, 1) for hidden_state in encoder_states]
        x = x.transpose(0, 1)

        return x, encoder_states, all_attentions

####Decoder and BART Decoder

In [None]:
#Code Reference: https://github.com/whaleloops/TransformEHR/blob/main/icdmodelbart.py

class DecoderLayer(nn.Module):
    def __init__(self, config: BartConfig):
        super().__init__()
        self.embed_dim = config.d_model
        self.output_attentions = config.output_attentions
        self.self_attn = SelfAttention(
            embed_dim=self.embed_dim, num_heads=config.decoder_attention_heads, dropout=config.attention_dropout,
        )
        self.dropout = config.dropout
        self.activation_fn = ACT2FN[config.activation_function]
        self.activation_dropout = config.activation_dropout
        self.normalize_before = config.normalize_before

        self.self_attn_layer_norm = torch.nn.LayerNorm(self.embed_dim, 1e-5, True)
        self.encoder_attn = SelfAttention(
            self.embed_dim,
            config.decoder_attention_heads,
            dropout=config.attention_dropout,
            encoder_decoder_attention=True,
        )
        self.encoder_attn_layer_norm = torch.nn.LayerNorm(self.embed_dim, 1e-5, True)
        self.fc1 = nn.Linear(self.embed_dim, config.decoder_ffn_dim)
        self.fc2 = nn.Linear(config.decoder_ffn_dim, self.embed_dim)
        self.final_layer_norm = torch.nn.LayerNorm(self.embed_dim, 1e-5, True)

    def forward(
        self,
        x,
        encoder_hidden_states,
        encoder_attn_mask=None,
        layer_state=None,
        causal_mask=None,
        decoder_padding_mask=None,
    ):
        residual = x

        if layer_state is None:
            layer_state = {}
        if self.normalize_before:
            x = self.self_attn_layer_norm(x)
        # Self Attention

        x, self_attn_weights = self.self_attn(
            query=x,
            key=x,
            layer_state=layer_state,  # adds keys to layer state
            key_padding_mask=decoder_padding_mask,
            attn_mask=causal_mask,
            need_weights=self.output_attentions,
        )
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = residual + x
        if not self.normalize_before:
            x = self.self_attn_layer_norm(x)

        # Cross attention
        residual = x
        assert self.encoder_attn.cache_key != self.self_attn.cache_key
        if self.normalize_before:
            x = self.encoder_attn_layer_norm(x)
        x, _ = self.encoder_attn(
            query=x,
            key=encoder_hidden_states,
            key_padding_mask=encoder_attn_mask,
            layer_state=layer_state,  # mutates layer state
        )
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = residual + x
        if not self.normalize_before:
            x = self.encoder_attn_layer_norm(x)

        # Fully Connected
        residual = x
        if self.normalize_before:
            x = self.final_layer_norm(x)
        x = self.activation_fn(self.fc1(x))
        x = F.dropout(x, p=self.activation_dropout, training=self.training)
        x = self.fc2(x)
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = residual + x
        if not self.normalize_before:
            x = self.final_layer_norm(x)
        return (
            x,
            self_attn_weights,
            layer_state,
        )  # just self_attn weights for now, following t5, layer_state = cache for decoding


class BartDecoder(nn.Module):
    """
    Transformer decoder consisting of *config.decoder_layers* layers. Each layer
    is a :class:`DecoderLayer`.
    Args:
        config: BartConfig
        embed_tokens (torch.nn.Embedding): output embedding
    """

    def __init__(self, config: BartConfig, embed_tokens: nn.Embedding):
        super().__init__()
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states
        self.dropout = config.dropout
        self.layerdrop = config.decoder_layerdrop
        self.padding_idx = embed_tokens.padding_idx
        self.max_target_positions = config.max_position_embeddings
        self.embed_scale = math.sqrt(config.d_model) if config.scale_embedding else 1.0
        self.embed_tokens = embed_tokens
        if config.static_position_embeddings:
            self.embed_positions = SinusoidalPositionalEmbedding(
                config.max_position_embeddings, config.d_model, config.pad_token_id
            )
        else:
            self.embed_positions = LearnedPositionalEmbedding(
                config.max_position_embeddings, config.d_model, self.padding_idx,
            )
        self.layers = nn.ModuleList(
            [DecoderLayer(config) for _ in range(config.decoder_layers)]
        )  # type: List[DecoderLayer]
        self.layernorm_embedding = torch.nn.LayerNorm(config.d_model, 1e-5, True) if config.normalize_embedding else nn.Identity()
        self.layer_norm = torch.nn.LayerNorm(config.d_model, 1e-5, True) if config.add_final_layer_norm else None

    def forward(
        self,
        input_ids,
        encoder_hidden_states,
        encoder_padding_mask,
        decoder_padding_mask,
        decoder_causal_mask,
        decoder_cached_states=None,
        use_cache=False,
        **unused
    ):
        """
        Includes several features from "Jointly Learning to Align and
        Translate with Transformer Models" (Garg et al., EMNLP 2019).

        Args:
            input_ids (LongTensor): previous decoder outputs of shape
                `(batch, tgt_len)`, for teacher forcing
            encoder_hidden_states: output from the encoder, used for
                encoder-side attention
            encoder_padding_mask: for ignoring pad tokens
            decoder_cached_states (dict or None): dictionary used for storing state during generation

        Returns:
            tuple:
                - the decoder's features of shape `(batch, tgt_len, embed_dim)`
                - hidden states
                - attentions
        """
        # check attention mask and invert
        if encoder_padding_mask is not None:
            encoder_padding_mask = invert_mask(encoder_padding_mask)

        # embed positions
        positions = self.embed_positions(input_ids, use_cache=use_cache)

        if use_cache:
            input_ids = input_ids[:, -1:]
            positions = positions[:, -1:]  # happens after we embed them
            # assert input_ids.ne(self.padding_idx).any()

        x = self.embed_tokens(input_ids) * self.embed_scale
        x += positions
        x = self.layernorm_embedding(x)
        x = F.dropout(x, p=self.dropout, training=self.training)

        # Convert to Bart output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)
        x = x.transpose(0, 1)
        encoder_hidden_states = encoder_hidden_states.transpose(0, 1)

        # decoder layers
        all_hidden_states = ()
        all_self_attns = ()
        next_decoder_cache = []
        for idx, decoder_layer in enumerate(self.layers):
            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
            if self.output_hidden_states:
                all_hidden_states += (x,)
            dropout_probability = random.uniform(0, 1)
            if self.training and (dropout_probability < self.layerdrop):
                continue

            layer_state = decoder_cached_states[idx] if decoder_cached_states is not None else None

            x, layer_self_attn, layer_past = decoder_layer(
                x,
                encoder_hidden_states,
                encoder_attn_mask=encoder_padding_mask,
                decoder_padding_mask=decoder_padding_mask,
                layer_state=layer_state,
                causal_mask=decoder_causal_mask,
            )

            if use_cache:
                next_decoder_cache.append(layer_past.copy())

            if self.layer_norm and (idx == len(self.layers) - 1):  # last layer of mbart
                x = self.layer_norm(x)
            if self.output_attentions:
                all_self_attns += (layer_self_attn,)

        # Convert to standard output format: (seq_len, BS, model_dim) -> (BS, seq_len, model_dim)
        all_hidden_states = [hidden_state.transpose(0, 1) for hidden_state in all_hidden_states]
        x = x.transpose(0, 1)
        encoder_hidden_states = encoder_hidden_states.transpose(0, 1)

        if use_cache:
            next_cache = ((encoder_hidden_states, encoder_padding_mask), next_decoder_cache)
        else:
            next_cache = None
        return x, next_cache, all_hidden_states, list(all_self_attns)


####Model1 - Bart (Bidirectional encoder and left-to-right decoder) Model

In [None]:
#Code Reference: https://github.com/whaleloops/TransformEHR/blob/main/icdmodelbart.py

def _prepare_bart_decoder_inputs(
    config, input_ids, decoder_input_ids=None, decoder_padding_mask=None, causal_mask_dtype=torch.float32
):
    """Prepare masks that ignore padding tokens in the decoder and a causal mask for the decoder if
    none are provided. This mimics the default behavior in fairseq. To override it pass in masks.
    Note: this is not called during generation
    """
    pad_token_id = config.pad_token_id
    if decoder_input_ids is None:
        decoder_input_ids = shift_tokens_right(input_ids, pad_token_id)
    bsz, tgt_len = decoder_input_ids.size()
    if decoder_padding_mask is None:
        decoder_padding_mask = make_padding_mask(decoder_input_ids, pad_token_id)
    else:
        decoder_padding_mask = invert_mask(decoder_padding_mask)
    causal_mask = torch.triu(fill_with_neg_inf(torch.zeros(tgt_len, tgt_len)), 1).to(
        dtype=causal_mask_dtype, device=decoder_input_ids.device
    )
    return decoder_input_ids, decoder_padding_mask, causal_mask


class PretrainedBartModel(PreTrainedModel):
    config_class = BartConfig
    base_model_prefix = "model"

    def _init_weights(self, module):
        std = self.config.init_std
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, SinusoidalPositionalEmbedding):
            pass
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()

    @property
    def dummy_inputs(self):
        pad_token = self.config.pad_token_id
        input_ids = torch.tensor([[0, 6, 10, 4, 2], [0, 8, 12, 2, pad_token]], device=self.device)
        dummy_inputs = {
            "attention_mask": input_ids.ne(pad_token),
            "input_ids": input_ids,
        }
        return dummy_inputs

class BartModel(PretrainedBartModel):
    def __init__(self, config: BartConfig):
        super().__init__(config)
        self.output_attentions = config.output_attentions
        self.output_hidden_states = config.output_hidden_states

        padding_idx, vocab_size = config.pad_token_id, config.vocab_size
        self.shared = nn.Embedding(vocab_size, config.d_model, padding_idx=padding_idx)
        print("BartModel starting BartEncoder")
        self.encoder = BartEncoder(config, self.shared)
        print("BartModel finish BartEncoder")
        self.decoder = BartDecoder(config, self.shared)
        print("BartModel finish BartDecoder")

        self.init_weights()
        print("BartModel finish init")

    def forward(
        self,
        input_ids,
        attention_mask=None,
        decoder_input_ids=None,
        encoder_outputs: Optional[Tuple] = None,
        decoder_attention_mask=None,
        decoder_cached_states=None,
        use_cache=False,
        visit_ids=None
    ):

        # make masks if user doesn't supply
        if not use_cache:
            decoder_input_ids, decoder_padding_mask, causal_mask = _prepare_bart_decoder_inputs(
                self.config,
                input_ids,
                decoder_input_ids=decoder_input_ids,
                decoder_padding_mask=decoder_attention_mask,
                causal_mask_dtype=self.shared.weight.dtype,
            )
        else:
            decoder_padding_mask, causal_mask = None, None

        assert decoder_input_ids is not None
        if encoder_outputs is None:
            encoder_outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask, visit_ids=visit_ids)
        assert isinstance(encoder_outputs, tuple)
        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
        decoder_outputs = self.decoder(
            decoder_input_ids,
            encoder_outputs[0],
            attention_mask,
            decoder_padding_mask,
            decoder_causal_mask=causal_mask,
            decoder_cached_states=decoder_cached_states,
            use_cache=use_cache,
        )
        # Attention and hidden_states will be [] or None if they aren't needed
        decoder_outputs: Tuple = _filter_out_falsey_values(decoder_outputs)
        assert isinstance(decoder_outputs[0], torch.Tensor)
        encoder_outputs: Tuple = _filter_out_falsey_values(encoder_outputs)
        return decoder_outputs + encoder_outputs

    def get_input_embeddings(self):
        return self.shared

    def set_input_embeddings(self, value):
        self.shared = value
        self.encoder.embed_tokens = self.shared
        self.decoder.embed_tokens = self.shared

    def get_output_embeddings(self):
        return _make_linear_from_emb(self.shared)  # make it on the fly


#### Model2 - ICDBART Model

In [None]:
#Code Reference: https://github.com/whaleloops/TransformEHR/blob/main/icdmodelbart.py

def _reorder_buffer(attn_cache, new_order):
    for k, input_buffer_k in attn_cache.items():
        if input_buffer_k is not None:
            attn_cache[k] = input_buffer_k.index_select(0, new_order)
    return attn_cache

class ICDBartForPreTraining(PretrainedBartModel):
    base_model_prefix = "model"

    def __init__(self, config: BartConfig):
        print("ICDBartForPreTraining starting super init")
        super().__init__(config)
        print("ICDBartForPreTraining finised super init")
        base_model = BartModel(config)
        print("ICDBartForPreTraining finised base_model")
        self.model = base_model
        self.register_buffer("final_logits_bias", torch.zeros((1, self.model.shared.num_embeddings)))

    def resize_token_embeddings(self, new_num_tokens: int) -> nn.Embedding:
        old_num_tokens = self.model.shared.num_embeddings
        new_embeddings = super().resize_token_embeddings(new_num_tokens)
        self.model.shared = new_embeddings
        self._resize_final_logits_bias(new_num_tokens, old_num_tokens)
        return new_embeddings

    def _resize_final_logits_bias(self, new_num_tokens: int, old_num_tokens: int) -> None:
        if new_num_tokens <= old_num_tokens:
            new_bias = self.final_logits_bias[:, :new_num_tokens]
        else:
            extra_bias = torch.zeros((1, new_num_tokens - old_num_tokens), device=self.final_logits_bias.device)
            new_bias = torch.cat([self.final_logits_bias, extra_bias], dim=1)
        self.register_buffer("final_logits_bias", new_bias)

    def forward(
        self,
        input_ids,
        attention_mask=None,
        encoder_outputs=None,
        decoder_input_ids=None,
        decoder_attention_mask=None,
        decoder_cached_states=None,
        lm_labels=None,
        use_cache=False,
        visit_ids=None,
        **unused
    ):
        r"""
        lm_labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`, defaults to :obj:`None`):
            Labels for computing the masked language modeling loss.
            Indices should either be in ``[0, ..., config.vocab_size]`` or -100 (see ``input_ids`` docstring).
            Tokens with indices set to ``-100`` are ignored (masked), the loss is only computed for the tokens
            with labels
            in ``[0, ..., config.vocab_size]``.

    Returns:
        :obj:`tuple(torch.FloatTensor)` comprising various elements depending on the configuration (:class:`~transformers.RobertaConfig`) and inputs:
        masked_lm_loss (`optional`, returned when ``lm_labels`` is provided) ``torch.FloatTensor`` of shape ``(1,)``:
            Masked language modeling loss.
        prediction_scores (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, config.vocab_size)`)
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
        hidden_states (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_hidden_states=True``):
            Tuple of :obj:`torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer)
            of shape :obj:`(batch_size, sequence_length, hidden_size)`.

            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (:obj:`tuple(torch.FloatTensor)`, `optional`, returned when ``config.output_attentions=True``):
            Tuple of :obj:`torch.FloatTensor` (one for each layer) of shape
            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`.

            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.

    Examples::

            # Mask filling only works for bart-large
            from transformers import BartTokenizer, BartForConditionalGeneration
            tokenizer = BartTokenizer.from_pretrained('bart-large')
            TXT = "My friends are <mask> but they eat too many carbs."
            model = BartForConditionalGeneration.from_pretrained('bart-large')
            input_ids = tokenizer.batch_encode_plus([TXT], return_tensors='pt')['input_ids']
            logits = model(input_ids)[0]
            masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
            probs = logits[0, masked_index].softmax(dim=0)
            values, predictions = probs.topk(5)
            tokenizer.decode(predictions).split()
            # ['good', 'great', 'all', 'really', 'very']
        """
        outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            encoder_outputs=encoder_outputs,
            decoder_attention_mask=decoder_attention_mask,
            decoder_cached_states=decoder_cached_states,
            use_cache=use_cache,
            visit_ids=visit_ids,
        )
        lm_logits = F.linear(outputs[0], self.model.shared.weight, bias=self.final_logits_bias)
        outputs = (lm_logits,) + outputs[1:]  # Add cache, hidden states and attention if they are here
        if lm_labels is not None:
            loss_fct = nn.CrossEntropyLoss(ignore_index = -100)
            # TODO(SS): do we need to ignore pad tokens in lm_labels?
            masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), lm_labels.view(-1))
            outputs = (masked_lm_loss,) + outputs

        return outputs

    def prepare_inputs_for_generation(self, decoder_input_ids, past, attention_mask, use_cache, **kwargs):
        assert past is not None, "past has to be defined for encoder_outputs"

        # first step, decoder_cached_states are empty
        if not past[1]:
            encoder_outputs, decoder_cached_states = past, None
        else:
            encoder_outputs, decoder_cached_states = past
        return {
            "input_ids": None,  # encoder_outputs is defined. input_ids not needed
            "encoder_outputs": encoder_outputs,
            "decoder_cached_states": decoder_cached_states,
            "decoder_input_ids": decoder_input_ids,
            "attention_mask": attention_mask,
            "use_cache": use_cache,  # change this to avoid caching (presumably for debugging)
        }

    def prepare_logits_for_generation(self, logits, cur_len, max_length):
        if cur_len == 1:
            self._force_token_ids_generation(logits, self.config.bos_token_id)
        if cur_len == max_length - 1 and self.config.eos_token_id is not None:
            self._force_token_ids_generation(logits, self.config.eos_token_id)
        return logits

    def _force_token_ids_generation(self, scores, token_ids) -> None:
        """force one of token_ids to be generated by setting prob of all other tokens to 0"""
        if isinstance(token_ids, int):
            token_ids = [token_ids]
        all_but_token_ids_mask = torch.tensor(
            [x for x in range(self.config.vocab_size) if x not in token_ids],
            dtype=torch.long,
            device=next(self.parameters()).device,
        )
        assert len(scores.shape) == 2, "scores should be of rank 2 with shape: [batch_size, vocab_size]"
        scores[:, all_but_token_ids_mask] = -float("inf")

    @staticmethod
    def _reorder_cache(past, beam_idx):
        ((enc_out, enc_mask), decoder_cached_states) = past
        reordered_past = []
        for layer_past in decoder_cached_states:
            # get the correct batch idx from decoder layer's batch dim for cross and self-attn
            layer_past_new = {
                attn_key: _reorder_buffer(attn_cache, beam_idx) for attn_key, attn_cache in layer_past.items()
            }
            reordered_past.append(layer_past_new)

        new_enc_out = enc_out if enc_out is None else enc_out.index_select(0, beam_idx)
        new_enc_mask = enc_mask if enc_mask is None else enc_mask.index_select(0, beam_idx)

        past = ((new_enc_out, new_enc_mask), reordered_past)
        return past

    def get_encoder(self):
        return self.model.encoder

    def get_output_embeddings(self):
        return _make_linear_from_emb(self.model.shared)  # make it on the fly


# Model Training

Following are the Computational requirements for the implementation:

*   Operating systems:
    *   GPU (usage may be needed)
    *   Google colab environment

*   Python 3.8.11 with libraries:
    *   NumPy (currently tested on version 1.20.3)
    *   PyTorch (currently tested on version 1.9.0+cu111)
    *   Transformers (currently tested on version 4.16.2)
    *   Accelerate (0.21.0)
    *   tqdm==4.62.2
    *   scikit-learn==0.24.2
    *   Pyhealth (v1.1.6 release - latest version)

### Model Initialization

In [None]:
#Initializing Pretrained Bart Model
#Note: Customized code for Project purpose

model_name = "facebook/bart-large"
tokenizer = BartTokenizer.from_pretrained(model_name)
config = BartConfig.from_pretrained('facebook/bart-large')
print(config)
config.static_position_embeddings = False
config.date_visit_embeddings = True

model = PretrainedBartModel(config)
print(model)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


BartConfig {
  "activation_dropout": 0.1,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "BartModel"
  ],
  "attention_dropout": 0.1,
  "bos_token_id": 0,
  "classif_dropout": 0.1,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "bart",
  "no_repeat_ngram_size":

In [None]:
#Model Training Configuration
#Note: Customized code for Project purpose

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer)
TEST_ID_STRS = [
    'chronic_PTSD_0',
    'type_2_diabtes_0',
    'hyperlipidemia_0',
    'loin_pain_0',
    'low_back_pain_0',
    'PTSD_0',
    'obstructive_sleep_apnea_hypopnea_0',
    'mental_depression_0',
    'chronic_obstructive_airway_disease_0',
    'sensorineural_hearing_loss_0',
    'gastroesophagel_reflux_disease_without_esophagitis_0',
    'gastroesophagel_reflux_disease_0',
    'coronary_arteriosclerosis_0',
    'arteriosclerotic_heart_disease_0',
    'chronic_PTSD_3',
    'type_2_diabtes_3',
    'hyperlipidemia_3',
    'loin_pain_3',
    'low_back_pain_3',
    'PTSD_3',
    'obstructive_sleep_apnea_hypopnea_3',
    'mental_depression_3',
    'chronic_obstructive_airway_disease_3',
    'sensorineural_hearing_loss_3',
    'gastroesophagel_reflux_disease_without_esophagitis_3',
    'gastroesophagel_reflux_disease_3',
    'coronary_arteriosclerosis_3',
    'arteriosclerotic_heart_disease_3',
    'chronic_PTSD_6',
    'type_2_diabtes_6',
    'hyperlipidemia_6',
    'loin_pain_6',
    'low_back_pain_6',
    'PTSD_6',
    'obstructive_sleep_apnea_hypopnea_6',
    'mental_depression_6',
    'chronic_obstructive_airway_disease_6',
    'sensorineural_hearing_loss_6',
    'gastroesophagel_reflux_disease_without_esophagitis_6',
    'gastroesophagel_reflux_disease_6',
    'coronary_arteriosclerosis_6',
    'arteriosclerotic_heart_disease_6',
    'least_happen'
]

training_args = TrainingArguments(
    output_dir = "./results_icd",
    num_train_epochs = 3,
    learning_rate=2e-5,
    per_device_train_batch_size = 4,
    per_device_eval_batch_size = 4,
    warmup_steps = 500,
    weight_decay = 0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
  )
print(training_args)
trainer = Trainer(
     model=model,
     args=training_args,
     train_dataset=train_dataset,
     eval_dataset=test_dataset,
     tokenizer=tokenizer,
     data_collator=data_collator,
#     compute_metrics=compute_metrics
     #prediction_loss_only=False
  )


TrainingArguments(
_n_gpu=0,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=epoch,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


In [None]:
#Model Training
#Note: Customized code for Project purpose
#Imp note: Unable to train model due to data collation/tokenization issue.
trainer.train()
trainer.save_model()

KeyError: 22

##Model Evaluation

In [None]:
#Model Evaluation
#Note: Customized code for Project purpose
#Imp note: Unable to run evaluate as it is dependent on Model training.
eval_output = trainer.evaluate()
print(eval_output)

# Results
Project Draft achievements:

*   Extracted MIMIC IV dataset from https://physionet.org/content/mimiciv/2.2
*   Mounted the data on Google drive
*   Performed data preprocessing required to run the hypothesis. (Most of the time spent on getting this huge dataset extracted and creating the required subset to run with existing computing resources).
*   Extracted the relevant models required for hypothesis and ablation implementation from the Github code available. (The paper implements multiple usecases and selecting the necessary code was challenging).
*   Updated the Model implementation code to run without compilation errors. (We had to customize the Model Initialization, Training and evaluation part as the existing paper uses multiple classes which cater to different usecases).
*   Attempted training BART model but currently facing challenges in Model training due to data collation/tokenization issue. (The paper uses Transformers library and customized few classes for the implementation. During the DLH course assignments, we have not used this library therefore understanding the usage of transformers library and implementing the code parallelly is the biggest challenge we are facing currently.)

Final Project Plan:
*   Planning to train and evaluate BART and ICDBART models without errors.
*   Planning to train and evaluate BART and ICDBART models with Ablations planned.
*   Planning to run the comparisons between the models to prove the hypothesis.



In [None]:
# TODO for Final Project

## Model comparison

In [None]:
# TODO for Final Project

# Discussion




In [None]:
#TODO for Final Project

# References



1. Citation to Original Paper: Yang, Z., Mitra, A., Liu, W. et al. TransformEHR: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat Commun 14, 7857 (2023). https://doi.org/10.1038/s41467-023-43715-z. 2023 Nov 29;14(1):7857. doi: 10.1038/s41467-023-43715-z. PMID: 38030638; PMCID: PMC10687211.
2. https://www.nature.com/articles/s41467-023-43715-z
3. https://physionet.org/content/mimiciv/2.2/icu/#files-panel
4. https://github.com/whaleloops/TransformEHR




In [None]:
# Inorder to unmount the drive
from google.colab import drive
drive.flush_and_unmount()