# CS598 Project: Enhancing Healthcare Predictive Models through Text-Based EHR Code Embedding

**Name:** Gene Horecka  
**Email:** [geneeh2@illinois.edu](mailto:geneeh2@illinois.edu)  
**Course:** CS 598 Deep Learning for Healthcare - Spring 2024
## Project Github Link: [https://github.com/genefever/cs598_descemb_project](https://github.com/genefever/cs598_descemb_project)

In [13]:
import torch
print(torch.cuda.is_available())

True


In [14]:
!git pull

remote: Enumerating objects: 9, done.[K
remote: Counting objects:  11% (1/9)[Kremote: Counting objects:  22% (2/9)[Kremote: Counting objects:  33% (3/9)[Kremote: Counting objects:  44% (4/9)[Kremote: Counting objects:  55% (5/9)[Kremote: Counting objects:  66% (6/9)[Kremote: Counting objects:  77% (7/9)[Kremote: Counting objects:  88% (8/9)[Kremote: Counting objects: 100% (9/9)[Kremote: Counting objects: 100% (9/9), done.[K
remote: Compressing objects:  50% (1/2)[Kremote: Compressing objects: 100% (2/2)[Kremote: Compressing objects: 100% (2/2), done.[K
remote: Total 5 (delta 3), reused 5 (delta 3), pack-reused 0[K
Unpacking objects:  20% (1/5)Unpacking objects:  40% (2/5)Unpacking objects:  60% (3/5)Unpacking objects:  80% (4/5)Unpacking objects: 100% (5/5)Unpacking objects: 100% (5/5), 595 bytes | 297.00 KiB/s, done.
From https://github.com/genefever/cs598_descemb_project
   4933420..0c92977  gpu        -> origin/gpu
Already up to date.


In [3]:
!git clone https://github.com/genefever/cs598_descemb_project.git
!cd cs598_descemb_project
!git checkout gpu
!git branch

Cloning into 'cs598_descemb_project'...
remote: Enumerating objects: 307, done.[K
remote: Counting objects: 100% (307/307), done.[K
remote: Compressing objects: 100% (157/157), done.[K
remote: Total 307 (delta 146), reused 282 (delta 122), pack-reused 0[K
Receiving objects: 100% (307/307), 16.29 MiB | 12.10 MiB/s, done.
Resolving deltas: 100% (146/146), done.
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git


In [4]:
import os
# from google.colab import drive
# drive.mount('/content/drive')

os.chdir('/content/cs598_descemb_project')
print("Current Working Directory: ", os.getcwd())

Current Working Directory:  /content/cs598_descemb_project



# Introduction

The paper "[Unifying Heterogeneous Electronic Health Records Systems via Text-Based Code Embedding](https://arxiv.org/abs/2108.03625)" addresses the significant challenge of heterogeneity in Electronic Health Records (EHR) systems. These systems, essential for modern healthcare, often differ in their coding and formatting of medical data, which hampers the development and application of predictive models across different institutions or datasets.

The primary contribution of this paper is the development of a novel framework named [Description-based Embedding (DescEmb)](https://github.com/hoon9405/DescEmb). This framework uses neural language models to create a unified, code-agnostic text-based representation of medical data. By transforming various coding formats into a consistent text-based embedding, DescEmb allows for more flexible and effective application of deep learning models across diverse EHR systems. This approach notably enhances the performance of predictive healthcare models, demonstrating superior results in several experimental setups compared to traditional code-based embedding methods.


# Scope of Reproducibility

The scope of reproducibility for the paper "[Unifying Heterogeneous Electronic Health Records Systems via Text-Based Code Embedding](https://arxiv.org/abs/2108.03625)" entails verifying the claims of improved predictive performance through the implementation of the DescEmb framework. This framework leverages a neural language model to convert medical codes into a unified, text-based embedding, which is purported to enhance predictive healthcare research without the constraints imposed by diverse EHR systems.

## Key Claims for Reproduction:
1. **Unified Learning Across Diverse EHR Formats:** DescEmb can unify learning across various EHR systems without needing individualized pre-processing or domain-specific knowledge, due to its text-based nature.
2. **Superior Predictive Performance:** The framework demonstrates better or comparable predictive performance than traditional code-based approaches across several clinical prediction tasks.
3. **Efficient Deployment in Diverse Environments:** The text-based approach allows models trained with DescEmb to be easily transferred and applied across different hospitals with differing EHR systems.

The reproducibility effort will focus on these claims by attempting to replicate the experiments outlined in the original paper using the datasets and code provided in the project's [GitHub repository](https://github.com/hoon9405/DescEmb). The process will involve re-running the model training and evaluation procedures to verify the reported performance improvements and the operational flexibility of the DescEmb approach.


# Methodology

## Requirements

- PyTorch version >= 1.8.1
- Python version >= 3.7

## Setting Up the Environment

To replicate the preprocessing and modeling described in the project, the following environment must be set up:

1. **Conda Environment**: Use the provided `environment.yml` file to create a Conda environment. This will install all required dependencies, including Python and PyTorch. Run the following command in your terminal:

   ```bash
   conda env create -f environment.yml



2. **Activate the Environment**

   ```bash
   conda activate descemb

## Data

### Data Description

The datasets used in this project are MIMIC-III and eICU, which are publicly available on the PhysioNet repository. These datasets include comprehensive data from intensive care units (ICUs), such as time-stamped records of medical events, lab results, medications, and more, recorded in different medical code systems.

- [**MIMIC-III**](https://physionet.org/content/iii/1.4/): Contains data for over 60,000 ICU stays at Beth Israel Deaconess Medical Center between 2001 and 2012. It includes information such as lab measurements, medication orders, and diagnostic codes.
- [**eICU**](https://physionet.org/content/eicu-crd/2.0/): A multi-center dataset containing data for over 200,000 ICU stays across the United States between 2014 and 2015. It includes similar types of data to MIMIC-III but is structured differently.
- [**ccs_multi_dx_tool_2015**](https://www.hcup-us.ahrq.gov/toolssoftware/ccs/Multi_Level_CCS_2015.zip): The Clinical Classifications Software (CCS) 2015 dataset groups ICD-9-CM diagnosis and procedure codes into clinically meaningful categories that are useful for health data analysis and research.
- [**icd10cmtoicd9gem**](https://data.nber.org/gem/icd10cmtoicd9gem.csv): The `icd10cmtoicd9gem.csv` file is a mapping table that converts ICD-10-CM codes to ICD-9-CM codes.

### Data Access

The datasets utilized in this project, MIMIC-III and eICU, are publicly available via PhysioNet. Users must adhere to licensing agreements and data usage policies, including the requirement for completing a training course on data handling. Detailed instructions for data access are as follows:

- **MIMIC-III** and **eICU**: Access these datasets by registering and completing the required data usage agreement at [PhysioNet](https://physionet.org/). After gaining access, download the data directly from their respective project pages.

- Alternatively, you can use the public [**MIMIC-III Demo Dataset**](https://physionet.org/content/mimiciii-demo/1.4/) and [**eICU Demo Dataset**](https://www.physionet.org/content/eicu-crd-demo/2.0.1/) without having to create an account, albeit it will just be a fraction of the real dataset. However, this is a good option if you would like to get started quickly and run the computations on a CPU.

### Data Preparation

The preparation of the datasets for training involves several steps, from downloading the data to preprocessing it into a usable format. Here’s how you can prepare your data:

1. **Download the Data**: After obtaining the necessary permissions, download the datasets from PhysioNet.
2. **Organize the Data**: Arrange the downloaded files according to the directory structure below:
```
data_input_path
├─ mimic
│  ├─ ADMISSIONS.csv
│  ├─ PATIENTS.csv
│  ├─ ICUSYAYS.csv
│  ├─ LABEVENTES.csv
│  ├─ PRESCRIPTIONS.csv
│  ├─ PROCEDURES.csv
│  ├─ INPUTEVENTS_CV.csv
│  ├─ INPUTEVENTS_MV.csv
│  ├─ D_ITEMDS.csv
│  ├─ D_ICD_PROCEDURES.csv
│  └─ D_LABITEMBS.csv
├─ eicu
│  ├─ diagnosis.csv
│  ├─ infusionDrug.csv
│  ├─ lab.csv
│  ├─ medication.csv
│  └─ patient.csv
├─ ccs_multi_dx_tool_2015.csv
└─ icd10cmtoicd9gem.csv

```
```
data_output_path
├─mimic
├─eicu
├─pooled
├─label
└─fold
```

3. **Preprocess the Data**: Use the following Python script to execute the preprocessing steps. This script automates the process of converting raw datasets into a format ready for model training.


In [6]:
!pip install iterative-stratification tqdm

Collecting iterative-stratification
  Downloading iterative_stratification-0.1.7-py3-none-any.whl (8.5 kB)
Installing collected packages: iterative-stratification
Successfully installed iterative-stratification-0.1.7


In [53]:
''' mimiciii '''
# data = 'mimiciii'
# data_src_directory = 'datasets/data_input_path/mimic'
# data_src_directory = 'datasets/data_input_path/mimic/mlm' # uncomment this line if you want to preprocess mimiciii mlm data

''' eicu '''
data = 'eicu' # uncomment this line if you want to preprocess eicu data
data_src_directory = 'datasets/data_input_path/eicu' # uncomment this line if you want to preprocess eicu data

run_ready_directory = 'datasets/data_output_path'
ccs_dx_tool_path = 'datasets/data_input_path/ccs_multi_dx_tool_2015.csv'
icd10to9_path = 'datasets/data_input_path/icd10cmtoicd9gem.csv'

# preprocess the data
!python3 preprocess/preprocess_main.py --src_data {data} --dataset_path {data_src_directory} --dest_path {run_ready_directory} --ccs_dx_tool_path {ccs_dx_tool_path} --icd10to9_path {icd10to9_path}

working directory .. :  /content/cs598_descemb_project
Destination directory is set to datasets/data_output_path
Data directory is set to datasets/data_input_path/eicu
eicu_cohort.pkl already exists skip create cohort step!___
eicu_df.pkl already exists skip dataframe generation step!___
label numpy file save to  datasets/data_output_path/eicu/label/mortality.npy
label numpy file save to  datasets/data_output_path/eicu/label/readmission.npy
label numpy file save to  datasets/data_output_path/eicu/label/los_3day.npy
label numpy file save to  datasets/data_output_path/eicu/label/los_7day.npy
['1' '10' '12' '16' '17' '18' '2' '3' '4' '5' '6' '7' '8' '9']
label numpy file save to  datasets/data_output_path/eicu/label/diagnosis.npy
seed :  1
mortality train and test split
X fold_task value counts 
 mortality_fold
1    101
Name: count, dtype: int64

1 label distribution:
mortality
0    0.957746
1    0.042254
Name: proportion, dtype: float64

2 label distribution:
mortality
0    1.0
Name: pro

**Note**
- **Computational Resources**: Preprocessing is computationally intensive. The machine configuration (e.g., CPU cores and RAM) should match the recommended specifications (described in the **Training** section below).
- **Data Security and Compliance**: Always comply with the licensing agreements of the data sources, particularly regarding the handling and privacy of sensitive healthcare data.

## Model

### References and Links

**Citation to the original paper**: [Unifying Heterogeneous Electronic Health Records Systems via Text-Based Code Embedding](https://arxiv.org/abs/2108.03625)

**Link to the paper's Github repo**: [Visit the repository](https://github.com/hoon9405/DescEmb?tab=readme-ov-file)


### Model Description
The DescEmb (Description-based Embedding) model utilizes advanced NLP techniques to handle heterogeneous data from Electronic Health Records (EHRs) systems. The key components used in training these models are:

- **Masked Language Modeling (MLM)**: This approach is inspired by BERT (Bidirectional Encoder Representations from Transformers) and is used to pre-train the DescEmb model. It helps the model learn contextual relationships between words in medical notes by predicting randomly masked words in a sentence.
- **Word2Vec Embedding**: This method is used for training a CodeEmb model, which focuses on learning vector representations of medical codes. Unlike MLM, Word2Vec directly predicts surrounding words given a target word, which helps capture the semantic relationships between different medical codes.

These methods are crucial for enabling the DescEmb model to generate embeddings that can unify disparate EHR systems, allowing for improved performance on various predictive tasks such as readmission, mortality, and length of stay predictions.

### Implementation Code
Below are code snippets for pre-training and fine-tuning the models.

#### Pre-training the DescEmb Model with MLM

#### Pre-train a CodeEmb model with Word2Vec

In [34]:
!python main.py \
    --distributed_world_size 1 \
    --input_path '/content/cs598_descemb_project/datasets/data_output_path' \
    --src_data 'eicu' \
    --task 'w2v' \
    --model 'codeemb'

2024-05-08 00:22:25 | INFO numexpr.utils NumExpr defaulting to 2 threads.)))
Type of pos_pair: <class 'dict'>
[2024-05-08 00:22:29,625][trainers.word2vec_trainer][INFO] - epoch: 0, loss: 1.386
[2024-05-08 00:22:29,625][trainers.word2vec_trainer][INFO] - Saving checkpoint to checkpoints/checkpoint_best.pt
[2024-05-08 00:22:29,627][trainers.word2vec_trainer][INFO] - Finished saving checkpoint to checkpoints/checkpoint_best.pt
[2024-05-08 00:22:29,668][trainers.word2vec_trainer][INFO] - epoch: 1, loss: 1.386
Validation AUROC increased (0.000000 --> -1.386131)
[2024-05-08 00:22:29,668][trainers.word2vec_trainer][INFO] - Saving checkpoint to checkpoints/checkpoint_best.pt
[2024-05-08 00:22:29,670][trainers.word2vec_trainer][INFO] - Finished saving checkpoint to checkpoints/checkpoint_best.pt
[2024-05-08 00:22:29,711][trainers.word2vec_trainer][INFO] - epoch: 2, loss: 1.386
Validation AUROC increased (-1.386131 --> -1.385945)
[2024-05-08 00:22:29,711][trainers.word2vec_trainer][INFO] - Savin

**Note**

- `data` should be set to `mimiciii` or `eicu`
- `percent` should be set to probability (default: `0.3`) of masking for MLM
- `model` should be set to `descemb_bert` or `descemb_rnn`

## Training

### Computational Requirements
The computational requirements for training the DescEmb model, based on the information from the README.md and the paper, include:

- **Processor**: Training is computationally intensive and recommended to be performed on a system with at least 128 cores of an AMD EPYC 7502 32-Core Processor for efficient processing.
- **Memory**: At least 60GB of RAM is required due to the large size of the datasets and the complexity of the model architectures involved.
- **Software**: Python 3.7 or higher with PyTorch 1.8.1 or higher is required. Make sure all dependencies from the `environment.yml` are installed.

These requirements ensure that the model training can proceed without hardware-induced limitations, particularly for this task, which involves large datasets like MIMIC-III and eICU.

### Implementation Code

Below are Python code snippets that demonstrate how to train and fine-tune the models using PyTorch. These examples are based on the commands found in the README.md:

#### Training a New Model

Other configurations will set to be default, which were used in the DescEmb paper.

`$descemb` should be 'descemb_bert' or 'descemb_rnn'

`$ratio` should be set to one of [10, 30, 50, 70, 100] (default: 100)

`$value` should be set to one of ['NV', 'VA', 'DSVA', 'DSVA_DPE', 'VC']

`$task` should be set to one of ['readmission', 'mortality', 'los_3day', 'los_7day', 'diagnosis']

Note that `--input-path ` should be the root directory containing preprocessed data.

##### Training a New CodeEmb Model

In [56]:
# Example training new DescEmb model:
# train_new_codeemb_model('data_output_path/mimic', 'mimic', 'readmission', 100, 'DSVA')
# train_new_codeemb_model('data_output_path/eicu', 'eicu', 'mortality', 70, 'VA')

!python main.py \
    --distributed_world_size 1 \
    --input_path '/content/cs598_descemb_project/datasets/data_output_path' \
    --model 'ehr_model' \
    --embed_model 'codeemb' \
    --pred_model 'rnn' \
    --src_data 'eicu' \
    --ratio 70 \
    --value_mode DSVA \
    --task 'readmission'

2024-05-08 01:17:05 | INFO numexpr.utils NumExpr defaulting to 2 threads.)))
[2024-05-08 01:17:06,651][trainers.trainer][INFO] - {'batch_size': 128,
 'bert_model': 'bert_tiny',
 'device_ids': [0],
 'disable_validation': False,
 'distributed_world_size': 1,
 'dropout': 0.3,
 'embed_model': 'codeemb',
 'enc_embed_dim': 128,
 'enc_hidden_dim': 256,
 'eval_data': None,
 'fold': None,
 'init_bert_params': False,
 'init_bert_params_with_freeze': False,
 'input_path': '/content/cs598_descemb_project/datasets/data_output_path',
 'load_pretrained_weights': False,
 'lr': 0.0001,
 'max_event_len': 150,
 'mlm_prob': 0.3,
 'model': 'ehr_model',
 'model_path': None,
 'n_epochs': 1000,
 'patience': 5,
 'pred_embed_dim': 128,
 'pred_hidden_dim': 256,
 'pred_model': 'rnn',
 'ratio': '70',
 'rnn_layer': 1,
 'save_dir': 'checkpoints',
 'save_prefix': 'checkpoint',
 'seed': 1,
 'src_data': 'eicu',
 'task': 'readmission',
 'transfer': False,
 'valid_subsets': ['valid', 'test'],
 'value_mode': 'DSVA'}
[2024

##### Training a New DescEmb Model

In [58]:
# Example training new DescEmb model:
# train_new_descemb_model('data_output_path/mimic', 'mimic', 'readmission', 100, 'DSVA')
# train_new_descemb_model('data_output_path/eicu', 'eicu', 'mortality', 70, 'VA')

!python main.py \
    --distributed_world_size 1 \
    --input_path '/content/cs598_descemb_project/datasets/data_output_path' \
    --model 'ehr_model' \
    --embed_model 'descemb_bert' \
    --pred_model 'rnn' \
    --src_data 'eicu' \
    --ratio 70 \
    --value_mode VA \
    --task 'readmission'

2024-05-08 01:28:04 | INFO numexpr.utils NumExpr defaulting to 2 threads.)))
[2024-05-08 01:28:05,449][trainers.trainer][INFO] - {'batch_size': 128,
 'bert_model': 'bert_tiny',
 'device_ids': [0],
 'disable_validation': False,
 'distributed_world_size': 1,
 'dropout': 0.3,
 'embed_model': 'descemb_bert',
 'enc_embed_dim': 128,
 'enc_hidden_dim': 256,
 'eval_data': None,
 'fold': None,
 'init_bert_params': False,
 'init_bert_params_with_freeze': False,
 'input_path': '/content/cs598_descemb_project/datasets/data_output_path',
 'load_pretrained_weights': False,
 'lr': 0.0001,
 'max_event_len': 150,
 'mlm_prob': 0.3,
 'model': 'ehr_model',
 'model_path': None,
 'n_epochs': 1000,
 'patience': 5,
 'pred_embed_dim': 128,
 'pred_hidden_dim': 256,
 'pred_model': 'rnn',
 'ratio': '70',
 'rnn_layer': 1,
 'save_dir': 'checkpoints',
 'save_prefix': 'checkpoint',
 'seed': 1,
 'src_data': 'eicu',
 'task': 'readmission',
 'transfer': False,
 'valid_subsets': ['valid', 'test'],
 'value_mode': 'VA'}
[2

**Note**: if you want to train with pre-trained BERT model, add command line parameters `--init_bert_params` or `--init_bert_params_with_freeze`. `--init_bert_params_with_freeze` enables the model to load and freeze BERT parameters.

#### Fine-tune a Pre-Trained Model

##### Fine-tuning a Pre-trained CodeEmb Model

In [59]:
!python main.py \
    --distributed_world_size 1 \
    --input_path '/content/cs598_descemb_project/datasets/data_output_path' \
    --model_path '/content/cs598_descemb_project/outputs/2024-05-08/10-17-06/checkpoints/checkpoint_best.pt' \
    --model ehr_model \
    --embed_model 'codeemb' \
    --pred_model 'rnn' \
    --src_data 'eicu' \
    --ratio 70 \
    --value_mode 'DSVA' \
    --task 'readmission'

2024-05-08 01:33:26 | INFO numexpr.utils NumExpr defaulting to 2 threads.)))
[2024-05-08 01:33:27,764][trainers.trainer][INFO] - {'batch_size': 128,
 'bert_model': 'bert_tiny',
 'device_ids': [0],
 'disable_validation': False,
 'distributed_world_size': 1,
 'dropout': 0.3,
 'embed_model': 'codeemb',
 'enc_embed_dim': 128,
 'enc_hidden_dim': 256,
 'eval_data': None,
 'fold': None,
 'init_bert_params': False,
 'init_bert_params_with_freeze': False,
 'input_path': '/content/cs598_descemb_project/datasets/data_output_path',
 'load_pretrained_weights': False,
 'lr': 0.0001,
 'max_event_len': 150,
 'mlm_prob': 0.3,
 'model': 'ehr_model',
 'model_path': '/content/cs598_descemb_project/outputs/2024-05-08/10-17-06/checkpoints/checkpoint_best.pt',
 'n_epochs': 1000,
 'patience': 5,
 'pred_embed_dim': 128,
 'pred_hidden_dim': 256,
 'pred_model': 'rnn',
 'ratio': '70',
 'rnn_layer': 1,
 'save_dir': 'checkpoints',
 'save_prefix': 'checkpoint',
 'seed': 1,
 'src_data': 'eicu',
 'task': 'readmission'

##### Fine-tuning a Pre-trained DescEmb Model

In [60]:
!python main.py \
    --distributed_world_size 1 \
    --input_path '/content/cs598_descemb_project/datasets/data_output_path' \
    --model_path '/content/cs598_descemb_project/outputs/2024-05-08/10-28-05/checkpoints/checkpoint_best.pt' \
    --model ehr_model \
    --embed_model 'descemb_bert' \
    --pred_model 'rnn' \
    --src_data 'eicu' \
    --ratio 70 \
    --value_mode VA \
    --task 'readmission'

2024-05-08 01:35:03 | INFO numexpr.utils NumExpr defaulting to 2 threads.)))
[2024-05-08 01:35:04,306][trainers.trainer][INFO] - {'batch_size': 128,
 'bert_model': 'bert_tiny',
 'device_ids': [0],
 'disable_validation': False,
 'distributed_world_size': 1,
 'dropout': 0.3,
 'embed_model': 'descemb_bert',
 'enc_embed_dim': 128,
 'enc_hidden_dim': 256,
 'eval_data': None,
 'fold': None,
 'init_bert_params': False,
 'init_bert_params_with_freeze': False,
 'input_path': '/content/cs598_descemb_project/datasets/data_output_path',
 'load_pretrained_weights': False,
 'lr': 0.0001,
 'max_event_len': 150,
 'mlm_prob': 0.3,
 'model': 'ehr_model',
 'model_path': '/content/cs598_descemb_project/outputs/2024-05-08/10-28-05/checkpoints/checkpoint_best.pt',
 'n_epochs': 1000,
 'patience': 5,
 'pred_embed_dim': 128,
 'pred_hidden_dim': 256,
 'pred_model': 'rnn',
 'ratio': '70',
 'rnn_layer': 1,
 'save_dir': 'checkpoints',
 'save_prefix': 'checkpoint',
 'seed': 1,
 'src_data': 'eicu',
 'task': 'readmis

## Evaluation

The primary metrics used in the `README.md` and the associated paper are **Area Under the Precision-Recall Curve (AUPRC)**.

### Metrics Descriptions

#### Area Under the Precision-Recall Curve (AUPRC):

- **Precision (Positive Predictive Value)**: The ratio of true positive predictions to the total predicted positives. It shows the accuracy of the positive predictions.
- **Recall (Sensitivity)**: The ratio of true positives to the actual total positives in the dataset. It measures the model's ability to capture positive instances.
- **AUPRC**: The AUPRC is a single number summary of these two metrics across different thresholds, emphasizing the balance between precision and recall. It is especially valuable in medical predictions where the cost of false negatives is high.

### Implementation Code

In [61]:
from sklearn.metrics import precision_recall_curve, auc
import matplotlib.pyplot as plt

def calculate_auprc(y_true, y_scores):
    """
    Calculate the Area Under the Precision-Recall Curve (AUPRC).

    Args:
    y_true (list or array): True binary labels in range {0, 1}.
    y_scores (list or array): Target scores, can either be probability estimates of the positive class,
                              confidence values, or non-thresholded measure of decisions.

    Returns:
    float: AUPRC score
    """
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    auprc = auc(recall, precision)
    return auprc

def plot_precision_recall_curve(y_true, y_scores):
    """
    Plot the Precision-Recall curve for a given set of true labels and scores.

    Args:
    y_true (list or array): True binary labels in range {0, 1}.
    y_scores (list or array): Target scores, similar to calculate_auprc.
    """
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    plt.figure(figsize=(8, 6))
    plt.plot(recall, precision, label=f'AUPRC = {auc(recall, precision):.2f}')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall curve')
    plt.legend(loc='best')
    plt.show()

# Example evaluation usage
auprc_score = calculate_auprc(y_true, y_scores)
plot_precision_recall_curve(y_true, y_scores)

NameError: name 'y_true' is not defined

**Note**
- Ensure that the `y_true` and `y_scores` are correctly formatted as arrays of true labels and model predictions respectively.
- The plot_precision_recall_curve function provides a visual understanding of the trade-off between precision and recall for different threshold settings.

# Results

## Results Overview

The DescEmb models demonstrated superior or comparable performance to traditional code-based embeddings (CodeEmb) across various clinical prediction tasks. The models were evaluated on tasks such as predicting readmission, mortality, length of stay (both 3-day and 7-day), and diagnosis predictions using datasets like MIMIC-III and eICU.

## Analyses

- **Performance Gains**: DescEmb models, especially those leveraging BERT-based embeddings, showed consistent improvements in AUPRC (Area Under the Precision-Recall Curve) across most tasks compared to traditional models.
- **Model Comparisons**: BERT-based DescEmb models generally outperformed simpler RNN-based models in complex tasks like diagnosis prediction, highlighting the effectiveness of pre-trained language models in handling complex textual data from EHRs.
- **Impact of Pre-training**: The addition of Masked Language Modeling (MLM) pre-training marginally improved performance, suggesting that further domain-specific adaptation of language models could be beneficial.

## Plans

Moving forward, the research can focus on:
- **Further Optimization**: Enhancing the efficiency of the models to make them accessible for real-time applications in clinical settings.
- **Expanding Dataset Usage**: Applying the DescEmb framework to additional datasets and exploring its effectiveness across different healthcare systems.
- **Advanced Model Architectures**: Investigating the integration of more complex neural architectures and their impact on the performance of EHR-based predictive models.

## Conclusion

The DescEmb approach marks a significant step forward in the use of NLP techniques for EHR data, offering a promising avenue for enhancing predictive healthcare analytics.




# Discussion of Reproducibility and Experimental Results

## Implications of Experimental Results
The attempt to reproduce the findings of the original paper highlighted several crucial aspects of the experimental setup and the limitations of available resources. Despite diligent efforts, the project did not achieve the same results as those documented, primarily due to constraints related to data handling and computational resources.

##Reproducibility of the Original Paper
The original paper, while comprehensive in many respects, presented significant challenges that hindered exact reproducibility:

- **Data and Pretraining:** The datasets used were massive, which significantly prolonged the pretraining period. Even with access to enhanced computational resources through Google Colab Pro, the available resources fell short of the demands for processing such large datasets efficiently.
- **Codebase Complexity:** The codebase provided by the original authors was extensive and complex. This complexity made it particularly challenging to navigate and implement the evaluation phase effectively.
- **Documentation Gaps:** The documentation, especially around the evaluation methodology, lacked sufficient detail. This made it difficult to understand and replicate how the model’s performance was assessed, further complicating the reproduction process.


## Challenges Encountered
- **Resource Limitation:** The primary challenge was the sheer scale of data and the computational power required for pretraining. The resources available through Google Colab Pro, although substantial, were inadequate for handling the dataset efficiently.
- **Technical Complexity:** Working through a vast and intricate codebase alone was particularly challenging. The complexity not only made it difficult to set up the project but also to reach the evaluation stage effectively.
- **Isolated Working Environment:** Handling this project solo, especially given my busy schedule with work and family commitments, added an additional layer of difficulty. Collaboration could have alleviated some of the technical burdens and provided a platform for problem-solving and idea exchange.


##Recommendations for Improving Reproducibility
1. **Enhanced Documentation:** Future iterations of the work could benefit greatly from more detailed documentation, particularly concerning the evaluation methodology. Clear, step-by-step guidance could help replicate the results more effectively and efficiently.
2. **Codebase Simplification:** Simplifying the codebase or at least organizing it with better modularity might help future researchers navigate the setup more easily.
3. **Community Collaboration:** Encouraging collaboration by setting up a discussion forum or a community portal could help researchers and practitioners tackle common issues collectively. Collaboration could be particularly beneficial in pooling resources, sharing computational power, and exchanging practical insights.
4. **Incremental Dataset Handling:** For projects involving massive datasets, it might be helpful to offer strategies or code alternatives for CPU testing to allow for incremental loading and processing of data. This could make the project more accessible to individuals or institutions with limited computational resources.

By addressing these points, the original authors and others in the field can enhance the accessibility and reproducibility of their research, making it more feasible for a broader audience to engage with and build upon their work.