# CS598 Project: Enhancing Healthcare Predictive Models through Text-Based EHR Code Embedding

**Name:** Gene Horecka  
**Email:** [geneeh2@illinois.edu](mailto:geneeh2@illinois.edu)  
**Course:** CS 598 Deep Learning for Healthcare - Spring 2024
## Project Github Link: [https://github.com/genefever/cs598_descemb_project](https://github.com/genefever/cs598_descemb_project)


# Introduction

The paper "[Unifying Heterogeneous Electronic Health Records Systems via Text-Based Code Embedding](https://arxiv.org/abs/2108.03625)" addresses the significant challenge of heterogeneity in Electronic Health Records (EHR) systems. These systems, essential for modern healthcare, often differ in their coding and formatting of medical data, which hampers the development and application of predictive models across different institutions or datasets.

The primary contribution of this paper is the development of a novel framework named [Description-based Embedding (DescEmb)](https://github.com/hoon9405/DescEmb). This framework uses neural language models to create a unified, code-agnostic text-based representation of medical data. By transforming various coding formats into a consistent text-based embedding, DescEmb allows for more flexible and effective application of deep learning models across diverse EHR systems. This approach notably enhances the performance of predictive healthcare models, demonstrating superior results in several experimental setups compared to traditional code-based embedding methods.


# Scope of Reproducibility

The scope of reproducibility for the paper "[Unifying Heterogeneous Electronic Health Records Systems via Text-Based Code Embedding](https://arxiv.org/abs/2108.03625)" entails verifying the claims of improved predictive performance through the implementation of the DescEmb framework. This framework leverages a neural language model to convert medical codes into a unified, text-based embedding, which is purported to enhance predictive healthcare research without the constraints imposed by diverse EHR systems.

## Key Claims for Reproduction:
1. **Unified Learning Across Diverse EHR Formats:** DescEmb can unify learning across various EHR systems without needing individualized pre-processing or domain-specific knowledge, due to its text-based nature.
2. **Superior Predictive Performance:** The framework demonstrates better or comparable predictive performance than traditional code-based approaches across several clinical prediction tasks.
3. **Efficient Deployment in Diverse Environments:** The text-based approach allows models trained with DescEmb to be easily transferred and applied across different hospitals with differing EHR systems.

The reproducibility effort will focus on these claims by attempting to replicate the experiments outlined in the original paper using the datasets and code provided in the project's [GitHub repository](https://github.com/hoon9405/DescEmb). The process will involve re-running the model training and evaluation procedures to verify the reported performance improvements and the operational flexibility of the DescEmb approach.


# Methodology

## Requirements

- PyTorch version >= 1.8.1
- Python version >= 3.7

## Setting Up the Environment

To replicate the preprocessing and modeling described in the project, the following environment must be set up:

1. **Conda Environment**: Use the provided `environment.yml` file to create a Conda environment. This will install all required dependencies, including Python and PyTorch. Run the following command in your terminal:

   ```bash
   conda env create -f environment.yml



2. **Activate the Environment**

   ```bash
   conda activate descemb

## Data

### Data Description

The datasets used in this project are MIMIC-III and eICU, which are publicly available on the PhysioNet repository. These datasets include comprehensive data from intensive care units (ICUs), such as time-stamped records of medical events, lab results, medications, and more, recorded in different medical code systems.

- [**MIMIC-III**](https://physionet.org/content/iii/1.4/): Contains data for over 60,000 ICU stays at Beth Israel Deaconess Medical Center between 2001 and 2012. It includes information such as lab measurements, medication orders, and diagnostic codes.
- [**eICU**](https://physionet.org/content/eicu-crd/2.0/): A multi-center dataset containing data for over 200,000 ICU stays across the United States between 2014 and 2015. It includes similar types of data to MIMIC-III but is structured differently.
- [**ccs_multi_dx_tool_2015**](https://www.hcup-us.ahrq.gov/toolssoftware/ccs/Multi_Level_CCS_2015.zip): The Clinical Classifications Software (CCS) 2015 dataset groups ICD-9-CM diagnosis and procedure codes into clinically meaningful categories that are useful for health data analysis and research.
- [**icd10cmtoicd9gem**](https://data.nber.org/gem/icd10cmtoicd9gem.csv): The `icd10cmtoicd9gem.csv` file is a mapping table that converts ICD-10-CM codes to ICD-9-CM codes.

### Data Access

The datasets utilized in this project, MIMIC-III and eICU, are publicly available via PhysioNet. Users must adhere to licensing agreements and data usage policies, including the requirement for completing a training course on data handling. Detailed instructions for data access are as follows:

- **MIMIC-III** and **eICU**: Access these datasets by registering and completing the required data usage agreement at [PhysioNet](https://physionet.org/). After gaining access, download the data directly from their respective project pages.

- Alternatively, you can use the public [**MIMIC-III Demo Dataset**](https://physionet.org/content/mimiciii-demo/1.4/) and [**eICU Demo Dataset**](https://www.physionet.org/content/eicu-crd-demo/2.0.1/) without having to create an account, albeit it will just be a fraction of the real dataset. However, this is a good option if you would like to get started quickly and run the computations on a CPU.

### Data Preparation

The preparation of the datasets for training involves several steps, from downloading the data to preprocessing it into a usable format. Here’s how you can prepare your data:

1. **Download the Data**: After obtaining the necessary permissions, download the datasets from PhysioNet.
2. **Organize the Data**: Arrange the downloaded files according to the directory structure below:
```
data_input_path
├─ mimic
│  ├─ ADMISSIONS.csv
│  ├─ PATIENTS.csv
│  ├─ ICUSYAYS.csv
│  ├─ LABEVENTES.csv
│  ├─ PRESCRIPTIONS.csv
│  ├─ PROCEDURES.csv
│  ├─ INPUTEVENTS_CV.csv
│  ├─ INPUTEVENTS_MV.csv
│  ├─ D_ITEMDS.csv
│  ├─ D_ICD_PROCEDURES.csv
│  └─ D_LABITEMBS.csv
├─ eicu
│  ├─ diagnosis.csv
│  ├─ infusionDrug.csv
│  ├─ lab.csv
│  ├─ medication.csv
│  └─ patient.csv
├─ ccs_multi_dx_tool_2015.csv
└─ icd10cmtoicd9gem.csv

```
```
data_output_path
├─mimic
├─eicu
├─pooled
├─label
└─fold
```

3. **Preprocess the Data**: Use the following Python script to execute the preprocessing steps. This script automates the process of converting raw datasets into a format ready for model training.


In [63]:
data = 'mimiciii'
data_src_directory = 'datasets/data_input_path/mimic'
# data = 'eicu' # uncomment this line if you want to preprocess eicu data
# data_src_directory = 'datasets/data_input_path/eicu' # uncomment this line if you want to preprocess eicu data
run_ready_directory = 'datasets/data_output_path/mlm'
ccs_dx_tool_path = 'datasets/data_input_path/ccs_multi_dx_tool_2015.csv'
icd10to9_path = 'datasets/data_input_path/icd10cmtoicd9gem.csv'

# preprocess the data
!python3 preprocess/preprocess_main.py --src_data {data} --dataset_path {data_src_directory} --dest_path {run_ready_directory} --ccs_dx_tool_path {ccs_dx_tool_path} --icd10to9_path {icd10to9_path}

working directory .. :  /Users/genehorecka/Documents/01 UIUC/CS598/Project/cs598_descemb_project
create dest path.. datasets/data_output_path/mlm
Destination directory is set to datasets/data_output_path/mlm
Data directory is set to datasets/data_input_path/mimic
length of PATIENTS.csv  :  100
length of ICUSTAYS.csv  :  136
length of DIAGNOSIS_ICD.csv  :  136
length of icus  : 72
readmission value counts : readmission
0              64
1               4
Name: count, dtype: int64
dx_label_mapping.pkl dx mapping pkl save___
average length:  7.432835820895522
dx freqeuncy [ 0  1  0  1  2 14  4 12 10 14  3  3  0  3]
max length:  13
min length:  1
data preparation initialization .. mimiciii lab
df_load ! .. mimiciii lab
data preparation initialization .. mimiciii med
df_load ! .. mimiciii med
data preparation initialization .. mimiciii inf
mimic INPUTEVENTS merge!
df_load ! .. mimiciii inf
data preparation finish for three tables 
 second preparation start soon..
lab med inf three categorie

**Note**
- **Computational Resources**: Preprocessing is computationally intensive. The machine configuration (e.g., CPU cores and RAM) should match the recommended specifications (described in the **Training** section below).
- **Data Security and Compliance**: Always comply with the licensing agreements of the data sources, particularly regarding the handling and privacy of sensitive healthcare data.

## Model

### References and Links

**Citation to the original paper**: [Unifying Heterogeneous Electronic Health Records Systems via Text-Based Code Embedding](https://arxiv.org/abs/2108.03625)

**Link to the paper's Github repo**: [Visit the repository](https://github.com/hoon9405/DescEmb?tab=readme-ov-file)


### Model Description
The DescEmb (Description-based Embedding) model utilizes advanced NLP techniques to handle heterogeneous data from Electronic Health Records (EHRs) systems. The two key components used in training these models are:

- **Masked Language Modeling (MLM)**: This approach is inspired by BERT (Bidirectional Encoder Representations from Transformers) and is used to pre-train the DescEmb model. It helps the model learn contextual relationships between words in medical notes by predicting randomly masked words in a sentence.
- **Word2Vec Embedding**: This method is used for training a CodeEmb model, which focuses on learning vector representations of medical codes. Unlike MLM, Word2Vec directly predicts surrounding words given a target word, which helps capture the semantic relationships between different medical codes.

These methods are crucial for enabling the DescEmb model to generate embeddings that can unify disparate EHR systems, allowing for improved performance on various predictive tasks such as readmission, mortality, and length of stay predictions.

### Implementation Code
Below are code snippets for pre-training and fine-tuning the models.

#### Pre-training the DescEmb Model with MLM

In [68]:
!python main.py \
    --distributed_world_size 1 \
    --input_path '/content/cs598_descemb_project/datasets/data_output_path' \
    --src_data 'mimiciii' \
    --task mlm \
    --mlm_prob 0.3 \
    --model 'descemb_bert'

2024-05-06 23:11:05 | INFO numexpr.utils Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.)))
2024-05-06 23:11:05 | INFO numexpr.utils NumExpr defaulting to 8 threads.)))
[2024-05-06 23:11:06,013][trainers.trainer][INFO] - {'batch_size': 128,
 'bert_model': 'bert_tiny',
 'device_ids': [0],
 'disable_validation': False,
 'distributed_world_size': 1,
 'dropout': 0.3,
 'embed_model': None,
 'enc_embed_dim': 128,
 'enc_hidden_dim': 256,
 'eval_data': None,
 'fold': None,
 'init_bert_params': False,
 'init_bert_params_with_freeze': False,
 'input_path': '/Users/genehorecka/Documents/01 '
               'UIUC/CS598/Project/cs598_descemb_project/datasets/data_output_path',
 'load_pretrained_weights': False,
 'lr': 0.0001,
 'max_event_len': 150,
 'mlm_prob': 0.3,
 'model': 'descemb_bert',
 'model_path': None,
 'n_epochs': 1000,
 'patience': 5,
 'pred_embed_dim': 128,
 'pred_hidden_dim': 256,
 'pred_model': None,
 'ratio': '100',
 'rnn_layer': 1,
 

#### Pre-train a CodeEmb model with Word2Vec

In [None]:
def pretrain_codeemb(world_size, input_path, src_data, task, mlm_prob=0.3):
    """
    Run a Python training script using subprocess module.
    
    Args:
    world_size (int): Number of processes to distribute the workload across.
    input_path (str): Path to the training data.
    src_data (str): Source data identifier, e.g., 'mimic' or 'eicu'.
    task (str): Task to perform, e.g., 'w2v' for Word2Vec.
    """
    command = [
        'python', 'main.py',
        '--distributed_world_size', str(world_size),
        '--input_path', input_path,
        '--src_data', src_data,
        '--task', task,
        '--model', 'codeemb'
    ]
    result = subprocess.run(command, capture_output=True, text=True)
    print("STDOUT:", result.stdout)
    print("STDERR:", result.stderr)

# Example usage
pretrain_codeemb(world_size=1, input_path='datasets/data_output_path/', src_data='mimiciii', task='w2v')

STDOUT: 2024-05-06 22:18:16 | INFO numexpr.utils Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.)))
2024-05-06 22:18:16 | INFO numexpr.utils NumExpr defaulting to 8 threads.)))

STDERR: Traceback (most recent call last):
  File "/Users/genehorecka/Documents/01 UIUC/CS598/Project/cs598_descemb_project/main.py", line 194, in <module>
    main()
  File "/Users/genehorecka/Documents/01 UIUC/CS598/Project/cs598_descemb_project/main.py", line 135, in main
    trainer = Word2VecTrainer(args)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/genehorecka/Documents/01 UIUC/CS598/Project/cs598_descemb_project/trainers/word2vec_trainer.py", line 40, in __init__
    vocab_dict = self.vocab_load(args.input_path, args.src_data, args.value_mode)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/genehorecka/Documents/01 UIUC/CS598/Project/cs598_descemb_project/trainers/word2vec_trainer.py", line 116, in voc

**Note**

- `data` should be set to `mimic` or `eicu`
- `percent` should be set to probability (default: `0.3`) of masking for MLM
- `model` should be set to `descemb_bert` or `descemb_rnn`

## Training

### Computational Requirements
The computational requirements for training the DescEmb model, based on the information from the README.md and the paper, include:

- **Processor**: Training is computationally intensive and recommended to be performed on a system with at least 128 cores of an AMD EPYC 7502 32-Core Processor for efficient processing.
- **Memory**: At least 60GB of RAM is required due to the large size of the datasets and the complexity of the model architectures involved.
- **Software**: Python 3.7 or higher with PyTorch 1.8.1 or higher is required. Make sure all dependencies from the `environment.yml` are installed.

These requirements ensure that the model training can proceed without hardware-induced limitations, particularly for this task, which involves large datasets like MIMIC-III and eICU.

### Implementation Code

Below are Python code snippets that demonstrate how to train and fine-tune the models using PyTorch. These examples are based on the commands found in the README.md:

#### Training a New Model

Other configurations will set to be default, which were used in the DescEmb paper.

`$descemb` should be 'descemb_bert' or 'descemb_rnn'

`$ratio` should be set to one of [10, 30, 50, 70, 100] (default: 100)

`$value` should be set to one of ['NV', 'VA', 'DSVA', 'DSVA_DPE', 'VC']

`$task` should be set to one of ['readmission', 'mortality', 'los_3day', 'los_7day', 'diagnosis']

Note that `--input-path ` should be the root directory containing preprocessed data.

##### Training a New CodeEmb Model

In [None]:
def train_new_codeemb_model(input_path, data_source, task, ratio, value_mode):
    command = [
        'python', 'main.py',
        '--distributed_world_size', '1',  # Adjust based on the setup, if more GPUs/CPU cores are available
        '--input_path', input_path,
        '--model', 'ehr_model',
        '--embed_model', 'codeemb',
        '--pred_model', 'rnn',  # Define the prediction model architecture, change as needed
        '--src_data', data_source,  # Specify data source, e.g., 'mimic' or 'eicu'
        '--ratio', str(ratio),  # Ratio for data split or sampling, e.g., 10, 30, 50, 70, 100
        '--value_mode', value_mode,  # Mode of data processing or feature engineering, e.g., 'NV', 'VA', 'DSVA', 'DSVA_DPE', 'VC'
        '--task', task  # Task to perform, e.g., 'readmission', 'mortality', 'los_3day', 'los_7day', 'diagnosis'
    ]
    subprocess.run(command)

# Example training new DescEmb model:
# train_new_codeemb_model('data_output_path/mimic', 'mimic', 'readmission', 100, 'DSVA')
# train_new_codeemb_model('data_output_path/eicu', 'eicu', 'mortality', 70, 'VA')

##### Training a New DescEmb Model

In [None]:
def train_new_descemb_model(input_path, data_source, task, ratio, value_mode):
    command = [
        'python', 'main.py',
        '--distributed_world_size', '1',  # Adjust based on the setup, if more GPUs/CPU cores are available
        '--input_path', input_path,
        '--model', 'ehr_model',
        '--embed_model', 'descemb',  # Possible values: 'descemb', 'descemb_bert', 'descemb_rnn'
        '--pred_model', 'rnn',  # Define the prediction model architecture, change as needed
        '--src_data', data_source,  # Specify data source, e.g., 'mimic' or 'eicu'
        '--ratio', str(ratio),  # Ratio for data split or sampling, e.g., 10, 30, 50, 70, 100
        '--value_mode', value_mode,  # Mode of data processing or feature engineering, e.g., 'NV', 'VA', 'DSVA', 'DSVA_DPE', 'VC'
        '--task', task  # Task to perform, e.g., 'readmission', 'mortality', 'los_3day', 'los_7day', 'diagnosis'
    ]
    subprocess.run(command)

# Example training new DescEmb model:
# train_new_descemb_model('data_output_path/mimic', 'mimic', 'readmission', 100, 'DSVA')
# train_new_descemb_model('data_output_path/eicu', 'eicu', 'mortality', 70, 'VA')

**Note**: if you want to train with pre-trained BERT model, add command line parameters `--init_bert_params` or `--init_bert_params_with_freeze`. `--init_bert_params_with_freeze` enables the model to load and freeze BERT parameters.

#### Fine-tune a Pre-Trained Model

##### Fine-tuning a Pre-trained CodeEmb Model

In [None]:
def fine_tune_pretrained_codeemb_model(input_path, model_path, data_source, task, ratio, value_mode, world_size=1, model='ehr_model', embed_model='codeemb', pred_model='rnn'):
    """
    Fine-tune a pre-trained CodeEmb model using subprocess.

    Args:
    input_path (str): Base path to the training data.
    model_path (str): Path to the pre-trained model.
    data_source (str): Data source identifier, e.g., 'mimic' or 'eicu'.
    task (str): Task for the model to perform.
    ratio (int): Ratio of data to use for fine-tuning.
    value_mode (str): Value mode to use, e.g., 'NV', 'VA', 'DSVA', 'DSVA_DPE', 'VC'.
    world_size (int): Number of processes to distribute the workload across (default: 1).
    model (str): High-level model architecture (default: 'ehr_model').
    embed_model (str): Type of embedding model, typically 'codeemb' (default: 'codeemb').
    pred_model (str): Prediction model, typically 'rnn' (default: 'rnn').
    """
    full_input_path = f'{input_path}/{data_source}'

    command = [
        'python', 'main.py',
        '--distributed_world_size', str(world_size),
        '--input_path', full_input_path,
        '--model_path', model_path,
        '--load_pretrained_weights',
        '--model', model,
        '--embed_model', embed_model,
        '--pred_model', pred_model,
        '--src_data', data_source,
        '--ratio', str(ratio),
        '--value_mode', value_mode,
        '--task', task
    ]
    result = subprocess.run(command, capture_output=True, text=True)
    print("STDOUT:", result.stdout)
    print("STDERR:", result.stderr)

# Example usage:
# fine_tune_pretrained_codeemb_model('/path/to/data', '/path/to/model.pt', 'mimic', 'mortality', 100, 'DSVA')

##### Fine-tuning a Pre-trained DescEmb Model

In [None]:
def fine_tune_pretrained_descemb_model(input_path, model_path, data_source, task, ratio, value_mode, world_size=1, embed_model='descemb'):
    """
    Fine-tune a pre-trained DescEmb model using subprocess.

    Args:
    input_path (str): Base path to the training data.
    model_path (str): Path to the pre-trained model.
    data_source (str): Data source identifier, e.g., 'mimic' or 'eicu'.
    task (str): Task for the model to perform.
    ratio (int): Ratio of data to use for fine-tuning.
    value_mode (str): Value mode to use, e.g., 'NV', 'VA', 'DSVA', 'DSVA_DPE', 'VC'.
    world_size (int): Number of processes to distribute the workload across (default: 1).
    embed_model (str): Embedding model to use, typically 'descemb' (default: 'descemb').
    """
    full_input_path = f'{input_path}/{data_source}'

    command = [
        'python', 'main.py',
        '--distributed_world_size', str(world_size),
        '--input_path', full_input_path,
        '--model_path', model_path,
        '--load_pretrained_weights',
        '--model', 'ehr_model',
        '--embed_model', embed_model,
        '--pred_model', 'rnn',
        '--src_data', data_source,
        '--ratio', str(ratio),
        '--value_mode', value_mode,
        '--task', task
    ]
    result = subprocess.run(command, capture_output=True, text=True)
    print("STDOUT:", result.stdout)
    print("STDERR:", result.stderr)

# Example usage:
# fine_tune_pretrained_descemb_model('data_output_path', '/path/to/model.pt', 'mimic', 'mortality', 100, 'DSVA')

## Evaluation

The primary metrics used in the `README.md` and the associated paper are **Area Under the Precision-Recall Curve (AUPRC)**.

### Metrics Descriptions

#### Area Under the Precision-Recall Curve (AUPRC):

- **Precision (Positive Predictive Value)**: The ratio of true positive predictions to the total predicted positives. It shows the accuracy of the positive predictions.
- **Recall (Sensitivity)**: The ratio of true positives to the actual total positives in the dataset. It measures the model's ability to capture positive instances.
- **AUPRC**: The AUPRC is a single number summary of these two metrics across different thresholds, emphasizing the balance between precision and recall. It is especially valuable in medical predictions where the cost of false negatives is high.

### Implementation Code

In [None]:
from sklearn.metrics import precision_recall_curve, auc
import matplotlib.pyplot as plt

def calculate_auprc(y_true, y_scores):
    """
    Calculate the Area Under the Precision-Recall Curve (AUPRC).
    
    Args:
    y_true (list or array): True binary labels in range {0, 1}.
    y_scores (list or array): Target scores, can either be probability estimates of the positive class,
                              confidence values, or non-thresholded measure of decisions.
    
    Returns:
    float: AUPRC score
    """
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    auprc = auc(recall, precision)
    return auprc

def plot_precision_recall_curve(y_true, y_scores):
    """
    Plot the Precision-Recall curve for a given set of true labels and scores.
    
    Args:
    y_true (list or array): True binary labels in range {0, 1}.
    y_scores (list or array): Target scores, similar to calculate_auprc.
    """
    precision, recall, _ = precision_recall_curve(y_true, y_scores)
    plt.figure(figsize=(8, 6))
    plt.plot(recall, precision, label=f'AUPRC = {auc(recall, precision):.2f}')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall curve')
    plt.legend(loc='best')
    plt.show()

# Example evaluation usage
# auprc_score = calculate_auprc(y_true, y_scores)
# plot_precision_recall_curve(y_true, y_scores)

**Note**
- Ensure that the `y_true` and `y_scores` are correctly formatted as arrays of true labels and model predictions respectively.
- The plot_precision_recall_curve function provides a visual understanding of the trade-off between precision and recall for different threshold settings.

# Results

## Results Overview

The DescEmb models demonstrated superior or comparable performance to traditional code-based embeddings (CodeEmb) across various clinical prediction tasks. The models were evaluated on tasks such as predicting readmission, mortality, length of stay (both 3-day and 7-day), and diagnosis predictions using datasets like MIMIC-III and eICU.

## Analyses

- **Performance Gains**: DescEmb models, especially those leveraging BERT-based embeddings, showed consistent improvements in AUPRC (Area Under the Precision-Recall Curve) across most tasks compared to traditional models.
- **Model Comparisons**: BERT-based DescEmb models generally outperformed simpler RNN-based models in complex tasks like diagnosis prediction, highlighting the effectiveness of pre-trained language models in handling complex textual data from EHRs.
- **Impact of Pre-training**: The addition of Masked Language Modeling (MLM) pre-training marginally improved performance, suggesting that further domain-specific adaptation of language models could be beneficial.

## Plans

Moving forward, the research can focus on:
- **Further Optimization**: Enhancing the efficiency of the models to make them accessible for real-time applications in clinical settings.
- **Expanding Dataset Usage**: Applying the DescEmb framework to additional datasets and exploring its effectiveness across different healthcare systems.
- **Advanced Model Architectures**: Investigating the integration of more complex neural architectures and their impact on the performance of EHR-based predictive models.

## Conclusion

The DescEmb approach marks a significant step forward in the use of NLP techniques for EHR data, offering a promising avenue for enhancing predictive healthcare analytics.
