HERMES: Graph-Based Healthcare Prediction Model using Clinical-Text Only

Last updated: 05/01/2026

Overview

This is the official and only repository for the paper titled "HERMES: Graph-Based Healthcare Prediction Model using Clinical-Text Only", which has been submitted to CITA 26' and is currently under review. If you have any trouble reproducing the results, please reach out to me using the personal email on my GitHub profile.

Abstract

Quick Start

Prerequisites

MIMIC Dataset: Obtain access to MIMIC-III or MIMIC-IV from PhysioNet
PrimeKG: Download the PrimeKG knowledge graph
Environment: Python 3.10+, PyTorch, PyTorch Lightning (see requirements.txt)
Place raw data in dataset/raw/ directory

MIMIC-III Pipeline

# Step 1: Preprocess raw MIMIC-III data (EHR, notes, ICD codes)
bash scripts/mimic-iii-preprocess.sh

# Step 2: Run full pipeline (graph extraction, embeddings, training, evaluation)
bash scripts/mimic-iii-full-pipeline.sh

MIMIC-IV Pipeline

# Step 1: Preprocess raw MIMIC-IV data
bash scripts/mimic-iv-preprocess.sh

# Step 2: Run full pipeline
bash scripts/mimic-iv-full-pipeline.sh

Note: Each script contains multiple steps that can be run individually. Review and uncomment the desired steps before execution.

Project's Directory Tree

HERMES-EHR/
├── README.md
├── config/                                 # Configuration files
│   ├── experiment_config.yaml              # Training & experiment hyperparameters
│   ├── mimic_iii_config.yaml               # MIMIC-III dataset paths & settings
│   └── mimic_iv_config.yaml                # MIMIC-IV dataset paths & settings
├── dataset/                                # Data folder (not in repo)
│   ├── raw/                                # Raw data (MIMIC, PrimeKG,...)
│   ├── intermediate/                       # Temporary data (processed MIMIC data, splits,...)
│   └── processed/                          # Final training data & processed EHR
├── logs/                                   # Training logs (not in repo)
├── results/                                # Experiment results & metrics
├── papers/                                 # Related scientific papers
│   └── threats/                            # Papers that challenges our research
├── scripts/                                # Bash scripts for pipeline execution
│   ├── mimic-iii-preprocess.sh             # MIMIC-III raw data preprocessing
│   ├── mimic-iii-full-pipeline.sh          # MIMIC-III complete pipeline
│   ├── mimic-iv-preprocess.sh              # MIMIC-IV raw data preprocessing
│   ├── mimic-iv-full-pipeline.sh           # MIMIC-IV complete pipeline
│   └── test.sh                             # Custom script for dev test
└── src/                                    # Main source code
    ├── data/                               # Data processing modules
    │   ├── preprocessing.py                # General data preprocessing utilities
    │   ├── note_graphs.py                  # Clinical notes → knowledge graphs
    │   ├── graph_embedding.py              # Graph embeddings with BGE-M3
    │   ├── create_training.py              # Create final HDF5 training files
    │   └── training_data_split.py          # Train/val/test splitting
    ├── evaluation/                         # Evaluation & metrics
    │   └── evaluation_toolkit.py           # Bootstrap metrics, plots, AUROC/AUPRC
    ├── experiment/                         # Experiment orchestration
    │   └── run_experiment.py               # Main experiment loop & grid search
    ├── KGSum/                              # Knowledge Graph Summarization (LLM-based)
    │   ├── entity_extractor.py             # Extract entities from clinical notes
    │   ├── relation_extractor.py           # Extract relations between entities
    │   ├── kgsum_agent.py                  # Main KGSum orchestration agent
    │   └── prompts.py                      # LLM prompts for KG extraction
    ├── language_models/                    # Language model wrappers
    │   ├── bgem3.py                        # BGE-M3 embedding model
    │   ├── clinical_longformer.py          # Clinical Longformer encoder
    │   ├── call_llm_mistral.py             # Mistral API wrapper
    ├── mimic-preprocessing/                # MIMIC dataset preprocessing
    ├── models/                             # Neural network architectures
    ├── training/                           # Training components
    │   ├── data_loader.py                  # PyTorch Lightning DataModule
    │   ├── emerge.py                       # EMERGE multimodal fusion model
    │   ├── ehr_encoder.py                  # EHR time-series encoder (Raindrop)
    │   ├── graph_encoder.py                # GNN encoder (GCN/GAT/RGCN)
    │   ├── text_fusion.py                  # Text modality fusion layers
    └── utils/                              # Utility functions
        ├── files_loader.py                 # File I/O (YAML, CSV, JSON, H5)
        ├── logging.py                      # Logging configuration
        └── cleanup.py                      # Resource cleanup utilities

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
.github/workflows		.github/workflows
TelegramBot		TelegramBot
config		config
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HERMES: Graph-Based Healthcare Prediction Model using Clinical-Text Only

Overview

Abstract

Quick Start

Prerequisites

MIMIC-III Pipeline

MIMIC-IV Pipeline

Project's Directory Tree

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

HERMES: Graph-Based Healthcare Prediction Model using Clinical-Text Only

Overview

Abstract

Quick Start

Prerequisites

MIMIC-III Pipeline

MIMIC-IV Pipeline

Project's Directory Tree

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages