Last updated: 05/01/2026
This is the official and only repository for the paper titled "HERMES: Graph-Based Healthcare Prediction Model using Clinical-Text Only", which has been submitted to CITA 26' and is currently under review. If you have any trouble reproducing the results, please reach out to me using the personal email on my GitHub profile.
- MIMIC Dataset: Obtain access to MIMIC-III or MIMIC-IV from PhysioNet
- PrimeKG: Download the PrimeKG knowledge graph
- Environment: Python 3.10+, PyTorch, PyTorch Lightning (see
requirements.txt) - Place raw data in
dataset/raw/directory
# Step 1: Preprocess raw MIMIC-III data (EHR, notes, ICD codes)
bash scripts/mimic-iii-preprocess.sh
# Step 2: Run full pipeline (graph extraction, embeddings, training, evaluation)
bash scripts/mimic-iii-full-pipeline.sh# Step 1: Preprocess raw MIMIC-IV data
bash scripts/mimic-iv-preprocess.sh
# Step 2: Run full pipeline
bash scripts/mimic-iv-full-pipeline.shNote: Each script contains multiple steps that can be run individually. Review and uncomment the desired steps before execution.
HERMES-EHR/
├── README.md
├── config/ # Configuration files
│ ├── experiment_config.yaml # Training & experiment hyperparameters
│ ├── mimic_iii_config.yaml # MIMIC-III dataset paths & settings
│ └── mimic_iv_config.yaml # MIMIC-IV dataset paths & settings
├── dataset/ # Data folder (not in repo)
│ ├── raw/ # Raw data (MIMIC, PrimeKG,...)
│ ├── intermediate/ # Temporary data (processed MIMIC data, splits,...)
│ └── processed/ # Final training data & processed EHR
├── logs/ # Training logs (not in repo)
├── results/ # Experiment results & metrics
├── papers/ # Related scientific papers
│ └── threats/ # Papers that challenges our research
├── scripts/ # Bash scripts for pipeline execution
│ ├── mimic-iii-preprocess.sh # MIMIC-III raw data preprocessing
│ ├── mimic-iii-full-pipeline.sh # MIMIC-III complete pipeline
│ ├── mimic-iv-preprocess.sh # MIMIC-IV raw data preprocessing
│ ├── mimic-iv-full-pipeline.sh # MIMIC-IV complete pipeline
│ └── test.sh # Custom script for dev test
└── src/ # Main source code
├── data/ # Data processing modules
│ ├── preprocessing.py # General data preprocessing utilities
│ ├── note_graphs.py # Clinical notes → knowledge graphs
│ ├── graph_embedding.py # Graph embeddings with BGE-M3
│ ├── create_training.py # Create final HDF5 training files
│ └── training_data_split.py # Train/val/test splitting
├── evaluation/ # Evaluation & metrics
│ └── evaluation_toolkit.py # Bootstrap metrics, plots, AUROC/AUPRC
├── experiment/ # Experiment orchestration
│ └── run_experiment.py # Main experiment loop & grid search
├── KGSum/ # Knowledge Graph Summarization (LLM-based)
│ ├── entity_extractor.py # Extract entities from clinical notes
│ ├── relation_extractor.py # Extract relations between entities
│ ├── kgsum_agent.py # Main KGSum orchestration agent
│ └── prompts.py # LLM prompts for KG extraction
├── language_models/ # Language model wrappers
│ ├── bgem3.py # BGE-M3 embedding model
│ ├── clinical_longformer.py # Clinical Longformer encoder
│ ├── call_llm_mistral.py # Mistral API wrapper
├── mimic-preprocessing/ # MIMIC dataset preprocessing
├── models/ # Neural network architectures
├── training/ # Training components
│ ├── data_loader.py # PyTorch Lightning DataModule
│ ├── emerge.py # EMERGE multimodal fusion model
│ ├── ehr_encoder.py # EHR time-series encoder (Raindrop)
│ ├── graph_encoder.py # GNN encoder (GCN/GAT/RGCN)
│ ├── text_fusion.py # Text modality fusion layers
└── utils/ # Utility functions
├── files_loader.py # File I/O (YAML, CSV, JSON, H5)
├── logging.py # Logging configuration
└── cleanup.py # Resource cleanup utilities