StructVar-Bench: A Multi-Modal Structural Dataset and Benchmark for Missense Variant Pathogenicity Prediction
Clinicians frequently identify genetic variants but lack reliable methods to determine whether they are benign or pathogenic (disease-causing). While tools like AlphaFold provide high-quality protein structures, and FoldX enables energetic mutation analysis, these signals are rarely unified into a single, scalable framework.
StructVar-Bench bridges this gap by constructing a large-scale dataset that integrates:
- structural context (AlphaFold)
- thermodynamic stability (FoldX ΔΔG)
- evolutionary priors (BLOSUM62)
- physicochemical changes
- graph-based representations of protein micro-environments
This repository provides both:
- a benchmark dataset for variant pathogenicity prediction
- an end-to-end pipleine for training classical ML and Graph Neural Networks (GNNs)
| Model | AUC-ROC | Accuracy | Precision | Recall |
|---|---|---|---|---|
| Random Forest | 0.8621 | 0.7805 | 0.7674 | 0.7384 |
| XGBoost | 0.8642 | 0.7839 | 0.7783 | 0.7301 |
| GCN | 0.8694 | 0.7972 | 0.8175 | 0.7100 |
| GAT | 0.8572 | 0.7838 | 0.7970 | 0.7002 |
Graph-based models slightly outperform feature-based baselines, suggesting that local 3D structural context provides meaningful predictive signal beyond engineered features alone.
- Data Acquisition: ClinVar + UniProt + AlphaFold
- Filtering: High-confidence missense variants only
- Energy Modeling: FoldX ΔΔG
- Feature Engineering: Physicochemical + evolutionary + structural
- Graph Construction: Local residue neighborhoods
- Model Training and Evaluation: RF, XGBoost, GCN, GAT
| Column Name | Description | Values | Example |
|---|---|---|---|
| Name | ClinVar mutation name | string | NM_001370259.2(MEN1):c.1117C>T (p.Pro373Ser) |
| Gene Symbol | What gene the mutation affects | string | MEN1 |
| UniProtID | UniProt protein ID | string | O00255 |
| Chromosome | Which chromosome the mutation affects | 1-23, X, Y | 11 |
| WildType | Wild-type amino acid | 3-letter amino acid code | Pro |
| ResidueIndex | Amino acid where the mutation occurs | float | 373.0 |
| MutantAA | Mutant amino acid | 3-letter amino acid code | Ser |
| Class | If the mutation causes disease | Benign, Pathogenic | Pathogenic |
| ReviewStatus | UniProt review status | string | criteria provided, single submitter |
| pLDDT | AlphaFold's confidence in the residue | 0-100, float | 97.94 |
| StructureFile | AlphaFold .pdb.gz file | file path | AF-O00255-F1-model_v6.pdb.gz |
| MutantStructureFile | Repaired and mutated PDB | file path | O00255_P373S.pdb |
| ddG | change in Gibbs free energy | float | 3.21647 |
| d_Hydrophobicity | change in Hydrophobicity | float | 0.8 |
| d_Charge | change in Charge at pH 7 | float | 0.0 |
| d_MW | change in Molecular Weight in Daltons | float | -10.0 |
| Blosum62 | value from BLOSUM62 matrix | float | -1.0 |
| RSA | Relative Solvent Accessibility | float | 0.0002387318563789 |
| SecStruct | secondary structure | Helix, Sheet, Coil | Helix |
Variant-Level Features (Graph Features)
- ΔΔG
- BLOSUM62 substitution score
- Δ Hydrophobicity, Δ Charge, Δ Molecular Weight
- RSA (Relative Solvent Accessibility)
- Secondary Structure (Helix / Sheet / Coil)
Structural Features (Node-Level)
- Amino acid identity (one-hot)
- pLDDT
- Hydrophobicity, Charge, Molecular Weight
- Center residue indicator
Graph Construction
- Nodes: residues within 10Å of mutation
- Edges: CA–CA distance < 8Å
- Edge feature: Euclidean distance
Classical ML
- Random Forest: primary baseline
- XGBoost
Graph Neural Networks
- Graph Convolutional Network (GCN): deep convolution + pooling
- Graph Attention Network (GAT): multi-head attention (4 heads)
- GCNSimple - lightweight GCN baseline
- Loss function: Binary Cross-Entropy with Logits (
BCEWithLogitsLoss) - Optimization: Adam optimizer with weight decay for regularization
- Learning rate scheduling:
ReduceLROnPlateau, adjusting learning rate based on validation AUC
Evaluation Metrics
- AUC-ROC (primary)
- Accuracy
- Precision
- Recall
src/
├── local/
| ├── verify_integrity.py # Checks proper file and column loading for ClinVar and AlphaFold (1 sample) files
| ├── build_cohort.py # Filters cohort to only human entries that meet criteria, maps to UniProt IDs, outputs CSV
| ├── filter_structures.py # Cross-references structures from the mapped cohort with the AlphaFold DB
| └── run_foldx.py # Runs the RepairPDB and BuildModel FoldX processes on the filtered cohort
|
├── parallel/
| ├── split_workload.py # Splits a CSV into n parts (used for worker_foldx.py)
| ├── worker_foldx.py # run_foldx.py, but uses a worker_id argument for an assigned portion of the filtered cohort
| └── merge_csvs.py # Merges CSVs in a directory
|
├── features/
| └── audit.py # Lists all proteins in filtered cohort, compares to /data/structures, finds missing repaired PDBs
| ├── run_missing_repairs.py # Runs RepairPDB on missing proteins found in audit.py
| ├── extract_features.py # Generates variant-level physicochemical and structural features for cohort_with_ddg.csv
| └── generate_graphs.py # Generates per-variant local structural graphs (JSON) from wild-type PDBs and cohort_features.csv
|
├── ml/
| ├── train_baseline_rf.py # Trains and evaluates a Random Forest classifier to predict variant pathogenicity
| ├── train_baseline_xgb.py # Trains and evaluates an XGBoost classifier to predict variant pathogenicity
| ├── gnn_model.py # Sets up GCN, GAT, and GCNSimple
| ├── graph_dataset.py # Loads JSON files from generate_graphs.py and converts them into a .pt file
| ├── train_gnn.py # GNN training workflow that implements model choosing and epoch patience
| ├── balance_check.py # Checks Benign/Pathogenic balance in test_set.csv (can be used for any CSV)
| └── evaluate_model.py # Evaluates models on test set and outputs Accuracy/Recall/Precision/AUC-ROC and figures
data/
├── raw/ # Not committed due to size
| ├── variant_summary.txt # Ground truth labels
| ├── alphaFold_human/ # 3D structures in .pdb.gz or .cif.gz format
| ├── human_reviewed.fasta # Sequences
| └── human_id_mapping.tsv # Metadata table with RefSeq and AlphaFold columns
├── processed/
| ├── cohort_final.csv # Cohort with variant-level physicochemical and structural features
| ├── structvar_graphs.pt # JSON graphs reformatted as a .pt object for GNN training; external
| └── structures/ # External
| ├── mutants/ # Mutant PDBs (format: UniProtID_MutationKey.pdb)
| └── wildtype/ # Repaired wild-type PDBs (format: WT_UniProtID.pdb)
├── figures/
| ├── Random Forest/ # Contains confusion_matrix.png, roc_curve.png, feature_importance.png
| ├── XGBoost/ # Contains confusion_matrix.png, roc_curve.png, feature_importance.png
| ├── GCN/ # Contains confusion_matrix.png, roc_curve.png
| ├── GAT/ # Contains confusion_matrix.png, roc_curve.png
| ├── roc_curve_stacked.png # All models' ROC curves
| └── graph_viz_3d/ # HTML files with 3D viewer of JSON graphs
├── splits/
| ├── test_set.csv # Test split
| ├── train_set.csv # Train split
| └── val_set.csv # Validation split
└── graphs/ # JSON graphs generated by generate_graphs.py; external
models/
├── baseline_rf.pkl # Random Forest model trained on 80/10/10 Train/Test/Val split
├── baseline_xgb.pkl # XGBoost model trained on 80/10/10 Train/Test/Val split
├── best_gcn.pth # Best GCN model from most recent train
├── best_gat.pth # Best GAT model from most recent train
└── training_history.csv # Epoch history for latest fully-trained GNN model
Due to size constraints, large artifacts are hosted externally.
GitHub
- All code
- Figures
- Dataset splits
- Models
cohort_final.csv- Logs (
terminal_log.md)
External Downloads
structures.zip: repaired wild-type and mutant PDBsstructvar_graphs.pt: PyG dataset
- ClinVar is a public database of all human genetic variants.
- For this project, we use the ClinVar txt summary, a clean, tab-delimited text file that already has headers outlined: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/ -> variant_summary.txt.gz
- For future work, the full ClinVar XML dump may be used for more thorough labeling: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/
- UnitprotKB is a public database of protein sequence and functional information.
- For reviewed human entries, do the following:
- Go to UniprotKB: https://www.uniprot.org/uniprotkb
- Search for reviewed humans:
(taxonomy_id:9606) AND (reviewed:true)-> https://www.uniprot.org/uniprotkb?query=*%28taxonomy_id%3A9606%29+AND+%28reviewed%3Atrue%29 - Download FASTA (canonical) for sequences
- Download TSV
- Click customize columns and select PDB, AlphaFoldDB, and RefSeq for cross-referencing
- For all UnitprotKB reviewed resources: https://www.uniprot.org/uniprotkb
- For reviewed human entries, do the following:
- AlphaFold PDB provides access to hundreds of millions of protein structure predictions: https://alphafold.ebi.ac.uk/download
- Here, we only care about the human proteins, so: https://alphafold.ebi.ac.uk/download -> UP000005640_9606_HUMAN_v4.tar