Skip to content

djcrocker/structvar-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

StructVar-Bench: A Multi-Modal Structural Dataset and Benchmark for Missense Variant Pathogenicity Prediction

Clinicians frequently identify genetic variants but lack reliable methods to determine whether they are benign or pathogenic (disease-causing). While tools like AlphaFold provide high-quality protein structures, and FoldX enables energetic mutation analysis, these signals are rarely unified into a single, scalable framework.

StructVar-Bench bridges this gap by constructing a large-scale dataset that integrates:

  • structural context (AlphaFold)
  • thermodynamic stability (FoldX ΔΔG)
  • evolutionary priors (BLOSUM62)
  • physicochemical changes
  • graph-based representations of protein micro-environments

This repository provides both:

  • a benchmark dataset for variant pathogenicity prediction
  • an end-to-end pipleine for training classical ML and Graph Neural Networks (GNNs)

✅ Results

Model AUC-ROC Accuracy Precision Recall
Random Forest 0.8621 0.7805 0.7674 0.7384
XGBoost 0.8642 0.7839 0.7783 0.7301
GCN 0.8694 0.7972 0.8175 0.7100
GAT 0.8572 0.7838 0.7970 0.7002

Graph-based models slightly outperform feature-based baselines, suggesting that local 3D structural context provides meaningful predictive signal beyond engineered features alone.

🧠 Method Overview

  1. Data Acquisition: ClinVar + UniProt + AlphaFold
  2. Filtering: High-confidence missense variants only
  3. Energy Modeling: FoldX ΔΔG
  4. Feature Engineering: Physicochemical + evolutionary + structural
  5. Graph Construction: Local residue neighborhoods
  6. Model Training and Evaluation: RF, XGBoost, GCN, GAT

📝 Dataset Details (cohort_final.csv)

Column Name Description Values Example
Name ClinVar mutation name string NM_001370259.2(MEN1):c.1117C>T (p.Pro373Ser)
Gene Symbol What gene the mutation affects string MEN1
UniProtID UniProt protein ID string O00255
Chromosome Which chromosome the mutation affects 1-23, X, Y 11
WildType Wild-type amino acid 3-letter amino acid code Pro
ResidueIndex Amino acid where the mutation occurs float 373.0
MutantAA Mutant amino acid 3-letter amino acid code Ser
Class If the mutation causes disease Benign, Pathogenic Pathogenic
ReviewStatus UniProt review status string criteria provided, single submitter
pLDDT AlphaFold's confidence in the residue 0-100, float 97.94
StructureFile AlphaFold .pdb.gz file file path AF-O00255-F1-model_v6.pdb.gz
MutantStructureFile Repaired and mutated PDB file path O00255_P373S.pdb
ddG change in Gibbs free energy float 3.21647
d_Hydrophobicity change in Hydrophobicity float 0.8
d_Charge change in Charge at pH 7 float 0.0
d_MW change in Molecular Weight in Daltons float -10.0
Blosum62 value from BLOSUM62 matrix float -1.0
RSA Relative Solvent Accessibility float 0.0002387318563789
SecStruct secondary structure Helix, Sheet, Coil Helix

🧬 Graph Details

Variant-Level Features (Graph Features)

  • ΔΔG
  • BLOSUM62 substitution score
  • Δ Hydrophobicity, Δ Charge, Δ Molecular Weight
  • RSA (Relative Solvent Accessibility)
  • Secondary Structure (Helix / Sheet / Coil)

Structural Features (Node-Level)

  • Amino acid identity (one-hot)
  • pLDDT
  • Hydrophobicity, Charge, Molecular Weight
  • Center residue indicator

Graph Construction

  • Nodes: residues within 10Å of mutation
  • Edges: CA–CA distance < 8Å
    • Edge feature: Euclidean distance

🤖 Models

Classical ML

  • Random Forest: primary baseline
  • XGBoost

Graph Neural Networks

  • Graph Convolutional Network (GCN): deep convolution + pooling
  • Graph Attention Network (GAT): multi-head attention (4 heads)
  • GCNSimple - lightweight GCN baseline

⚙️ Training Details

  • Loss function: Binary Cross-Entropy with Logits (BCEWithLogitsLoss)
  • Optimization: Adam optimizer with weight decay for regularization
  • Learning rate scheduling: ReduceLROnPlateau, adjusting learning rate based on validation AUC

Evaluation Metrics

  • AUC-ROC (primary)
  • Accuracy
  • Precision
  • Recall

📁 Repository Structure

src/
├── local/
|   ├── verify_integrity.py         # Checks proper file and column loading for ClinVar and AlphaFold (1 sample) files
|   ├── build_cohort.py             # Filters cohort to only human entries that meet criteria, maps to UniProt IDs, outputs CSV
|   ├── filter_structures.py        # Cross-references structures from the mapped cohort with the AlphaFold DB
|   └── run_foldx.py                # Runs the RepairPDB and BuildModel FoldX processes on the filtered cohort
|
├── parallel/
|   ├── split_workload.py           # Splits a CSV into n parts (used for worker_foldx.py)
|   ├── worker_foldx.py             # run_foldx.py, but uses a worker_id argument for an assigned portion of the filtered cohort
|   └── merge_csvs.py               # Merges CSVs in a directory
|
├── features/
|   └── audit.py                    # Lists all proteins in filtered cohort, compares to /data/structures, finds missing repaired PDBs
|   ├── run_missing_repairs.py      # Runs RepairPDB on missing proteins found in audit.py
|   ├── extract_features.py         # Generates variant-level physicochemical and structural features for cohort_with_ddg.csv
|   └── generate_graphs.py          # Generates per-variant local structural graphs (JSON) from wild-type PDBs and cohort_features.csv
|
├── ml/
|   ├── train_baseline_rf.py        # Trains and evaluates a Random Forest classifier to predict variant pathogenicity
|   ├── train_baseline_xgb.py       # Trains and evaluates an XGBoost classifier to predict variant pathogenicity
|   ├── gnn_model.py                # Sets up GCN, GAT, and GCNSimple
|   ├── graph_dataset.py            # Loads JSON files from generate_graphs.py and converts them into a .pt file
|   ├── train_gnn.py                # GNN training workflow that implements model choosing and epoch patience
|   ├── balance_check.py            # Checks Benign/Pathogenic balance in test_set.csv (can be used for any CSV)
|   └── evaluate_model.py           # Evaluates models on test set and outputs Accuracy/Recall/Precision/AUC-ROC and figures

🗂️ Data Organization

data/                               
├── raw/                            # Not committed due to size
|   ├── variant_summary.txt         # Ground truth labels
|   ├── alphaFold_human/            # 3D structures in .pdb.gz or .cif.gz format
|   ├── human_reviewed.fasta        # Sequences
|   └── human_id_mapping.tsv        # Metadata table with RefSeq and AlphaFold columns    
├── processed/ 
|   ├── cohort_final.csv            # Cohort with variant-level physicochemical and structural features
|   ├── structvar_graphs.pt         # JSON graphs reformatted as a .pt object for GNN training; external
|   └── structures/                 # External
|       ├── mutants/                # Mutant PDBs (format: UniProtID_MutationKey.pdb)
|       └── wildtype/               # Repaired wild-type PDBs (format: WT_UniProtID.pdb)
├── figures/     
|   ├── Random Forest/              # Contains confusion_matrix.png, roc_curve.png, feature_importance.png
|   ├── XGBoost/                    # Contains confusion_matrix.png, roc_curve.png, feature_importance.png
|   ├── GCN/                        # Contains confusion_matrix.png, roc_curve.png
|   ├── GAT/                        # Contains confusion_matrix.png, roc_curve.png
|   ├── roc_curve_stacked.png       # All models' ROC curves
|   └── graph_viz_3d/               # HTML files with 3D viewer of JSON graphs
├── splits/                         
|   ├── test_set.csv                # Test split
|   ├── train_set.csv               # Train split
|   └── val_set.csv                 # Validation split
└── graphs/                         # JSON graphs generated by generate_graphs.py; external

models/
├── baseline_rf.pkl                 # Random Forest model trained on 80/10/10 Train/Test/Val split
├── baseline_xgb.pkl                # XGBoost model trained on 80/10/10 Train/Test/Val split
├── best_gcn.pth                    # Best GCN model from most recent train
├── best_gat.pth                    # Best GAT model from most recent train
└── training_history.csv            # Epoch history for latest fully-trained GNN model

📦 Data Availability

Due to size constraints, large artifacts are hosted externally.

GitHub

  • All code
  • Figures
  • Dataset splits
  • Models
  • cohort_final.csv
  • Logs (terminal_log.md)

External Downloads

  • structures.zip: repaired wild-type and mutant PDBs
  • structvar_graphs.pt: PyG dataset

📚 Resources Used and Downloads

About

A multi-modal structural dataset and benchmark for missense variant pathogenicity prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors