StructVar-Bench: A Multi-Modal Structural Dataset and Benchmark for Missense Variant Pathogenicity Prediction

Clinicians frequently identify genetic variants but lack reliable methods to determine whether they are benign or pathogenic (disease-causing). While tools like AlphaFold provide high-quality protein structures, and FoldX enables energetic mutation analysis, these signals are rarely unified into a single, scalable framework.

StructVar-Bench bridges this gap by constructing a large-scale dataset that integrates:

structural context (AlphaFold)
thermodynamic stability (FoldX ΔΔG)
evolutionary priors (BLOSUM62)
physicochemical changes
graph-based representations of protein micro-environments

This repository provides both:

a benchmark dataset for variant pathogenicity prediction
an end-to-end pipleine for training classical ML and Graph Neural Networks (GNNs)

✅ Results

Model	AUC-ROC	Accuracy	Precision	Recall
Random Forest	0.8621	0.7805	0.7674	0.7384
XGBoost	0.8642	0.7839	0.7783	0.7301
GCN	0.8694	0.7972	0.8175	0.7100
GAT	0.8572	0.7838	0.7970	0.7002

Graph-based models slightly outperform feature-based baselines, suggesting that local 3D structural context provides meaningful predictive signal beyond engineered features alone.

🧠 Method Overview

Data Acquisition: ClinVar + UniProt + AlphaFold
Filtering: High-confidence missense variants only
Energy Modeling: FoldX ΔΔG
Feature Engineering: Physicochemical + evolutionary + structural
Graph Construction: Local residue neighborhoods
Model Training and Evaluation: RF, XGBoost, GCN, GAT

📝 Dataset Details (`cohort_final.csv`)

Column Name	Description	Values	Example
Name	ClinVar mutation name	string	NM_001370259.2(MEN1):c.1117C>T (p.Pro373Ser)
Gene Symbol	What gene the mutation affects	string	MEN1
UniProtID	UniProt protein ID	string	O00255
Chromosome	Which chromosome the mutation affects	1-23, X, Y	11
WildType	Wild-type amino acid	3-letter amino acid code	Pro
ResidueIndex	Amino acid where the mutation occurs	float	373.0
MutantAA	Mutant amino acid	3-letter amino acid code	Ser
Class	If the mutation causes disease	Benign, Pathogenic	Pathogenic
ReviewStatus	UniProt review status	string	criteria provided, single submitter
pLDDT	AlphaFold's confidence in the residue	0-100, float	97.94
StructureFile	AlphaFold .pdb.gz file	file path	AF-O00255-F1-model_v6.pdb.gz
MutantStructureFile	Repaired and mutated PDB	file path	O00255_P373S.pdb
ddG	change in Gibbs free energy	float	3.21647
d_Hydrophobicity	change in Hydrophobicity	float	0.8
d_Charge	change in Charge at pH 7	float	0.0
d_MW	change in Molecular Weight in Daltons	float	-10.0
Blosum62	value from BLOSUM62 matrix	float	-1.0
RSA	Relative Solvent Accessibility	float	0.0002387318563789
SecStruct	secondary structure	Helix, Sheet, Coil	Helix

🧬 Graph Details

Variant-Level Features (Graph Features)

ΔΔG
BLOSUM62 substitution score
Δ Hydrophobicity, Δ Charge, Δ Molecular Weight
RSA (Relative Solvent Accessibility)
Secondary Structure (Helix / Sheet / Coil)

Structural Features (Node-Level)

Amino acid identity (one-hot)
pLDDT
Hydrophobicity, Charge, Molecular Weight
Center residue indicator

Graph Construction

Nodes: residues within 10Å of mutation
Edges: CA–CA distance < 8Å
- Edge feature: Euclidean distance

🤖 Models

Classical ML

Random Forest: primary baseline
XGBoost

Graph Neural Networks

Graph Convolutional Network (GCN): deep convolution + pooling
Graph Attention Network (GAT): multi-head attention (4 heads)
GCNSimple - lightweight GCN baseline

⚙️ Training Details

Loss function: Binary Cross-Entropy with Logits (BCEWithLogitsLoss)
Optimization: Adam optimizer with weight decay for regularization
Learning rate scheduling: ReduceLROnPlateau, adjusting learning rate based on validation AUC

Evaluation Metrics

AUC-ROC (primary)
Accuracy
Precision
Recall

📁 Repository Structure

src/
├── local/
|   ├── verify_integrity.py         # Checks proper file and column loading for ClinVar and AlphaFold (1 sample) files
|   ├── build_cohort.py             # Filters cohort to only human entries that meet criteria, maps to UniProt IDs, outputs CSV
|   ├── filter_structures.py        # Cross-references structures from the mapped cohort with the AlphaFold DB
|   └── run_foldx.py                # Runs the RepairPDB and BuildModel FoldX processes on the filtered cohort
|
├── parallel/
|   ├── split_workload.py           # Splits a CSV into n parts (used for worker_foldx.py)
|   ├── worker_foldx.py             # run_foldx.py, but uses a worker_id argument for an assigned portion of the filtered cohort
|   └── merge_csvs.py               # Merges CSVs in a directory
|
├── features/
|   └── audit.py                    # Lists all proteins in filtered cohort, compares to /data/structures, finds missing repaired PDBs
|   ├── run_missing_repairs.py      # Runs RepairPDB on missing proteins found in audit.py
|   ├── extract_features.py         # Generates variant-level physicochemical and structural features for cohort_with_ddg.csv
|   └── generate_graphs.py          # Generates per-variant local structural graphs (JSON) from wild-type PDBs and cohort_features.csv
|
├── ml/
|   ├── train_baseline_rf.py        # Trains and evaluates a Random Forest classifier to predict variant pathogenicity
|   ├── train_baseline_xgb.py       # Trains and evaluates an XGBoost classifier to predict variant pathogenicity
|   ├── gnn_model.py                # Sets up GCN, GAT, and GCNSimple
|   ├── graph_dataset.py            # Loads JSON files from generate_graphs.py and converts them into a .pt file
|   ├── train_gnn.py                # GNN training workflow that implements model choosing and epoch patience
|   ├── balance_check.py            # Checks Benign/Pathogenic balance in test_set.csv (can be used for any CSV)
|   └── evaluate_model.py           # Evaluates models on test set and outputs Accuracy/Recall/Precision/AUC-ROC and figures

🗂️ Data Organization

data/                               
├── raw/                            # Not committed due to size
|   ├── variant_summary.txt         # Ground truth labels
|   ├── alphaFold_human/            # 3D structures in .pdb.gz or .cif.gz format
|   ├── human_reviewed.fasta        # Sequences
|   └── human_id_mapping.tsv        # Metadata table with RefSeq and AlphaFold columns    
├── processed/ 
|   ├── cohort_final.csv            # Cohort with variant-level physicochemical and structural features
|   ├── structvar_graphs.pt         # JSON graphs reformatted as a .pt object for GNN training; external
|   └── structures/                 # External
|       ├── mutants/                # Mutant PDBs (format: UniProtID_MutationKey.pdb)
|       └── wildtype/               # Repaired wild-type PDBs (format: WT_UniProtID.pdb)
├── figures/     
|   ├── Random Forest/              # Contains confusion_matrix.png, roc_curve.png, feature_importance.png
|   ├── XGBoost/                    # Contains confusion_matrix.png, roc_curve.png, feature_importance.png
|   ├── GCN/                        # Contains confusion_matrix.png, roc_curve.png
|   ├── GAT/                        # Contains confusion_matrix.png, roc_curve.png
|   ├── roc_curve_stacked.png       # All models' ROC curves
|   └── graph_viz_3d/               # HTML files with 3D viewer of JSON graphs
├── splits/                         
|   ├── test_set.csv                # Test split
|   ├── train_set.csv               # Train split
|   └── val_set.csv                 # Validation split
└── graphs/                         # JSON graphs generated by generate_graphs.py; external

models/
├── baseline_rf.pkl                 # Random Forest model trained on 80/10/10 Train/Test/Val split
├── baseline_xgb.pkl                # XGBoost model trained on 80/10/10 Train/Test/Val split
├── best_gcn.pth                    # Best GCN model from most recent train
├── best_gat.pth                    # Best GAT model from most recent train
└── training_history.csv            # Epoch history for latest fully-trained GNN model

📦 Data Availability

Due to size constraints, large artifacts are hosted externally.

GitHub

All code
Figures
Dataset splits
Models
cohort_final.csv
Logs (terminal_log.md)

External Downloads

structures.zip: repaired wild-type and mutant PDBs
structvar_graphs.pt: PyG dataset

📚 Resources Used and Downloads

ClinVar is a public database of all human genetic variants.
- For this project, we use the ClinVar txt summary, a clean, tab-delimited text file that already has headers outlined: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/ -> variant_summary.txt.gz
- For future work, the full ClinVar XML dump may be used for more thorough labeling: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/
UnitprotKB is a public database of protein sequence and functional information.
- For reviewed human entries, do the following:
  - Go to UniprotKB: https://www.uniprot.org/uniprotkb
  - Search for reviewed humans: (taxonomy_id:9606) AND (reviewed:true) -> https://www.uniprot.org/uniprotkb?query=*%28taxonomy_id%3A9606%29+AND+%28reviewed%3Atrue%29
  - Download FASTA (canonical) for sequences
  - Download TSV
    - Click customize columns and select PDB, AlphaFoldDB, and RefSeq for cross-referencing
- For all UnitprotKB reviewed resources: https://www.uniprot.org/uniprotkb
AlphaFold PDB provides access to hundreds of millions of protein structure predictions: https://alphafold.ebi.ac.uk/download
- Here, we only care about the human proteins, so: https://alphafold.ebi.ac.uk/download -> UP000005640_9606_HUMAN_v4.tar

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
models		models
src		src
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
terminal_log.md		terminal_log.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StructVar-Bench: A Multi-Modal Structural Dataset and Benchmark for Missense Variant Pathogenicity Prediction

✅ Results

🧠 Method Overview

📝 Dataset Details (`cohort_final.csv`)

🧬 Graph Details

🤖 Models

⚙️ Training Details

📁 Repository Structure

🗂️ Data Organization

📦 Data Availability

📚 Resources Used and Downloads

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StructVar-Bench: A Multi-Modal Structural Dataset and Benchmark for Missense Variant Pathogenicity Prediction

✅ Results

🧠 Method Overview

📝 Dataset Details (cohort_final.csv)

🧬 Graph Details

🤖 Models

⚙️ Training Details

📁 Repository Structure

🗂️ Data Organization

📦 Data Availability

📚 Resources Used and Downloads

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

📝 Dataset Details (`cohort_final.csv`)

Packages