PATHOS - Predicting variant pathogenicity by combining Protein Language Models and biological features
PATHOS predicts pathogenicity of protein variants using protein language models (ESM-C 600M, Ankh2 Large). Pre-computed scores for 140M+ variants across 17,734 human proteins.
Set up PATHOS with a single script that installs dependencies and downloads the database.
Prerequisites: ~35 GB disk space
git clone https://github.com/DSIMB/PATHOS.git
cd PATHOS
./setup_pathos.sh
conda activate PATHOS_envQuery pathogenicity scores for protein variants using UniProt IDs and mutation notation.
python run_pathos.py --protein P51787 --mutation M1Apython run_pathos.py --file example_input.txt --output results.csv# Keep only highly pathogenic variants
python run_pathos.py --protein P51787 --min-score 0.9 --output pathogenic.csv
# Keep only highly benign variants
python run_pathos.py --protein P51787 --max-score 0.1 --output benign.csv
# Keep variants in a specific range
python run_pathos.py --protein P51787 --min-score 0.55 --max-score 0.65 --output uncertain.csv# For proteins in database (instant)
python run_pathos.py --protein P51787 --output P51787_all.csv
# For proteins NOT in database (requires --scan, can take hours)
python run_pathos.py --protein Q96PN7 --scan --output Q96PN7_all.csvSupports TXT, TSV, and CSV formats. Headers are auto-detected and skipped.
TXT/TSV (space or tab-separated):
P51787 M1A R56V # Multiple mutations per line
Q9Y6X3 M1C # Single mutation
P10635 # Full scan (all 19 substitutions per position)
CSV (comma-separated):
Protein,Mutation
P51787,M1A
P51787,L50R
Q9Y6X3,M1CIf all queried variants are already in the precomputed database (139M+ variants), results are returned instantly.
For variants not in the database, PATHOS performs de novo prediction:
- Load UniProt sequences and validate mutations
- Check/generate MSAs using mmseqs2 (if not already generated)
- Compute PASTML conservation scores
- Extract UniProt annotations and allele frequencies
- Generate embeddings with ESMC 600M and Ankh2 Large
- Run PATHOS inference (ensemble of both models)
Results are displayed in the terminal and exported to CSV with the following columns:
- UniProt ID
- Mutation (e.g., M1A)
- PATHOS score (0-1)
- Classification (Benign/Pathogenic)
PATHOS outputs a score between 0 and 1 indicating the probability of pathogenicity.
| Score | Classification |
|---|---|
| < 0.63 | Benign |
| >= 0.63 | Pathogenic |
Full list of available options for run_pathos.py.
| Option | Description |
|---|---|
-p, --protein |
UniProt protein ID (e.g., P51787) |
-m, --mutation |
Mutation in format like M1A (requires --protein) |
-f, --file |
Input file with protein IDs and mutations (TXT, TSV, or CSV) |
-o, --output |
Output CSV file (default: stdout for single mutation) |
--min-score |
Minimum PATHOS score threshold for filtering results (0.0-1.0) |
--max-score |
Maximum PATHOS score threshold for filtering results (0.0-1.0) |
--scan |
Enable de novo full protein scan (required for proteins not in database) |
--n-jobs |
Number of parallel workers for feature generation (default: 5) |
--batch-size |
Batch size for embedding generation (default: 100) |
--mmseqs-mem-limit |
Memory limit for mmseqs2 MSA generation (default: 8G) |
--batch-threshold |
Number of variants above which batched mode is enabled (default: 10000) |
PATHOS can predict scores for all possible mutations of a protein (19 substitutions × sequence length).
For proteins already in the pre-computed database, simply omit the --mutation argument:
python run_pathos.py --protein P51787 --output P51787_all.csvThis instantly retrieves all pre-computed scores for that protein.
The list of 17,574 pre-computed human proteins (having less than 1,024 residues) is available in proteins_in_db.txt.
For proteins not in the database, a full de novo scan is required. This can take several hours as it needs to:
- Generate MSA alignments
- Compute conservation scores
- Generate embeddings for thousands of mutations
To enable de novo scanning, add the --scan flag:
python run_pathos.py --protein Q96PN7 --scan --output Q96PN7_all.csvWarning: De novo scans are computationally expensive. For a typical 500-residue protein, this means predicting ~9,500 mutations.
Soon available
If you use PATHOS in your research, please cite:
Radjasandirane, R., Cretin, G., Diharce, J., de Brevern, A. G., & Gelly, J. C. (2026). PATHOS: Predicting Variant Pathogenicity by Combining Protein Language Models and Biological Features. Artificial Intelligence in the Life Sciences, 100165.
For bug reports, feature requests, or questions, please contact: