GitHub - cx0/llm-for-clinical-variants: Data repository for NeurIPS 2022 LMRL workshop paper.

Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes

This repo contains the scripts and metadata used in our work presented at NeurIPS 2022 Learning Meaningful Representation of Life (LMRL) workshop.

Workshop website | Paper

Abstract: Despite being self-supervised, protein language models have shown remarkable performance in fundamental biological tasks such as predicting impact of genetic variation on protein structure and function. The effectiveness of these models on diverse set of tasks suggests that they learn meaningful representation of fitness landscape that can be useful for downstream clinical applications. Here, we interrogate the use of these language models in characterizing known pathogenic mutations in curated, medically actionable genes through an exhaustive search of putative compensatory mutations on each variant's genetic background. Systematic analysis of the predicted effects of these compensatory mutations reveal unappreciated structural features of proteins that are missed by other structure predictors like AlphaFold. While deep mutational scan experiments provide an unbiased estimate of the mutational landscape, we encourage the community to generate and curate rescue mutation experiments to inform the design of more sophisticated co-masking strategies and leverage large language models more effectively for downstream clinical prediction tasks.

Pretrained models

Model	Number of layers	Number of parameters	Training dataset	Implementation in our work
ESM-2	33	650M	UR50/D	Single model with `wt-marginals` scoring strategy
ESM-1v	33	650M	UR90/S	Ensemble of 5 models with the same scoring strategy as ESM-2
ESMFold	48	690M	PDB + UR50	Structure prediction for BAG3
AlphaFold2				AlphaFold2 structural model prediction for BAG3
Cross-protein transfer				Zero-shot prediction scores for all 53 ACMG genes except MAX and HNF1A

Data on gene list and sequence variation

Description	Data source
List of clinically actionable genes	ACMG v3.1
Allele frequency	gnomAD v2 GRCh38 liftover
ClinVar annotations	Accessed on 09/17/2022
Multiple sequence alignments	UCSC multiz-100 way CDS alignment (Placental mammals)

Citation

If you find this work useful, please cite it as follows:

@misc{
  url = {https://arxiv.org/abs/2211.10000},
  author = {Soylemez, Onuralp and Cordero, Pablo},
  keywords = {Machine Learning (cs.LG), Genomics (q-bio.GN), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},
  title = {Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Feedback

If you have any questions or comments, or would like to collaborate, please feel free to reach out.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
data		data
scripts		scripts
structures		structures
.gitignore		.gitignore
README.md		README.md
mutscan_plotting.ipynb		mutscan_plotting.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

scripts

scripts

structures

structures

.gitignore

.gitignore

README.md

README.md

mutscan_plotting.ipynb

mutscan_plotting.ipynb

Repository files navigation

Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes

Pretrained models

Data on gene list and sequence variation

Citation

Feedback

About

Releases

Packages

Languages

cx0/llm-for-clinical-variants

Folders and files

Latest commit

History

Repository files navigation

Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes

Pretrained models

Data on gene list and sequence variation

Citation

Feedback

About

Topics

Resources

Stars

Watchers

Forks

Languages