Skip to content

Repository containing notebooks to compute statistics in the paper "A unified approach to evolutionary conservation and population constraint in proteins".

License

Notifications You must be signed in to change notification settings

bartongroup/SM_Pfam-gnomAD-statistics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A unified approach to evolutionary conservation and population constraint in proteins

DOI License: MIT

Repository containing notebooks to compute statistics in the paper "A unified approach to evolutionary conservation and population constraint in proteins".

Author: Stuart MacGowan (smacgowan@dundee.ac.uk)

Dataset

The analysis is based on aggregated statistics we computed from data accessed from the following databases:

  • Pfam-A database of protein families (version 31.0)
  • gnomAD database of human genetic variation (version 2.1.1).
  • ClinVar database of human genetic variants and their clinical significance.
  • PDBe database of protein structures.

These were processed into a single dataset of aggregated statistics for each Pfam domain, which is provided in data/pfam-gnomAD-clinvar-pdb-colstats_c7c3e19.csv.gz.

Manuscript figures

The figures in the manuscript are generated by the notebooks in the figure folders under manuscript-figures.

  • Figure 1B: Frequency distribution of gnomAD missense variants across all amino acid residues in Pfam domains.
  • Figure 1C: Frequency distributions of gnomAD missense variants over alignment columns of Pfam domains.
  • Figure 1D: Total number of gnomAD missense or synonymous variants vs. the Shenkin diversity at each position across SH2 domains.
  • Figure 2A: Cumulative distributions of the normalised missense enrichment score or normalised Shenkin for positions where the consensus relative solvent accessibility class is core, partially exposed, or surface.
  • Figure 3A: The conservation plane: classifying residues in Pfam domains with evolutionary conservation and population constraint.
  • Figure 4A: Odds ratios of the enrichment of protein-ligand interacting residues from BioLiP within sites in different conservation plane categories.
  • Figure 4B: PPI site enrichments.
  • Figure 4C: ClinVar Pathogenic site enrichments relative to the gnomAD missense background.

Citation

Stuart A. MacGowan, Fábio Madeira, Thiago Britto-Borges et al. A unified approach to evolutionary conservation and population constraint in protein domains highlights structural features and pathogenic sites, 13 July 2023, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-3160340/v1]

License

This repository and its contents were created by Stuart A. MacGowan (@stuartmac) at the University of Dundee and is provided under the MIT license. See LICENSE for details.

About

Repository containing notebooks to compute statistics in the paper "A unified approach to evolutionary conservation and population constraint in proteins".

Resources

License

Stars

Watchers

Forks

Packages

No packages published