Skip to content
/ cdom Public

Identify commonly mutated protein domains in a given Mutation Annotation Format file.

Notifications You must be signed in to change notification settings

aabaker99/cdom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cdom

Identify commonly mutated protein domains in a given Mutation Annotation Format file.

scripts/mutated_domains.R

Analyzes mutation data in the MAF format. This script is applied to analyze mutations in patients with NF and to analyze mutations in the general population according to the 1000 genomes project. In either case, it reports the most frequently mutated protein domains.

scripts/diffusion.py

Diffuse genetic mutations over a protein interaction network as done in Network-Based Stratification. Genetic variants are filtered to include only those which have a rare allele frequency and a damaging PolyPhen prediction.

scripts/filter_protein_to_ipr.py

Make the InterPro protein2ipr data more managble by excluding all non-human proteins.

scripts/parse_uniprot.py

Called by filter_protein_to_ipr.py

scripts/nf_lollipop.py

Plot the frequency of amino acid mutations for a protein of interest. Currently only analyzes the NF1 protein.

Data

  • uniprot_sprot_human at ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uniprot_sprot_human.dat.gz
  • protein2ipr at ftp://ftp.ebi.ac.uk/pub/databases/interpro/74.0/protein2ipr.dat.gz

Documentation

Dependencies

Scripts to install (some) of the dependencies are located in the install directory.

  • Python 3.x
    • pyreadr
    • pandas
    • argparse
  • R

TODO

  • Update mutated_domains.R to support a generic MAF file or one on Synapse
  • Finalize application of mutated_domains.R to 1000 genomes data
  • Program to synthesize frequently mutated domains in disease and control populations (which domains are mutated in disease state significantly more than the control?); this should include a difference in proportions test a la prop.test in R
  • Check if subsetting InterPro's protein2ipr data to include only human proteins is a sufficient reduction in the size of data to be able to perform in-memory joins of InterPro and mutation data in R (and obviate the need for SQLite).
  • Generalize nf_lollipop.py to genes other than NF1

About

Identify commonly mutated protein domains in a given Mutation Annotation Format file.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published