Skip to content

gevirank/gevir

Repository files navigation

GeVIR: Gene Variation Intolerance Rank

Gene Variation Intolerance Rank (GeVIR) is a gene level metric which can aid dominant and recessive Mendelian disease genes discovery.

Project Description

With large scale population sequencing projects gathering pace there is a need for strategies that progress disease gene prioritisation. Metrics that provide information about a gene and its ability to tolerate protein altering variation can aid clinical interpretation of human genomes and advance disease gene discovery. Previous methods analysed total variant load in a gene, but not their distribution pattern within a gene. Utilising data from 138,632 exome/genome sequences, we developed Gene Variation Intolerance Rank (GeVIR), to produce a continuous gene level metric for 19,361 genes that is able to prioritise both dominant and recessive Mendelian disease genes, outperforming missense constraint metrics and comparable, but complementary, to loss-of-function constraint metrics. GeVIR is also able to prioritise short genes, for which loss-of-function constraint cannot be confidently estimated. The majority of the most intolerant genes identified have no defined phenotype and are candidates for severe dominant disorders.

Required Datasets

GeVIR analysis requires local version of gnomAD v2.0.1 database, specifically following collections: exome_coverage, exome_variants, genome_variants, exons, genes, transcripts. Instructions how to install gnomAD database can be found here: https://github.com/macarthur-lab/gnomad_browser

Additionally, it requires following datasets:

Note 1: All datasets have to be placed into ./source_data directory and have names as defined at the top of import_data.py script (can be changed there). ClinVar dataset requires reannotation with Ensembl VEP script, instructions how to install and use it can be found here: https://github.com/Ensembl/ensembl-vep.

Note 2: ./tables folder contains gnomAD gene constraint scores, mouse heterozygous lethal, cell essential, cell non-essential and loss-of-function tolerant datasets which were used in original GeVIR study.

Usage

Download all required datasets, place them in ./source_data folder. To install required python modules run:

pip install -r requirements.txt

To import datasets into the local database run:

python import_data.py

To calculate GeVIR scores run:

python gevir.py

To create gene sets used in the figures production run:

python gene_sets.py

To produce all the figures run:

python figures.py

To export gene scores from the database run:

python export_data.py

Note: Individual operations can be enabled/disabled via comments in the main method in all scripts, check them before performing the analysis.

Code Description

  • import_data.py - imports data from various datasets into the local database

  • gevir.py - computes Variant Intolerant Regions (VIRs) from gnomAD variant data and creates GeVIR gene scores

  • gene_sets.py - combines gene scores (e.g. GeVIR, LOEUF, Missense z-score) into a single dataset, loads disease and essential gene lists from the database

  • export_data.py - exports VIR and GeVIR gene scores data as csvs

  • figures.py - performs gene scores evaluation, draws figures and reports statistics

  • gnomad_utils.py - modified version of gnomAD browser code, obtained from: https://github.com/macarthur-lab/gnomad_browser/blob/master/utils.py

  • common.py - contains methods commonly used by other scripts

  • csv_reader.py - custom csv reader, used to import data into local database

About

Gene Variation Intolerance Rank (GeVIR) is a gene level metric which can aid dominant and recessive Mendelian disease genes discovery.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages