Code repository for the CAGI6 SickKids challenge

A challenge submission by the DROPpers.

The team members are: Julien Gagneur, Christian Mertes, Ines Scheller, Nicholas H. Smith, Vicente A. Yépez.

This is the code repository used to generate the results for our two submissions for the CAGI6 SickKids challenge. We tackled the challenge by predicting the molecular events underlying disease from a patient's genome and transcriptome using variant annotation, aberrant gene expression events, and human phenotype ontology.

The code consists of 4 parts that are described below:

Aberrant event detection in RNA-seq data using DROP.
Annotating and filtering variants
Computing phenotypic similarity scores
Prioritizing events using XGBoost

A detailed description of our full analysis can be found here.

Aberrant event detection in RNA-seq

We used DROP with the default configuration to call aberrant events. To run the full pipeline, we suggest in a nutshell (i) to install DROP through bioconda, (ii) put all relevant data into Data/project_data/raw/, and (iii) create a sample annotation in Data/project_data/sample_annotation.tsv. You can then run the full DROP pipeline with

snakemake -j 20

The main pipeline configuration can be found here.

Variant annotation and filtering

As described in the method, we used VEP to annotate the variants. In short, we annotated all default information from VEP, allele frequencies through gnomAD, added CADD, SpliceAI, and EVE scores, as well as ClinVar and UTRannotator information. The respective configuration and scripts can be found here and here. After adapting the config to your local infrastructure and a successful run of the DROP pipeline, you should be able to run it with snakemake as following:

snakemake -j 20 --snakefile Snakefile_vep_anno.smk

Phenotypic similarity scores

We computed the phenotypic similarity scores as described by Kopajtich et al. A more detailed version can be found also in our Methods section. The scripts to run it can be found here.

Prioritizing events using XGBoost

For the final submission of the SickKids challenge, we used XGBoost to predict the disease-causing gene given the HPO terms, genetic information, as well as RNA-seq-based aberrant events of an individual. The code for our model can be found here and here. The model can be trained as soon as the RNA-seq outliers are called, the variants are annotated, filtered, and preprocessed, and the phenotypic similarity scores are calculated.

Disclaimer

This code was put together for the CAGI6 SickKids challenge and is not production-ready. This repository is meant to be complementary to our method description and to help others to get started. If there is any question about the model/code please create a new issue.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.drop		.drop
.wBuild		.wBuild
Scripts		Scripts
docs		docs
envs		envs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
Snakefile_vep_anno.smk		Snakefile_vep_anno.smk
config.yaml		config.yaml
vep_GRCh37.config		vep_GRCh37.config

License

gagneurlab/cagi6_sickkids

Folders and files

Latest commit

History

Repository files navigation

Code repository for the CAGI6 SickKids challenge

Aberrant event detection in RNA-seq

Variant annotation and filtering

Phenotypic similarity scores

Prioritizing events using XGBoost

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Languages