Skip to content
No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitattributes
.gitignore
All_files_v0.zip
All_files_v1.zip
QQplot.png
README.md
exinator_detailed_workflow.jpg

README.md

#################################################################################################

PLEASE NOTE

ExInAtor is ONLY designed to be used with mutations from Whole Genome Sequences. It is NOT designed to be used with Exome data.

#################################################################################################

Table Of Contents

Authors

Andrés Lanzós 1,2,3 Joana Carlevaro-Fita 1,2,3 Loris Mularoni 3,4 Ferran Reverter 1,2,3 Emilio Palumbo 1,2,3 Roderic Guigó 1,2,3 Rory Johnson 1,2,3,5 *

  1. Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona, 08003, Spain.
  2. Universitat Pompeu Fabra (UPF), Barcelona, Spain.
  3. Institut Hospital del Mar d’Investigacions Mèdiques (IMIM), 08003 Barcelona, Spain.
  4. Research Unit on Biomedical Informatics, Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Dr. Aiguader 88, Barcelona, Spain.
  5. Present address: Department of Clinical Research, University of Bern, Murtenstrasse 35, 3010 Bern, Switzerland.

    * Corresponding author.

Description

ExInAtor is designed to detect cancer driver long non-coding RNA genes. The general approach is to identify genes that have an excess of mutations in their exons compared to local background regions (including intronic sequences).. The statistic method is based on the Hypergeometric distribution and generating a Q (False Discovery Rate) value for each gene. The main output file is called “ExInAtor_Gene_List.txt”, which contains comprehensive results for all analyzed genes.

The general approach is represented in the following figure:

A. Example gene to analyse and two flanking genes.

B. Merging of exons for each gene.

C. Extension of the background region while removing any exon of any flanking gene.

D. Mutation mapping.

E. Definition of the new background region to obtain the same trinucleotide content (i.e. percentage) than the exonic region.

F. Mutation mapping on the new background region.

workflow

Requirements

1. Linux:

Tested in Ubuntu 14.04

2. Bedtools:

Please use version 2.19.1. More recent versions will not work because of changes in some bedtools commands.

3. Python:

Tested on version 2.7.6

4. R:

Tested on versions 3.3.1

5. Awk:

Tested on versions mawk 1.3.3

Installation

1. Download the compressed file “All_files.zip” from:

https://github.com/alanzos/ExInAtor

2. Uncompress it in path

3. Ready to run

For each file you must provide the full path. Example command:

$  python2.7 Main.py -i /home/All_files/Inputs/Breast.bed -o /home/All_files/Outputs -f /home/All_files/Inputs/Genome_v19.fasta -g /home/All_files/Inputs/gencode.v19.long_noncoding_RNAs.gtf -s /home/All_files/Inputs/chromosomes.txt -k /home/All_files/Inputs/3mers.txt -w /home/All_files/Inputs/gencode.v19.annotation.gtf -c 6 -n 119 -b 10000

In the folder “Outputs” you can find the expected results.

Inputs

All of the following files are provided in the compressed file “All_files.zip” except the Fasta file of the whole genome (“-f | –fasta_file”) which can be obtained here (818 MB)

1. Mandatory:
  • -i | –input_file -> File containing the localization of the cancer mutations in BED format.
  • -o | –output_folder -> Name of the folder where files and plots are saved.
  • -g | –gtf_file -> File containing the genes and exons to analyse in GTF format. Other features, including transcripts, will be ignored.
  • -f | –fasta_file -> Fasta of the whole genome.
  • -s | –chr_sizes -> Two-column tab-separated text file containing assembly sequence names and sizes.
  • -k | –kmers_file -> Txt file containing all the possible trinucleotides.
  • -w | –whole_genome -> File containing the genes and exons of all the genome in GTF format. Other features, including transcripts, will be ignored.
  • -n | –number_of_genomes -> The number of samples or genomes corresponding to the “input_file”.
2. Optional:
  • -e | –exonic_filter -> Minimun number of exonic mutations a gene must have to be analyzed.
  • -x | –background_filter -> Minimun number of background mutations a gene must have to be analyzed.
  • -b | –background_size -> the extension length of the background region that includes all introns.
  • -c | –cores -> the number of CPU cores to use in the analysis.

Outputs

1. ExInAtor_Gene_List.txt: Final output. List of driver candidates with exon_mutations, exon_length, intron_mutations, intron_length, pval, qval. Sorted by Qval.

2. genes.bed: BED file containing the localization of each gene.

3. counts.txt: File containing the number of exonic mutations, exonic length, background mutations and background length for each gene.

4. table_kmer_counts.txt: File containing the number of exonic mutations, exonic length, background mutations and background length for each trinucleotide in each gene.

5. qqplot.png: QQplot of the expected and observed pvalues

QQplots

These plots are used to evaluate whether the Pvalues follow a Uniform distribution, and hence the overall statistical behaviour of the results. “QQ” stands for Quantile-Quantile plot. The point of these figures is to compare the distribution of observed pvalues to the expected distribution under the null hypothesis of no association. The null hypothesis in this case would generate a uniform distribution, this means, a flat histogram over all statistical tests with a total density of 1.

For more information see:

https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot

http://www.jstor.org/stable/2987987?origin=crossref&seq=1#page_scan_tab_contents

The following is the expected QQplot of the example data provided with ExInAtor in the GitHub link: https://github.com/alanzos/ExInAtor

qqplot
You can’t perform that action at this time.