A pipeline that uses read-level features and extra trees/random forest algorithms for accurate and fast detection of somatic mutations in next generation sequencing data.
We have created a docker image with all dependencies installed:
- If you don't have docker already installed in your system, please follow these instructions.
docker pull labxa/rfcaller:1.0.0
The image has the following structure:
databases
directory contains dbSNP and ploidy files.RFcaller
has the scripts used by the pipeline.output
is the working directory.
/home
|-- databases
|-- dbSNP
|-- UCSC_dbSNP153Common_hg19_combined.vcf.gz
|-- UCSC_dbSNP153Common_hg19_combined.vcf.gz.tbi
|-- UCSC_dbSNP153Common_hg38_combined.vcf.gz
|-- UCSC_dbSNP153Common_hg38_combined.vcf.gz.tbi
|-- ploidy_files
|-- GRCm38.ploidy.file
|-- GRCm39.ploidy.file
|-- RFcaller
/output
Here is basic configuration, for more information see docker and RFcaller options.
docker run --rm -v /BAMS_PATH/:/bams/ -v /GENOME_PATH/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.0.0 -@ INT -i INPUT.LIST --genome /genome/GENOME.FA --dbSNP hg19/hg38
Among all the options offered by docker (docker run --help
), we recommend:
--rm
: Automatically remove the container when it exits.-v, --volume
: Mount local volumes in the container.- With the option
-v $(pwd):/output/
, RFcaller results will be in your current directory.
- With the option
-u, --user
: Specify the user ID and its group ID. It's useful to not run the pipeline as root.-i, --interactive
: Keep STDIN open even if not attached.-t, --tty
: Allocate a pseudo-TTY. When combined with-i
it allows you to connect your terminal with the container terminal.-e, --env
: Set environment variables.
RFcaller has the following required inputs:
-i, --input
: The input is a TSV (tab-separated values) file with five required columns and an optional extra one. This last column is used to specify the file with the positions you want to analyze.
Normal name | Normal BAM (CRAM supported) |
Tumor name | Tumor BAM (CRAM supported) |
Output name | VCF/BED (optional) |
-g, --genome
: Reference genome in FASTA format (bgzip supported).-p, --dbSNP
: VCF file provided by the image with common SNPs (MAF ≥ 1%) to eliminate these positions from the analysis. In case you want to use your own VCF (bgzip supported), use also the--ploidy_file
argument.- Choices:
[hg19, hg38, your_dbSNP]
- Choices:
RFcaller provides additional configuration through the following optional inputs:
-@
: Max number of threads to use.- Default: 20
-b, --bamsDir
: Main directory where BAMs are located inside the container. By default, the program will look in the/bams/
directory and that's why we set the docker option-v /BAMS_PATH/:/bams/
. However, if you mount your BAMs in other directory inside the container, you have to use this option to specify it.- Default:
/bams/
- Default:
-o, --outputDir
: Name for the output directory. By default, the results will be saved in your current directory, but with this option a new folder will be created to save them.--positions
: This argument refers to the file format of the--input
optional column.- Default:
VCF
- Choices:
[VCF, BED]
- Default:
--ploidy_file
:bcftools call --ploidy_file
to use in variant calling. By default, if the user specifies--dbSNP hg19
or--dbSNP hg38
,--ploidy_file
is GRCh37 or GRCh38 respectively. Otherwise, you must select between the following options:- Choices:
[GRCh37, GRCh38, GRCm38 (mouse), GRCm39 (mouse), None (consider all sites as diploid), your_ploidy_file]
- Choices:
--includePatches
: Tag used to analyze both canonical chromosomes and patches.--keep
: Tag used to keep intermediate files.
RFcaller has some extra parameters to use as filters:
Parameter | Description | Default |
---|---|---|
contamination | Percentage of tumor contamination in normal sample. For example, if you expect that 10% of reads in normal comes from tumor (10% contamination),you have to set "contamination": 0.1 . This argument is used to increase the maximum number of mutated reads allowed in normal sample |
0.05 |
TD_cov_SNV | Minimum coverage for tumor sample (SNVs) | 7 |
TD_cov_INDEL | Minimum coverage for tumor sample (INDELs) | 7 |
ND_cov_SNV | Minimum coverage for normal sample (SNVs) | 7 |
ND_cov_INDEL | Minimum coverage for normal sample (INDELs) | 7 |
TD_mut_SNV | Minimum number of mutated reads for a position in tumor sample to not discard it (SNVs) | 3 |
TD_mut_INDEL | Minimum number of mutated reads for a position in tumor sample to not discard it (INDELs) | 4 |
ND_mut_SNV | Maximum number of mutated reads allowed for a position in normal sample (SNVs) | 3 |
ND_mut_INDEL | Maximum number of mutated reads allowed for a position in normal sample (INDELs) | 2 |
ND_window | Window size around a position to look for mutations in normal (INDELs) | 10 |
SNV_threshold | Minimum QUAL value to consider a SNV as good | 10.1774 |
INDEL_threshold | Minimum QUAL value to consider an INDEL as good | 32.1418 |
polyINDEL_threshold | Minimum QUAL value to consider an homopolymerINDEL (mononucleotide microsatellites) as good | 0.7723 |
Imagine that you are working in the directory /home/example
which has the following structure:
/home/example/
|-- example.metadata
|-- bams
|-- normal.bam
|-- normal.bam.bai
|-- tumor.bam
|-- tumor.bam.bai
|-- genome
|-- hg19.fa
|-- hg19.fa.fai
If we use the following command to run RFcaller:
docker run --rm -v /home/example/bams/:/bams/ -v /home/example/genome/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.0.0 -i example.metadata --genome /genome/hg19.fa --dbSNP hg19
The example.metadata file should be:
normal normal.bam tumor tumor.bam example
This is because thanks to the -v
command, the files inside the container are located in the following way:
Computer | Docker |
---|---|
/home/example/example.metadata | /output/example.metadata |
/home/example/bams/normal.bam | /bams/normal.bam |
/home/example/bams/tumor.bam | /bams/tumor.bam |
/home/example/genome/hg19.fa | /genome/hg19.fa |
A more detailed explanation:
- We use
-i example.metadata
because inside the container, we are working in the/output/
directory, which is where theexample.metadata
file is located. It would also be valid to write-i /output/example.metadata
. - We have said before that the default directory for the BAMs inside the container is
/bams/
. In this sense, as our BAMs are in this directory, we don't have to specify the full path in theexample.metadata
file, just write where these BAMs are located inside this folder. - In the case of genome, we have to specify the full path inside the container, that is why we use
--genome /genome/hg19.fa
.
To make it clear how to use the Docker options together with RFcaller, we have prepared a more complex example. For this time, we start with the following directory structure:
/home/complex_example/
|-- example.metadata
/data/
|-- bams
|-- normal_dir
|-- normal.bam
|-- normal.bam.bai
|-- tumor_dir
|-- tumor.bam
|-- tumor.bam.bai
|-- hg19_genome
|-- hg19.fa
|-- hg19.fa.fai
In the event that we were working in the /home/
directory:
docker run --rm -v /data/bams/:/example_bams/ -v /data/hg19_genome/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.0.0 -i complex_example/example.metadata --bamsDir /example_bams/ --genome /genome/hg19.fa --dbSNP hg19
The distribution of the files will be:
Computer | Docker |
---|---|
/home/complex_example/example.metadata | /output/complex_example/example.metadata |
/data/bams/normal_dir/normal.bam | /example_bams/normal_dir/normal.bam |
/data/bams/tumor_dir/tumor.bam | /example_bams/tumor_dir/tumor.bam |
/data/hg19_genome/genome/hg19.fa | /genome/hg19.fa |
Thus, the metadata.example
file should be:
normal normal_dir/normal.bam tumor tumor_dir/tumor.bam example
A more detailed explanation:
- As we are using
-v $(pwd):/output/
and our working directory outside the container is/home/
, theexample.metadata
file in the container is inside the foldercomplex_example
. This is why we use-i complex_example/example.metadata
. - We have mounted the BAMs directory in
/example_bams/
instead of/bams/
, so we also have to changed BAMs directory for RFcaller with the-b, --bamsDir
option. Now, for theexample.metadata
we have to write the full BAMs path removing the main directory because it is defined with--bamsDir
.
The time zone of the docker image is Europe/Madrid
and it's used to set the time for the log file. To change this, add the -e
option to set environment variables:
docker run --rm -e "TZ=$(cat /etc/timezone)" -v /BAMS_PATH/:/bams/ -v /GENOME_PATH/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.0.0 -@ INT -i INPUT.LIST --genome /genome/GENOME.FA --dbSNP hg19/hg38
Or set it manually (to know your TZ visit: TZ database)
docker run --rm -e "TZ=America/Toronto" -v /BAMS_PATH/:/bams/ -v /GENOME_PATH/:/genome/ -v $(pwd):/output/ -u $(id -u):$(id -g) -it labxa/rfcaller:1.0.0 -@ INT -i INPUT.LIST --genome /genome/GENOME.FA --dbSNP hg19/hg38
For each case three directories and a final file will be created:
- calling: bcftools basic call results
${name}.vcf.gz
-> Original calling from bcftools${name}.norm.vcf.gz
-> After normalizing the indels${name}.filter.norm.vcf.gz
-> After removing SNPs and very low quality mutations${name}.SNVs.filter.norm.vcf
and${name}.INDELs.filter.norm.vcf
-> Separate SNVs and INDELs into two filespolyIndel.pos
-> BED file with those positions which contain an homopolymer INDEL
- somaticSNV: RFcaller results for SNVs
reduced.positions
-> All positions to analyze in${name}.SNVs.filter.norm.vcf
reduced_*.bam
-> A small bam with just the reads overlapping positions inreduced.positions
*.mini.pileup
and${name}.mini.pileup.vcf
-> Filtered mini.pileup and its VCF for the positions inreduced.positions
common.positions
-> File with the positions in${name}.mini.pileup.vcf
with the formatCHR \t POS
${tumor}.read.names
,${name}.mutations.interval
,${name}.sequence
,${name}.stats
,${name}.cigar
and${name}.mapq
-> Intermediate files with read-level features${name}.prepare_muts.csv
-> File generated from gathering the above features and filtering themregression_results_SNVs_${name}.txt
-> Regression algorithm result for SNVsmutations_SNVs_${name}.vcf.gz
-> Final set of SNVs
- somaticINDEL: RFcaller results for INDELs
reduced.positions
-> All positions to analyze in${name}.INDELs.filter.norm.vcf
short_reduced.positions
andlong_reduced.positions
-> Positions separated by the size of the INDEL (short < 7 and long >= 7)reduced_*.bam
-> A small bam with just the reads overlapping positions inreduced.positions
*.mini.pileup
and${name}.mini.pileup.vcf
-> Filtered mini.pileup and its VCF for the positions inreduced.positions
short_common.positions
,long_common.positions
andcommon.positions
-> File with the positions in${name}.mini.pileup.vcf
with the formatCHR \t POS
${name}.features
,${name}.mutations.interval
,${name}.sequence
,${name}.distribution
,${name}.stats
,${name}.cigar
and${name}.mapq
-> Intermediate files with read-level features${name}.prepare_muts.csv
-> File generated from gathering the above features and filtering themregression_results_INDELs_${name}.txt
-> Regression algorithm result for INDELsmutations_INDELs_${name}.vcf.gz
-> Final set of INDELs
mutations_${name}.vcf
-> Final VCF with the mutations from SNV and INDEL pipelines
If the tag --keep
is set, all above directories are removed and it is only conserved the final VCF file mutations_${name}.vcf
.
Two additional global files are created:
- RFcaller.log -> Log file with the commands that have been executed for each case and their results
- .stderr.log -> A file with errors and warnings that have occurred during the pipeline
Authors: Ander Díaz-Navarro, Pablo Bousquets-Muñoz & Xose S. Puente -- Universidad de Oviedo