Skip to content

Functionality: DNA

arhofman edited this page Jul 6, 2017 · 16 revisions

Here we describe the functionality included in NGS-pipe to analyse DNA experiments. The figure depicted below provides an schematic overview of the general categories we used to cluster the rules provided in NGS-pipe.

<\h1>

For each rule we give a short description if its functionality deviates from the cluster description.

Adapter removal and quality trimming

Sometimes the raw sequences contain adapters which need to be removed in order to avoid false positive calls in the analysis. Further, often it is desirable to remove bad quality bases from the raw sequencing files. In order to do so there are two rules available:

  • rule trimmomatic_paired, trimmomatic_single: Trimmomatic (Bolger 2014)
  • rule seqpurge_paired: SeqPurge (Sturm 2016)

Alignment

The raw sequencing files need to be mapped to a reference sequence in order to determine their point of origin.

  • rule bwa_single, bwa_paired: BWA (Li, 2009_a, 2013) (bwa mem and bwa aln are available)
  • rule bowtie2_single, bowtie2_paired: Bowtie2 (Langmead and Salzberg, 2014)
  • rule yara_single, yara_end: Yara (Siragusa, 2015)
  • rule soap_paired, soap2sam: Soap (Li, 2008)

Sam/Bam file processing

After the sequenced reads are aligned against the corresponding reference sequence, many different post processing utilities working on the SAM/BAM files are available. Here we introduce the ones integrated into NGS-pipe (in the order they would be applied).

  • rule samtools_create_index: This rule creates an index of a BAM file using SAMtools (Li, 2009_b)
  • rule picards_fix_mate_pair_and_sort: This rule sorts a BAM file according to coordinate and fixes wrong mate pair information using Picard tools (http://broadinstitute.github.io/picard).
  • rule picard_merge_bams: There are often multiple sequencing files for a single sample which mapped independently and need to be merged. NGS-pipe offers this functionality using Picard tools (http://broadinstitute.github.io/picard).
  • rule samtools_remove_secondary_alignments: Some read mappers and custom scripts do not handle secondary read alignments properly, therefore, the user can delete them from the BAM file. This rule uses the SAMtools (Li, 2009_b) view command to achieve this goal.
  • rule picards_mark_PCR_duplicates: This rule uses Picard tools (http://broadinstitute.github.io/picard) to find and mark likely PCR duplicates. PCR duplicates are reads that originate from the same biological sequence template and can bias the analysis.
  • rule samtools_remove_PCR_duplicates: If a BAM file contains marked PCR duplicates, this rule can be used to delete them. In order to do so, the SAMtools (Li, 2009_b) view command is used.
  • rule gatk_reassign_one_mapping_quality_filter: This rule manipulates the mapping qualities of a BAM file using the GATK (McKenna, 2010) functionality. This may be necessary, for example, to override the mapping quality of 256 used by Bowtie2 (Langmead and Salzberg, 2014) as the best mapping quality. The official SAM/BAM specification reserves a mapping quality of 256 to indicate that there is no mapping quality for the read.
  • rule gatk_realign_target_creation, gatk_realign_indels: Because the read mappers make use of heuristics, their alignments are often not optimal. This is especially true for sites harbouring indels. Therefore, it can be very useful to perform a realignment around these sites. In NGS-pipe, this is achieved using the GATK (McKenna, 2010) functionality.
  • rule gatk_first_pass_create_recalibration_table, gatk_base_recalibration: Since the sequencers exhibit systematic biases in many cases, the user can use the GATK (McKenna, 2010) functionality to re-assign the base qualities of the reads using a machine learning approach.

Single nucleotide variant calling

This section describes the rule that can be used to call single nucleotide variants (SNV) and single nucleotide polymorphisms. For the SNV calling, most of the approaches make use of statistical models that are used to test whether the frequency of a variant in a tumor and a control sample are significantly different.

SNV callers

  • rule gatk_mutect_2: MuTect2 (published 2016 as part of GATK (McKenna, 2010))
  • rule mutect_1: MuTect (Cibulskis, 2013)
  • rule varscan_somatic: VarScan2 (Koboldt, 2012)
  • rule strelka: Strelka (Saunders, 2012)
  • rule somatic_sniper: SomaticSniper (Larson, 2011)
  • rule joint_SNVMix_2_TRAIN, joint_SNVMix_2_CLASSIFY: JointSNVMix, (Roth, 2012)
  • rule vardict: VarDict (Lai, 2016)
  • rule deepSNV: deepSNV (Gerstung, 2012) - here we provide an additional R script in the scripts folder which is invoked by the rule deepSNV

SNV caller combination

In order to use the strength of several SNV callers, there exist different methods to combine their results. In the following we state the ones available via NGS-pipe

  • rule gatk_variant_combine: This rule uses GATK's (McKenna, 2010) CombineVariants to combine the output of different SNV callers. For instance, the user is able to define the number of callers which are required to have identified a variant in order to report it.
  • SomaticSeq: SomatiqSeq is a variant caller that uses several other variant callers as input. It uses a machine learning approach to best combine the input variant caller results. In NGS-pipe we implemented SomaticSeq version 2.0.1 using several rules that can be found in somatiq_sec_snake.py.
  • rule rank_combine_variants: This tool combines the results of different SNV callers using the rank information. In order to do so, the correlation between the different used variant callers is computed and a list of re-ranked variants is compiled. The idea is that variants identified by several callers with different underlying models get a higher rank in the final output file. The tool can be downloaded here: rank_combination.R

Copy number change estimation

In addition to mutations affecting short stretches of the genome, there are copy number events where whole genomic segments are amplified or deleted. NGS-pipe offers several rules to detect these events.

  • rule bicseq2_*: This set of rules implements the usage of BicSeq2 (Xi, 2016) (WGS only)
  • rule varscan_copy_number, varscan_copy_caller: These rules implement the copynumber change estimation using VarScan2 (Koboldt, 2012)
  • rule facets: FACETS (Shen, 2016)

Annotation

There are several approaches to annotate the results of mutation calls to link them to databases or some measure of impact. IN NGS-pipe we provide the following rules to do so:

  • rule snpEff_annotation: SnpEff (Cingolani, 2012a)
  • rule snpSift_dbSNP_Annotation: Using SnpSift (Cingolani, 2012a) to annotate the variants obtained from the mutation caller present in dbSNP (Sherry, 1999, 2001)
  • rule snpSift_COSMIC_annotation: Using SnpSift (Cingolani, 2012a) to annotate the variants obtained from the mutation caller present in COSMIC (Forbes, 2014)
  • rule snpSift_clinVar_annotation: Using SnpSift (Cingolani, 2012a) to annotate the variants obtained from the mutation caller present in ClinVar (Landrum, 2013)
  • rule snpSift_dbNSFP_annotation: Using SnpSift (Cingolani, 2012a) to annotate the variants obtained from the mutation caller present in dbNSF (Liu, 2011)

Quality checking

There are many different possibilities to check the quality of an NGS experiment. Here we list the rules that can be used to create quality statistics:

  • rule fastqc: FastQC is used to get general statistics of FASTQ files
  • rule qualimap_PDF, qualimap_HTML: Qualimap2 (Okonechnikov, 2015) can be used to get general statistics on SAM/BAM files
  • rule samtools_flagstat: SAMtools SAMtools (Li, 2009_b) flagstats provides some basic statistics on SAM/BAM files.
  • rule multiqc_fastq, multiqc_bam: MultiQC (Ewels, 2016) can be used to combine the results of other quality checking tools (here FastQC, Qaulimap2, SAMtool's flagstats) across samples.

References

Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(4), 2114-2120.

Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., ... & Getz, G. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature biotechnology, 31(3), 213-219.

Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., ... & Ruden, D. M. (2012a). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2), 80-92.

Cingolani, P., Patel, V. M., Coon, M., Nguyen, T., Land, S. J., Ruden, D. M., & Lu, X. (2012b). Using Drosophila melanogaster as a model for genotoxic chemical mutational studies with a new program, SnpSift. Toxicogenomics in non-mammalian species, 3, 35.

Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), 3047-3048.

Forbes, S. A., Beare, D., Gunasekaran, P., Leung, K., Bindal, N., Boutselakis, H., ... & Kok, C. Y. (2014). COSMIC: exploring the world's knowledge of somatic mutations in human cancer. Nucleic acids research, 43(D1), D805-D811.

Gerstung, M., Beisel, C., Rechsteiner, M., Wild, P., Schraml, P., Moch, H., & Beerenwinkel, N. (2012). Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nature communications, 3, 811.

Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., ... & Wilson, R. K. (2012). VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research, 22(3), 568-576.

Lai, Z., Markovets, A., Ahdesmaki, M., Chapman, B., Hofmann, O., McEwen, R., ... & Dry, J. R. (2016). VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic acids research, 44(11), e108-e108.

Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357-359.

Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., & Maglott, D. R. (2013). ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic acids research, 42(D1), D980-D985.

Larson, D. E., Harris, C. C., Chen, K., Koboldt, D. C., Abbott, T. E., Dooling, D. J., ... & Ding, L. (2011). SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics, 28(3), 311-317.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997.

Li H. and Durbin R. (2009_a). Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics, 25(14), 1754-1760.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009_b). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079.

Li, R., Li, Y., Kristiansen, K., & Wang, J. (2008). SOAP: short oligonucleotide alignment program. Bioinformatics, 24(5), 713-714.

Liu, X., Jian, X., & Boerwinkle, E. (2011). dbNSFP: a lightweight database of human nonsynonymous SNPs and their functional predictions. Human mutation, 32(8), 894-899.

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., ... & DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research, 20(9), 1297-1303.

Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2015). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292-294.

Roth, A., Ding, J., Morin, R., Crisan, A., Ha, G., Giuliany, R., ... & Marra, M. A. (2012). JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics, 28(7), 907-913.

Saunders, C. T., Wong, W. S., Swamy, S., Becq, J., Murray, L. J., & Cheetham, R. K. (2012). Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics, 28(14), 1811-1817.

Shen, R., & Seshan, V. E. (2016). FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing. Nucleic acids research, 44(16), e131-e131.

Sherry, S. T., Ward, M., & Sirotkin, K. (1999). dbSNP—database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome research, 9(8), 677-679.

Sherry, S. T., Ward, M. H., Kholodov, M., Baker, J., Phan, L., Smigielski, E. M., & Sirotkin, K. (2001). dbSNP: the NCBI database of genetic variation. Nucleic acids research, 29(1), 308-311.

Siragusa E. (2015): Approximate string matching for high-throughput sequencing, http://www.diss.fu-berlin.de/diss/receive/FUDISS_thesis_000000099827

Sturm, M., Schroeder, C., & Bauer, P. (2016). SeqPurge: highly-sensitive adapter trimming for paired-end NGS data. BMC bioinformatics, 17(1), 208.

Xi, R., Lee, S., Xia, Y., Kim, T. M., & Park, P. J. (2016). Copy number analysis of whole-genome data using BIC-seq2 and its application to detection of cancer susceptibility variants. Nucleic acids research, 44(13), 6274-6286.