This pipeline processes longread sequencing data to identify somatic mutations using various bioinformatics tools. The workflow is defined using Snakemake, ensuring reproducibility and scalability.
The pipeline integrates several bioinformatics tools to process samples from raw sequencing data to somatic variant calling and analysis. Key steps include mapping with Minimap2, sorting BAM files, quality control with Qualimap and NanoPlot, variant calling with ClairS, variant annotation with VEP, somatic structure variant (SV) calling, and somatic copy number alteration (CNA) analysis.
The pipeline is designed to take longread sequencing data through a comprehensive analysis to identify and characterize somatic mutations, including single nucleotide variants (SNVs), structural variants (SVs), and copy number alterations (CNAs). Below is a breakdown of the critical stages in the pipeline:
- Purpose: Aligns raw sequencing reads to a reference genome. This step is crucial for identifying the genomic locations of the reads.
- Tool: Minimap2 is a fast sequence alignment program designed to handle long reads (e.g., from PacBio or Oxford Nanopore technologies).
- Purpose: Sorts the aligned reads in the BAM files by their genomic coordinates. Sorting is necessary for many downstream analyses, including variant calling.
- Tool: This step is performed using Samtools, which can efficiently sort and index BAM files.
- Purpose: Assesses the quality of the sequencing and alignment. Quality control metrics help identify potential issues with sequencing runs or alignment processes.
- Tools:
- Qualimap provides detailed statistics about alignment quality, coverage, and other essential metrics.
- NanoPlot generates graphical summaries of sequencing quality, offering insights into the distribution of read lengths, quality scores, and more.
- Purpose: Identifies somatic variants from the aligned reads, focusing on mutations that occur in cancer cells.
- Tool: ClairS is a variant caller designed for high accuracy in detecting single nucleotide variants (SNVs) and indels from sequencing data.
- Purpose: Enriches variant calls with information on their potential effects on genes, proteins, and disease phenotypes. This step is crucial for interpreting the functional impact of identified variants.
- Tool: The Variant Effect Predictor (VEP) annotates detected variants with data from multiple databases, providing insights into their biological significance.
- Purpose: Detects larger genomic rearrangements such as deletions, duplications, inversions, and translocations that can play significant roles in cancer development.
- Tool: Severus analyze the alignments for patterns indicative of somatic structural variations.
- Purpose: Identifies changes in the number of copies of genomic regions, which are common in cancer genomes and can indicate regions of amplification or deletion associated with oncogenes or tumor suppressor genes.
- Tool: Wakhan can perform somatic CNA analysis, offering insights into the genomic gains and losses across the cancer genome.
By integrating these bioinformatics tools and steps, the pipeline provides a comprehensive analysis of somatic mutations from long read sequencing data, enabling researchers to understand the genetic alterations driving cancer and potentially identify targets for therapy.
Modify the config/config.yaml and config/sample_sheet.csv to specify your samples and analysis parameters.
The pipeline produces the following key outputs:
- Sorted BAM files
- Quality control reports from Qualimap and NanoPlot
- Somatic variants in VCF format annotated with VEP
- Somatic structure variants results
- Somatic copy number alteration (CNA) analysis results