Ready-to-use components for implementation Google Cloud Dataflow pipelines to solve genomics processing tasks
Here you can find a wide list of components for building genomics data processing pipelines based on Apache Beam unified programming model and runnable with Google Cloud Dataflow. Current package includes tools for:
- Building Batch and Streaming processing transformation graphs of genomics data
- Working with SRA metadata annotations
- Manipulations with FASTQ files
- Working with FASTA genome references
- Sequence alignment
- Different SAM/BAM data manipulations (Sorting, Merging, etc.)
- Variant Calling
- Variant Calling results (VCF) export
The usage of components from current library allows you to build highly scalable, parallel and efficient genomics data processing pipelines. Here is a principle schema of the pipeline that identifies genetic variations from sequence data that was built on library components:
- Java Development Kit (JDK) version 8
- Apache Maven
The repository contains two Maven modules:
- genomics-dataflow-core - module with Dataflow Genomics Core Components Java source code
- giab-example - module with demo project, that shows an example of usage of genomics-dataflow-core
There are several high-level classes, that could be used as the main building blocks for your pipeline. Here are some of them:
- ParseSourceCsvTransform - provides queue of input data transformation. It includes reading input CSV file (example), parsing, filtering, check for anomalies in metadata. Return ready to use key-value pair of SampleMetaData and list of FileWrapper
- SplitFastqIntoBatches - provides FASTQ splitting mechanism to increase parallelism and balance load between workers
- AlignAndSamProcessingTransform - contains queue of genomics transformation namely Sequence alignment (FASTQ->SAM), converting to binary format (SAM->BAM), sorting BAM and merging BAM in scope of specific contig region
- VariantCallingTransform - Apache Beam PTransform that provides Variant Calling logic. Currently supported GATK Haplotaype Caller and Deep Variant pipeline from Google.
- PrepareAndExecuteVcfToBqTransform - Apache Beam PTransform that groups Variant Calling results (VCF) of contig regions and exports them into the BigQuery table. Uses vcf-to-bigquery transform from GCP Variant Transforms
By default, minimap2 aligner is used for Sequence aligning stage. Optionally you can use BWA aligner by passsing --aligner=BWA
to the Apache Beam PipelineOptions.
Also, you can add a custom aligner by extending AlignService class
By default, pipeline uses GATK Haplotaype Caller.
Optionally there is a possibility to run the pipeline with a Deep Variant variant caller. To do this you should pass --variantCaller=DEEP_VARIANT
to the Apache Beam PipelineOptions.
Also, you can add a custom variant caller by extending VariantCallingService class
This repository contains an example of usage of Dataflow Genomics Core Components library, that provides a demo pipeline with batch processing of the NA12878 sample from Genome in a Bottle.
Nanostream Dataflow - a scalable, reliable, and cost effective end-to-end pipeline for fast DNA sequence analysis using Dataflow on Google Cloud
GCP-PopGen Processing Pipeline - a repository, that contains a number of Apache Beam pipeline configurations for processing different populations of genomes (e.g. Homo Sapiens, Rice, Cannabis)
Repository contains unit test that covers all main components and one end-to-end integration test.
For integration testing you have to configure TEST_BUCKET
environment variable.