Dataflow Genomics Core Components

Ready-to-use components for implementation Google Cloud Dataflow pipelines to solve genomics processing tasks

Overview

Here you can find a wide list of components for building genomics data processing pipelines based on Apache Beam unified programming model and runnable with Google Cloud Dataflow. Current package includes tools for:

Building Batch and Streaming processing transformation graphs of genomics data
Working with SRA metadata annotations
Manipulations with FASTQ files
Working with FASTA genome references
Sequence alignment
Different SAM/BAM data manipulations (Sorting, Merging, etc.)
Variant Calling
Variant Calling results (VCF) export

Efficiency

The usage of components from current library allows you to build highly scalable, parallel and efficient genomics data processing pipelines. Here is a principle schema of the pipeline that identifies genetic variations from sequence data that was built on library components:

Prerequisites

Java Development Kit (JDK) version 8
Apache Maven

Structure

The repository contains two Maven modules:

genomics-dataflow-core - module with Dataflow Genomics Core Components Java source code
giab-example - module with demo project, that shows an example of usage of genomics-dataflow-core

High-level components

There are several high-level classes, that could be used as the main building blocks for your pipeline. Here are some of them:

ParseSourceCsvTransform - provides queue of input data transformation. It includes reading input CSV file (example), parsing, filtering, check for anomalies in metadata. Return ready to use key-value pair of SampleMetaData and list of FileWrapper
SplitFastqIntoBatches - provides FASTQ splitting mechanism to increase parallelism and balance load between workers
AlignAndSamProcessingTransform - contains queue of genomics transformation namely Sequence alignment (FASTQ->SAM), converting to binary format (SAM->BAM), sorting BAM and merging BAM in scope of specific contig region
VariantCallingTransform - Apache Beam PTransform that provides Variant Calling logic. Currently supported GATK Haplotaype Caller and Deep Variant pipeline from Google.
PrepareAndExecuteVcfToBqTransform - Apache Beam PTransform that groups Variant Calling results (VCF) of contig regions and exports them into the BigQuery table. Uses vcf-to-bigquery transform from GCP Variant Transforms

Sequence aligning

By default, minimap2 aligner is used for Sequence aligning stage. Optionally you can use BWA aligner by passsing --aligner=BWA to the Apache Beam PipelineOptions.

Also, you can add a custom aligner by extending AlignService class

Variant Calling

By default, pipeline uses GATK Haplotaype Caller. Optionally there is a possibility to run the pipeline with a Deep Variant variant caller. To do this you should pass --variantCaller=DEEP_VARIANT to the Apache Beam PipelineOptions.

Also, you can add a custom variant caller by extending VariantCallingService class

Usage

This repository contains an example of usage of Dataflow Genomics Core Components library, that provides a demo pipeline with batch processing of the NA12878 sample from Genome in a Bottle.

Already used by

Nanostream Dataflow - a scalable, reliable, and cost effective end-to-end pipeline for fast DNA sequence analysis using Dataflow on Google Cloud

GCP-PopGen Processing Pipeline - a repository, that contains a number of Apache Beam pipeline configurations for processing different populations of genomes (e.g. Homo Sapiens, Rice, Cannabis)

Testing

Repository contains unit test that covers all main components and one end-to-end integration test. For integration testing you have to configure TEST_BUCKET environment variable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Dataflow Genomics Core Components

Overview

Efficiency

Prerequisites

Structure

High-level components

Sequence aligning

Variant Calling

Usage

Already used by

Testing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Dataflow Genomics Core Components

Overview

Efficiency

Prerequisites

Structure

High-level components

Sequence aligning

Variant Calling

Usage

Already used by

Testing