Skip to content

Latest commit

 

History

History
62 lines (48 loc) · 6.55 KB

README.md

File metadata and controls

62 lines (48 loc) · 6.55 KB

Maven Central

Dataflow Genomics Core Components

Ready-to-use components for implementation Google Cloud Dataflow pipelines to solve genomics processing tasks

Overview

Here you can find a wide list of components for building genomics data processing pipelines based on Apache Beam unified programming model and runnable with Google Cloud Dataflow. Current package includes tools for:

  • Building Batch and Streaming processing transformation graphs of genomics data
  • Working with SRA metadata annotations
  • Manipulations with FASTQ files
  • Working with FASTA genome references
  • Sequence alignment
  • Different SAM/BAM data manipulations (Sorting, Merging, etc.)
  • Variant Calling
  • Variant Calling results (VCF) export

Efficiency

The usage of components from current library allows you to build highly scalable, parallel and efficient genomics data processing pipelines. Here is a principle schema of the pipeline that identifies genetic variations from sequence data that was built on library components: Pipeline principle schema

Prerequisites

Structure

The repository contains two Maven modules:

High-level components

There are several high-level classes, that could be used as the main building blocks for your pipeline. Here are some of them:

Sequence aligning

By default, minimap2 aligner is used for Sequence aligning stage. Optionally you can use BWA aligner by passsing --aligner=BWA to the Apache Beam PipelineOptions.

Also, you can add a custom aligner by extending AlignService class

Variant Calling

By default, pipeline uses GATK Haplotaype Caller. Optionally there is a possibility to run the pipeline with a Deep Variant variant caller. To do this you should pass --variantCaller=DEEP_VARIANT to the Apache Beam PipelineOptions.

Also, you can add a custom variant caller by extending VariantCallingService class

Usage

This repository contains an example of usage of Dataflow Genomics Core Components library, that provides a demo pipeline with batch processing of the NA12878 sample from Genome in a Bottle.

Already used by

Nanostream Dataflow - a scalable, reliable, and cost effective end-to-end pipeline for fast DNA sequence analysis using Dataflow on Google Cloud

GCP-PopGen Processing Pipeline - a repository, that contains a number of Apache Beam pipeline configurations for processing different populations of genomes (e.g. Homo Sapiens, Rice, Cannabis)

Testing

Repository contains unit test that covers all main components and one end-to-end integration test. For integration testing you have to configure TEST_BUCKET environment variable.