This is the main repository of the Flint project for Amazon Web Services. Flint is a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large reference collection of bacterial genomes.
Our computational framework is primarily implemented using the MapReduce model, and deployed in a cluster launched using the Elastic Map Reduce service offered by AWS (Amazon Web Services). The cluster consists of multiple commodity worker machines (computational nodes), and in the current configuration of the cluster that we use, each worker machine consists of 15 GB of RAM, 8 vCPUs (a hyperthread of a single Intel Xeon core), and 100 GB of EBS disk storage. Each of the worker nodes will work in parallel to align the input sequencing DNA reads to a partitioned shard of the reference database; after the alignment step is completed, each worker node acts as a regular Spark executor node.
Valdes, Stebliankin, Narasimhan (2019), Large Scale Microbiome Profiling in the Cloud, ISMB 2019, in review.
How To Get Started
- Download the Code and follow the instructions on how to create an EMR cluster, setup the streaming source, and start Flint.
- Instructions, as well as a manual and reference, can be found at the project’s website.
- If you found a bug, open an issue and please provide detailed steps to reliably reproduce it.
- If you have feature request, open an issue.
- If you would like to contribute, please submit a pull request.
The basic requirements for the worker nodes are:
The remaining requirements are python packages that Flint needs for a successful run, please refer to the package's documentation for instructions and/or installation instructions.
Contact Camilo Valdes for pull requests, bug reports, good jokes and coffee recipes.