Skip to content
Main repository of the Flint project for Spark and Amazon EMR.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
amazon-emr
data
docs
examples/index_deployment
genome_cleaning
genome_preprocessing
hpc_indexing
modules
services
spark_vms
utilities
visualization
LICENSE
README.md
flint.py
provision_index.py

README.md

Flint

This is the main repository of the Flint project for Amazon Web Services. Flint is a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large reference collection of bacterial genomes.

Our computational framework is primarily implemented using the MapReduce model, and deployed in a cluster launched using the Elastic Map Reduce service offered by AWS (Amazon Web Services). The cluster consists of multiple commodity worker machines (computational nodes), and in the current configuration of the cluster that we use, each worker machine consists of 15 GB of RAM, 8 vCPUs (a hyperthread of a single Intel Xeon core), and 100 GB of EBS disk storage. Each of the worker nodes will work in parallel to align the input sequencing DNA reads to a partitioned shard of the reference database; after the alignment step is completed, each worker node acts as a regular Spark executor node.

The current database for running Flint is version 41 from Ensembl Bacteria, but we are currently working on the latest version of RefSeq, which should be available this summer.

Publications

Valdes, Stebliankin, Narasimhan (2019), Large Scale Microbiome Profiling in the Cloud, ISMB 2019, in review.

How To Get Started

Communication

  • If you found a bug, open an issue and please provide detailed steps to reliably reproduce it.
  • If you have feature request, open an issue.
  • If you would like to contribute, please submit a pull request.

Requirements

Flint is designed to run on Apache Spark, but the current implementation is tuned for Amazon's EMR Elastic Map Reduce. The basic requirements for an EMR cluster are:

The basic requirements for the worker nodes are:

Bowtie2

Bowtie is required for the alignment step, and needs to be installed in all worker nodes of the Spark Cluster. See the Bowtie2 manual for more information.

Python Packages

The remaining requirements are python packages that Flint needs for a successful run, please refer to the package's documentation for instructions and/or installation instructions.

Contact

Contact Camilo Valdes for pull requests, bug reports, good jokes and coffee recipes.

Maintainers

Collaborators

License

The software in this repository is available under the MIT License. See the LICENSE file for more information.

You can’t perform that action at this time.