Home

Introduction

McDonnell Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. The GMS code is available at http://github.com/genome/gms.

Publication

Please read the GMS publication for additional background, concepts and terminology: Genome Modeling System: A Knowledge Management Platform for Genomics

Concepts

[Bioinformatics > Genome Modeling System (GMS) > GMS_Figure3.png]

Model

The central metaphor for analysis products in the GMS is the Genome Model. Each model represents one state of belief about the sequence and features of a given subject. Multiple approaches to arrive at a conclusion for the same subject will be represented as multiple models in the system, each with a different "processing profile" to describe the methods in precise computational terms.

Processing Profile

A processing profile defines the exact software versions and parameters used to process data. The same data reprocessed with the same processing profile, inputs, and model should produce identical builds (results).

Build

Once a model is defined, it is built. Each attempt to build a model launches a workflow on the compute cluster, and adds a record of that build to the database for the model in question to track processing.

Subject

The subject of a model determines which genome it intends to examine, much as the processing profile determines how it will be examined. The subject of a model is sometimes a particular individual, but is more often a specific sample from some individual. In cancer analysis, one model will be made for the genome of the tumor, and another for the genome of a matched normal, with a third performing the comparison between the two. The MedSeq models target the individual in general, taking other models as inputs, each with more specific subjects relating to tumor or normal DNA or RNA. Other types of models in the system have an entire cohort as a subject, analyzing the genomics of the group with regard to phenotype or clinical characteristics. For cases in which a given organism is considered a model organism for the species, the species is considered the subject, i.e. when de novo assembling a new reference genome.

A model must include a subject. The subject of models are generally biological samples that our analysis activities are meant to characterize. A common example of our use of sample, is the dna extracted from many cells of a solid or liquid tumor which have been extracted in a surgical process from a patient in a clinical setting. It should be noted that multiple samples might be extracted from a patient, especially in the instance of a cancer patient with diseased and normal tissue. Each of these samples could become the subject of independent models (ex. Reference Alignment). Some kinds of models, like Phenotype Correlation, have a group of samples as their subject. These groups of samples are described as population groups in the terminology of the Genome Modeling System.

Inputs

A model’s inputs typically include data. Depending on the model type, models can require reference sequences, sequence data, genotypes, variants, annotation, or regions of interest. The subject of a model may limit what inputs can be assigned, ensuring that data match the subject and annotation matches the reference sequence for example.

Cohorts

For analysis of a cohort, there is often one model for each member, producing conclusions about each genome in isolation, with an additional model for the entire cohort. The processing profile for the cohort describes how to take the initial models and draw further conclusions about the cohort as a unit. In many cases, the cohort-level analysis draws directly from primary data behind the original models, or their intermediate results. For instance, a pedigree-aware variant detector may go back to the original alignments from members of a family rather than merely converge variant calling results from the individuals.

[Bioinformatics > Genome Modeling System (GMS) > Screen shot 2014-02-10 at 2.36.04 PM.png]

Pipeline

Each sub-type of model defines a distinct analysis pipeline. The model subclass definition includes a specification for inputs and parameters to be supplied when models are created, as well as logic to construct a workflow to build results. Adding new pipelines requires writing a software module to describe the new sub-type of model. The simplest pipeline can be no more complicated than a small script, and the most complicated will have an elaborate graph of steps, each with distinct processing requirements.

External Databases

The model/build metaphor extends throughout the system, such that the human reference itself is represented as a model in the system, with each build from the Genome Reference Consortium being represented as a GMS build of that model. External data sets from dbSNP, Ensembl, etc., are also tracked and versioned with the model/build process. When reference genomes or annotation data sets are updated, models of individual genomes know that their inputs are no longer current, and can readily be rebuilt to reflect the latest data from the community. Annotation of genome models for individual subjects is typically a product of crossing the annotation of the reference genome with the variants found in the individual. Tools such as VEP (the Ensembl Variant Effect Predictor) are integrated into the GMS, as well as a custom annotator that has been tuned to perform a similar task on somatic variants from cancer tumors. For pipelines processing RNA-seq data, the annotation build supplied is also used during alignment to assist in aligning across introns.

Disk Management

Detailed analysis results are represented in files as defined by the bioinformatics community and are stored in a fashion that is built to scale with computational tools. For example Illumina DNA sequence reads are stored in BAM format, as are alignments. Variant detection results are stored in VCF files with block-gzip compression to allow querying without full decompression via tabix. The GMS includes a Disk Allocation System that stores data about each slice of disk used in the RDBMS, along with information about the owner of the data. The latter is typically a given build, or a "software result" that can be shared by builds (Box 1). Processes that require disk request an appropriate amount beforehand, and then re-size their allocation after processing completes to handle differences between expected and actual disk usage. The disk allocation system allows administrators to re-locate data as required without interrupting processing.