Home

Zachary Skidmore edited this page Jun 10, 2016 · 42 revisions
Clone this wiki locally

The Genome Modeling System

The Genome Institute at Washington University has developed a high-throughput, fault-tolerant analysis information management system called the Genome Modeling System (GMS), capable of executing complex, interdependent, and automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. GMS includes a full system image with software and services, expandable from one workstation to a large compute cluster.

GMS also serves as a platform for bioinformatics development, allowing a large team to collaborate on data analysis, or an individual researcher to leverage the work of others effectively within its federated data management system that also supports external collaboration. Most importantly, rather than separating ad-hoc analysis from rigorous, reproducible pipelines, the GMS provides systematic integration between the two. The GMS thus promotes versioned data tracking of ad hoc analyses while facilitating rapid development of formal pipelines.

As a demonstration of the GMS, we performed an integrated analysis of whole genome, exome and transcriptome sequencing data from a breast cancer cell line (HCC1395) and matched lymphoblastoid line (HCC1395 BL) and produced an integrated analysis of these data. The results are available for users to test the software, complete tutorials and develop novel GMS pipeline configurations.

For a brief demonstration of the GMS please start with the: Quick Tour in a Pre-configured Virtual Machine.

The summary of results from the ClinSeq pipeline is here.

Introduction to this documentation

The GMS is a complicated system with many components. Running GMS pipelines will allow you to perform data QC, alignment of whole genome, exome and rna-seq data to a reference genome, summarize coverage achieved, somatic variation detection of multiple types (each with multiple callers), perform transcriptome assembly and expression estimation, differential expression, integrated analysis of orthogonal data types, and interpret the clinical relevance of 'omic' events in a patients tumor. Performing these analyses involves installation, automation, and integration of dozens of open source bioinformatics tools. These tools are incredibly heterogenous in their implementation and level of engineering. The GMS is the glue that holds all of these pieces together. This project attempts to make it as easy as possible to install and configure all necessary tools and services. Some basic bioinformatics experience and familiarity with genomics concepts of the part of the user are assumed. However, the GMS wiki attempts to document everything you will need to know to get started.

There are several alternate installation strategies that depend on the computer hardware at your disposal. These are documented in the Installation Types Overview page. A number of tutorials are provided to give you an introduction to various areas of the system, including installing the system on your own hardware, installing the system on Amazon AWS, completing the demonstration tumor/normal analysis, more advanced GMS commands, using the GMS web viewer, and importing your own data. The docs page covers many more specific topics for both users and developers who contribute to the maintenance and development of the GMS project. These docs also include detailed descriptions of six major pipelines of the GMS. Finally an FAQ page attempts to cover the most common questions we encounter. If you find that your question remains unanswered by this documentation or coverage of a critical topic is lacking, please contact us.

Quick navigation:

Install Docs Tutorials FAQ
Step-by-step instructions for installing the sGMS Technical documentation about the internals of the sGMS Tutorials for running different analyses using the sGMS Frequently asked questions about the sGMS

Target audience

We hope that the system will be installed at small to medium genome centers, core facilities, and individual labs with some IT and bioinformatics support. We expect that large genome centers will continue to use their own infrastructure because of their investment in these and the corresponding momentum at these centers. Small to medium centers will need at least minimal system administration and bioinformatics support. Bioinformatics and analysis are complex domains and will continue to require bioinformaticists and analysts for the foreseeable future. Similarly, deploying a large scale GMS installation will require IT support. That being said we have documented several cloud installation options for the GMS and we hope this will substantially reduce the IT support burden for some of our target users.

Acknowledgements:

The development of the Genome Modeling System was funded by an NHGRI Large Scale Sequencing and Analysis Center grant (U54 HG003079) to Richard K Wilson. Additional funding to make this system usable by the community was also provided by NHGRI Genome Sequencing Informatics Tools (GS-IT) Program U01 HG006517 to David J Dooling (year 1) and Li Ding (years 1-4).

Contributions:

The GMS is the result of the efforts of a dedicated group of personnel over a number of years. For a fairly detailed list of contributions please see the [contributions] (https://github.com/genome/gms/wiki/Contributions) page.