Skip to content

ginkgobioworks/gen

Repository files navigation

Gen

Gen is a version control system for genetic sequences. It efficiently stores genome-length sequences and sequence variations, with native support for polyploid genomes and pooled genotypes. Each project is organized into a repository, where collections of sequences and associated data are stored and tracked over time. Within a repository, branches can be created to explore different modifications or variations without affecting the main project. These branches can later be merged to integrate results from different experiments or collaborators.

The gen client can import standard sequence file formats from sources like NCBI and genetic design tools. Internally, sequences are stored as pangenomic molecules that represent not just a single strain, cultivar, or cell line, but also any engineered or naturally derived variants for use in laboratory experiments. Gen molecules take the form of a graph structure as shown in the figure below. Each molecule is made up out of a network of nodes that represent sequence fragments, and edges that define how sequence fragments are connected. Multiple molecules are organized into collections that could represent the different chromosomes in a cell or fragments in a reaction mixture. Molecules generally start out as a single node that holds a reference sequence, and new edges and nodes are added for every sequence variant that is designed or observed. To reconstitute a linear sequence, the client walks from node to node along a defined path.

<figure 1>

Imported feature annotations can be propagated from path to path in a sequence-agnostic way that relies on coordinate translation. Paths can also be compared to one another to detect features that are common or different between sets of paths, which can be used to analyze experimental data. A sample object represents the subset of the possible paths and edges that is actually present in an experimental sample. A value between from 0 and 1 is assigned to each edge and path to represent the probability that and edge or path is observed. These numbers can be derived from sequencing results, or set by the user to represent an isolate or cloning reaction. This allows a user to focus on distinguishing features of a molecule by masking out irrelevant edges. The figure below demonstrates how this can be used to represent a polyploid genome obtained through cross-breeding. Like paths, samples can be compared to one another to detect differences and common features.

<figure 2>

Installing from Source

Make sure you have a Rust compiler installed on your system. You can install the Rust toolset using the rustup installer.

  1. Clone the source with git:

    git clone https://github.com/ginkgobioworks/gen.git
    cd rust
  2. Compile the gen package and its dependencies:

    cargo build
    
  3. You can find the gen executable in ./targets/debug/ or execute it via cargo:

    cargo run -- <arguments>
    

Usage

Starting a new repository

gen --db <file> init

gen --db <file> import --fasta <file> --name <string>

Cloning an existing repository

Recording sequence changes

Sequence variants observed through NGS can be imported into a gen repository via standard VCF file obtained from variant callers like Freebayes, GATK, or DeepVariant. [...]

Associating numerical data with paths and edges

Commits and merges

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages