Feature annotation locations in RDF

Introduction

This grew out of the combination of groups discussing:

In all these cases, and related tasks like GTF, GFF or DDBJ/GenBank to RDF we need to describe feature annotations and their locations in RDF.

We are using GitHub for both the https://github.com/JervenBolleman/FALDO (formal ontology), and to write up a https://github.com/JervenBolleman/FALDO-paper (paper) (using LaTeX and a journal template). The paper has now been submitted, and a preprint is available here:

FALDO: A semantic standard for describing the location of nucleotide and protein feature annotation J. Bolleman, C.J. Mungall, F. Strozzi, J. Barran, M. Dumontier, R.J.P. Bonnal, R. Buels, R. Hoendorf, T. Fujisawa, T. Katayama, P.J.A. Cock. bioRxiv doi: 10.1101/002121

Update: It was decided to put the stable release of the FALDO at http://biohackathon.org/resource/faldo (Nov 2012).

Proposal

For the formal ontology, see https://github.com/JervenBolleman/FALDO.

Simple continuous feature locations can be described with start and end co-ordinates (typically integers) giving a range or sub-region on a given (strand of a) reference sequence. This corresponds to one line in the GFF and GTF file formats. Often features a made up of several of these sub-regions, which corresponds to multiple lines in GFF and GTF with a common identifier, and to a join or order location in DDBJ/GenBank/EMBL format.

Therefore a feature in RDF can have a single Region object, or a list (where the order is known) or a bag (where the order is unknown) of Region objects. Each location has a start and end Position, which will have a co-ordinate (a literal integer for an ExactlyKnownPosition) and a Reference. Subclasses of Position are used to record the strand information (e.g. forward strand, or strand-less).

Although not needed for mapping GFF/GTF or DDBJ/GenBank/EMBL features, this does permit the use of a Location whose start and end refer to different Reference sequences.

As in GFF/GTF or DDBJ/GenBank/EMBL features, start and end co-ordindates will be one-based, and for nucleotide sequences given with respect to the forward strand (i.e. 1 <= start <= end <= length of reference).

DDBJ/GenBank/EMBL to RDF

INSDC feature table locations using 'join' should give an ordered list of Regions, whereas the rarely used 'order' location regions are unordered, and should give a bag of Regions.

Separately from the location issue itself, it would be sensible to infer links between gene/mRNA/CDS features - also required for DDBJ/GenBank/EMBL to GFF3.

GFF to RDF

Todo - link to Rob H's tool here.

Examples

We will now illustrate several examples, including pathological and corner cases showing their GFF and/or DDBJ/GenBank/EMBL location string, followed by a conversion to RDF.

Simple genes with exons

In DDBJ/GenBank/EMBL genes and CDS features are described using a join, combining what are typically exons into a single feature (i.e. no explicit parent/child relationship).

Features spanning the origin of a circular genome

Consider a gene on the forward strand of a circular genome of length 1000bp, running from start 900bp through the origin to end at 200bp, giving a feature of length 300bp (todo - check off by one error in length) ? We could just use one Region with appropriate start, end etc. However, this would mean end < start, which is perhaps undesirable.

We could just the GFF3 style 'overflow' end coordinate, but then reference length < end. i.e. Use the reference length plus the end value, here 1000bp + 200bp giving an apparent end value of 1200bp.

We could use the DDBJ/GenBank/EMBL approach and split the location into two, join(900.1000,1..200).

TODO - discussion of which is considered preferable. Is there a suitable ontology term which might be included to make this wrapping explicit?

Trans-spliced genes

In trans-splicing, particularly common in tRNA, regions of mRNA from very different regions of the genome are spliced together. This allows a number of atypical situations to occur:

Unusually ordered exons

Example from plant tRNA to follow

Mixed strand genes

Here is a GenBank example from NC_000932, the Arabidopsis thaliana chloroplast:

 gene            join(complement(69611..69724),139856..140650)
                 /gene="rps12"
                 /locus_tag="ArthCp047"
                 /trans_splicing
                 /db_xref="GeneID:844801"
 CDS             join(complement(69611..69724),139856..140087,
                 140625..140650)
                 /gene="rps12"
                 /locus_tag="ArthCp047"
                 /trans_splicing
                 /note="trans-spliced"
                 /codon_start=1
                 /transl_table=11
                 /product="ribosomal protein S12"
                 /protein_id="NP_051038.1"
                 /db_xref="GI:7525057"
                 /db_xref="GeneID:844801"
                 /translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK
                 PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL
                 DAVGVKDRQQGRSKYGVKKPK"

Mixed chromosome genes

Example from plant tRNA to follow.

See also the example below of a gene split between two contigs of a draft genome assembly.

Gene (or transcript) split between two contigs

When working with a draft genome assembly, it is quite possible for a region of low coverage or difficult to assemble repeat region or other such problem to prevent the complete assembly of the middle of a gene, yet leave recognisable fragments at the ends of two contigs. If this is identified during the annotation (or from RNA Seq mapping), this could be grounds to scaffold those contigs together.

Same gene on different assembly versions

When working with a new genome assembly, it is common to go through a series of assembly and scaffolding refinements - where possible automatically lifting earlier annotation to the new reference. It would be reasonable in this case for an RDF triple store to contain both the old triples describing a given gene's location on the old assembly, and new triples describing the revised location against the new assembly. Note that the reference sequences from the different assemblies must have different identifiers.

SNPs and other mutations

Simple SNP

Give an example, perhaps from dsSNP referencing a particular build of the human genome, of a single base mutation. i.e. In this case the Region's start and end would be the same Position (since we use one-based counting).

Insertion site

For an insertion site between bases 100 and 101, in DDBJ/GenBank/EMBL we would write 100^101 for the location. In RDF we suggest the Region's start be "after 100" and the end be "before 101" (add the class names here later).

Fuzzy locations

Partial genes

Examples with a before/after position, for example at the very end of an incomplete config.

Multiple possible start codons

It is not always easy to predict which start codon is used, and in fact sometimes a gene can be expressed in multiple forms by using different start codons. In DDBJ/GenBank/EMBL this used to be expressed with a one-of location (seems to have been withdrawn, probably such cases are now annotated as two genes?).

Restriction Digest site

In some/all cases you would want to use two features/locations for the recognition site and the cutting site.

Some recognition sites will be strand neutral (e.g. palindromic recognition sequences), others will be strand specific.

Simple blunt end cutting probably just needs a location as in the insertion site example above. For overhanging cut sites, perhaps two locations are needed - one for each strand?

Annotated protein

The examples above have been focused on genomic annotations, but one might equally well want to annotate a protein sequence, for example to mark a signal peptide or a PFAM domain.

Group

Discussion included the following people:

Jerven Bolleman
Peter Cock
Toshiaki Katayama
Rob Buels
Robert Hoehndorf
Raoul Bonnal (INGM, BioRuby) Linkedin: http://it.linkedin.com/in/raoulbonnal
Michel Dumontier
Takatomo Fujisawa
Francesco Strozzi
Joachim Baran

Provide feedback

Saved searches

Use saved searches to filter your results more quickly