-
Notifications
You must be signed in to change notification settings - Fork 0
Feature annotation locations in RDF
This grew out of the combination of groups discussing:
- Protein Domain RDF
- Genome-RDF
- Genome/Human gene annotation RDF
- Cufflinks RDF
- GFF3 to RDF/OWL converter
In all these cases, and related tasks like GTF, GFF or DDBJ/GenBank to RDF we need to describe feature annotations and their locations in RDF.
We are using GitHub for both the https://github.com/JervenBolleman/FALDO (formal ontology), and to write up a https://github.com/JervenBolleman/FALDO-paper (paper) (using LaTeX and a journal template). The paper has now been submitted, and a preprint is available here:
Update: It was decided to put the stable release of the FALDO at http://biohackathon.org/resource/faldo (Nov 2012).
For the formal ontology, see https://github.com/JervenBolleman/FALDO.
Simple continuous feature locations can be described with start and end co-ordinates (typically integers) giving a range or sub-region on a given (strand of a) reference sequence. This corresponds to one line in the GFF and GTF file formats. Often features a made up of several of these sub-regions, which corresponds to multiple lines in GFF and GTF with a common identifier, and to a join or order location in DDBJ/GenBank/EMBL format.
Therefore a feature in RDF can have a single Region object, or a list (where the order is known) or a bag (where the order is unknown) of Region objects. Each location has a start and end Position, which will have a co-ordinate (a literal integer for an ExactlyKnownPosition) and a Reference. Subclasses of Position are used to record the strand information (e.g. forward strand, or strand-less).
Although not needed for mapping GFF/GTF or DDBJ/GenBank/EMBL features, this does permit the use of a Location whose start and end refer to different Reference sequences.
As in GFF/GTF or DDBJ/GenBank/EMBL features, start and end co-ordindates will be one-based, and for nucleotide sequences given with respect to the forward strand (i.e. 1 <= start <= end <= length of reference).
INSDC feature table locations using 'join' should give an ordered list of Regions, whereas the rarely used 'order' location regions are unordered, and should give a bag of Regions.
Separately from the location issue itself, it would be sensible to infer links between gene/mRNA/CDS features - also required for DDBJ/GenBank/EMBL to GFF3.
Todo - link to Rob H's tool here.
We will now illustrate several examples, including pathological and corner cases showing their GFF and/or DDBJ/GenBank/EMBL location string, followed by a conversion to RDF.
In DDBJ/GenBank/EMBL genes and CDS features are described using a join, combining what are typically exons into a single feature (i.e. no explicit parent/child relationship).
Consider a gene on the forward strand of a circular genome of length 1000bp, running from start 900bp through the origin to end at 200bp, giving a feature of length 300bp (todo - check off by one error in length) ? We could just use one Region with appropriate start, end etc. However, this would mean end < start, which is perhaps undesirable.
We could just the GFF3 style 'overflow' end coordinate, but then reference length < end. i.e. Use the reference length plus the end value, here 1000bp + 200bp giving an apparent end value of 1200bp.
We could use the DDBJ/GenBank/EMBL approach and split the location into two, join(900.1000,1..200).
TODO - discussion of which is considered preferable. Is there a suitable ontology term which might be included to make this wrapping explicit?
In trans-splicing, particularly common in tRNA, regions of mRNA from very different regions of the genome are spliced together. This allows a number of atypical situations to occur:
Example from plant tRNA to follow
Here is a GenBank example from NC_000932, the Arabidopsis thaliana chloroplast:
gene join(complement(69611..69724),139856..140650)
/gene="rps12"
/locus_tag="ArthCp047"
/trans_splicing
/db_xref="GeneID:844801"
CDS join(complement(69611..69724),139856..140087,
140625..140650)
/gene="rps12"
/locus_tag="ArthCp047"
/trans_splicing
/note="trans-spliced"
/codon_start=1
/transl_table=11
/product="ribosomal protein S12"
/protein_id="NP_051038.1"
/db_xref="GI:7525057"
/db_xref="GeneID:844801"
/translation="MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKK
PNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTL
DAVGVKDRQQGRSKYGVKKPK"
Example from plant tRNA to follow.
See also the example below of a gene split between two contigs of a draft genome assembly.
When working with a draft genome assembly, it is quite possible for a region of low coverage or difficult to assemble repeat region or other such problem to prevent the complete assembly of the middle of a gene, yet leave recognisable fragments at the ends of two contigs. If this is identified during the annotation (or from RNA Seq mapping), this could be grounds to scaffold those contigs together.
When working with a new genome assembly, it is common to go through a series of assembly and scaffolding refinements - where possible automatically lifting earlier annotation to the new reference. It would be reasonable in this case for an RDF triple store to contain both the old triples describing a given gene's location on the old assembly, and new triples describing the revised location against the new assembly. Note that the reference sequences from the different assemblies must have different identifiers.
Give an example, perhaps from dsSNP referencing a particular build of the human genome, of a single base mutation. i.e. In this case the Region's start and end would be the same Position (since we use one-based counting).
For an insertion site between bases 100 and 101, in DDBJ/GenBank/EMBL we would write 100^101 for the location. In RDF we suggest the Region's start be "after 100" and the end be "before 101" (add the class names here later).
Examples with a before/after position, for example at the very end of an incomplete config.
It is not always easy to predict which start codon is used, and in fact sometimes a gene can be expressed in multiple forms by using different start codons. In DDBJ/GenBank/EMBL this used to be expressed with a one-of location (seems to have been withdrawn, probably such cases are now annotated as two genes?).
In some/all cases you would want to use two features/locations for the recognition site and the cutting site.
Some recognition sites will be strand neutral (e.g. palindromic recognition sequences), others will be strand specific.
Simple blunt end cutting probably just needs a location as in the insertion site example above. For overhanging cut sites, perhaps two locations are needed - one for each strand?
The examples above have been focused on genomic annotations, but one might equally well want to annotate a protein sequence, for example to mark a signal peptide or a PFAM domain.
Discussion included the following people:
- Jerven Bolleman
- Peter Cock
- Toshiaki Katayama
- Rob Buels
- Robert Hoehndorf
- Raoul Bonnal (INGM, BioRuby) Linkedin: http://it.linkedin.com/in/raoulbonnal
- Michel Dumontier
- Takatomo Fujisawa
- Francesco Strozzi
- Joachim Baran