Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document typecheck option in gff3validator #910

Closed
cmungall opened this issue Feb 23, 2019 · 5 comments
Closed

Document typecheck option in gff3validator #910

cmungall opened this issue Feb 23, 2019 · 5 comments

Comments

@cmungall
Copy link

Context: The-Sequence-Ontology/SO-Ontologies#465

The gff3 validator uses the sofa.obo file to check gff3. Is there documentation on what the expectations are in the file, both regarding the term names and the ontology graph.

I'm trying to document a kind of service level agreement on the SO side, and want to ensure that future changes to SO don't violate your expectations.

Also, is this the right repo for the canonical gff3 validator? Last I recall it was in perl, not C

@gordon
Copy link
Member

gordon commented Feb 27, 2019

At the time I wrote a generic OBO file parser and that the ontology file is parseable by it is a requirement.
You can automatically check that in your test suite by using gt gff3 -typecheck test.obo testdata/standard_gene_simple.gff3 or gt gff3validator -typecheck test.obo testdata/standard_gene_simple.gff3

It doesn't make any assumptions about term names and ontology graphs, but changing established names or graph relationships might invalidate existing GFF3 files.

I think the constraints mentioned on top of The-Sequence-Ontology/SO-Ontologies#465 are a good start.

This repository is not related to the Perl GFF3 validator. I wrote a new GFF3 parser and validator in C based on the specifications. If I recall correctly the Perl validator didn't meet my performance requirements. It has been extensively tested and at the time had a much better performance than any other GFF3 validator I tested.

@cmungall
Copy link
Author

Thanks for the info!

Can you describe how you use the graph? Are there assumptions about particular relationship types (e.g. part_of)? For example, is the exon-part_of-transcript relationship used to check that exons are within the bounds of transcripts?

@gordon
Copy link
Member

gordon commented Mar 5, 2019

I parse an OBO file according to the OBO Flat File Format Specification, version 1.2, (see API and implementation).

This gives me all the "Term" stanzas. From the set of "Term" stanzas I build the ontology graph (while ignoring obsolete stanzas). See implementation.

This gives me all valid terms and the is_a and partof relations. This graph is then used in the GFF3 parser (if enabled) to make sure all terms are valid and all parent-child relationships in GFF3 are part_of in the ontology. In some special cases I also check is_a relationships.

@cmungall
Copy link
Author

Many thanks for your explanation!

I am assuming that your implementation also uses the member_of relation too; it seems this way:

/* match member_of */
if (!strncmp(rel, MEMBER_OF, strlen(MEMBER_OF))) {
const char *member_of = rel + strlen(MEMBER_OF) + 1;
gt_str_append_cstr_nt(buf, member_of, strcspn(member_of, " \n"));
gt_type_node_part_of_add(node, gt_symbol(gt_str_get(buf)));
continue;

This is good because otherwise most feature graphs would be declared invalid, as the path in SO from a transcript to a gene involves a hop over member-of (which is a sub-relation of member-of).

@gordon
Copy link
Member

gordon commented Apr 12, 2019

You are right, I forgot to mention that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants