Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help needed moving away from the MySQL database #126

Closed
juanjoDiaz opened this issue May 10, 2018 · 15 comments
Closed

Help needed moving away from the MySQL database #126

juanjoDiaz opened this issue May 10, 2018 · 15 comments
Assignees

Comments

@juanjoDiaz
Copy link

Hi,

I've been using the MySQL dump of GO in the past for my Cytoscape app (http://apps.cytoscape.org/apps/gfdnet)

I would like to move to the new model which I understand that is based on blazegraph or OBO files.

Essentially, given a gene network, I was using GO to:

  • Get the list of possible genus and species so the user can select the right one for the analysis.
  • Match all the synonyms of a gene (in case a non-standard name is used)
  • Match each gene with its associated gene products (including synonyms)
  • Match each gene product with its associated GO terms
  • Construct the DAG (Tree) from the annotations of those gene products all the way to the ontology root.

You can see my DB queries in https://github.com/juanjoDiaz/gfdnet/tree/master/src/main/java/org/cytoscape/gfdnet/model/dataaccess/go

Any help to port the tool to the latest available data (either Blazegraph, OBO files, or whatnot) would be really appreciated.

@kltm
Copy link
Member

kltm commented May 10, 2018

@juanjoDiaz

Welcome to the transition--there's a lot going on and we're still trying to get the all documentation and links in order as well.

While OBO is still used as a format for the ontologies, you may want to look at OWL as a possibility moving forward as well: in the future the most data rich "annotation" models that we'll be creating will be based on that (often in TTL format).

The the latest blazegraph dumps for a release can be found at http://current.geneontology.org, monthly-ish releases at http://release.geneontology.org (we still need to clean that out--please ignore anything before March 2018); frequent (almost daily) snapshots can be found at http://snapshot.geneontology.org. For all of these under products/blazegraph, with a date first for the "releases". You would be interested in the blazegraph-production.jnl.gz file.

I am actually unfamiliar with the relations used to store information like synonyms, or if we're even loading that into the blazegraph at this point; I'll defer comment about that and public data model documentation to @cmungall .

@juanjoDiaz
Copy link
Author

Thanks for the response @kltm

I'll look into OWL and wait for @cmungall for the details :)

Couple more questions:

  • Is there any information about how those dumps files are organized and how the match the old MySQL structure?
  • Is there any publicly accessible database and any guidelines of how to do so?

@kltm
Copy link
Member

kltm commented May 10, 2018

@juanjoDiaz

For annotation/modeling data, the "dump" files that we provide are either going to be OWL TTL, so an OWL model, or a GAF 2.1. For the blazegraph, the layout in the triplestore is an extremely close analog to how the data is modeled in the OWL, but with small adjustments where they do not fit.

For the time being, we have made available http://rdf.geneontology.org . While not widely advertised, it should be usable as a beta tool as we shore it up.

@suzialeksander
Copy link
Collaborator

@cmungall, was there anything you can add to this?

@juanjoDiaz
Copy link
Author

Thanks for following up!
To be honest I'm still 100% stuck on this one.

No clue on how to get the information about gene synonyms or gene products.
And no clue on how to query the new Blazergraph. I've tried the examples at https://github.com/geneontology/go-graphstore/wiki/Example-queries but they don't seem to work. http://geneontology.org/rdf/ just returns not found, for example.

I hope we can get more info soon. I'm surprised if I'm the only one having this problem. I was expecting more people relying on the MySQL database...

@kltm kltm assigned dougli1sqrd and unassigned cmungall Jul 6, 2018
@kltm
Copy link
Member

kltm commented Jul 6, 2018

@juanjoDiaz Ugh, sorry, actually those wiki examples are a little crusty. Our public RDF endpoint is more properly at http://rdf.geneontology.org . I tagged on @dougli1sqrd , who is currently working on the graph store and new documentation--he should be able to help get you started a bit.

@dougli1sqrd
Copy link

dougli1sqrd commented Jul 10, 2018

Hi @juanjoDiaz!
Check out https://github.com/geneontology/go-site/blob/master/graphstore/triplestore_info.md. This is an in progress document about how we construct the triplestore. SPARQL is the way to query the data there, and I have links on SPARQL tutorials in the document. There is some explanation of the data model, as well as links to resources in more detail on GO-CAM models.

I can also help with specific SPARQL queries you might want. Check out the document, and let me know what other questions you have.

FYI, we currently do not have taxon information on gene products in the triple store directly. But we are actively working on that right now, and should have that completed in a day or so, trickling into the triple store after that.

@juanjoDiaz
Copy link
Author

Thanks @dougli1sqrd !

The queries that I'm trying to do are listed on my first message.

  • Get the list of available genus and species in GO.
  • Given a gene name get all the synonyms (so, in case a non-standard name is used, I can use the standard one for the following queries)
  • Given a gene get its gene products (including synonyms)
  • Given a gene product get its associated GO terms
  • Construct the DAG (Tree) from the annotations of those gene products all the way to the ontology root. I guess I might not need this anymore if OBO/OWL/Blazegraph/etc. have a way to get a GO term depth, the shortest distance between two GO Term and the LCA (lowest common ancestor) of two GO terms.

You can see my MySQL DB queries in https://github.com/juanjoDiaz/gfdnet/tree/master/src/main/java/org/cytoscape/gfdnet/model/dataaccess/go

Any help on getting those queries in the new system is more than welcome.

@suzialeksander
Copy link
Collaborator

@dougli1sqrd can you help out @juanjoDiaz with these queries?
Thanks!

@kltm
Copy link
Member

kltm commented Jul 30, 2018

@juanjoDiaz
Unfortunately, while we have some examples (https://github.com/geneontology/sparqlr/tree/master/templates), and are working on getting more examples into the system, we don't really have the capacity to work through these right now. As well, on a little inspection with our public endpoint, some of these queries would timeout before completion.
For the final query you have, it may be worth noting that "distance" and term "depth" are often not particularly information carrying concepts in the GO in many use cases, as there are multiple paths over the closure of many types of relationships. For some of your use cases, you may actually be interested in the BioLink API https://biolink.geneontology.io/ (apologies, HTTPS exception needed for the moment).

@suzialeksander
Copy link
Collaborator

@juanjoDiaz, is there anything else we can clarify before I close this ticket?

@juanjoDiaz
Copy link
Author

Hi @suzialeksander,
Sorry for the delay answering.

Unfortunately, I still haven't been able to find a clear path to migrate away from MySQL.

I understand from @kltm response that BioLink API might be a better option than using the new GO database directly. Is it correct my understanding that BioLink just wraps GO new database together with other databases to offer a consistent API?

I've taken a look at the API definition and it seems that I'd be able to build the subsection of the GO tree for a set of genes using:

  • /bioentity/goterm/{id}/genes/ to extract the related GO terms
  • /ontol/subgraph/{ontology}/{node} to get the tree up to the root for each GO term
  • custom code to assemble the tree at my end.

Is that the suggested approach?

However, it also seems that I won't be able Get the list of available genus and species in GO.
In that sense, I can't see any reference to genus or species. How do I specify the organism when doing queries against GO like the ones mentione above?

Also, I'd like to know how production ready, is this Biolink API? I wouldn't like to put the effor to migrate from MySQL to BioLink to then have it changed or deprecated and have to start all over again.

Thanks for all the support!

[P.S.: as a side note, I disagree on "distance" and "depth" not being relevant concepts. They definitely are relevant for my research (https://www.sciencedirect.com/science/article/pii/S1532046417300382) 🙂]

@kltm
Copy link
Member

kltm commented Sep 4, 2018

@juanjoDiaz No worries--it's not like we're speed demons ourselves.

The BioLink API (https://github.com/biolink/) is indeed production and be used for some things (currently used for by Monarch and the Alliance Genome Ribbon), but it is still in progress, with routes still being filled out and fixed as use cases arise. It may be useful to look at their tracker (biolink-api) and see if the features that you want are in progress, or to suggest them for addition.

That said, for what you're doing, you may have better luck with directly sending queries to the SPARQL endpoint (http://rdf.geneontology.org) to get the data that you're interested in.

[I'd agree that distance may be fine for some kinds of topological use cases; we just try and get users to think about what they're doing in case they are using it as a proxy for some concept of "information", in which case is may be misleading, depending on the use. It's boilerplate that I add to any query we get that makes reference to anything "depth"-like.]

@kltm kltm removed the stale label Sep 4, 2018
@suzialeksander
Copy link
Collaborator

Hi @juanjoDiaz, did you have any more questions on this right now? If not, I'd like to close this ticket, and you are welcome to open a new ticket anytime. If there's something else on this ticket you still would like more information on, please let us know!

Thanks!

@juanjoDiaz
Copy link
Author

Hi @suzialeksander ,

Feel free to close this tickets. hen I finally can start working on this I'll create more specific tickets as questions arise.

Thanks everyone for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants