Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize / rework problematic SPARQL query in API in pathways widget #3

Closed
kltm opened this issue Nov 19, 2021 · 13 comments
Closed

Optimize / rework problematic SPARQL query in API in pathways widget #3

kltm opened this issue Nov 19, 2021 · 13 comments

Comments

@kltm
Copy link
Member

kltm commented Nov 19, 2021

Recently, with our migration to EC2 and move to a smaller machine, we've come to understand that some queries coming through the GO-CAM API to the SPARQL endpoint and both voluminous (possibly crawling bots) and are no longer able to meet the query timeout.

An example of the main problematic query is https://gist.github.com/kltm/19e2d12cde2aa2e1f3f92cebdf16d937
Note that the code for this is https://github.com/lpalbou/api-gorest-2020/blob/aee0b9bd1e8b6c7ea1c815cfd70e2f1972deb0d7/queries/sparql-gp.js#L67; we'll probably want to

After some discussion on the software call, steps here may be:

  • determine optimizations for the query
    • direct optimization
    • splitting, with coordination from the UI
  • otherwise, explore caching in the API
  • Test that fixes are producing same results as current, or that we want to change current

Secondarily, we'll want to make sure the devops processes are clear to roll this out into production without hiccups. (May want to try just redeployment first?) TBD. (Caveat here that we may run into other such queries. Good to get our practices down and document how we do this.)

@balhoff Already has some very (very) promising ideas here.

Tagging @tmushayahama @dustine32 @sierra-moxon @cmungall @vanaukenk @balhoff .

@balhoff
Copy link
Member

balhoff commented Nov 19, 2021

Here is the fast version of the query. Note that the gene product ID needs to be injected in two different lines within the query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX pr: <http://purl.org/ontology/prv/core#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX providedBy: <http://purl.org/pav/providedBy>
PREFIX MF: <http://purl.obolibrary.org/obo/GO_0003674>
PREFIX causally_upstream_of_or_within: <http://purl.obolibrary.org/obo/RO_0002418>
PREFIX causally_upstream_of_or_within_negative_effect: <http://purl.obolibrary.org/obo/RO_0004046>
PREFIX causally_upstream_of_or_within_positive_effect: <http://purl.obolibrary.org/obo/RO_0004047>
PREFIX causally_upstream_of: <http://purl.obolibrary.org/obo/RO_0002411>
PREFIX causally_upstream_of_negative_effect: <http://purl.obolibrary.org/obo/RO_0002305>
PREFIX causally_upstream_of_positive_effect: <http://purl.obolibrary.org/obo/RO_0002304>
PREFIX regulates: <http://purl.obolibrary.org/obo/RO_0002211>
PREFIX negatively_regulates: <http://purl.obolibrary.org/obo/RO_0002212>
PREFIX positively_regulates: <http://purl.obolibrary.org/obo/RO_0002213>
PREFIX directly_regulates: <http://purl.obolibrary.org/obo/RO_0002578>
PREFIX directly_positively_regulates: <http://purl.obolibrary.org/obo/RO_0002629>
PREFIX directly_negatively_regulates: <http://purl.obolibrary.org/obo/RO_0002630>
PREFIX directly_activates: <http://purl.obolibrary.org/obo/RO_0002406>
PREFIX indirectly_activates: <http://purl.obolibrary.org/obo/RO_0002407>
PREFIX directly_inhibits: <http://purl.obolibrary.org/obo/RO_0002408>
PREFIX indirectly_inhibits: <http://purl.obolibrary.org/obo/RO_0002409>
PREFIX transitively_provides_input_for: <http://purl.obolibrary.org/obo/RO_0002414>
PREFIX immediately_causally_upstream_of: <http://purl.obolibrary.org/obo/RO_0002412>
PREFIX directly_provides_input_for: <http://purl.obolibrary.org/obo/RO_0002413>
PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>
PREFIX hint: <http://www.bigdata.com/queryHints#>
SELECT DISTINCT ?gocam ?title
WHERE {
  GRAPH ?gocam {  
    # Inject gene product ID here
    ?gene rdf:type <http://identifiers.org/rgd/1586460> .
  }
  ?gocam metago:graphType metago:noctuaCam .
  ?gocam dc:title ?title .
  GRAPH ?gocam {
    VALUES ?causal { causally_upstream_of_or_within: causally_upstream_of_or_within_negative_effect: causally_upstream_of_or_within_positive_effect: causally_upstream_of: causally_upstream_of_negative_effect: causally_upstream_of_positive_effect: regulates: negatively_regulates: positively_regulates: directly_regulates: directly_positively_regulates: directly_negatively_regulates: directly_activates: indirectly_activates: directly_inhibits: indirectly_inhibits: transitively_provides_input_for: immediately_causally_upstream_of: directly_provides_input_for: }
    VALUES ?causal2 { causally_upstream_of_or_within: causally_upstream_of_or_within_negative_effect: causally_upstream_of_or_within_positive_effect: causally_upstream_of: causally_upstream_of_negative_effect: causally_upstream_of_positive_effect: regulates: negatively_regulates: positively_regulates: directly_regulates: directly_positively_regulates: directly_negatively_regulates: directly_activates: indirectly_activates: directly_inhibits: indirectly_inhibits: transitively_provides_input_for: immediately_causally_upstream_of: directly_provides_input_for: }
    #?s enabled_by: ?gpnode . # I don't think this is needed for this query
    #?gpnode rdf:type ?identifier . # I don't think this is needed for this query
    ?ind1 enabled_by: ?gpnode1 .
    ?ind2 enabled_by: ?gpnode2 .
    ?ind3 enabled_by: ?gpnode3 .
    ?ind1 ?causal ?ind2 .
    ?ind2 ?causal2 ?ind3 .
    ?gpnode1 rdf:type ?gp1 .
    ?gpnode2 rdf:type ?gp2 .
    ?gpnode3 rdf:type ?gp3 .
    # Inject gene product ID here
    FILTER(?gp1 = <http://identifiers.org/rgd/1586460> || ?gp2 = <http://identifiers.org/rgd/1586460> || ?gp3 = <http://identifiers.org/rgd/1586460>)  
  }
  FILTER EXISTS {
    ?ind1 rdf:type MF: .
  }
  FILTER EXISTS {
    ?ind2 rdf:type MF: .
  }
  FILTER EXISTS {
    ?ind3 rdf:type MF: .
  }
}
ORDER BY ?gocam

@kltm
Copy link
Member Author

kltm commented Nov 19, 2021

Thanks @balhoff ! I guess the next steps would be:

  • test this to make sure that we're getting the same results
  • roll this out to product (<-more discussion here)

@kltm
Copy link
Member Author

kltm commented Nov 22, 2021

For people watching here, noting possible crossover with geneontology/helpdesk#366

@kltm kltm changed the title Optimize / rework problematic SPARQL query in API Optimize / rework problematic SPARQL query in API in pathways widget Nov 23, 2021
@kltm
Copy link
Member Author

kltm commented Dec 15, 2021

@dustine32 Now has a test instance at https://api2.geneontology.xyz. His testing confirms that the queries time out when there is no other traffic.

Next steps are likely:

  1. update with @balhoff 's code
  2. check if there are noticeable improvements
  3. figure out the juggle

@kltm
Copy link
Member Author

kltm commented Dec 17, 2021

@dustine32 To kick this off, if I wanted to update the code currently at api2, would I just make a PR (that then gets merged) on https://github.com/geneontology/api-gorest-2021? If so, how long might the propagation take (and what would be the best way to observe the progress (if possible with current permissions))?
In case, started: https://github.com/geneontology/api-gorest-2021/tree/issue-api-gorest-3-update-sparql

kltm added a commit to geneontology/api-gorest-2021 that referenced this issue Dec 17, 2021
For geneontology/api-gorest#3
Trying to bring on @balhoff 's changes to the experimental codebase (geneontology/api-gorest#3 (comment))
@kltm
Copy link
Member Author

kltm commented Dec 17, 2021

@dustine32 Possibly a first attempt here? geneontology/api-gorest-2021#1

@dustine32
Copy link

@kltm Yes! Thank you for PR-ing this. That's the route I would go. Then we'll merge to master and it should only take about a minute to auto-deploy.

It would be cool if we could setup a branch-specific auto-deploy to a test or dev URL. Hm...

@kltm
Copy link
Member Author

kltm commented Dec 17, 2021

Well, we have two at this point, right? api and api2. That should at least, for now, allow testing. Unless autodeploy for branches is easy, it might not be worth pursuing for the moment.

@kltm
Copy link
Member Author

kltm commented Dec 17, 2021

That said, we can go either route on your determination. I'd ideally like to see if we can get this cleared before we all start wandering away, but more than that I'd like to make sure we don't bork the production API.

@dustine32
Copy link

@kltm Yep, I'm down with just getting this out using api2. I'll approve the PR in a sec.

What I've read so far makes setting up a dev deploy pipeline (as I imagined it) look a bit tricky but, I think, doable.

dustine32 added a commit to geneontology/api-gorest-2021 that referenced this issue Dec 17, 2021
@kltm
Copy link
Member Author

kltm commented Dec 17, 2021

Before and after, from @dustine32

https://api.geneontology.xyz/gp/RGD:1586460/models?causalmf=true (~30 sec timeout)
https://api2.geneontology.xyz/gp/RGD:1586460/models?causalmf=true (5-10s, correct "empty" results)

I guess the thing to do from here would be to do a little more testing to make sure we have similar/correct results between the two endpoints, then go to production. That may require a little help from @balhoff (or others setting up a local instance) so that we can actually run the old query to completion.

@kltm
Copy link
Member Author

kltm commented Dec 17, 2021

That said, since we are basically looking at working vs non-working, I think that just a little spot checking would probably be enough.

@dustine32
Copy link

On 2021-12-17, we switched the api.geneontology.xyz DNS record in AWS to point to the new repo. Watching the traffic through the new API's dashboard, it's clear that the switchover took effect and the pathways GO-CAM count is being returned (seemingly) instantaneously on the Alliance gene pages (e.g. FB:FBgn0000499).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants