Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize / rework problematic SPARQL query in API in /go-cam site #4

Closed
kltm opened this issue Nov 23, 2021 · 3 comments
Closed

Optimize / rework problematic SPARQL query in API in /go-cam site #4

kltm opened this issue Nov 23, 2021 · 3 comments

Comments

@kltm
Copy link
Member

kltm commented Nov 23, 2021

Noting that there is a lot of similarity with #3, but that this is a separate issue.

Recently, with our migration to EC2 and move to a smaller machine, we've come to understand that some queries coming through the GO-CAM API to the SPARQL endpoint are no longer able to meet the query timeout.

An example of the main problematic query is https://github.com/lpalbou/api-gorest-2020/blob/aee0b9bd1e8b6c7ea1c815cfd70e2f1972deb0d7/queries/sparql-models.js#L219
Note that the code for this is https://github.com/lpalbou/api-gorest-2020/blob/aee0b9bd1e8b6c7ea1c815cfd70e2f1972deb0d7/queries/sparql-gp.js#L67; we'll probably want to

We're still looking at how to proceed, but the steps here may be:

  • determine if optimizations for the query are possible
    • direct optimization
    • splitting, with coordination from the UI
  • otherwise, explore caching in the API
  • test that fixes are producing same results as current, or that we want to change current

Secondarily, we'll want to make sure the devops processes are clear to roll this out into production without hiccups. (May want to try just redeployment first?) TBD. (Caveat here that we may run into other such queries. Good to get our practices down and document how we do this.)

@balhoff Would it be possible for you to look at the SPARQL for this one as well?

Tagging @tmushayahama @dustine32 @sierra-moxon @cmungall @vanaukenk @balhoff .

@balhoff
Copy link
Member

balhoff commented Nov 29, 2021

I'm finding this one much harder to improve.

@lpalbou
Copy link
Contributor

lpalbou commented Dec 2, 2021

Hi, just passing by quickly.

I don’t know about the latest modifications but the first query was cached in a compressed json and loaded by the website. The idea is that for a given month, the result of that (very frequent) query is always the same, so it’s ideal to cache.

the second one is more problematic and more recent. It was created to find the models with at least 1 MF connected by one in and one out causal relationship to other MFs. This one could possibly be optimized but is harder to cache as the input is the is of a gene. Harder but not impossible since we probably only have around 2k genes so we could run and cache those 2k queries. Especially if you were to use something like memcache or redis.. but a json on cdn would do the trick too.

ideally.. we were also discussing to index gocams in golr. That was my initial thought for future evolution of this code and this will probably become more and more important as the resource grows. FYI I have discussed with a few other people in the RDF world and because of those speed issues, they cache everything every night.

Hope this helps a little and hope everyone is doing great - Laurent-Phillipe

@kltm
Copy link
Member Author

kltm commented Jan 12, 2022

There are now no longer any more optimizations that need to added that aren't going to be dealt with by another mechanism.
Thank you everybody!

@kltm kltm closed this as completed Jan 12, 2022
@kltm kltm moved this from In progress to Done in Software essential and proactive maintenance Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants