Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SPARQL required for GO-CAM website resource files #2

Closed
dustine32 opened this issue Dec 17, 2021 · 8 comments
Closed

Fix SPARQL required for GO-CAM website resource files #2

dustine32 opened this issue Dec 17, 2021 · 8 comments

Comments

@dustine32
Copy link
Contributor

Carrying on with the work to overcome timeout issues with SPARQL queries called by the GO-CAM API. Similar to how we improved the models-by-GP query in geneontology/api-gorest#3, we've still got two queries that are essential for the GO-CAM website to function but currently timing out after 30 seconds:

QUERY 1: This one is meant to get a GO-CAM-to-GP lookup file:

AllModelsGPs() {
// Transform the array in string
var encoded = encodeURIComponent(`
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>
PREFIX in_taxon: <http://purl.obolibrary.org/obo/RO_0002162>
SELECT ?gocam (GROUP_CONCAT(distinct ?identifier;separator="` + separator + `") as ?gpids)
(GROUP_CONCAT(distinct ?name;separator="` + separator + `") as ?gpnames)
WHERE
{
GRAPH ?gocam {
?gocam metago:graphType metago:noctuaCam .
?s enabled_by: ?gpnode .
?gpnode rdf:type ?identifier .
FILTER(?identifier != owl:NamedIndividual) .
FILTER(!contains(str(?gocam), "_inferred"))
}
optional {
?identifier rdfs:label ?name
}
BIND(IF(bound(?name), ?name, ?identifier) as ?name)
}
GROUP BY ?gocam
`);
return "?query=" + encoded;
},

The actual query after resolving the separator from the config.json:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>
PREFIX in_taxon: <http://purl.obolibrary.org/obo/RO_0002162>
SELECT ?gocam   (GROUP_CONCAT(distinct ?identifier;separator="@@") as ?gpids)
			 (GROUP_CONCAT(distinct ?name;separator="@@") as ?gpnames)
WHERE 
{
    GRAPH ?gocam {
        ?gocam metago:graphType metago:noctuaCam .
        ?s enabled_by: ?gpnode .    
        ?gpnode rdf:type ?identifier .
        FILTER(?identifier != owl:NamedIndividual) .
        FILTER(!contains(str(?gocam), "_inferred"))
    }
    optional {
        ?identifier rdfs:label ?name
    }
    BIND(IF(bound(?name), ?name, ?identifier) as ?name)
}
GROUP BY ?gocam

QUERY 2: This one is meant to get a GO-CAM-to-GO-term lookup file:

AllModelsGOs() {
// Transform the array in string
var encoded = encodeURIComponent(`
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX definition: <http://purl.obolibrary.org/obo/IAO_0000115>
PREFIX BP: <http://purl.obolibrary.org/obo/GO_0008150>
PREFIX MF: <http://purl.obolibrary.org/obo/GO_0003674>
PREFIX CC: <http://purl.obolibrary.org/obo/GO_0005575>
SELECT distinct ?gocam ?goclasses ?goids ?gonames ?definitions
WHERE
{
GRAPH ?gocam {
?gocam metago:graphType metago:noctuaCam .
?entity rdf:type owl:NamedIndividual .
?entity rdf:type ?goids
}
VALUES ?goclasses { BP: MF: CC: } .
# rdf:type faster then subClassOf+ but require filter
# ?goids rdfs:subClassOf+ ?goclasses .
?entity rdf:type ?goclasses .
# Filtering out the root BP, MF & CC terms
filter(?goids != MF: )
filter(?goids != BP: )
filter(?goids != CC: )
# then getting their definitions
?goids rdfs:label ?gonames .
?goids definition: ?definitions .
}
ORDER BY DESC(?gocam)
`);
return "?query=" + encoded;
},

Raw query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

PREFIX definition: <http://purl.obolibrary.org/obo/IAO_0000115>
PREFIX BP: <http://purl.obolibrary.org/obo/GO_0008150>
PREFIX MF: <http://purl.obolibrary.org/obo/GO_0003674>
PREFIX CC: <http://purl.obolibrary.org/obo/GO_0005575>
SELECT distinct ?gocam ?goclasses ?goids ?gonames ?definitions
WHERE 
{

GRAPH ?gocam {
?gocam metago:graphType metago:noctuaCam  .
        ?entity rdf:type owl:NamedIndividual .
?entity rdf:type ?goids
    }

    VALUES ?goclasses { BP: MF: CC:  } . 
    # rdf:type faster then subClassOf+ but require filter 			
    # ?goids rdfs:subClassOf+ ?goclasses .
?entity rdf:type ?goclasses .

# Filtering out the root BP, MF & CC terms
filter(?goids != MF: )
filter(?goids != BP: )
filter(?goids != CC: )

# then getting their definitions
?goids rdfs:label ?gonames .
?goids definition: ?definitions .
}
ORDER BY DESC(?gocam)

@balhoff @kltm Any ideas how we can speed these up to return results in under 30 seconds? They don't need to run crazy fast as they typically only execute when triggered by a GO release (so ~once a month).

@balhoff
Copy link
Member

balhoff commented Dec 17, 2021

@dustine32 for Query 1, if you can do the grouping on the client side, this will complete in 18 seconds:

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>
PREFIX in_taxon: <http://purl.obolibrary.org/obo/RO_0002162>
SELECT DISTINCT ?gocam ?identifier ?name
WHERE 
{
  GRAPH ?gocam {
        ?gocam metago:graphType metago:noctuaCam .
  }
  FILTER NOT EXISTS {
    ?gocam prov:wasDerivedFrom ?asserted_cam .
  }
  GRAPH ?gocam {
    ?s enabled_by: ?gpnode .    
 	?gpnode rdf:type ?identifier .
	FILTER(?identifier != owl:NamedIndividual) .
  }
  OPTIONAL {
    ?identifier rdfs:label ?label
  }
  BIND(COALESCE(?label, ?identifier) AS ?name)
}

@dustine32
Copy link
Contributor Author

@balhoff Oh sweet! We can probably add a step here to handle the grouping of the results. Thanks!

@balhoff
Copy link
Member

balhoff commented Dec 17, 2021

@dustine32 here is an 11.5 sec version of query 2:

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

PREFIX definition: <http://purl.obolibrary.org/obo/IAO_0000115>
PREFIX BP: <http://purl.obolibrary.org/obo/GO_0008150>
PREFIX MF: <http://purl.obolibrary.org/obo/GO_0003674>
PREFIX CC: <http://purl.obolibrary.org/obo/GO_0005575>
SELECT distinct ?gocam ?goclasses ?goids ?gonames ?definitions
WHERE 
{
  GRAPH ?gocam {
    ?gocam metago:graphType metago:noctuaCam  .
  }
  FILTER NOT EXISTS {
    ?gocam prov:wasDerivedFrom ?asserted_cam .
  }
  GRAPH ?gocam {
    ?entity rdf:type owl:NamedIndividual .
    ?entity rdf:type ?goids
  }
  VALUES ?goclasses { BP: MF: CC:  } . 
    # rdf:type faster then subClassOf+ but require filter 			
    # ?goids rdfs:subClassOf+ ?goclasses .
  ?entity rdf:type ?goclasses .

  # Filtering out the root BP, MF & CC terms
  filter(?goids != MF: )
  filter(?goids != BP: )
  filter(?goids != CC: )

  # then getting their definitions
  ?goids rdfs:label ?gonames .
  ?goids definition: ?definitions .
}
ORDER BY DESC(?gocam)

lpalbou added a commit to lpalbou/api-gorest-2021 that referenced this issue Dec 30, 2021
Linked to geneontology#2

I see the timeout of the rdf endpoint has been increased, but the timeout of the API / requests here also needed to be increased. By doing so, I tested locally and the query now works
@lpalbou
Copy link
Contributor

lpalbou commented Dec 30, 2021

Query2

I tried both versions of the query 2 and didn't find a speed improvement (old query: 27s then 34s on second run; new query: 29s then 39s). As shown, the time greatly varies based on when the server receives the query.

It seems that the timeout of the rdf endpoint was increased (up to 60s would probably be a good idea for now ?), so I also increased the timeout of the GO-CAM API itself: #3 . If you merge this PR, the query 2 seems to run and this would solve the cache created on AWS/lambda as it uses https://api.geneontology.xyz/models/go.

Query1

For the query 1, I still have a server timeout at 30s.. unsure why this is not the case for query 2 ? Maybe some config to check on RDF server ? Indeed, removing the grouping on RDF side, gets a much faster query (10s). Note this is the query used to create the GPs cache on AWS/lambda: https://api.geneontology.xyz/models/gp . @dustine32 remember cloud9 ?

Screen Shot 2021-12-30 at 10 49 13 AM

Notes

  • If you want to continue to use a cache and avoid those timeouts, I would suggest using blazegraph runner during the release and store the files on the GO S3. Improving query performance is just a temporary fix as more GO-CAMs will be created. Just be sure to object in S3 (gocam-goterms.json is 10.4mb and 1.4mb compressed) - example
  • Indexing GO-CAMs would solve that issue and many others, but to be used by 3rd party sites (eg Alliance), GOLr would need to be https. You could still use the GO API or GO-CAM API (https) as a proxy to deliver https responses from GOLr
  • More to the point, @kltm those caches were created in the first place to enable client-side GO-CAM search at a time Ben's API didn't exist (hence why we are loading all terms, all gps for all go cams). Now that we have proper server-side search, these caches could/should probably be deprecated and benefit from server-side search @tmushayahama. If you do, the rest of the page only needs data for 10 models and this was always extremely fast through pagination

Happy holiday season to all ! 🎄🎉

@dustine32
Copy link
Contributor Author

Whoa, thanks again @lpalbou for all the advice! I'm now leaning towards your first note suggestion (using blazegraph runner during the release) but of course I also have to try the easy way out short-term.

Commit 5fe0a4b applies @balhoff's fix for Query1 (/models/gp) and moves handling of "group by gocam" results outside of the query, reusing @lpalbou's super-handy mergeResults function that was just laying there. Results are returned from the API in around 10 seconds.

I tried applying @balhoff's new Query2 but still ran into a timeout issue while testing the lambda locally:

Function 'GOREST' timed out after 30 seconds

Then I bumped this timeout from 30 to 60 sec in the template.yml:

Timeout: 30

This change at least got me to the next error:

Response payload size (10785926 bytes) exceeded maximum allowed payload size (6291556 bytes).

Looks like this 6MB limit is tied to an unchangeable AWS Lambda limit. There are some workarounds such as having the API immediately store the response payload in S3 then returning an S3 URL. This miiight work for us since our goal is to get it into S3 anyway, but it probably won't work for external users (then again, this route has been broken for a while so...). Also, the effort to implement this workaround might as well be spent coding blazegraph-runner calls into the release pipeline. Tagging @kltm.

@lpalbou
Copy link
Contributor

lpalbou commented Jan 5, 2022

Glad if it helps Dustin 🙂 . I do think a longer term solution would be blazegraph runner.. but in the mean time this may/should work. What I am really puzzled about is.. how come we reach a 6Mb payload limit ? That’s a lot, what are we sending ? From memory /models/go or /models/gp would worst case scenario send list of gocam ids.. and by default already does it for all.. so I am missing something here ?

ps: the “Winston” article was a lot of fun. I love AWS but sometimes there are hard constraints that can really cause issue (eg code pipeline can’t target an existing GH repo 😅)

@kltm
Copy link
Member

kltm commented Jan 12, 2022

[Note: documentation for manual hack of file update/upload while we work things out: https://docs.google.com/document/d/18vYy9sZq-dyjYWW0mnw3XpXRJjlI7pbQWvMlSSdXdjA/edit#heading=h.tzx1g6nhmgtd .]

@kltm
Copy link
Member

kltm commented Jan 12, 2022

Closing in favor of geneontology/pipeline#265

@kltm kltm closed this as completed Jan 12, 2022
@kltm kltm moved this from In progress to Done in Software essential and proactive maintenance Jan 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants