Fix SPARQL required for GO-CAM website resource files #2

dustine32 · 2021-12-17T19:02:26Z

Carrying on with the work to overcome timeout issues with SPARQL queries called by the GO-CAM API. Similar to how we improved the models-by-GP query in geneontology/api-gorest#3, we've still got two queries that are essential for the GO-CAM website to function but currently timing out after 30 seconds:

QUERY 1: This one is meant to get a GO-CAM-to-GP lookup file:

api-gorest-2021/queries/sparql-models.js

Lines 262 to 293 in 480092b

    
               AllModelsGPs() { 
        
                   // Transform the array in string 
        
                   var encoded = encodeURIComponent(` 
        
                   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
        
                   PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>  
        
                   PREFIX owl: <http://www.w3.org/2002/07/owl#> 
        
                   PREFIX metago: <http://model.geneontology.org/> 
        
                   PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333> 
        
                   PREFIX in_taxon: <http://purl.obolibrary.org/obo/RO_0002162> 
        
                   SELECT ?gocam   (GROUP_CONCAT(distinct ?identifier;separator="` + separator + `") as ?gpids) 
        
           			        	(GROUP_CONCAT(distinct ?name;separator="` + separator + `") as ?gpnames) 
        
                   WHERE  
        
                   { 
        
                       GRAPH ?gocam { 
        
                           ?gocam metago:graphType metago:noctuaCam . 
        
                           ?s enabled_by: ?gpnode .     
        
                           ?gpnode rdf:type ?identifier . 
        
                           FILTER(?identifier != owl:NamedIndividual) . 
        
                           FILTER(!contains(str(?gocam), "_inferred")) 
        
                       } 
        
                       optional { 
        
                           ?identifier rdfs:label ?name 
        
                       } 
        
                       BIND(IF(bound(?name), ?name, ?identifier) as ?name) 
        
                   } 
        
                   GROUP BY ?gocam 
        
                   `); 
        
                   return "?query=" + encoded; 
        
               },

The actual query after resolving the separator from the config.json:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>
PREFIX in_taxon: <http://purl.obolibrary.org/obo/RO_0002162>
SELECT ?gocam   (GROUP_CONCAT(distinct ?identifier;separator="@@") as ?gpids)
			 (GROUP_CONCAT(distinct ?name;separator="@@") as ?gpnames)
WHERE 
{
    GRAPH ?gocam {
        ?gocam metago:graphType metago:noctuaCam .
        ?s enabled_by: ?gpnode .    
        ?gpnode rdf:type ?identifier .
        FILTER(?identifier != owl:NamedIndividual) .
        FILTER(!contains(str(?gocam), "_inferred"))
    }
    optional {
        ?identifier rdfs:label ?name
    }
    BIND(IF(bound(?name), ?name, ?identifier) as ?name)
}
GROUP BY ?gocam

QUERY 2: This one is meant to get a GO-CAM-to-GO-term lookup file:

api-gorest-2021/queries/sparql-models.js

Lines 219 to 259 in 480092b

    
               AllModelsGOs() { 
        
                   // Transform the array in string 
        
                   var encoded = encodeURIComponent(` 
        
               	PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
        
                   PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
        
                   PREFIX metago: <http://model.geneontology.org/> 
        
               	PREFIX owl: <http://www.w3.org/2002/07/owl#> 
        
                   PREFIX definition: <http://purl.obolibrary.org/obo/IAO_0000115> 
        
                   PREFIX BP: <http://purl.obolibrary.org/obo/GO_0008150> 
        
                   PREFIX MF: <http://purl.obolibrary.org/obo/GO_0003674> 
        
                   PREFIX CC: <http://purl.obolibrary.org/obo/GO_0005575> 
        
           		SELECT distinct ?gocam ?goclasses ?goids ?gonames ?definitions 
        
                   WHERE  
        
                   { 
        
             		    GRAPH ?gocam { 
        
               			?gocam metago:graphType metago:noctuaCam  . 
        
                           ?entity rdf:type owl:NamedIndividual . 
        
               			?entity rdf:type ?goids 
        
                       } 
        
                       VALUES ?goclasses { BP: MF: CC:  } .  
        
                       # rdf:type faster then subClassOf+ but require filter 			 
        
                       # ?goids rdfs:subClassOf+ ?goclasses . 
        
               		?entity rdf:type ?goclasses . 
        
             			# Filtering out the root BP, MF & CC terms 
        
           			filter(?goids != MF: ) 
        
             			filter(?goids != BP: ) 
        
           		  	filter(?goids != CC: ) 
        
             			# then getting their definitions 
        
               		?goids rdfs:label ?gonames . 
        
             		    ?goids definition: ?definitions . 
        
                   } 
        
           		ORDER BY DESC(?gocam) 
        
                   `); 
        
                   return "?query=" + encoded; 
        
               },

Raw query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

PREFIX definition: <http://purl.obolibrary.org/obo/IAO_0000115>
PREFIX BP: <http://purl.obolibrary.org/obo/GO_0008150>
PREFIX MF: <http://purl.obolibrary.org/obo/GO_0003674>
PREFIX CC: <http://purl.obolibrary.org/obo/GO_0005575>
SELECT distinct ?gocam ?goclasses ?goids ?gonames ?definitions
WHERE 
{

GRAPH ?gocam {
?gocam metago:graphType metago:noctuaCam  .
        ?entity rdf:type owl:NamedIndividual .
?entity rdf:type ?goids
    }

    VALUES ?goclasses { BP: MF: CC:  } . 
    # rdf:type faster then subClassOf+ but require filter 			
    # ?goids rdfs:subClassOf+ ?goclasses .
?entity rdf:type ?goclasses .

# Filtering out the root BP, MF & CC terms
filter(?goids != MF: )
filter(?goids != BP: )
filter(?goids != CC: )

# then getting their definitions
?goids rdfs:label ?gonames .
?goids definition: ?definitions .
}
ORDER BY DESC(?gocam)

@balhoff @kltm Any ideas how we can speed these up to return results in under 30 seconds? They don't need to run crazy fast as they typically only execute when triggered by a GO release (so ~once a month).

The text was updated successfully, but these errors were encountered:

balhoff · 2021-12-17T19:46:26Z

@dustine32 for Query 1, if you can do the grouping on the client side, this will complete in 18 seconds:

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> 
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>
PREFIX in_taxon: <http://purl.obolibrary.org/obo/RO_0002162>
SELECT DISTINCT ?gocam ?identifier ?name
WHERE 
{
  GRAPH ?gocam {
        ?gocam metago:graphType metago:noctuaCam .
  }
  FILTER NOT EXISTS {
    ?gocam prov:wasDerivedFrom ?asserted_cam .
  }
  GRAPH ?gocam {
    ?s enabled_by: ?gpnode .    
 	?gpnode rdf:type ?identifier .
	FILTER(?identifier != owl:NamedIndividual) .
  }
  OPTIONAL {
    ?identifier rdfs:label ?label
  }
  BIND(COALESCE(?label, ?identifier) AS ?name)
}

dustine32 · 2021-12-17T19:57:22Z

@balhoff Oh sweet! We can probably add a step here to handle the grouping of the results. Thanks!

balhoff · 2021-12-17T19:58:19Z

@dustine32 here is an 11.5 sec version of query 2:

PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX metago: <http://model.geneontology.org/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

PREFIX definition: <http://purl.obolibrary.org/obo/IAO_0000115>
PREFIX BP: <http://purl.obolibrary.org/obo/GO_0008150>
PREFIX MF: <http://purl.obolibrary.org/obo/GO_0003674>
PREFIX CC: <http://purl.obolibrary.org/obo/GO_0005575>
SELECT distinct ?gocam ?goclasses ?goids ?gonames ?definitions
WHERE 
{
  GRAPH ?gocam {
    ?gocam metago:graphType metago:noctuaCam  .
  }
  FILTER NOT EXISTS {
    ?gocam prov:wasDerivedFrom ?asserted_cam .
  }
  GRAPH ?gocam {
    ?entity rdf:type owl:NamedIndividual .
    ?entity rdf:type ?goids
  }
  VALUES ?goclasses { BP: MF: CC:  } . 
    # rdf:type faster then subClassOf+ but require filter 			
    # ?goids rdfs:subClassOf+ ?goclasses .
  ?entity rdf:type ?goclasses .

  # Filtering out the root BP, MF & CC terms
  filter(?goids != MF: )
  filter(?goids != BP: )
  filter(?goids != CC: )

  # then getting their definitions
  ?goids rdfs:label ?gonames .
  ?goids definition: ?definitions .
}
ORDER BY DESC(?gocam)

Linked to geneontology#2 I see the timeout of the rdf endpoint has been increased, but the timeout of the API / requests here also needed to be increased. By doing so, I tested locally and the query now works

lpalbou · 2021-12-30T08:51:49Z

Query2

I tried both versions of the query 2 and didn't find a speed improvement (old query: 27s then 34s on second run; new query: 29s then 39s). As shown, the time greatly varies based on when the server receives the query.

It seems that the timeout of the rdf endpoint was increased (up to 60s would probably be a good idea for now ?), so I also increased the timeout of the GO-CAM API itself: #3 . If you merge this PR, the query 2 seems to run and this would solve the cache created on AWS/lambda as it uses https://api.geneontology.xyz/models/go.

Query1

For the query 1, I still have a server timeout at 30s.. unsure why this is not the case for query 2 ? Maybe some config to check on RDF server ? Indeed, removing the grouping on RDF side, gets a much faster query (10s). Note this is the query used to create the GPs cache on AWS/lambda: https://api.geneontology.xyz/models/gp . @dustine32 remember cloud9 ?

Notes

If you want to continue to use a cache and avoid those timeouts, I would suggest using blazegraph runner during the release and store the files on the GO S3. Improving query performance is just a temporary fix as more GO-CAMs will be created. Just be sure to object in S3 (gocam-goterms.json is 10.4mb and 1.4mb compressed) - example
Indexing GO-CAMs would solve that issue and many others, but to be used by 3rd party sites (eg Alliance), GOLr would need to be https. You could still use the GO API or GO-CAM API (https) as a proxy to deliver https responses from GOLr
More to the point, @kltm those caches were created in the first place to enable client-side GO-CAM search at a time Ben's API didn't exist (hence why we are loading all terms, all gps for all go cams). Now that we have proper server-side search, these caches could/should probably be deprecated and benefit from server-side search @tmushayahama. If you do, the rest of the page only needs data for 10 models and this was always extremely fast through pagination

Happy holiday season to all ! 🎄🎉

dustine32 · 2022-01-05T01:37:31Z

Whoa, thanks again @lpalbou for all the advice! I'm now leaning towards your first note suggestion (using blazegraph runner during the release) but of course I also have to try the easy way out short-term.

Commit 5fe0a4b applies @balhoff's fix for Query1 (/models/gp) and moves handling of "group by gocam" results outside of the query, reusing @lpalbou's super-handy mergeResults function that was just laying there. Results are returned from the API in around 10 seconds.

I tried applying @balhoff's new Query2 but still ran into a timeout issue while testing the lambda locally:

Function 'GOREST' timed out after 30 seconds

Then I bumped this timeout from 30 to 60 sec in the template.yml:

api-gorest-2021/template.yml

Line 21 in 3f2e995

Timeout: 30

This change at least got me to the next error:

Response payload size (10785926 bytes) exceeded maximum allowed payload size (6291556 bytes).

Looks like this 6MB limit is tied to an unchangeable AWS Lambda limit. There are some workarounds such as having the API immediately store the response payload in S3 then returning an S3 URL. This miiight work for us since our goal is to get it into S3 anyway, but it probably won't work for external users (then again, this route has been broken for a while so...). Also, the effort to implement this workaround might as well be spent coding blazegraph-runner calls into the release pipeline. Tagging @kltm.

lpalbou · 2022-01-05T01:58:55Z

Glad if it helps Dustin 🙂 . I do think a longer term solution would be blazegraph runner.. but in the mean time this may/should work. What I am really puzzled about is.. how come we reach a 6Mb payload limit ? That’s a lot, what are we sending ? From memory /models/go or /models/gp would worst case scenario send list of gocam ids.. and by default already does it for all.. so I am missing something here ?

ps: the “Winston” article was a lot of fun. I love AWS but sometimes there are hard constraints that can really cause issue (eg code pipeline can’t target an existing GH repo 😅)

kltm · 2022-01-12T00:39:56Z

[Note: documentation for manual hack of file update/upload while we work things out: https://docs.google.com/document/d/18vYy9sZq-dyjYWW0mnw3XpXRJjlI7pbQWvMlSSdXdjA/edit#heading=h.tzx1g6nhmgtd .]

kltm · 2022-01-12T02:27:55Z

Closing in favor of geneontology/pipeline#265

kltm added this to In progress in Software essential and proactive maintenance Dec 17, 2021

kltm mentioned this issue Dec 21, 2021

The GO-CAM browser just spins--no display created geneontology/web-gocam#17

Closed

lpalbou mentioned this issue Dec 30, 2021

Change timeout to enable models/go #3

Merged

dustine32 added a commit that referenced this issue Jan 5, 2022

Jim's improved query for models/gp w/ grouping handled in API. For #2

5fe0a4b

kltm mentioned this issue Jan 12, 2022

Add JSON product production for GO-CAM API to pipeline geneontology/pipeline#265

Open

12 tasks

kltm closed this as completed Jan 12, 2022

kltm moved this from In progress to Done in Software essential and proactive maintenance Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SPARQL required for GO-CAM website resource files #2

Fix SPARQL required for GO-CAM website resource files #2

dustine32 commented Dec 17, 2021

balhoff commented Dec 17, 2021

dustine32 commented Dec 17, 2021

balhoff commented Dec 17, 2021

lpalbou commented Dec 30, 2021 •

edited

Loading

dustine32 commented Jan 5, 2022

lpalbou commented Jan 5, 2022

kltm commented Jan 12, 2022

kltm commented Jan 12, 2022

Fix SPARQL required for GO-CAM website resource files #2

Fix SPARQL required for GO-CAM website resource files #2

Comments

dustine32 commented Dec 17, 2021

balhoff commented Dec 17, 2021

dustine32 commented Dec 17, 2021

balhoff commented Dec 17, 2021

lpalbou commented Dec 30, 2021 • edited Loading

Query2

Query1

Notes

dustine32 commented Jan 5, 2022

lpalbou commented Jan 5, 2022

kltm commented Jan 12, 2022

kltm commented Jan 12, 2022

lpalbou commented Dec 30, 2021 •

edited

Loading