Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AQL memoize subquery or pre-load collection to coordinator before executing query #17244

Open
dan1els opened this issue Sep 29, 2022 · 0 comments

Comments

@dan1els
Copy link

dan1els commented Sep 29, 2022

My Environment

  • ArangoDB Version: 3.9.0
  • Deployment Mode: Cluster
  • Deployment Strategy: Kubernetes
  • Configuration: 6 coordinators, 6 primaries, 3 agency
  • Infrastructure: own
  • Operating System: -
  • Total RAM in your machine: 32Gb
  • Disks in use: SSD
  • Used Package: official Docker library

Component, Query & Data

Affected feature:
AQL

Size of your Dataset on disk:
~70Gb

Problem:

I have a question regarding caching some AQL subqueries on coordinator.
I'm traversing a tree structure recursively, like

 for step1Doc,step1Edge,step1Path in 1..1 INBOUND root edges
      let step1Children = (  
       // subquery call
        for step2Doc,step2Edge,step2Path in 1..1 INBOUND step1Doc edges       
           // subquery call
          let step2Children = (
              // subquery call
            for step3Doc,step3Edge,step3Path in 1..1 INBOUND step2Doc edges
                // subquery call
              let step3Children = (
                 // subquery call
                for step4Doc,step4Edge,step4Path in 1..1 INBOUND step3Doc edges    
                  return distinct step4Doc
              )      
              return step3Children
          )
          ...
  )

And also I have a separate collection with additional metadata (static) related to the each vertex, which is a really small one (about 200 elements). I wish to query this collection for each node on each level to add some data to the result, but doing it drastically increase execution time (from 9 sec. to a minute)

The problem is that the subquery is being evaluated each time for each vertex and too many connections is spawned between coordinator and dbs. It's clearly seen in the profiler report:

 398   SubqueryStartNode   COOR   5252   10390       0.02294               - LET doc4 = ( /* subquery begin */
 399   ScatterNode         COOR   5252   10390       0.04006                     - SCATTER
 400   RemoteNode          DBS   10244   10390       2.31177                      - REMOTE
 329   IndexNode           DBS   10244    6338       0.07971                          - FOR doc IN metadata   /* primary index scan, 1 shard(s) */    
 368   RemoteNode          COOR  10244    6338       2.42944                       - REMOTE
 369   GatherNode          COOR   5129    6338       0.03051                          - GATHER   /* unsorted */
 401   SubqueryEndNode     COOR   5128    5269       0.00300                    - RETURN  doc ) /* subquery end */

Results without subquery:

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00000
 parsing                    0.00028
 optimizing ast             0.00004
 loading collections        0.00000
 instantiating plan         0.00034
 optimizing plan            0.05803
 executing                  9.05962
 finalizing                 0.00117

Results with subquery:

Query Profile:
 Query Stage           Duration [s]
 initializing               0.00001
 parsing                    0.00071
 optimizing ast             0.00009
 loading collections        0.00000
 instantiating plan         0.00175
 optimizing plan            0.05680
 executing                 40.63792
 finalizing                 0.00836

Another bad side effect is that network load is too high.

Expected result:

So, I'm looking for some kind of memoize operation or any other way to explicitly evaluate and the result of subquery, or for some way to preload these kind of collections to the coordinator before the query starts.

Actually, I can execute 3 different queries and then aggregate the results somewhere on the backend if there's no way to get job done inside arangodb.

UPD:
Actually I figured out the next trick. I added next line on the top of the query:

let collectionResults = true ? (subquery) : []

This forced execution of the query on startup and same results are reused.

@dan1els dan1els changed the title AQL memoize subquery or preloading collection on coordinator before executing query AQL memoize subquery or pre-load collection to coordinator before executing query Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant