Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster lookups for reuse leveraging IR #287

Open
jtratner opened this issue Aug 9, 2019 · 12 comments
Open

Faster lookups for reuse leveraging IR #287

jtratner opened this issue Aug 9, 2019 · 12 comments

Comments

@jtratner
Copy link
Contributor

jtratner commented Aug 9, 2019

Right now I still notice that queries for existing applets and files can take a varying amount of time - sometimes quite long. My guess is that some of this has to do with how system performance on rendering workflows or applets with long specifications.

Would it make sense to use an in query to limit the response to a smaller set of files, e.g., pseudocode-wise, rather than:

applets := findDataObjects(class=applet)
workflows := findDataObjects(class=workflow)

instead do:

applet_checksums = ['C56E42263E2AD139AC92BF6AE0AF4CDA', ...]
applets := findDataObjects(class=applet, properties=[{"dxWDL_checksum": checksum} for checksum in checksums])
workflows := findDataObjects(class=workflow, properties...)

I think the complex property will still cause backend to search through all objects, but I think it'll limit how many of the found objects get described, thus speeding response time.

(this is based on my guesses of overall implementation)

@jtratner
Copy link
Contributor Author

jtratner commented Aug 9, 2019

(other optimization would be to only look for (id, name, property) when looking in current folder, tho that may already be happening - Found 7 workflows in project-Fb6Pxx00J4fJ38kjJqbpZ0ZK folder=/workflows/jtratner/<snip> (16349 millisec))

@jtratner
Copy link
Contributor Author

jtratner commented Aug 9, 2019

For context, I think our standard compiled workflow now has 56 applets and 7 workflows per build. We are creating builds more and more frequently (from past 6 months we have 2.6K workflows and 8K applets, and that's ramping up) and we hope to (in general) keep using the same project indefinitely for builds to leverage reuse.

@orodeh
Copy link
Contributor

orodeh commented Aug 9, 2019

I dug into this yesterday, with the help of the platform team. There are two separate problems, both on the platform side, not in dxWDL.

  1. If the folder you are searching in is "/", the project root folder, the database search will recurse into the entire project.
  2. The workflows generated by dxWDL are big, much larger than was envisioned a few years ago, when all workflows were written by hand. We think that the queries for workflows return the entire metadata on the back-end, not just the specified fields. Naturally, this is slow for large workflows. If you have lots of large workflows, this is even worse.

The platform team has filed bugs for these, and will work on them.

@orodeh
Copy link
Contributor

orodeh commented Aug 10, 2019

The dxWDL part of this issue has been fixed; optimizations have been implemented for the find-data-objects queries. Therefore, I am closing it here. This remains an issue for the rest of our team.

@orodeh orodeh closed this as completed Aug 10, 2019
@jtratner
Copy link
Contributor Author

I actually meant a slightly different point here. Currently, my understanding is that the reuse component of workflow compilation functions as follows:

  1. Compiles WDL to an internal IR
  2. Grabs the first 1000 applets and first 1000 workflows that have a dxWDL_checksum property and creates a big table of digest => executable ID (ObjectDirectory)
  3. Iterates through the IR, calculating the checksum for each applet or workflow on the fly and then seeing if it's present in the object directory.

Now that we have more than 2K applets in our project, that (1) step is going to start causing some issues.

I was thinking of a different strategy for projectWideReuse that would potentially be more performant (or at least require less data across the wire) and would work, without pagination, regardless of the size of the project.

  1. Compile WDL to internal IR
  2. Walk through IR, calculate digests for all items, collect digests into an array for workflows and array for applets.
  3. decide on some batching size (maybe 10?) and make requests to findDataObjects with properties: {"$or": [{"dxWDL_checksum": "<hash1>"}, {"dxWDL_checksum": "<hash2>"}, ...}, etc. Use that to construct the ObjectDirectory
  4. Only rebuild applets or workflows not found.

The good part is that this makes the lookup time be relative to the size of the workflow, rather than the size of the project and (I believe) this should cause the lookups to be relatively quick because they'll do a query for the property first, and then do the describe on the found IDs later. (my understanding is there's an index like (project, property, dxID)). Which means each request should be (relatively!) small.

@orodeh orodeh reopened this Aug 12, 2019
@orodeh
Copy link
Contributor

orodeh commented Aug 12, 2019

I asked the back-end team, and this is worth a try.

@orodeh
Copy link
Contributor

orodeh commented Aug 16, 2019

It turns out that this isn't so simple to do, because the checksum cannot be computed just from the IR. It also covers referenced data-objects. It can only be calculated incrementally, while building a complex workflow (from the bottom up).

There are two ways I could think of to doing this:

  1. Query the entire project, and make a big map from checksum to data-object. This requires one large query at the beginning.
  2. As the workflow is built, query the platform for every new applet/sub-workflow generated. This requires many small queries.

Both approachs are non optimal, I am not sure which is better. In the meantime, I limited the number of results returned by adding a constraint on the data-object name. It has to be one of those we are generating, which is known at after the IR phase.

Let's see if 1.18 is sufficient.

@orodeh
Copy link
Contributor

orodeh commented Aug 20, 2019

@jtratner, is this version better?

@jtratner
Copy link
Contributor Author

What specifically can't be computed just from the IR? Any data objects should be resolvable prior to compilation, right?

@orodeh
Copy link
Contributor

orodeh commented Sep 26, 2019

Right. But if you create a new applet, workflow, or data object, it has an unpredictable ID. Let's say that you have a workflow that compiles into, applet B, that depends on applet A. B's checksum requires the ID for A. You have to, first, create A, and then, create B.

@jtratner
Copy link
Contributor Author

jtratner commented Sep 26, 2019 via email

@jdidion
Copy link
Contributor

jdidion commented Feb 22, 2021

More recent versions of dxWDL, as well as dxCompiler, constrain the search by the applet names (which are deterministic). Hopefully this has sped up the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants