Design and merge plan: operator output port result cache (MVP) #5880
Xiao-zhen-Liu
started this conversation in
Ideas
Replies: 1 comment
-
|
@Xiao-zhen-Liu Thanks for the great summary. Please also describe our plan to manage the lifecycle of the cached results. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Design and merge plan: operator output port result cache (MVP)
In current Texera main, the engine runs a workflow from the start every time, even when the user changed only one operator near the end. This proposal adds a result cache so that, on a re-run, an output port whose upstream computation logic is unchanged reads its saved result instead of recomputing it. The code is written and working on a prototype branch. This post describes the design and the plan to bring it into
mainas small PRs, so anyone can raise concerns before the PRs go up.Matching results across executions
Each output port has a cache key built from its upstream operators, their parameters, their output schemas, and the wiring between them. Two ports with the same cache key produce the same result (output port equivalence). When a run saves a port's result, we record
(workflow, port, cache key) -> result location. On a later run, a port whose cache key has a recorded result is a matched port, and that result can be reused. Any edit upstream of a port changes its cache key, so its old result is no longer matched.Scope (MVP)
In scope: reuse the saved result at every matched port (full reuse), match by cache key, invalidate entries that no longer match after an edit, and read, write, and clear the cache from the UI.
Out of scope (future work, not in these PRs): choosing per port whether reuse is cheaper than recompute (cost-based reuse planning), and removing results under storage limits (eviction). The merged code always reuses a matched port's result.
How it fits the current system
Current main (figure below): the Workflow Compiler builds a physical plan,
CostBasedScheduleGeneratorbuilds a schedule of regions, and the executor runs the regions on workers that read and write tables in storage.With the cache MVP (figure below), the engine includes additional modules:
CostBasedScheduleGenerator): schedule the run-skeleton as it does today. The removed part becomes regions that are skipped, and operators that read from it use the saved result locations.The executor saves results to the cached-result storage as ports finish.
Nothing changes when there are no matched ports
On the first run of a workflow, or any run right after an upstream edit, there are no matched ports: the run-skeleton is the whole plan and the schedule is the same as today. The cache changes behavior only when a matched port exists, so the code can land inactive and turn on once results are saved.
Merge plan: five PRs
In dependency order; each has its own issue:
PRs 1 and 2 are independent. PR 3 needs 1 and 2; PR 4 needs 3; PR 5 needs 4.
Beta Was this translation helpful? Give feedback.
All reactions