Store intermediate data in a directory named after the compute ID #413

TomNicholas · 2024-03-06T17:18:12Z

We should probably make Cubed store its intermediate data in a directory named {CONTEXT_ID}/{compute_id}, but that's a bit more work.

Originally posted by @TomNicholas in cubed-dev/cubed-benchmarks#10 (comment)

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-03-06T17:42:11Z

Saving temporary data from individual executions into different directories would be useful for benchmarking. This requires the ops to know both the CONTEXT_ID and the compute_id.

Currently the CONTEXT_ID is a global variable and hence always available, but the compute_id is generated when the plan is executed and passed to the executor. The ops functions don't see this, so I'm wondering what the best way to pass the compute_id down so it's available inside ops.py functions is? It seems wrong to add extra arguments to e.g. blockwise, but also seems bad to have a global variable that gets rewritten every time a new execution starts...

tomwhite · 2024-03-06T17:50:46Z

I think it's worse than that - the intermediate directories paths are created (not on the filesystem, just as strings) as the array functions are called, but before the computation is run - and therefore before the compute_id is created.

I think the easier way to solve the original problem in cubed-dev/cubed-benchmarks#10 would be to just get the intermediate array paths from the DAG.

tomwhite · 2024-03-06T18:19:31Z

Thinking about this more, it would be possible to change lazy_zarr_array to just take an array name ("array-001") not a store, and then turn it into the full path for the Zarr store only when it's created at the beginning of the computation. So it's possible, but still a fairly substantial change.

TomNicholas · 2024-03-06T18:42:53Z

I think it's worse than that - the intermediate directories paths are created (not on the filesystem, just as strings) as the array functions are called, but before the computation is run - and therefore before the compute_id is created.

So should we perhaps instead create only the known part of the directory path (i.e. the "prefixes") during plan construction time, and then join the compute_id to make a full path only once the execution begins?

I think the easier way to solve the original problem in cubed-dev/cubed-benchmarks#10 would be to just get the intermediate array paths from the DAG.

So then the benchmark context managers need to know about the plan object right? Or can we add it to what's saved in history.plan?

TomNicholas · 2024-03-06T18:54:25Z

Oh I didn't see your comment when I wrote mine - I think we're suggesting basically the same thing.

I agree this is probably overkill to get cubed-dev/cubed-benchmarks#10 working, but I do think being able to distinguish different run directories might be useful in other contexts (e.g. perhaps an external tool whose job is to periodically purge temporary data from older runs).

tomwhite · 2024-03-07T11:25:40Z

I agree this is probably overkill to get cubed-dev/cubed-benchmarks#10 working, but I do think being able to distinguish different run directories might be useful in other contexts (e.g. perhaps an external tool whose job is to periodically purge temporary data from older runs).

Yes, that would be useful.

tomwhite changed the title ~~Thanks for the explanation!~~ Store intermediate data in a directory named after the compute ID Mar 6, 2024

TomNicholas mentioned this issue Mar 7, 2024

Benchmark amount of intermediate data stored cubed-dev/cubed-benchmarks#10

Open

tomwhite mentioned this issue Jul 22, 2024

Single machine tracking issue #514

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store intermediate data in a directory named after the compute ID #413

Store intermediate data in a directory named after the compute ID #413

TomNicholas commented Mar 6, 2024

TomNicholas commented Mar 6, 2024

tomwhite commented Mar 6, 2024

tomwhite commented Mar 6, 2024

TomNicholas commented Mar 6, 2024

TomNicholas commented Mar 6, 2024

tomwhite commented Mar 7, 2024

Store intermediate data in a directory named after the compute ID #413

Store intermediate data in a directory named after the compute ID #413

Comments

TomNicholas commented Mar 6, 2024

TomNicholas commented Mar 6, 2024

tomwhite commented Mar 6, 2024

tomwhite commented Mar 6, 2024

TomNicholas commented Mar 6, 2024

TomNicholas commented Mar 6, 2024

tomwhite commented Mar 7, 2024