Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store intermediate data in a directory named after the compute ID #413

Open
TomNicholas opened this issue Mar 6, 2024 · 6 comments
Open

Comments

@TomNicholas
Copy link
Collaborator

We should probably make Cubed store its intermediate data in a directory named {CONTEXT_ID}/{compute_id}, but that's a bit more work.

Originally posted by @TomNicholas in cubed-dev/cubed-benchmarks#10 (comment)

@TomNicholas
Copy link
Collaborator Author

Saving temporary data from individual executions into different directories would be useful for benchmarking. This requires the ops to know both the CONTEXT_ID and the compute_id.

Currently the CONTEXT_ID is a global variable and hence always available, but the compute_id is generated when the plan is executed and passed to the executor. The ops functions don't see this, so I'm wondering what the best way to pass the compute_id down so it's available inside ops.py functions is? It seems wrong to add extra arguments to e.g. blockwise, but also seems bad to have a global variable that gets rewritten every time a new execution starts...

@tomwhite tomwhite changed the title Thanks for the explanation! Store intermediate data in a directory named after the compute ID Mar 6, 2024
@tomwhite
Copy link
Member

tomwhite commented Mar 6, 2024

I think it's worse than that - the intermediate directories paths are created (not on the filesystem, just as strings) as the array functions are called, but before the computation is run - and therefore before the compute_id is created.

I think the easier way to solve the original problem in cubed-dev/cubed-benchmarks#10 would be to just get the intermediate array paths from the DAG.

@tomwhite
Copy link
Member

tomwhite commented Mar 6, 2024

Thinking about this more, it would be possible to change lazy_zarr_array to just take an array name ("array-001") not a store, and then turn it into the full path for the Zarr store only when it's created at the beginning of the computation. So it's possible, but still a fairly substantial change.

@TomNicholas
Copy link
Collaborator Author

I think it's worse than that - the intermediate directories paths are created (not on the filesystem, just as strings) as the array functions are called, but before the computation is run - and therefore before the compute_id is created.

So should we perhaps instead create only the known part of the directory path (i.e. the "prefixes") during plan construction time, and then join the compute_id to make a full path only once the execution begins?

I think the easier way to solve the original problem in cubed-dev/cubed-benchmarks#10 would be to just get the intermediate array paths from the DAG.

So then the benchmark context managers need to know about the plan object right? Or can we add it to what's saved in history.plan?

@TomNicholas
Copy link
Collaborator Author

Oh I didn't see your comment when I wrote mine - I think we're suggesting basically the same thing.

I agree this is probably overkill to get cubed-dev/cubed-benchmarks#10 working, but I do think being able to distinguish different run directories might be useful in other contexts (e.g. perhaps an external tool whose job is to periodically purge temporary data from older runs).

@tomwhite
Copy link
Member

tomwhite commented Mar 7, 2024

I agree this is probably overkill to get cubed-dev/cubed-benchmarks#10 working, but I do think being able to distinguish different run directories might be useful in other contexts (e.g. perhaps an external tool whose job is to periodically purge temporary data from older runs).

Yes, that would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants