-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
upload: progress indication for "digesting"? #400
Comments
I will leave UX decision to you (a separate column or just reuse existing) @jwodder |
@yarikoptic I believe |
perhaps use pydra instead and store the cache directory locally somewhere. |
@satra Exactly what relevant features does pydra have? Digest progress is the easy part; caching it on the filesystem is the hard part. |
that's exactly what pydra would provide in addition to parallelization of a process :) i would suggest going through the tutorial: https://github.com/nipype/pydra-tutorial |
@satra I see that that provides caching of functions, but I don't see how one would get digest progress information out of it. Keep in mind that we're currently digesting using sha256, which isn't parallelizable, unlike the contemplated Merkel tree hash. Or are you recommending pydra due to it having a queryable cache? I don't see that in the docs. |
@satra FWIW pydra is relatively heavy of a dependency (and dependencies down) to use just for caching. Even joblib is somewhat too much since we use just memoization part... Also I see it depends on cloudpickle -- is that is what it uses for pickling? its README warns
which possibly makes it suboptimal (although I am not sure how good we are now for sure anyways: con/fscacher#36) |
@jwodder , with all the generators it needs more thinking on how to wrap it all up ... meanwhile - having support for them in fscacher would be useful, so filed con/fscacher#37 |
perhaps i misunderstood what you meant by
are you saying caching the progressing state so you could start at whatever point it stopped? or are you caching the entire hash? indeed pydra is not suited for progressive filesystem caching. but whenever you run a function, if it's completed the local cache is like a memoization on disk. |
@satra We need the cache to be a datastore mapping tuples to bytes that (a) can be freely queried & modified, not just used to dumbly memoize a function, and (b) is threading- and multiprocessing-safe. I originally envisioned something like how joblib.Memory works, where pickled files are laid out in a directory tree based on hashes of the inputs or something like that, which would mean coming up with our own layout scheme, though now that I think about it more, we might just be able to get away with a sqlite database.... |
i think i'm still not fully following what the cache stores i.e what the contents of tuples are what bytes, but sounds like you have figured out a solution :) |
@yarikoptic Regarding the pyout columns: I've renamed "upload" to "pct" and used it for both digest progress and upload progress, but it appears that the "size" and "pct" columns both stop updating after digesting is done, even though they should start counting from zero again while uploading. Do you know why this is? EDIT: Never mind, reusing "upload" messed with a custom display function, so I've put the digest progress in the "message" column for now. The downside is that this ruins the summary at below the table. |
no problem -- "message": dict(
color=dict(
re_lookup=[["^exists", "yellow"], ["^(failed|error|ERROR)", "red"]]
),
aggregate=counts,
), so for |
@yarikoptic I tried this, but it didn't make a difference. |
re the overall approach... I think it could also be done quite non-intrusively (although not sure if having a callback bound to Digester would have some negative side effect on fscacher) via use of the add optional callback to Digester, wrap call to a digester into ret = func(callback)
raise StopIteration(ret) if ret is not None else StopIteration |
didn't try to debug but may be because please also see #400 (comment) on may be a simpler path toward needed functionality and avoiding adding a new dependency (diskcache) with a known NFS issue |
@yarikoptic That was it. The summary is still rather flickery, though. |
@yarikoptic If I'm understanding |
we can parametrize |
@yarikoptic The good news is that joblib's Memory.cache has an |
FTR: PR was merged and we are waiting for joblib release, asked: joblib/joblib#1165 (comment) |
Our LINC collaborator @dstansby is uploading a large Zarr with the LINC Client and has suggested that we provide feedback for the uploads. Below I have copied the text from the original issue (lincbrain#45).
Thank you. cc @aaronkanzer |
I guess we would need to get back to postponed
|
@yarikoptic The "status" for a Zarr upload should start out at "producing asset". If the reporter's program was displaying nothing before even getting to the pyout display, then the client hadn't even gotten to the point of starting the Zarr upload proper. Running the command as |
@kabilar It's an option of the top-level |
Perfect, thank you. Will test out today. |
"digesting" could take comparable to upload time, in particular in the case of the parallel
upload
s. Although with caching we at least would not bother redoing it twice if did recently, it would be nice if we have a way to report back to the user on the progress of digesting. For upload we have a dedicated columnupload
which shows%
. But I wonder if we better have a dedicated column "progress" which would be populated (and either cleared or report time on how long it took after it is done) used by whatever 'status' column reports (digesting, or uploading, etc).The text was updated successfully, but these errors were encountered: