Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upload: progress indication for "digesting"? #400

Open
yarikoptic opened this issue Feb 18, 2021 · 27 comments
Open

upload: progress indication for "digesting"? #400

yarikoptic opened this issue Feb 18, 2021 · 27 comments
Assignees
Labels
blocked Blocked by some needed development/fix cmd-upload UX

Comments

@yarikoptic
Copy link
Member

"digesting" could take comparable to upload time, in particular in the case of the parallel uploads. Although with caching we at least would not bother redoing it twice if did recently, it would be nice if we have a way to report back to the user on the progress of digesting. For upload we have a dedicated column upload which shows %. But I wonder if we better have a dedicated column "progress" which would be populated (and either cleared or report time on how long it took after it is done) used by whatever 'status' column reports (digesting, or uploading, etc).

@yarikoptic
Copy link
Member Author

yarikoptic commented Mar 11, 2021

I will leave UX decision to you (a separate column or just reuse existing) @jwodder

@jwodder
Copy link
Member

jwodder commented Mar 11, 2021

@yarikoptic I believe joblib.Memory (on which fscacher depends) uses pickle to store data, and iterators aren't pickleable. In order to implement this, we'd need a way to query the cache to see whether it already has an entry for a given path and a way to manually insert entries in the cache, neither of which seem to be supported by joblib.

@satra
Copy link
Member

satra commented Mar 11, 2021

perhaps use pydra instead and store the cache directory locally somewhere.

@jwodder
Copy link
Member

jwodder commented Mar 11, 2021

@satra Exactly what relevant features does pydra have? Digest progress is the easy part; caching it on the filesystem is the hard part.

@satra
Copy link
Member

satra commented Mar 11, 2021

that's exactly what pydra would provide in addition to parallelization of a process :)

i would suggest going through the tutorial: https://github.com/nipype/pydra-tutorial

@jwodder
Copy link
Member

jwodder commented Mar 11, 2021

@satra I see that that provides caching of functions, but I don't see how one would get digest progress information out of it. Keep in mind that we're currently digesting using sha256, which isn't parallelizable, unlike the contemplated Merkel tree hash. Or are you recommending pydra due to it having a queryable cache? I don't see that in the docs.

@yarikoptic
Copy link
Member Author

@satra FWIW pydra is relatively heavy of a dependency (and dependencies down) to use just for caching. Even joblib is somewhat too much since we use just memoization part... Also I see it depends on cloudpickle -- is that is what it uses for pickling? its README warns

Cloudpickle can only be used to send objects between the exact same version of Python.

Using cloudpickle for long-term object storage is not supported and strongly discouraged.

which possibly makes it suboptimal (although I am not sure how good we are now for sure anyways: con/fscacher#36)

@yarikoptic
Copy link
Member Author

@jwodder , with all the generators it needs more thinking on how to wrap it all up ... meanwhile - having support for them in fscacher would be useful, so filed con/fscacher#37

@satra
Copy link
Member

satra commented Mar 12, 2021

perhaps i misunderstood what you meant by

caching it on the filesystem is the hard part.

are you saying caching the progressing state so you could start at whatever point it stopped? or are you caching the entire hash? indeed pydra is not suited for progressive filesystem caching. but whenever you run a function, if it's completed the local cache is like a memoization on disk.

@jwodder
Copy link
Member

jwodder commented Mar 12, 2021

@satra We need the cache to be a datastore mapping tuples to bytes that (a) can be freely queried & modified, not just used to dumbly memoize a function, and (b) is threading- and multiprocessing-safe. I originally envisioned something like how joblib.Memory works, where pickled files are laid out in a directory tree based on hashes of the inputs or something like that, which would mean coming up with our own layout scheme, though now that I think about it more, we might just be able to get away with a sqlite database....

@satra
Copy link
Member

satra commented Mar 12, 2021

i think i'm still not fully following what the cache stores i.e what the contents of tuples are what bytes, but sounds like you have figured out a solution :)

@jwodder
Copy link
Member

jwodder commented Mar 12, 2021

@yarikoptic Regarding the pyout columns: I've renamed "upload" to "pct" and used it for both digest progress and upload progress, but it appears that the "size" and "pct" columns both stop updating after digesting is done, even though they should start counting from zero again while uploading. Do you know why this is?

EDIT: Never mind, reusing "upload" messed with a custom display function, so I've put the digest progress in the "message" column for now. The downside is that this ruins the summary at below the table.

@yarikoptic
Copy link
Member Author

yarikoptic commented Mar 12, 2021

no problem -- summary could be customized and we could exclude progress reporting. ATM it is

        "message": dict(
            color=dict(
                re_lookup=[["^exists", "yellow"], ["^(failed|error|ERROR)", "red"]]
            ),
            aggregate=counts,
        ),

so for aggregate you could just provide a counts_no_progress (or just lambda or filter right there on top of counts) which would first filter entries

@jwodder
Copy link
Member

jwodder commented Mar 12, 2021

@yarikoptic I tried this, but it didn't make a difference.

@yarikoptic
Copy link
Member Author

re the overall approach... I think it could also be done quite non-intrusively (although not sure if having a callback bound to Digester would have some negative side effect on fscacher) via use of the generator_from_callback we have and used for reporting progress from upload for girder backend:

add optional callback to Digester, wrap call to a digester into generator_from_callback and iterate it while getting the final result from StopIteration exception since that is where the final result is provided by the generator_from_callback

                ret = func(callback)
                raise StopIteration(ret) if ret is not None else StopIteration

@yarikoptic
Copy link
Member Author

@yarikoptic I tried this, but it didn't make a difference.

didn't try to debug but may be because % is yielded in "status" and not "message" field which you are filtering?

please also see #400 (comment) on may be a simpler path toward needed functionality and avoiding adding a new dependency (diskcache) with a known NFS issue

@jwodder
Copy link
Member

jwodder commented Mar 12, 2021

@yarikoptic That was it. The summary is still rather flickery, though.

@jwodder
Copy link
Member

jwodder commented Mar 12, 2021

@yarikoptic If I'm understanding generator_from_callback() correctly, you use it by creating a function func that takes a callback, and then you do generator_from_callback(func) to get an iterable of the return values from the callback. Is that correct? What are you envisioning as the func in this case? It can't be get_digest(), because then the callback argument would mess with argument caching. It can't be a function called by get_digest(), because then the generator would be inside get_digest(), and we're back to square one with the impossibility of returning an iterator from a cached function. It can't be a function that calls get_digest(), because get_digest() doesn't return progress information and can't without being uncacheable.

@yarikoptic
Copy link
Member Author

It can't be get_digest(), because then the callback argument would mess with argument caching

we can parametrize PersistentCache with specific exclude_kwargs=None|iterable (e.g. in our case exclude_kwargs=['callback']) to be excluded from the signature used for caching . Then callback could be passed through without affecting caching

@jwodder
Copy link
Member

jwodder commented Mar 12, 2021

@yarikoptic The good news is that joblib's Memory.cache has an ignore parameter that can be used to implement this. The bad news is that the implementation does not work with fscacher's repeated function-wrapping. I filed a bug report.

@jwodder jwodder added the blocked Blocked by some needed development/fix label Mar 24, 2021
@yarikoptic
Copy link
Member Author

FTR: PR was merged and we are waiting for joblib release, asked: joblib/joblib#1165 (comment)

@kabilar
Copy link
Member

kabilar commented Jul 20, 2024

Our LINC collaborator @dstansby is uploading a large Zarr with the LINC Client and has suggested that we provide feedback for the uploads. Below I have copied the text from the original issue (lincbrain#45).

I'm currently uploading a large (300 GB) dataset, or at least I think I am... there's a Python process using my outgoing internet at 200Mbps, but all that the lincbrain cli has told me is:

$ lincbrain upload
2024-05-02 15:20:14,885 [    INFO] Found 6 files to consider

It would be nice if there was some sort of progress bar, or at a minimum a message saying the CLI was starting a file upload.

Thank you.

cc @aaronkanzer

@yarikoptic
Copy link
Member Author

I guess we would need to get back to postponed

@jwodder
Copy link
Member

jwodder commented Jul 22, 2024

@yarikoptic The "status" for a Zarr upload should start out at "producing asset". If the reporter's program was displaying nothing before even getting to the pyout display, then the client hadn't even gotten to the point of starting the Zarr upload proper. Running the command as dandi --log-level debug download ... would give more information about what it's actually doing at that point.

@kabilar
Copy link
Member

kabilar commented Jul 22, 2024

Thanks @jwodder. To clarify, does dandi upload also have this option (--log-level debug)? I didn't see it in the docs.

@jwodder
Copy link
Member

jwodder commented Jul 22, 2024

@kabilar It's an option of the top-level dandi command, not a subcommand option.

@kabilar
Copy link
Member

kabilar commented Jul 22, 2024

Perfect, thank you. Will test out today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Blocked by some needed development/fix cmd-upload UX
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants