upload: progress indication for "digesting"? #400

yarikoptic · 2021-02-18T18:21:46Z

"digesting" could take comparable to upload time, in particular in the case of the parallel uploads. Although with caching we at least would not bother redoing it twice if did recently, it would be nice if we have a way to report back to the user on the progress of digesting. For upload we have a dedicated column upload which shows %. But I wonder if we better have a dedicated column "progress" which would be populated (and either cleared or report time on how long it took after it is done) used by whatever 'status' column reports (digesting, or uploading, etc).

The text was updated successfully, but these errors were encountered:

yarikoptic · 2021-03-11T00:35:37Z

I will leave UX decision to you (a separate column or just reuse existing) @jwodder

jwodder · 2021-03-11T13:30:47Z

@yarikoptic I believe joblib.Memory (on which fscacher depends) uses pickle to store data, and iterators aren't pickleable. In order to implement this, we'd need a way to query the cache to see whether it already has an entry for a given path and a way to manually insert entries in the cache, neither of which seem to be supported by joblib.

satra · 2021-03-11T13:58:39Z

perhaps use pydra instead and store the cache directory locally somewhere.

jwodder · 2021-03-11T14:03:41Z

@satra Exactly what relevant features does pydra have? Digest progress is the easy part; caching it on the filesystem is the hard part.

satra · 2021-03-11T14:09:01Z

that's exactly what pydra would provide in addition to parallelization of a process :)

i would suggest going through the tutorial: https://github.com/nipype/pydra-tutorial

jwodder · 2021-03-11T15:44:30Z

@satra I see that that provides caching of functions, but I don't see how one would get digest progress information out of it. Keep in mind that we're currently digesting using sha256, which isn't parallelizable, unlike the contemplated Merkel tree hash. Or are you recommending pydra due to it having a queryable cache? I don't see that in the docs.

yarikoptic · 2021-03-11T15:56:54Z

@satra FWIW pydra is relatively heavy of a dependency (and dependencies down) to use just for caching. Even joblib is somewhat too much since we use just memoization part... Also I see it depends on cloudpickle -- is that is what it uses for pickling? its README warns

Cloudpickle can only be used to send objects between the exact same version of Python.

Using cloudpickle for long-term object storage is not supported and strongly discouraged.

which possibly makes it suboptimal (although I am not sure how good we are now for sure anyways: con/fscacher#36)

yarikoptic · 2021-03-11T16:07:08Z

@jwodder , with all the generators it needs more thinking on how to wrap it all up ... meanwhile - having support for them in fscacher would be useful, so filed con/fscacher#37

satra · 2021-03-12T00:17:05Z

perhaps i misunderstood what you meant by

caching it on the filesystem is the hard part.

are you saying caching the progressing state so you could start at whatever point it stopped? or are you caching the entire hash? indeed pydra is not suited for progressive filesystem caching. but whenever you run a function, if it's completed the local cache is like a memoization on disk.

jwodder · 2021-03-12T00:37:38Z

@satra We need the cache to be a datastore mapping tuples to bytes that (a) can be freely queried & modified, not just used to dumbly memoize a function, and (b) is threading- and multiprocessing-safe. I originally envisioned something like how joblib.Memory works, where pickled files are laid out in a directory tree based on hashes of the inputs or something like that, which would mean coming up with our own layout scheme, though now that I think about it more, we might just be able to get away with a sqlite database....

satra · 2021-03-12T01:08:42Z

i think i'm still not fully following what the cache stores i.e what the contents of tuples are what bytes, but sounds like you have figured out a solution :)

jwodder · 2021-03-12T15:24:12Z

@yarikoptic Regarding the pyout columns: I've renamed "upload" to "pct" and used it for both digest progress and upload progress, but it appears that the "size" and "pct" columns both stop updating after digesting is done, even though they should start counting from zero again while uploading. Do you know why this is?

EDIT: Never mind, reusing "upload" messed with a custom display function, so I've put the digest progress in the "message" column for now. The downside is that this ruins the summary at below the table.

yarikoptic · 2021-03-12T16:04:38Z

no problem -- summary could be customized and we could exclude progress reporting. ATM it is

        "message": dict(
            color=dict(
                re_lookup=[["^exists", "yellow"], ["^(failed|error|ERROR)", "red"]]
            ),
            aggregate=counts,
        ),

so for aggregate you could just provide a counts_no_progress (or just lambda or filter right there on top of counts) which would first filter entries

jwodder · 2021-03-12T16:17:47Z

@yarikoptic I tried this, but it didn't make a difference.

yarikoptic · 2021-03-12T16:27:47Z

re the overall approach... I think it could also be done quite non-intrusively (although not sure if having a callback bound to Digester would have some negative side effect on fscacher) via use of the generator_from_callback we have and used for reporting progress from upload for girder backend:

add optional callback to Digester, wrap call to a digester into generator_from_callback and iterate it while getting the final result from StopIteration exception since that is where the final result is provided by the generator_from_callback

                ret = func(callback)
                raise StopIteration(ret) if ret is not None else StopIteration

yarikoptic · 2021-03-12T16:33:39Z

@yarikoptic I tried this, but it didn't make a difference.

didn't try to debug but may be because % is yielded in "status" and not "message" field which you are filtering?

please also see #400 (comment) on may be a simpler path toward needed functionality and avoiding adding a new dependency (diskcache) with a known NFS issue

jwodder · 2021-03-12T16:37:51Z

@yarikoptic That was it. The summary is still rather flickery, though.

jwodder · 2021-03-12T16:55:58Z

@yarikoptic If I'm understanding generator_from_callback() correctly, you use it by creating a function func that takes a callback, and then you do generator_from_callback(func) to get an iterable of the return values from the callback. Is that correct? What are you envisioning as the func in this case? It can't be get_digest(), because then the callback argument would mess with argument caching. It can't be a function called by get_digest(), because then the generator would be inside get_digest(), and we're back to square one with the impossibility of returning an iterator from a cached function. It can't be a function that calls get_digest(), because get_digest() doesn't return progress information and can't without being uncacheable.

yarikoptic · 2021-03-12T20:18:21Z

It can't be get_digest(), because then the callback argument would mess with argument caching

we can parametrize PersistentCache with specific exclude_kwargs=None|iterable (e.g. in our case exclude_kwargs=['callback']) to be excluded from the signature used for caching . Then callback could be passed through without affecting caching

jwodder · 2021-03-12T21:35:50Z

@yarikoptic The good news is that joblib's Memory.cache has an ignore parameter that can be used to implement this. The bad news is that the implementation does not work with fscacher's repeated function-wrapping. I filed a bug report.

yarikoptic · 2021-05-04T15:15:58Z

FTR: PR was merged and we are waiting for joblib release, asked: joblib/joblib#1165 (comment)

kabilar · 2024-07-20T04:21:20Z

Our LINC collaborator @dstansby is uploading a large Zarr with the LINC Client and has suggested that we provide feedback for the uploads. Below I have copied the text from the original issue (lincbrain#45).

I'm currently uploading a large (300 GB) dataset, or at least I think I am... there's a Python process using my outgoing internet at 200Mbps, but all that the lincbrain cli has told me is:
$ lincbrain upload
2024-05-02 15:20:14,885 [    INFO] Found 6 files to consider
It would be nice if there was some sort of progress bar, or at a minimum a message saying the CLI was starting a file upload.

Thank you.

cc @aaronkanzer

yarikoptic · 2024-07-21T19:25:46Z

I guess we would need to get back to postponed

Show digest progress when uploading #465
but meanwhile, @jwodder do you think we could at least have overall status "digesting" to be displayed without progress for a file (zarr or not) before we start digesting it for upload?

jwodder · 2024-07-22T12:30:41Z

@yarikoptic The "status" for a Zarr upload should start out at "producing asset". If the reporter's program was displaying nothing before even getting to the pyout display, then the client hadn't even gotten to the point of starting the Zarr upload proper. Running the command as dandi --log-level debug download ... would give more information about what it's actually doing at that point.

kabilar · 2024-07-22T14:56:54Z

Thanks @jwodder. To clarify, does dandi upload also have this option (--log-level debug)? I didn't see it in the docs.

jwodder · 2024-07-22T14:58:08Z

@kabilar It's an option of the top-level dandi command, not a subcommand option.

kabilar · 2024-07-22T15:07:42Z

Perfect, thank you. Will test out today.

yarikoptic added the UX label Feb 18, 2021

yarikoptic assigned jwodder Mar 11, 2021

yarikoptic mentioned this issue Mar 11, 2021

support generator functions con/fscacher#37

Open

jwodder mentioned this issue Mar 12, 2021

Show digest progress when uploading #465

Closed

jwodder mentioned this issue Mar 12, 2021

Add exclude_kwargs to memoization decorators con/fscacher#38

Merged

jwodder added the blocked Blocked by some needed development/fix label Mar 24, 2021

jwodder added the cmd-upload label Apr 15, 2021

jwodder mentioned this issue Feb 16, 2022

Next design, directories oriented?/localized, is needed con/fscacher#69

Open

yarikoptic-gitmate self-assigned this Aug 10, 2023

yarikoptic-gitmate unassigned jwodder Aug 10, 2023

kabilar mentioned this issue Jul 20, 2024

Add some feedback when uploading datasets lincbrain/linc-cli#45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upload: progress indication for "digesting"? #400

upload: progress indication for "digesting"? #400

yarikoptic commented Feb 18, 2021

yarikoptic commented Mar 11, 2021 •

edited

Loading

jwodder commented Mar 11, 2021

satra commented Mar 11, 2021

jwodder commented Mar 11, 2021

satra commented Mar 11, 2021

jwodder commented Mar 11, 2021 •

edited

Loading

yarikoptic commented Mar 11, 2021

yarikoptic commented Mar 11, 2021

satra commented Mar 12, 2021

jwodder commented Mar 12, 2021

satra commented Mar 12, 2021

jwodder commented Mar 12, 2021 •

edited

Loading

yarikoptic commented Mar 12, 2021 •

edited

Loading

jwodder commented Mar 12, 2021

yarikoptic commented Mar 12, 2021

yarikoptic commented Mar 12, 2021

jwodder commented Mar 12, 2021

jwodder commented Mar 12, 2021

yarikoptic commented Mar 12, 2021

jwodder commented Mar 12, 2021

yarikoptic commented May 4, 2021

kabilar commented Jul 20, 2024

yarikoptic commented Jul 21, 2024

jwodder commented Jul 22, 2024

kabilar commented Jul 22, 2024

jwodder commented Jul 22, 2024

kabilar commented Jul 22, 2024

upload: progress indication for "digesting"? #400

upload: progress indication for "digesting"? #400

Comments

yarikoptic commented Feb 18, 2021

yarikoptic commented Mar 11, 2021 • edited Loading

jwodder commented Mar 11, 2021

satra commented Mar 11, 2021

jwodder commented Mar 11, 2021

satra commented Mar 11, 2021

jwodder commented Mar 11, 2021 • edited Loading

yarikoptic commented Mar 11, 2021

yarikoptic commented Mar 11, 2021

satra commented Mar 12, 2021

jwodder commented Mar 12, 2021

satra commented Mar 12, 2021

jwodder commented Mar 12, 2021 • edited Loading

yarikoptic commented Mar 12, 2021 • edited Loading

jwodder commented Mar 12, 2021

yarikoptic commented Mar 12, 2021

yarikoptic commented Mar 12, 2021

jwodder commented Mar 12, 2021

jwodder commented Mar 12, 2021

yarikoptic commented Mar 12, 2021

jwodder commented Mar 12, 2021

yarikoptic commented May 4, 2021

kabilar commented Jul 20, 2024

yarikoptic commented Jul 21, 2024

jwodder commented Jul 22, 2024

kabilar commented Jul 22, 2024

jwodder commented Jul 22, 2024

kabilar commented Jul 22, 2024

yarikoptic commented Mar 11, 2021 •

edited

Loading

jwodder commented Mar 11, 2021 •

edited

Loading

jwodder commented Mar 12, 2021 •

edited

Loading

yarikoptic commented Mar 12, 2021 •

edited

Loading