Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more efficient caching #848

Open
satra opened this issue Dec 7, 2021 · 7 comments
Open

more efficient caching #848

satra opened this issue Dec 7, 2021 · 7 comments

Comments

@satra
Copy link
Member

satra commented Dec 7, 2021

the two imaging dandisets are large and will continuously run into caching efficiency. giacomo’s is only about 5TB but lee’s is around 120TB and growing. any kind of bids-related rewrite could thus involve significant checksum computation overhead that could take weeks. i would say it’s time to consider efficiency of both zarr versions and large files qua local checksum computation. i would say the overall problem is to ensure that a local directory can be checksummed efficiently.

one easy way is to maintain a table of mtime+size checksum alongside a dandi-etag in the cache. a rename or a move of a file doesn’t change this checksum and can be copied even across filesystems with both of those elements maintained. thus having a table that is simply an LRU type cache would allow for local movement instead of tying it to a path name.

@jwodder
Copy link
Member

jwodder commented Dec 7, 2021

@satra So you're saying that if a file about to be digested has the same mtime and size as any previously digested file, it should be assumed to be the same file and the previous digest returned? That seems liable to frequently give wrong results.

@satra
Copy link
Member Author

satra commented Dec 7, 2021

@jwodder - mtime + size by itself is insufficient as a checksum. however, in a specific setting of a dandiset, it can be used as a proxy. in the cache i would maintain path and allow the option of replacing the path with a new path. so if someone is doing a rename, this should work. for example say dandi mv path1 path2 would update the cache and not require re-checksumming.

@yarikoptic
Copy link
Member

I really would like to avoid mtime+size alone as some kind of a "proxy" measure. IMHO it is more important to have it AFAIK correct (even if slow) than likely to be correct (but faster).

In principle dandi mv could be implemented but we would need to figure out how to RF fscacher we developed/used for the purpose of caching such compute results.
(note that also any dandi-cli, dandischema, or pynwb or other listed library update would also invalidate cache - may be for digesting we should relax that a bit)

A complimentary/alternative is for people to use git-annex (and/or datalad) to version their data (unless on windows or file system without symlinks support), and then fscacher AFAIK (since we have a test https://github.com/con/fscacher/blob/HEAD/src/fscacher/tests/test_cache.py#L146 and con/fscacher#44 open) would resolve the symlink to the actual file with content, thus avoiding redigesting. If it still does -- we should have that fixed I guess.

@yarikoptic
Copy link
Member

also if we are talking about mv in the archive itself, may be such mv (or move?) should parallel delete which we already have: move them in the archive (may be with an option to move locally as well).

@satra
Copy link
Member Author

satra commented Dec 7, 2021

I really would like to avoid mtime+size alone as some kind of a "proxy" measure. IMHO it is more important to have it AFAIK correct (even if slow) than likely to be correct (but faster).

slow is what we are trying to avoid, so we need the engineering to avoid slow.

(note that also any dandi-cli, dandischema, or pynwb or other listed library update would also invalidate cache - may be for digesting we should relax that a bit)

indeed i am not a fan of why digesting would be invalidated by those libraries. the intent of a digest is that it is not dependent on a library version.

git-annex / datalad can be an option when this becomes something most neuroscientists can do. at present we will have a lot of microscopy data coming in and most of the awardees have indicated that the terminal is not something they use daily. we have to adjust our efforts towards ease.

@yarikoptic
Copy link
Member

the intent of a digest is that it is not dependent on a library version.

well, if library "fixes" digest algorithm, cache should be invalidated. But indeed, since we already well-tested this functionality, @jwodder -- may be instead of library versions, just add some explicit token value (e.g. 1) for "digest implementation", so we could still be able to invalidate prior cache by explicitly changing that token.

slow is what we are trying to avoid, so we need the engineering to avoid slow.

well, caching already allows to avoid it for many scenarios.

Besides explicit dandi mv I do not see some magical way to speed it up more (without having some underlying mechanism -- from filesystem or git-annex to provide reliable "content identity" information)

As for support for some move in the fscacher, let's look into feasibility separately -- con/fscacher#57 .

Re-implementing support for "more efficient caching" straight in dandi-cli, I guess could be done, but at large would probably end up being just an "overfit fork" of fscacher and/or joblib (which fscacher relies upon) implementation... might need to be done, but I think we should first research more into possible solutions/approaches.

@yarikoptic
Copy link
Member

the intent of a digest is that it is not dependent on a library version.

well, if library "fixes" digest algorithm, cache should be invalidated. But indeed, since we already well-tested this functionality, @jwodder -- may be instead of library versions, just add some explicit token value (e.g. 1) for "digest implementation", so we could still be able to invalidate prior cache by explicitly changing that token.

FWIW: I checked the code. Apparently we do not add any token ATM, so no upgrades should invalidate that cache (may be only if joblib does some invalidation upon its upgrade, didn't check): https://github.com/dandi/dandi-cli/blob/master/dandi/support/digests.py#L76 . So nothing todo on this regard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants