Check digests when redownloading #1364

sneakers-the-rat · 2023-11-22T04:05:11Z

Hello again (again!)

Hoping this unsolicited PR isn't unwelcome - figured it would be lightweight enough that it didn't need to be pre-checked.

This PR checks file digests when deciding to re-download a file or not. Basically it implements this TODO:

# TODO: use digests if available? or if e.g. size is identical
# but mtime is different

In downloading datasets I had one disk go down, and IT keeps messing with my network so i get intermittent connection breaks that cause downloads to stall, and so I am pretty worried about corrupted data through all that. It makes sense to me to check file hashes at the time of deciding to redownload (so semantically, attempting to download a file to the same location always results in the same file for a given dandiset and version). Thinking about the distributed future of dandi, we'll need to be doing a whole lot of hashing - so eg. this is a step towards being able to validate data from a mirror server against the hashes hosted on the 'main' server.

Pretty straightforward:

made helper function to check_digests - i noticed that in the relevant places that there were several possible digests that could be checked against, so to tidy up a few if/elses i just put that into one function that checks hashes in order of priority. I do this rather than ensuring that all hashes match or just testing the first one because not all hashing functions are created equal (eg. md5 has been broken for >10 years and should be deprecated) so we do want to explicitly prefer some, esp. since in the future more will break.
the OVERWRITE_DIFFERENT leg is unchanged - just refactored using check_digests
in REFRESH ...
- If digests is None or an empty dict, behavior is unchanged.
- If the mtime is missing and we have a digest, check it. don't redownload if digest matches.
- If we have mtime, first check it and size. If either doesn't match, redownload without checking digest
- If mtime and size match, and we have a digest, then check it. if it matches, don't redownload.
Also added more log messages so we have greppable logs on all legs.
Tests for the check_digests function and to ensure that files that match size and mtime but don't match hash trigger a re-download.

Tests pass, ran precommit actions. Not sure why mypy is failing on the ZarrChecksumTree import, i didn't touch that code. I see the code coverage checker complaining, i didn't see tests for the _download_file function itself in the other tests, which i'd need to do to check against a dandiset without a digest, but i can definitely do that if needed.

Potential changes

Check order: i had written it to quickly decide to redownload if mtime or size were not matching and only check file digests if they did, that seems like a reasonable order to me but it could be changed!
Refactoring/tidying: the check to download at the top of _download_file is now quite long, and it seems like a separate enough operation that it should be split out into its own function, but didn't want to do too much in one PR without checking.
Optionality: should this be optional? I think in general when one is trying to redownload a file to the same directory with one of the relevant options (refresh or overwrite different) that one always wants to be sure that the file they have matches the one on the server. the digest might take awhile, but it's the true check that the file is the same. If we want this to be optional I can propagate a parameter up through to the calling functions and click command.
Put in the digest command? The current dandi digest command just computes the hash, but we could also have it check it against the server hashes as well?

probably more! lmk what needs to be changed :).

Side note: I followed the DEVELOPMENT.md guide to run local tests, but the vendored docker-compose files didn't work, and the fixtures aren't set up to use the dandi-archive docker container i had set up outside of the dandi-cli repo. I had to add an extra DANDI_DOCKER_DIR env variable to replace the LOCAL_DOCKER_DIR which is hardcoded to a directory beneath dandi/tests/fixtures Pretty sure i followed instructions, and it looked like some parts of that doc have gotten out of sync with the fixtures. I also had to override the versioneer auto-versioning which caused ~half of the tests to fail since my version was something like dandi-0+untagged.3399.g20efd37 since tags don't get fetched for forks by default.

- redownload immediately if size or mtime don't match, check digest only if they do. - debug messages on all arms of decision tow download - move import into download function - test using debug messages

codecov · 2023-11-22T04:11:16Z

Codecov Report

Attention: 13 lines in your changes are missing coverage. Please review.

Comparison is base (114da47) 88.69% compared to head (b6c2195) 88.65%.
Report is 1 commits behind head on master.

Files	Patch %	Lines
dandi/download.py	50.00%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1364      +/-   ##
==========================================
- Coverage   88.69%   88.65%   -0.04%     
==========================================
  Files          77       77              
  Lines       10434    10503      +69     
==========================================
+ Hits         9254     9311      +57     
- Misses       1180     1192      +12

Flag	Coverage Δ
unittests	`88.65% <83.75%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sneakers-the-rat · 2023-11-22T04:37:59Z

Interesting. looks like the windows tests are failing - it's saying that a file of all zeros has the same hash as the real file. I'm assuming that's because the PersistentCache doesn't recognize the file has changed, so it's returning the memoized digests. I could make that test pass by taking the digest before setting the mtime, but it seems like that's a correct failure catching a bug in the get_digest function that we shouldn't ignore - that would defeat checksumming in practice.

that actually seems like it's coming from the way fscache is using ctime_ns which is deprecated in windows: https://docs.python.org/3/library/os.html#os.stat_result.st_ctime_ns

https://github.com/con/fscacher/blob/e3a6dee365f89a3b0212d19976e61a8165bf0488/src/fscacher/cache.py#L209C10-L209C10

dandi/support/digests.py

jwodder · 2023-11-22T13:10:04Z

dandi/support/digests.py

+    Only check the first digest that is present in ``digests`` , rather than
+    checking all and returning if any are ``True`` - save compute time and
+    honor the notion of priority - some hash algos are better than others.


Suggested change

Only check the first digest that is present in ``digests`` , rather than

checking all and returning if any are ``True`` - save compute time and

honor the notion of priority - some hash algos are better than others.

Only the first digest algorithm in ``priority`` that is present in

``digests`` is checked.

Describing an approach we're not doing is confusing, and "why" reasons generally don't belong in docstrings.

I usually find "why" explanations helpful - eg. why not have a function that just take one hash rather than a dict of hashes, and if we are passing multiple hashes why wouldn't we check all of them - but sure i'll move that out of the docstring, no strong opinions. the test should protect against future memory loss

updated docstring to also describe params and that makes previous description of what we're not doing obsolete anyway

dandi/support/digests.py

dandi/tests/test_download.py

dandi/support/digests.py

dandi/download.py

yarikoptic · 2023-11-22T16:06:44Z

and so I am pretty worried about corrupted data through all that.

unrelated to the destiny of this PR, I would welcome you to try datalad or git/git-annex directly on the https://github.com/dandi/dandisets which should provide quite an up to date and tightly version/digests controlled access to DANDI data.

yarikoptic · 2023-11-22T16:08:29Z

dandi/download.py

@@ -578,26 +583,16 @@ def _download_file(
                and "sha256" in digests
            ):
                if key_parts[-1].partition(".")[0] == digests["sha256"]:
+                    lgr.debug("already exists - matching digest in filename")


it is always "not" fun to guess what message is talking about, so better use

Suggested change

lgr.debug("already exists - matching digest in filename")

lgr.debug("%s already exists - matching digest in filename", path)

or alike here and elsewhere

Got it. Also noticed that there was a checksum yield message that would apply when we skip because it matches and added that.

sneakers-the-rat · 2023-11-23T01:36:31Z

I think i made all the suggested changes, only thing that i added is also yielding {'checksum': 'ok'} in cases where we check the digest and skip the download. typing test is still mysteriously failing, and windows tests also still fail presumably because digests are being cached differently than they are on mac/linux, lmk what would be good on that :)

yarikoptic · 2023-11-24T20:51:59Z

@jwodder any clue why mypy now becomes unhappy? or it is just a coincidence and just something new which came up now?

typing: commands[0]> mypy dandi
dandi/support/digests.py:109: error: Module "zarr_checksum" does not explicitly
export attribute "ZarrChecksumTree"  [attr-defined]
        from zarr_checksum import ZarrChecksumTree
        ^
dandi/files/zarr.py:[29](https://github.com/dandi/dandi-cli/actions/runs/6964564563/job/18951986129?pr=1364#step:5:30)2: error: Module "zarr_checksum" does not explicitly
export attribute "ZarrChecksumTree"  [attr-defined]
            from zarr_checksum import ZarrChecksumTree
            ^
Found 2 errors in 2 files (checked 80 source files)

re windows: needs to be troubleshooted I guess.

yarikoptic · 2023-11-24T20:54:39Z

dandi/download.py

-            ):
-                yield _skip_file("already exists")
+            elif digests is not None and check_digests(path, digests):
+                lgr.debug(f"{path!r} already exists - matching digest")


I thought, in the light of e.g. dandi/dandi-archive#1750, that we fully avoided using f-strings in logging but at least 2 did sneak in already

❯ git grep 'lgr\.[^(]*(f' dandi/download.py: lgr.debug(f"{path!r} - same attributes: {same}. Redownloading") dandi/upload.py: lgr.info(f"Found {len(dandi_files)} files to consider")

I am not that much of a puritan so wouldn't force a chance here.

aha! i didn't realize there was a practical reason for this, thought it was just old style string formatting. did this in bc28daa

jwodder · 2023-11-27T13:13:22Z

@yarikoptic The typing errors are addressed by #1365.

yarikoptic · 2023-11-27T15:23:29Z

@jwodder could you please analyze that windows issue, for which @sneakers-the-rat also reported con/fscacher#87 ?

sneakers-the-rat · 2023-11-28T00:43:24Z

The typing errors are addressed by #1365.

Merged from master

could you please analyze that windows issue, for which @sneakers-the-rat also reported con/fscacher#87 ?

lmk how i can help - i can force the cache to be invalidated in the tests to make them pass, but it seems like it's worth addressing the underlying source of inconsistency

jwodder

Review so far.

jwodder · 2023-11-28T14:38:12Z

dandi/tests/test_download.py

+    download(url, tmp_path, get_metadata=False)
+    dsdir = tmp_path / "000027"
+    nwb = dsdir / "sub-RAT123" / "sub-RAT123.nwb"
+    digests = digester(str(nwb))


Suggested change

digests = digester(str(nwb))

digests = digester(nwb)

jwodder · 2023-11-28T14:38:27Z

dandi/tests/test_download.py

+    os.utime(nwb, (time.time(), mtime))
+
+    # now original digests should fail since it's a bunch of zeros
+    zero_digests = digester(str(nwb))


Suggested change

zero_digests = digester(str(nwb))

zero_digests = digester(nwb)

jwodder · 2023-11-28T14:38:58Z

dandi/tests/test_download.py

+    assert "successfully downloaded" in last_5.lower()
+
+    # should pass again
+    redownload_digests = digester(str(nwb))


Suggested change

redownload_digests = digester(str(nwb))

redownload_digests = digester(nwb)

jwodder · 2023-11-28T14:43:07Z

dandi/tests/test_download.py

+    for dtype in redownload_digests.keys():
+        assert redownload_digests[dtype] == digests[dtype]


Suggested change

for dtype in redownload_digests.keys():

assert redownload_digests[dtype] == digests[dtype]

assert redownload_digests == digests

jwodder · 2023-11-28T14:54:16Z

dandi/download.py

-                lgr.warning(
-                    f"{path!r} - no mtime or ctime in the record, redownloading"
-                )
+                if digests is not None and check_digests(path, digests):


This is a change in behavior. As documented here and here, "refresh" is only supposed to check size and mtime.

jwodder · 2023-11-28T18:04:58Z

@yarikoptic @sneakers-the-rat As I indicated here, making the "refresh" mode compare digests is a change in behavior. It's currently documented to only check size & mtime, and if we were to make it check digests as well, what would be the difference between "refresh" and "overwrite-different"?

sneakers-the-rat · 2023-12-01T04:20:32Z

Yes! i am proposing a change in behavior - I wasn't sure what the difference was, so plz lmk if i'm missing some context here. It seems like the intention of both of them is "check if we have the file we're about to download, if so skip."

The current OVERWRITE_DIFFERENT implementation a) does a special-cased check for a sha-256 hash in the filename if it is in git annex and then b) checks for dandi-etags or md5 hashes if they're present, so one thing this PR does is consolidate (b) so eg. it's easier to change digest types if we want to eg. do IPFS things with CIDs and whatnot

REFRESH seems to just be a faster but weaker version of the same intention. So ideally we would consolidate them, but i wasn't sure how y'all felt about that and didn't want to do too much in the same PR. The naming suggests that REFRESH was intended to synchronize an already-downloaded dataset if the version on the server was updated, but that's also covered by a more general "download if different" strategy that first checks for mismatches in mtime or size and only then checks the hash. That might be slow, but since y'all cache hash digests on disk, repeated calls to refresh/overwrite_different should be fast if mtime/size haven't changed. The difference pre and post then would be excluding the case where size and mtime match but the data doesn't.

So maybe we need a change in semantics? We could collapse those down into one download mode. If the difference between the two modes at the moment is check for sameness with time and size vs. hash then i think the names could be made a little clearer. since refresh can also overwrite, maybe something like

download --existing={error,skip,refresh,force}

which could also move downloading closer to uploading semantics:

eg currently:

op	download	upload
`error`	raise error	raise error
`skip`	skip	skip
`force`		always overwrite
`overwrite`	always overwrite	overwrite if size or mtime differs
`overwrite-different`	overwrite if size or hash differs
`refresh`	overwrite if size or mtime differs	overwrite if local mtime > remote

we could do

op	download	upload
`error`	(same)	(same)
`skip`	(same)	(same)
`force`	always overwrite	always overwrite
`refresh`	overwrite if size, mtime, or hash differs	overwrite if size or hash differs and local mtime > remote mtime

which isn't perfectly symmetric, but uploading and downloading aren't symmetric actions either in this case.

a humble proposal!

yarikoptic · 2024-01-29T18:11:01Z

Hi @sneakers-the-rat and sorry for the delay on providing more feedback here. Indeed there seems to be some need/desire to streamline modes more! The original desire behind refresh is to be able to do a "quick refresh", so collapsing with "check also a checksum" would ruin that. But I also see desired specificity of overwrite-different as it

would not even rely/test mtime at all, and operate solely based on content/(size and then checksum)
would not lead to a "refresh" of e.g. mtime of the file so that a subsequent refresh could be used for a quick up-to-dateness assurance.

I guess we could gain refresh-thorough (or replace current refresh with refresh-fast and introduce a new refresh) which would

still use size and mtime for shortcut to cause download if any of them differ
if those are identical - go to checksums

It would be different from download-different as having mtime based shortcut, but that would be the only difference if I see it right. The question would be -- is it worth the effort in addition to overwrite-different? should we compliment its functioning with

adjust mtime (if can) to match the remote one
?

sneakers-the-rat · 2024-01-29T21:36:31Z

Gotcha, makes sense.

So one option is to make a separate action for update or sync rather than overloading download - it seems like the intent of "ensure what I have is the same as on the server" is different enough from "get this item from the server" that it deserves its own command, maybe that fits in with the current digest function?

The basic tradeoff is certainty vs. speed, right? The purpose of checksumming is to give an option where someone can be certain that what they have locally is the same thing as what's on the server without needing to do the full download. ie. there isn't a case where it would be desirable to have a false negative for triggering some update except for time.
So in order of costliness:

Check mtime/size
Checksum
Downloading file

I am still not quite sure why overwrite-different = (size + checksum) and refresh = (size + mtime) would be two different paths. As is I think this would be the flowchart currently:

flowchart TD
    Download
    DoDownload[Do Download]
    DLMode[Download Mode]
    DontDownload[Don't Download]
    CheckSize[Check Size]
    CheckMTime[Check mtime]
    Checksum[Checksum]
    

    Download -- "Doesnt Exist" --> DoDownload
    Download -- "Does Exist" --> DLMode
    DLMode -- "`ignore`" --> DontDownload
    DLMode -- "overwrite" --> DoDownload
    DLMode -- "overwrite-different" --> CheckSize
    DLMode -- "refresh" --> CheckSize
    CheckSize -- "match (overwrite-different)" --> Checksum
    CheckSize -- "match (refresh)" --> CheckMTime
    CheckSize -- "no match" --> DoDownload
    
    CheckMTime -- "match" --> DontDownload
    CheckMTime -- "no match" --> DoDownload

    Checksum -- "match" --> DontDownload
    Checksum -- "no match" --> DoDownload

Where I don't think that refresh and overwrite-different communicate the difference between checking mtime + size vs checksum + size - what is the semantic difference between those paths? sort of hard to me to articulate.

What would make sense to me would be to have a "fast" or "slow" path that goes through the same set of checks:

So eg.

dandi download --force
dandi download --ignore
dandi download --refresh
dandi download --refresh --checksum

flowchart TD
    Download
    DoDownload[Do Download]
    DLMode[Download Mode]
    DontDownload[Don't Download]
    CheckSize[Check Size]
    CheckMTime[Check mtime]
    Checksum[Checksum]
    ChecksumEnabled[--checksum]
    
    Download -- "Doesnt Exist" --> DoDownload
    Download -- "Does Exist" --> DLMode
    DLMode -- "--ignore" --> DontDownload
    DLMode -- "--force" --> DoDownload
    DLMode -- "--refresh" --> CheckSize
    CheckSize -- "match" --> CheckMTime
    CheckSize -- "no match" --> DoDownload
    
    CheckMTime -- "match" --> ChecksumEnabled
    CheckMTime -- "no match" --> DoDownload

    ChecksumEnabled -- "yes" --> Checksum
    ChecksumEnabled -- "no" --> DontDownload

    Checksum -- "match" --> DontDownload
    Checksum -- "no match" --> DoDownload

So then the action refresh is the same, but the checksum flag determines whether one does the refresh in a "fast" or "slow" way, where the tradeoff is clearly enough speed vs. certainty. One could set some env or config variable like "always checksum when refreshing" that would be awkward to express if that option was part of the same enum as the download mode.

I think that mtime is a pretty weak check by itself with lots of opportunities for false positives and negatives for an update at the point when it's not an independent representation of version/time of update on server vs. touching the file/modifying the file, so i take no opinion on whether the download function should match the mtimes - to me the question is "how can we make the download command reflect the intention of 'download this and make sure what i have is correct'" and I hope i'm not coming across as too pedantic or ornery here - happy to do whatever with this PR

yarikoptic · 2024-01-29T23:13:53Z

I am still not quite sure why overwrite-different = (size + checksum) and refresh = (size + mtime) would be two different paths

if size differs -- checksum cannot[*] be identical. if mtime differs -- checksum still can be the same.

once again -- size check in "checksum" based mode is just a shortcut for the aforementioned reason.

[*] unless there is a weak checksum like md5 and we are under attack... not the case here

sneakers-the-rat · 2024-01-30T00:16:01Z

Right right, understand that, so the size and mtime are "fast" ways to check for mismatches, and checksum is a slower but more accurate check. What i mean when i am saying i'm not sure why those would be two separate code paths is that it seems like overwrite-different and refresh are doing two versions of the same thing, a "fast" and a "slow" version -- we know that if size is different that checksum will fail, but even when they match we might not want to do the checksum if we just want to do a "fast" check (as per the purpose of refresh).

So why --overwrite-different and --refresh rather than --refresh and --refresh --checksum to indicate "fast with potential for false negatives" and "slow but certain" versions of the same intention.

Don't want to get too far into the weeds on this, it seems like in any case this PR is not of interest, which is fine, but wondering what the outcome should be:

Close this and pretend it never happened (totally fine)
Refactor to just keep the generalization of checking digests part (rather than if else each type of digest) in the OVERWRITE_DIFFERENT leg. If we pick this one i'd also want to contribute to the docs a bit to explain what the different options mean (currently the docs only explain refresh)

It seems like in either case I assume we would probably want to handle change to call naming/etc. in another PR, which I would also be happy to draft.

sneakers-the-rat · 2024-03-12T06:06:57Z

all good, i'm just gonna make my own wrapper <3. thx for yr time on this one, closing to tidy up my lists.

sneakers-the-rat added 2 commits November 20, 2023 21:20

check digests when resuming downloads rather than mtime and size

20efd37

change order of checks

c90182f

- redownload immediately if size or mtime don't match, check digest only if they do. - debug messages on all arms of decision tow download - move import into download function - test using debug messages

oop, make mypi and flake8 happy

b1a694e

sneakers-the-rat mentioned this pull request Nov 22, 2023

Inconsistent behavior on Windows from deprecated st_ctime_ns stat con/fscacher#87

Open

jwodder added the minor Increment the minor version when merged label Nov 22, 2023

jwodder changed the title ~~[Feature][minor] - Check Digests when Redownloading~~ Check Digests when Redownloading Nov 22, 2023

jwodder changed the title ~~Check Digests when Redownloading~~ Check digests when redownloading Nov 22, 2023

jwodder requested changes Nov 22, 2023

View reviewed changes

yarikoptic reviewed Nov 22, 2023

View reviewed changes

sneakers-the-rat added 2 commits November 22, 2023 17:28

address PR comments

96f20f7

missed one PR suggestion - using other docstring annotation format

0dbe049

yarikoptic reviewed Nov 24, 2023

View reviewed changes

sneakers-the-rat added 2 commits November 27, 2023 16:34

remove f strings from logger calls

bc28daa

Merge branch 'master' into feature-check-digests

b6c2195

jwodder requested changes Nov 28, 2023

View reviewed changes

sneakers-the-rat closed this Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check digests when redownloading #1364

Check digests when redownloading #1364

sneakers-the-rat commented Nov 22, 2023 •

edited

codecov bot commented Nov 22, 2023 •

edited

sneakers-the-rat commented Nov 22, 2023

jwodder Nov 22, 2023

sneakers-the-rat Nov 23, 2023

sneakers-the-rat Nov 23, 2023

yarikoptic commented Nov 22, 2023

yarikoptic Nov 22, 2023

sneakers-the-rat Nov 23, 2023

sneakers-the-rat commented Nov 23, 2023

yarikoptic commented Nov 24, 2023

yarikoptic Nov 24, 2023

sneakers-the-rat Nov 28, 2023

jwodder commented Nov 27, 2023

yarikoptic commented Nov 27, 2023

sneakers-the-rat commented Nov 28, 2023

jwodder left a comment

jwodder Nov 28, 2023

jwodder Nov 28, 2023

jwodder Nov 28, 2023

jwodder Nov 28, 2023

jwodder Nov 28, 2023

jwodder commented Nov 28, 2023

sneakers-the-rat commented Dec 1, 2023

yarikoptic commented Jan 29, 2024

sneakers-the-rat commented Jan 29, 2024 •

edited

yarikoptic commented Jan 29, 2024

sneakers-the-rat commented Jan 30, 2024

sneakers-the-rat commented Mar 12, 2024

	lgr.debug("already exists - matching digest in filename")
	lgr.debug("%s already exists - matching digest in filename", path)

	zero_digests = digester(str(nwb))
	zero_digests = digester(nwb)

	redownload_digests = digester(str(nwb))
	redownload_digests = digester(nwb)

		for dtype in redownload_digests.keys():
		assert redownload_digests[dtype] == digests[dtype]

	for dtype in redownload_digests.keys():
	assert redownload_digests[dtype] == digests[dtype]
	assert redownload_digests == digests

Check digests when redownloading #1364

Check digests when redownloading #1364

Conversation

sneakers-the-rat commented Nov 22, 2023 • edited

Potential changes

codecov bot commented Nov 22, 2023 • edited

Codecov Report

sneakers-the-rat commented Nov 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarikoptic commented Nov 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sneakers-the-rat commented Nov 23, 2023

yarikoptic commented Nov 24, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwodder commented Nov 27, 2023

yarikoptic commented Nov 27, 2023

sneakers-the-rat commented Nov 28, 2023

jwodder left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwodder commented Nov 28, 2023

sneakers-the-rat commented Dec 1, 2023

yarikoptic commented Jan 29, 2024

sneakers-the-rat commented Jan 29, 2024 • edited

yarikoptic commented Jan 29, 2024

sneakers-the-rat commented Jan 30, 2024

sneakers-the-rat commented Mar 12, 2024

sneakers-the-rat commented Nov 22, 2023 •

edited

codecov bot commented Nov 22, 2023 •

edited

sneakers-the-rat commented Jan 29, 2024 •

edited