Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on md5 mismatch in GCS #2488

Merged
merged 1 commit into from
Apr 4, 2023
Merged

Retry on md5 mismatch in GCS #2488

merged 1 commit into from
Apr 4, 2023

Conversation

jdangerx
Copy link
Member

@jdangerx jdangerx commented Apr 4, 2023

We ran into this during nightly builds:

The above exception was caused by the following exception:
google.resumable_media.common.DataCorruption: Checksum mismatch while downloading:

  https://storage.googleapis.com/download/storage/v1/b/internal-zenodo-cache.catalyst.coop/o/eia923%2F10.5281-zenodo.7236677%2Feia923-2020.zip?alt=media&userP
roject=catalyst-cooperative-pudl

The X-Goog-Hash header indicated an MD5 checksum of:

  7IwB5H1MDI+hT3YKRrmGQw==

but the actual MD5 checksum of the downloaded contents was:

  w2YhJ2+m55W/tHh5l3/fow==

I downloaded the actual file and it had a different checksum from what's in the error:

MD5 (eia923-2020.zip) = ec8c01e47d4c0c8fa14f760a46b98643

And that matches the md5 listed in the GCP console for internal-zenodo-cache.catalyst.coop/eia923/10.5281-zenodo.7236677/eia923-2020.zip.

So that leads me to believe there's something funny going on with chunked downloading & that this might just be a transient issue with the download (though, shouldn't TCP/IP be making sure that we get the data that we expect to get from Google? I guess things can and do break sometimes.)

So, I added DataCorruption to the list of additional retryable errors. The GCS client automatically does exponential backoff with a total timeout limit, so this shouldn't suddenly cause our nightly builds to take even longer.

@jdangerx jdangerx self-assigned this Apr 4, 2023
@zaneselvans
Copy link
Member

The checksums from the Goog look like maybe they're encoded somehow, not in hexadecimal like normal MD5 sums and the one you created (which matches my local copy of the file). Are there ways to output the hash in like, base64?

Hmm, that doesn't look like it.

echo "ec8c01e47d4c0c8fa14f760a46b98643" | base64
ZWM4YzAxZTQ3ZDRjMGM4ZmExNGY3NjBhNDZiOTg2NDMK

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a retry on corrupted data is a good idea regardless, but do you have any idea why the hashes from Google (both good and bad) don't look like normal MD5 hashes in hexadecimal? I tried encoding the real hash in base64 but that didn't look right either.

@zaneselvans zaneselvans added zenodo Issues having to do with Zenodo data archiving and retrieval. cloud Stuff that has to do with adapting PUDL to work in cloud computing context. labels Apr 4, 2023
@jdangerx
Copy link
Member Author

jdangerx commented Apr 4, 2023

Oh, good point with the base64! I ran this and got something that makes sense:

> echo "7IwB5H1MDI+hT3YKRrmGQw==" | base64 --decode | hexdump -C                         (pudl-dev)
00000000  ec 8c 01 e4 7d 4c 0c 8f  a1 4f 76 0a 46 b9 86 43  |....}L...Ov.F..C|
00000010
> echo "w2YhJ2+m55W/tHh5l3/fow==" | base64 --decode | hexdump -C                         (pudl-dev)
00000000  c3 66 21 27 6f a6 e7 95  bf b4 78 79 97 7f df a3  |.f!'o.....xy....|
00000010

So this still looks like "whoops, got bad data once" but at least we see the matching md5 in the error!

@jdangerx
Copy link
Member Author

jdangerx commented Apr 4, 2023

Out of curiosity I was wondering why your earlier command didn't work - it's because base64 interpreted your input as an ASCII encoded string, instead of as a hexadecimal representation of bytes. This works (xxd -r -p turns the hexdump into an actual byte string):

> echo "ec8c01e47d4c0c8fa14f760a46b98643" | xxd -r -p | base64                       (pudl-dev)
7IwB5H1MDI+hT3YKRrmGQw==

@zaneselvans zaneselvans merged commit 289657e into dev Apr 4, 2023
4 checks passed
@zaneselvans zaneselvans deleted the retry-on-md5-mismatch branch April 4, 2023 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Stuff that has to do with adapting PUDL to work in cloud computing context. zenodo Issues having to do with Zenodo data archiving and retrieval.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

2 participants