Ensure deterministic checksums on csv.gz outputs #856

rousik · 2020-12-14T18:31:08Z

The default behavior for gzip files is to add mtime timestamp to the header which results in non-deterministic checksum of the resulting files even if the content is the same.

Integration between pandas and gzip that allows writing compressed gzip files does not easily allow clearing the mtime attribute even though GzipFile library allows that (see pandas-dev/pandas#28103 for more details).

This change works around this problem using the solution suggested in the pandas bug discussion. Unfortunately, it requires some dirty work to construct iobuffer wrapped zipfiles by hand and passing those to pandas instead of relying on it the native pandas/gzip integration.

Deterministic checksums are necessary if we want to have truly reproducible results and will make it easier to quickly compare two outputs side-by-side (as we could compare resource checksum for each datapackage).

gzip files are adding mtimes in the headers which results in non-deterministic checksums of the resulting files. This change adds a workaround to ensure that mtime is not set and this should allow us to generate compressed datapackages with deterministic checksums. Unfortunately, due to the existing known-bug, the workaround is somewhat dirty and requires us to pass the iobuffer wrapped zip files that we have to open by hand with the necessary parameters instead of relying on the pandas/gzip integration. See pandas-dev/pandas#28103 for more details.

zaneselvans

FYI if you want to work on branches within the repo, CI will have access to the stored EIA API key and then the tests ought to pass.

rousik requested a review from zaneselvans December 14, 2020 18:31

zaneselvans approved these changes Dec 14, 2020

View reviewed changes

zaneselvans merged commit 1336376 into catalyst-cooperative:sprint29 Dec 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure deterministic checksums on csv.gz outputs #856

Ensure deterministic checksums on csv.gz outputs #856

rousik commented Dec 14, 2020

zaneselvans left a comment

Ensure deterministic checksums on csv.gz outputs #856

Ensure deterministic checksums on csv.gz outputs #856

Conversation

rousik commented Dec 14, 2020

zaneselvans left a comment

Choose a reason for hiding this comment