Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure deterministic checksums on csv.gz outputs #856

Merged

Conversation

rousik
Copy link
Collaborator

@rousik rousik commented Dec 14, 2020

The default behavior for gzip files is to add mtime timestamp to the header which results in non-deterministic checksum of the resulting files even if the content is the same.

Integration between pandas and gzip that allows writing compressed gzip files does not easily allow clearing the mtime attribute even though GzipFile library allows that (see pandas-dev/pandas#28103 for more details).

This change works around this problem using the solution suggested in the pandas bug discussion. Unfortunately, it requires some dirty work to construct iobuffer wrapped zipfiles by hand and passing those to pandas instead of relying on it the native pandas/gzip integration.

Deterministic checksums are necessary if we want to have truly reproducible results and will make it easier to quickly compare two outputs side-by-side (as we could compare resource checksum for each datapackage).

gzip files are adding mtimes in the headers which results in
non-deterministic checksums of the resulting files. This change
adds a workaround to ensure that mtime is not set and this should
allow us to generate compressed datapackages with deterministic
checksums.

Unfortunately, due to the existing known-bug, the workaround is somewhat
dirty and requires us to pass the iobuffer wrapped zip files that we
have to open by hand with the necessary parameters instead of relying
on the pandas/gzip integration.

See pandas-dev/pandas#28103 for more details.
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI if you want to work on branches within the repo, CI will have access to the stored EIA API key and then the tests ought to pass.

@zaneselvans zaneselvans merged commit 1336376 into catalyst-cooperative:sprint29 Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants