Ensure deterministic checksums on csv.gz outputs #856
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The default behavior for gzip files is to add mtime timestamp to the header which results in non-deterministic checksum of the resulting files even if the content is the same.
Integration between pandas and gzip that allows writing compressed gzip files does not easily allow clearing the mtime attribute even though GzipFile library allows that (see pandas-dev/pandas#28103 for more details).
This change works around this problem using the solution suggested in the pandas bug discussion. Unfortunately, it requires some dirty work to construct iobuffer wrapped zipfiles by hand and passing those to pandas instead of relying on it the native pandas/gzip integration.
Deterministic checksums are necessary if we want to have truly reproducible results and will make it easier to quickly compare two outputs side-by-side (as we could compare resource checksum for each datapackage).