Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default parquet compression format from Snappy to LZ4 #424

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mrocklin
Copy link
Member

Snappy's status as default is maybe just due to history. Snappy had better Java support and LZ4 wasn't always available in systems like Spark. Today Spark and other systems support LZ4 as well, and LZ4 generally performs a bit better, especially on decompression.

This is a significant change, but I think the only reason not to do it is historical, which I think maybe isn't a good enough reason these days.

Snappy's status as default is maybe just due to history.  Snappy had
better Java support and LZ4 wasn't always available in systems like
Spark.  Today Spark and other systems support LZ4 as well, and LZ4
generally performs a bit better, especially on decompression.

This is a significant change, but I think the only reason not to do it
is historical, which I think maybe isn't a good enough reason these
days.
@mrocklin
Copy link
Member Author

@rjzamora any concerns here? (no urgency during the holiday)

@rjzamora
Copy link
Member

rjzamora commented Nov 27, 2023

cuDF doesn't support LZ4 compression at the moment. However, that should also improve decompression performance on GPUs once we do have support (ref). I have a feeling that the lack of support is just because no one has asked for it yet (New Issue: rapidsai/cudf#14495).

I'd obviously prefer this kind of change to come after there is LZ4 support in cudf, but I can probably find short-term workarounds if you are eager to get this in immediately.

@rjzamora
Copy link
Member

rjzamora commented Nov 27, 2023

Just a side note that LZ4 is technically "deprecated" in the Parquet spec, so we should also make sure that pyarrow is actually using the interoperable LZ4_RAW codec under the hood when the user specifies compression="lz4".

@mrocklin
Copy link
Member Author

It looks like DuckDB doesn't do this either, which is a little sad 😞

but I can probably find short-term workarounds if you are eager to get this in immediately

I'm not in a huge rush.

Just a side note that LZ4 is technically "deprecated" in the Parquet spec, so we should also make sure that pyarrow is actually using the interoperable LZ4_RAW codec under the hood when the user specifies compression="lz4".

I think that I care not-at-all what the spec says. I do care deeply what conventions and general support is among popular libraries that are used to read and write parquet.

@rjzamora
Copy link
Member

I think that I care not-at-all what the spec says. I do care deeply what conventions and general support is among popular libraries that are used to read and write parquet.

Yeah, that makes sense. Perhaps I'll rephrase my comment a bit: It seems like there has been some commotion around the LZ4 Parquet codec recently. Since I don't personally understand this commotion very well, we should probably double check with the pyarrow folks that using compression="lz4" is not about to become a problem for some reason.

@mrocklin
Copy link
Member Author

It seems like there has been some commotion around the LZ4 Parquet codec recently

This came out of this issue: apache/arrow#38389 (comment)

To give more context, most Parquet deserialization cost seems to be SNAPPY decompression. The use of SNAPPY seems to be a significant performance bottleneck. You might find that issue of interest.

we should probably double check with the pyarrow folks that using compression="lz4" is not about to become a problem for some reason

To be clear, the fact that DuckDB doesn't support LZ4 already I think makes this a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants