New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-321: Default maximum block padding to 8MB. #232
Conversation
@isnotinvain and @julienledem, I think we should set a default max padding size before we release the row group padding in 1.8.0. There is more discussion about the 8MB limit on the issue, PARQUET-321. Thoughts? |
So this would no longer be an opt-in feature? It would be on by default? |
Yeah, it would be on by default. I think we need to do something like this so we don't end up with tiny row groups squeezed into the end of HDFS blocks. That either means starting a full-size block or padding. The main question is: what should happen if there is 2MB left before the end of the block? Should we start a 2MB row group, pad, or ignore the situation? I think the right thing to do is to pad. Our expectation is for this to be less than ~2% overhead because we're pretty good at targeting a specific row group size, but we need to decide on the minimum size we are willing to use for a row group. |
b9cc154
to
b5750f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a very reasonable default to me, much better than the current default of 0. Any chance to get it in the codebase?
I'm still +1 on this. @julienledem, any thoughts? |
+1 |
Have you been using the padding in production? Do we have any numbers to show that it helps / doesn't hurt etc? |
CDH has been using this as the default for more than a year. I don't have numbers since we don't use it for S3, but I think it's a good thing to avoid remote reads. |
OK, yeah sounds fine to me. If there's trouble with it, it can be set back to 0 by the user right? |
Yeah, users can always override this. @Zicl, would you like to update this and submit a new PR? |
@rdblue, sure, I'll do that. |
Merged #391. |
rdblue's change applied to the newest code. Original pull request: apache/parquet-mr#232 Author: Zoltan Ivanfi <zi@cloudera.com> Closes #391 from zicl/master and squashes the following commits: b1c5c1d [Zoltan Ivanfi] PARQUET-321: Default maximum block padding to 8MB. (cherry picked from commit 98c2769) Change-Id: Ic1127f0621153007075810147ac28840f7343b6d
No description provided.