Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-321: Default maximum block padding to 8MB. #232

Closed

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Jun 30, 2015

No description provided.

@rdblue
Copy link
Contributor Author

rdblue commented Jun 30, 2015

@isnotinvain and @julienledem, I think we should set a default max padding size before we release the row group padding in 1.8.0. There is more discussion about the 8MB limit on the issue, PARQUET-321. Thoughts?

@isnotinvain
Copy link
Contributor

So this would no longer be an opt-in feature? It would be on by default?

@rdblue
Copy link
Contributor Author

rdblue commented Jul 1, 2015

Yeah, it would be on by default. I think we need to do something like this so we don't end up with tiny row groups squeezed into the end of HDFS blocks. That either means starting a full-size block or padding.

The main question is: what should happen if there is 2MB left before the end of the block? Should we start a 2MB row group, pad, or ignore the situation? I think the right thing to do is to pad. Our expectation is for this to be less than ~2% overhead because we're pretty good at targeting a specific row group size, but we need to decide on the minimum size we are willing to use for a row group.

@rdblue rdblue force-pushed the PARQUET-321-set-max-padding-default branch from b9cc154 to b5750f7 Compare February 5, 2016 16:32
Copy link
Contributor

@zivanfi zivanfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very reasonable default to me, much better than the current default of 0. Any chance to get it in the codebase?

@rdblue
Copy link
Contributor Author

rdblue commented Dec 5, 2016

I'm still +1 on this. @julienledem, any thoughts?

@julienledem
Copy link
Member

+1
@isnotinvain ?

@isnotinvain
Copy link
Contributor

Have you been using the padding in production? Do we have any numbers to show that it helps / doesn't hurt etc?

@rdblue
Copy link
Contributor Author

rdblue commented Dec 6, 2016

CDH has been using this as the default for more than a year. I don't have numbers since we don't use it for S3, but I think it's a good thing to avoid remote reads.

@isnotinvain
Copy link
Contributor

OK, yeah sounds fine to me. If there's trouble with it, it can be set back to 0 by the user right?

@rdblue
Copy link
Contributor Author

rdblue commented Dec 6, 2016

Yeah, users can always override this.

@Zicl, would you like to update this and submit a new PR?

@zivanfi
Copy link
Contributor

zivanfi commented Dec 7, 2016

@rdblue, sure, I'll do that.

@zivanfi
Copy link
Contributor

zivanfi commented Dec 7, 2016

@rdblue I created #391 as you suggested.

asfgit pushed a commit that referenced this pull request Dec 7, 2016
rdblue's change applied to the newest code.

Original pull request: #232

Author: Zoltan Ivanfi <zi@cloudera.com>

Closes #391 from zicl/master and squashes the following commits:

b1c5c1d [Zoltan Ivanfi] PARQUET-321: Default maximum block padding to 8MB.
@rdblue
Copy link
Contributor Author

rdblue commented Dec 7, 2016

Merged #391.

@rdblue rdblue closed this Dec 7, 2016
cloudera-hudson pushed a commit to cloudera/parquet-mr that referenced this pull request Jul 23, 2018
rdblue's change applied to the newest code.

Original pull request: apache/parquet-mr#232

Author: Zoltan Ivanfi <zi@cloudera.com>

Closes #391 from zicl/master and squashes the following commits:

b1c5c1d [Zoltan Ivanfi] PARQUET-321: Default maximum block padding to 8MB.

(cherry picked from commit 98c2769)

Change-Id: Ic1127f0621153007075810147ac28840f7343b6d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants