-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-11969] Adds an option for setting row-group size in ParquetIO #14227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aromanenko-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM. Just a couple of minor comments.
| return new AutoValue_ParquetIO_Sink.Builder() | ||
| .setJsonSchema(schema.toString()) | ||
| .setCompressionCodec(CompressionCodecName.SNAPPY) | ||
| .setRowGroupSize(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to set a default size here instead of just 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a default in ParquetWriter.Builder for rowGroupSize (here) if it is not explicitly set in the builder. So I decided to interpret 0 as fallback to that default (see line 1128 below). Do you feel that we need to introduce another default here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just asked about this because imho 0 looks like as a magic number here which makes it a bit confusing to understand why it's set to 0 without digging into follow-up code.
I think it would be more evident to set it to ParquetWriter.DEFAULT_BLOCK_SIZE by default and avoid additional if (getRowGroupSize() > 0) later. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your point that 0 looks like magic here (how about adding a one line comment?). But the problem with picking a default here is that I feel this is something that needs to be left to the underlying writer. For example, I am not sure if ParquetWriter.DEFAULT_BLOCK_SIZE is even the "right" choice because there is also ParquetWriter.DEFAULT_PAGE_SIZE which actually has a different value here (i.e., 1MB vs 128MB for default block size).
It does seem to me that in some places rowGroupSize and blockSize are used interchangeably in ParquetWriter. For example this constructor call is passing its blockSize parameter as the rowGroupSize of the called constructor. If you feel strongly about taking a stance here, it is okay with me but because of these complexities in ParquetWriter I thought I just leave the choice of the default to that. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see your point. So let's keep it as it is (or -1 maybe?) just, please, add a comment on this why we set it to 0 here. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a little more about this and decided to go with your original suggestion. Now I think it is actually not a bad idea to expose a little bit of complexities inside ParquetWriter here to give a signal to the user that rowGroupSize is actually used for block-size setting too (and there is a comment too, so that should be fine).
PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to mention that another reason I went this route is that we don't need the rowGroupSize==0 check in open() below (line 1128). If anywhere else we used Sink.Builder we should have accounted for this special case but with using the new default, we don't need such checks.
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
| return new AutoValue_ParquetIO_Sink.Builder() | ||
| .setJsonSchema(schema.toString()) | ||
| .setCompressionCodec(CompressionCodecName.SNAPPY) | ||
| .setRowGroupSize(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a default in ParquetWriter.Builder for rowGroupSize (here) if it is not explicitly set in the builder. So I decided to interpret 0 as fallback to that default (see line 1128 below). Do you feel that we need to introduce another default here?
aromanenko-dev
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
Please add a meaningful description for your change here
This adds the plumbing needed for setting row-group size in the underlying
ParquetWriterofParquetIO.Sink.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.