Skip to content

Conversation

@bashir2
Copy link
Contributor

@bashir2 bashir2 commented Mar 13, 2021

Please add a meaningful description for your change here

This adds the plumbing needed for setting row-group size in the underlying ParquetWriter of ParquetIO.Sink.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

Copy link
Contributor

@aromanenko-dev aromanenko-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM. Just a couple of minor comments.

return new AutoValue_ParquetIO_Sink.Builder()
.setJsonSchema(schema.toString())
.setCompressionCodec(CompressionCodecName.SNAPPY)
.setRowGroupSize(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to set a default size here instead of just 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a default in ParquetWriter.Builder for rowGroupSize (here) if it is not explicitly set in the builder. So I decided to interpret 0 as fallback to that default (see line 1128 below). Do you feel that we need to introduce another default here?

Copy link
Contributor

@aromanenko-dev aromanenko-dev Mar 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just asked about this because imho 0 looks like as a magic number here which makes it a bit confusing to understand why it's set to 0 without digging into follow-up code.

I think it would be more evident to set it to ParquetWriter.DEFAULT_BLOCK_SIZE by default and avoid additional if (getRowGroupSize() > 0) later. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point that 0 looks like magic here (how about adding a one line comment?). But the problem with picking a default here is that I feel this is something that needs to be left to the underlying writer. For example, I am not sure if ParquetWriter.DEFAULT_BLOCK_SIZE is even the "right" choice because there is also ParquetWriter.DEFAULT_PAGE_SIZE which actually has a different value here (i.e., 1MB vs 128MB for default block size).

It does seem to me that in some places rowGroupSize and blockSize are used interchangeably in ParquetWriter. For example this constructor call is passing its blockSize parameter as the rowGroupSize of the called constructor. If you feel strongly about taking a stance here, it is okay with me but because of these complexities in ParquetWriter I thought I just leave the choice of the default to that. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see your point. So let's keep it as it is (or -1 maybe?) just, please, add a comment on this why we set it to 0 here. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a little more about this and decided to go with your original suggestion. Now I think it is actually not a bad idea to expose a little bit of complexities inside ParquetWriter here to give a signal to the user that rowGroupSize is actually used for block-size setting too (and there is a comment too, so that should be fine).

PTAL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention that another reason I went this route is that we don't need the rowGroupSize==0 check in open() below (line 1128). If anywhere else we used Sink.Builder we should have accounted for this special case but with using the new default, we don't need such checks.

Copy link
Contributor Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

return new AutoValue_ParquetIO_Sink.Builder()
.setJsonSchema(schema.toString())
.setCompressionCodec(CompressionCodecName.SNAPPY)
.setRowGroupSize(0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a default in ParquetWriter.Builder for rowGroupSize (here) if it is not explicitly set in the builder. So I decided to interpret 0 as fallback to that default (see line 1128 below). Do you feel that we need to introduce another default here?

Copy link
Contributor

@aromanenko-dev aromanenko-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@aromanenko-dev aromanenko-dev merged commit 93d0be0 into apache:master Mar 17, 2021
@bashir2 bashir2 deleted the BEAM-11969 branch March 17, 2021 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants