Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parquet Format #241

Merged

Conversation

@tony810430
Copy link
Contributor

tony810430 commented Apr 1, 2019

After reviewing the works on #172 and other tests in the repo, I thought the overall implementation is great and test coverage is sufficient.

In order to accelerate helping this feature contributed back, I created this PR based on my own branch.
It is more convenient if there is any change need to make after reviewing from others.

The works in the PR also include:

  1. Rebase all commits to master in order to merge by fast-forward easily.
  2. Provide a new config to set parquet format compression type.
  3. Fix some minor issues, such as coding style and removing redundant code.

Implementation Brief:

Introduce ParquetFormat and ParquetRecordWriterProvider to receive records and write them into parquet file by converting SinkRecord to avro data and writing through AvroParquetWriter.

The inner class S3ParquetOutputFile in ParquetRecordWriterProvider implements OutputFile which should be provided in order to build AvroParquetWriter. Instead of other formats warp output stream into writer or just uses that stream directly, AvroParquetWriter uses OutputFile's method to create output stream, so the implementation passes filename and s3 storage object to S3ParquetOutputFile's constructor and creates the output stream when it is needed.

For the S3OutputStream, its interface changed to PositionOutputStream, which should implemented a new method, so that it could be accepted by ParquetWriter.

Because we can't control when or how to commit file through AvroParquetWriter and the only way to manually commit file is to close it, we wrapped S3OutputStream by S3ParquetOutputStream to make sure S3OutputStream#commit() must be called when S3ParquetOutputStream#close() is called. Even through we don't know that is trigged by commit or close, it's okey due to idempotent property of s3 sink connect.

The last modified file is S3SinkConnectorConfig. Added a new configuration for parquet compression type, such as gzip, snappy and etc. All supported compression type from parquet library could be configured. Since I took the values from parquet library directly, which were not matched to exist s3 compression type config, I chose to introduce a new config name to distinguish them.

Last but not least, all of the unit tests for parquet format implementations took those from avro format as reference, and I think those tests are sufficient. Because we removed dependency to hive, we also lost some dependencies for parquet. I added them back as few as possible.

@ConfluentCLABot

This comment has been minimized.

Copy link

ConfluentCLABot commented Apr 1, 2019

@confluentinc It looks like @tony810430 just signed our Contributor License Agreement. 👍

Always at your service,

clabot

@tony810430

This comment has been minimized.

Copy link
Contributor Author

tony810430 commented Apr 1, 2019

@kkonstantine, @eakmanrq
Please help to review this PR. Thanks.

btw, this PR's tests should rely on confluentinc/kafka-connect-storage-common#99 merged first, since it used a new API from parquet-mr 1.10.0.

@tony810430 tony810430 force-pushed the tony810430:feature/add_parquet_format branch from 6509fa9 to 66af002 Apr 1, 2019
@tony810430

This comment has been minimized.

Copy link
Contributor Author

tony810430 commented Apr 22, 2019

@kkonstantine, @eakmanrq
Would you have some time to review this PR? Thanks.

@eakmanrq

This comment has been minimized.

Copy link

eakmanrq commented Apr 23, 2019

Code itself looks good to me. @tony810430 Have you had a chance to test this out in a production environment? Curious what the performance looks like.

@tony810430

This comment has been minimized.

Copy link
Contributor Author

tony810430 commented Apr 24, 2019

@eakmanrq Yes, I have already try this out in our production environment. However, my use case is just a small topic with only one partition and little QPS. I have been running it over one week and have verified the uploaded data are all correct. I didn't see any critical performance downgrade during these days, but as I said above, it's hard to see how the performance is.

@tony810430 tony810430 mentioned this pull request May 8, 2019
Copy link
Member

kkonstantine left a comment

Thanks @tony810430 for driving this PR!

Just left a few initial comments. Haven't finished a complete review, but before we continue, it'd be nice to describe at the high level how parquet file upload is meant to be implemented here with this S3 Kafka connector. (ideally we'd add this description to the merge comment of the commit). Thanks!

@tony810430 tony810430 force-pushed the tony810430:feature/add_parquet_format branch 2 times, most recently from d7debaf to 6cc7209 May 8, 2019
@tony810430 tony810430 force-pushed the tony810430:feature/add_parquet_format branch from 6cc7209 to 334799a May 8, 2019
@tony810430

This comment has been minimized.

Copy link
Contributor Author

tony810430 commented May 8, 2019

@kkonstantine, I have updated some descriptions of implementations in my initial comment. Not sure if those are what you meant and wanted. Please let me know if you need more information.

@jocelyndrean

This comment has been minimized.

Copy link

jocelyndrean commented May 16, 2019

I tried this PR today and everything worked perfectly :) Thanks @tony810430 for this amazing job !

@tony810430

This comment has been minimized.

Copy link
Contributor Author

tony810430 commented May 17, 2019

@jocelyndrean Thanks for sharing your experience. It will be better if you have free time to help review this PR and we can make this feature merged for more users to benefit from it. =)

Copy link

jocelyndrean left a comment

LGTM. FYI: It's running in my staging environment for hours now and this version stored around 200GB of parquet files on S3. Nothing to report. Works perfectly.

@tony810430

This comment has been minimized.

Copy link
Contributor Author

tony810430 commented Sep 16, 2019

@kkonstantine Have already addressed comments.

Copy link
Member

kkonstantine left a comment

Thanks @tony810430 for the quick turnaround.
Your commit implementation is what I had in mind.
Added a few more comments. Catching my delayed flight from SFO, I'll return for a final look on the tests.
Almost there!
Thanks!

@kkonstantine

This comment has been minimized.

Copy link
Member

kkonstantine commented Sep 17, 2019

Also, DataWriterParquetTest.testProjectNoVersion seems to fail the build job. Are you able to reproduce locally?

@tony810430

This comment has been minimized.

Copy link
Contributor Author

tony810430 commented Sep 17, 2019

I'm not sure why DataWriterParquetTest.testProjectNoVersion failed. It worked on my machine. And I also notice that it worked in the 12th build as well, and there is no change between 12th and 13th build.

I'll try to build other confluent's dependencies locally with latest version, and check if I can reproduce locally.

…t testing to align with DataWriterAvroTest
@tony810430

This comment has been minimized.

Copy link
Contributor Author

tony810430 commented Sep 17, 2019

I found that there were some discrepancies between DataWriterParquetTest and DataWriterAvroTest. I have made them consistent, but not sure I have understood the intent from origin author.

Copy link
Member

kkonstantine left a comment

Thanks for fixing the test @tony810430
I think we are almost set.

Let's pin the hadoop version to 3.2.0 in the pom.xml
Also, do you know whether we need hadoop-mapreduce-client-core outside tests? I'd like to include only the absolutely necessary deps.

One last thing that I'd prefer to have is a unified config for compression types. Adding another type doesn't seem absolutely necessary, but unfortunately I don't think we can check the format.class from within the validator (I might need to remember things here). Maybe it's worth considering on another iteration before we release.

@tony810430

This comment has been minimized.

Copy link
Contributor Author

tony810430 commented Sep 18, 2019

I have run this feature since 5.2.x released version. I see it has hadoop-mapreduce-client-core jar in its lib/ folder, but I didn't try to remove it when I upgraded to 5.3.x. I have no idea whether it is necessary outside tests.

For unified compression types, it seems that the current Validator's API lacks other configs' settings like Recommender has, otherwise we can define dependents for format.class and access it in CompressionTypeValidator. For now, I don't have any good solution to verify format.class in the parsing phase.

@tony810430 tony810430 requested a review from kkonstantine Sep 24, 2019
kkonstantine added 2 commits Oct 8, 2019
  * parquet-codec will be added in storage-common for all the storage sink connectors
@kkonstantine

This comment has been minimized.

Copy link
Member

kkonstantine commented Oct 8, 2019

With my two last commits I'm setting the hadoop dependencies version to the latest bugfix (3.2.1) and I'm removing the config s3.parquet.compression.type. The current PR will be merged without compression support for parquet, but shortly and again targeting CP 5.4 parquet.codec similar to avro.codec will be added to storage-common.

Copy link
Member

kkonstantine left a comment

Terrific work @tony810430 !
LGTM

Merging parquet support for S3

@kkonstantine kkonstantine changed the title Add Parquet Format based on PR #172 Add Parquet Format Oct 8, 2019
@kkonstantine kkonstantine merged commit 61bb29d into confluentinc:master Oct 8, 2019
1 check passed
1 check passed
continuous-integration/jenkins/pr-merge This commit looks good
Details
@yauhen-sobaleu

This comment has been minimized.

Copy link

yauhen-sobaleu commented Oct 27, 2019

Hi guys, is it possible to test Parquet Format with S3 Connector with confluentinc/cp-kafka-connect:5.4.0-beta1 docker image? Is it already included?

@tony810430 tony810430 deleted the tony810430:feature/add_parquet_format branch Oct 28, 2019
@bstaudacher

This comment has been minimized.

Copy link

bstaudacher commented Nov 7, 2019

Very excited for this! Thank you for your hard work @tony810430.
What release will this be available in? Can't wait to try it out.

@NathanNam

This comment has been minimized.

Copy link

NathanNam commented Dec 19, 2019

It will be part of CP 5.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.