-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Parquet Format #241
Add Parquet Format #241
Conversation
@confluentinc It looks like @tony810430 just signed our Contributor License Agreement. 👍 Always at your service, clabot |
@kkonstantine, @eakmanrq btw, this PR's tests should rely on confluentinc/kafka-connect-storage-common#99 merged first, since it used a new API from parquet-mr 1.10.0. |
6509fa9
to
66af002
Compare
@kkonstantine, @eakmanrq |
Code itself looks good to me. @tony810430 Have you had a chance to test this out in a production environment? Curious what the performance looks like. |
@eakmanrq Yes, I have already try this out in our production environment. However, my use case is just a small topic with only one partition and little QPS. I have been running it over one week and have verified the uploaded data are all correct. I didn't see any critical performance downgrade during these days, but as I said above, it's hard to see how the performance is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tony810430 for driving this PR!
Just left a few initial comments. Haven't finished a complete review, but before we continue, it'd be nice to describe at the high level how parquet file upload is meant to be implemented here with this S3 Kafka connector. (ideally we'd add this description to the merge comment of the commit). Thanks!
kafka-connect-s3/src/main/java/io/confluent/connect/s3/S3SinkConnectorConfig.java
Outdated
Show resolved
Hide resolved
kafka-connect-s3/src/main/java/io/confluent/connect/s3/format/parquet/ParquetFormat.java
Outdated
Show resolved
Hide resolved
...ect-s3/src/main/java/io/confluent/connect/s3/format/parquet/ParquetRecordWriterProvider.java
Show resolved
Hide resolved
d7debaf
to
6cc7209
Compare
…provide TimeBasedPartitioner and time field
6cc7209
to
334799a
Compare
@kkonstantine, I have updated some descriptions of implementations in my initial comment. Not sure if those are what you meant and wanted. Please let me know if you need more information. |
I tried this PR today and everything worked perfectly :) Thanks @tony810430 for this amazing job ! |
@jocelyndrean Thanks for sharing your experience. It will be better if you have free time to help review this PR and we can make this feature merged for more users to benefit from it. =) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. FYI: It's running in my staging environment for hours now and this version stored around 200GB of parquet files on S3. Nothing to report. Works perfectly.
@kkonstantine Have already addressed comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tony810430 for the quick turnaround.
Your commit implementation is what I had in mind.
Added a few more comments. Catching my delayed flight from SFO, I'll return for a final look on the tests.
Almost there!
Thanks!
kafka-connect-s3/src/main/java/io/confluent/connect/s3/S3SinkConnectorConfig.java
Outdated
Show resolved
Hide resolved
kafka-connect-s3/src/main/java/io/confluent/connect/s3/storage/S3OutputStream.java
Show resolved
Hide resolved
kafka-connect-s3/src/main/java/io/confluent/connect/s3/storage/S3Storage.java
Outdated
Show resolved
Hide resolved
kafka-connect-s3/src/main/java/io/confluent/connect/s3/storage/S3ParquetOutputStream.java
Show resolved
Hide resolved
kafka-connect-s3/src/main/java/io/confluent/connect/s3/storage/S3ParquetOutputStream.java
Outdated
Show resolved
Hide resolved
Also, |
I'm not sure why I'll try to build other confluent's dependencies locally with latest version, and check if I can reproduce locally. |
…t testing to align with DataWriterAvroTest
I found that there were some discrepancies between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the test @tony810430
I think we are almost set.
Let's pin the hadoop version to 3.2.0 in the pom.xml
Also, do you know whether we need hadoop-mapreduce-client-core
outside tests? I'd like to include only the absolutely necessary deps.
One last thing that I'd prefer to have is a unified config for compression types. Adding another type doesn't seem absolutely necessary, but unfortunately I don't think we can check the format.class
from within the validator (I might need to remember things here). Maybe it's worth considering on another iteration before we release.
kafka-connect-s3/src/main/java/io/confluent/connect/s3/storage/S3OutputStream.java
Show resolved
Hide resolved
I have run this feature since 5.2.x released version. I see it has For unified compression types, it seems that the current |
* parquet-codec will be added in storage-common for all the storage sink connectors
With my two last commits I'm setting the hadoop dependencies version to the latest bugfix ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Terrific work @tony810430 !
LGTM
Merging parquet support for S3
Hi guys, is it possible to test Parquet Format with S3 Connector with confluentinc/cp-kafka-connect:5.4.0-beta1 docker image? Is it already included? |
Very excited for this! Thank you for your hard work @tony810430. |
It will be part of CP 5.4. |
When I try to use this, I'm getting an error that a Hadoop class is not found. Is this a defect or am I missing a prereq or other configuration in my container maybe?
|
After reviewing the works on #172 and other tests in the repo, I thought the overall implementation is great and test coverage is sufficient.
In order to accelerate helping this feature contributed back, I created this PR based on my own branch.
It is more convenient if there is any change need to make after reviewing from others.
The works in the PR also include:
Implementation Brief:
Introduce
ParquetFormat
andParquetRecordWriterProvider
to receive records and write them into parquet file by convertingSinkRecord
to avro data and writing throughAvroParquetWriter
.The inner class
S3ParquetOutputFile
inParquetRecordWriterProvider
implementsOutputFile
which should be provided in order to buildAvroParquetWriter
. Instead of other formats warp output stream into writer or just uses that stream directly,AvroParquetWriter
usesOutputFile
's method to create output stream, so the implementation passes filename and s3 storage object toS3ParquetOutputFile
's constructor and creates the output stream when it is needed.For the
S3OutputStream
, its interface changed toPositionOutputStream
, which should implemented a new method, so that it could be accepted byParquetWriter
.Because we can't control when or how to commit file through
AvroParquetWriter
and the only way to manually commit file is to close it, we wrappedS3OutputStream
byS3ParquetOutputStream
to make sureS3OutputStream#commit()
must be called whenS3ParquetOutputStream#close()
is called. Even through we don't know that is trigged by commit or close, it's okey due to idempotent property of s3 sink connect.The last modified file is
S3SinkConnectorConfig
. Added a new configuration for parquet compression type, such asgzip
,snappy
and etc. All supported compression type from parquet library could be configured. Since I took the values from parquet library directly, which were not matched to exist s3 compression type config, I chose to introduce a new config name to distinguish them.Last but not least, all of the unit tests for parquet format implementations took those from avro format as reference, and I think those tests are sufficient. Because we removed dependency to hive, we also lost some dependencies for parquet. I added them back as few as possible.