Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-11482] Thrift support for KafkaTableProvider #13572

Merged
merged 1 commit into from
Dec 30, 2020

Conversation

ccciudatu
Copy link
Contributor

@ccciudatu ccciudatu commented Dec 17, 2020

This is a follow-up for #13428 to add Thrift support in KafkaTableProvider, since we now have all the building blocks for a BeamKafkaThriftTable.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@ccciudatu
Copy link
Contributor Author

R: @TheNeuralBit @chrlarsen

@ccciudatu
Copy link
Contributor Author

@TheNeuralBit Is installing the thrift binary on Jenkins servers an option?
Otherwise, I'll have to fall back to generating thrift manually and add it to source control, the same as @chrlarsen and I did within the io module.

@ccciudatu
Copy link
Contributor Author

R: @piotr-szuberski

Copy link

@pjotrekk pjotrekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I've left few comments, mostly cosmetic.

I don't have any knowledge about Thrift though so I'd suggest adding someone who took part in it.

@codecov
Copy link

codecov bot commented Dec 19, 2020

Codecov Report

Merging #13572 (1466db9) into master (b6243e7) will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #13572      +/-   ##
==========================================
- Coverage   82.74%   82.73%   -0.02%     
==========================================
  Files         466      466              
  Lines       57518    57518              
==========================================
- Hits        47594    47586       -8     
- Misses       9924     9932       +8     
Impacted Files Coverage Δ
...thon/apache_beam/runners/portability/job_server.py 51.06% <0.00%> (-14.57%) ⬇️
sdks/python/apache_beam/io/localfilesystem.py 91.66% <0.00%> (-0.76%) ⬇️
sdks/python/apache_beam/utils/subprocess_server.py 53.17% <0.00%> (-0.58%) ⬇️
...dks/python/apache_beam/options/pipeline_options.py 94.56% <0.00%> (-0.04%) ⬇️
sdks/python/apache_beam/examples/sql_taxi.py 0.00% <0.00%> (ø)
sdks/python/apache_beam/transforms/core.py 91.54% <0.00%> (+0.02%) ⬆️
...hon/apache_beam/runners/worker/bundle_processor.py 93.83% <0.00%> (+0.12%) ⬆️
sdks/python/apache_beam/runners/runner.py 74.54% <0.00%> (+0.15%) ⬆️
...eam/runners/interactive/interactive_environment.py 90.28% <0.00%> (+0.35%) ⬆️
...runners/interactive/display/pcoll_visualization.py 85.86% <0.00%> (+0.52%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6243e7...be47279. Read the comment docs.

@ccciudatu
Copy link
Contributor Author

@piotr-szuberski I pushed a fixup to address all your comments. I'll double check if I missed anything, but please have another look.

Copy link

@pjotrekk pjotrekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've left only one comment.

@ccciudatu
Copy link
Contributor Author

Squashed it all in a single commit and rebased on top of master.

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more requests. This is looking good overall

import org.apache.beam.sdk.values.Row;
import org.apache.beam.sdk.values.TypeDescriptor;

public final class SqlRows {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting close to the abstraction that I was discussing in your other PR. A couple suggestions for this class:

  • Let's make it @Internal, we don't want Beam users to use this and expect backwards compatibility. It's only public so other Beam modules can use it.
  • I think it belongs in the schemas package
  • Drop SQL from the name. Right now this is only used in SQL, but we will want to use this infra outside of SQL at some point (which is why I don't think it belongs in the sql package). Maybe something like RowMessages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a tiny step towards whatever abstractions we want to end up with, as it doesn't even cover Avro yet.
I'm still trying to wrap my head around how would a proper data format abstraction look like in Beam.
I mean, instead of FileIO, AvroIO, KafkaIO, ThriftIO plus Kafka tables for each data format, what if we took the format out of the IO as an entirely different concern and provide the ability to mix those freely. If IOs are just [File, Kafka, PubSub, Socket, whatever], and data formats are [Avro, Protobuf, CSV, Thrift, whatever], we can (in theory) allow any combination like PubSub+Thrift or File+Proto with no extra cost (except for incompatible candidates, like Kafka+parquet, perhaps).

Do you think this is worth considering (for 3.0, of course)? I am aware that such an abstraction is really difficult to get right (if at all feasible), but I still think it may be worth trying, as we should at least end up reducing the number of data_source+data_format combinations that are now modelled as new datasource types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this sounds ideal. As far as 3.0 - I think it would make sense to start working on it in 2.x versions as an @Experimental API. We shouldn't remove the existing IOs prior to a major version change, but it would be fine to start work on an alternative. This would be similar to (and related to) the SchemaIOProvider concept we had a Google intern work on last summer:

Would you have any interest in working on an effort like this?

CC: @robinyqiu a design like this would be relevant for your work on IOs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be more than happy to get involved. Most likely, I'll need to join your slack channel, and for that I need an apache.org email, apparently. I'll see how I can sort that out...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you can get an invite at https://s.apache.org/slack-invite (more info here: https://infra.apache.org/slack.html)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds easy. Not sure who's supposed to invite me, but I'll ask around. If you can help with that, I'd be grateful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I can invite you as a guest. I think https://s.apache.org/slack-invite is supposed to be self-service but the page indicates it breaks it often. Can you send me a message at bhulette@apache.org so I know what email to use?

CHANGES.md Outdated Show resolved Hide resolved
@TheNeuralBit
Copy link
Member

any chance this could end up in the 2.27.0 release or should I revisit the update to CHANGES.md?

The branch has already been cut so we'd need to cherrypick it to get it into 2.27.0. Usually that's reserved for bugfixes, but it's up to the discretion of the release manager

@TheNeuralBit
Copy link
Member

Run SQL Postcommit

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @ccciudatu!

@TheNeuralBit
Copy link
Member

@chrlarsen would you have time to take a look? Specifically it would be good to check over the changes to ThriftSchema

@chrlarsen
Copy link
Contributor

@TheNeuralBit definitely I'll have a look.

Copy link
Contributor

@chrlarsen chrlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The ThriftSchema changes look to be in line with what the Thrift IDL specifies, thanks @ccciudatu!

@ccciudatu
Copy link
Contributor Author

ccciudatu commented Dec 25, 2020

@piotr-szuberski @TheNeuralBit @chrlarsen Thanks for looking into this!

I resolved merge conflicts for CHANGES.md and added the following cosmetics:

  • simplified generics, as the extends Enum<..> constraint is no longer needed
  • dedicated mockTable versions for proto/thrift in KafkaTableProviderTest
  • listed the DEFAULT and OPTIONAL "requiredness" types explicitly, falling through to default -- for self-documenting purposes and to eliminate the compiler warning for too few cases (@piotr-szuberski, I hope you're ok with this)

@TheNeuralBit
Copy link
Member

Run SQL PostCommit

@TheNeuralBit TheNeuralBit merged commit 9de4ea4 into apache:master Dec 30, 2020
@aromanenko-dev
Copy link
Contributor

@ccciudatu Thanks for contribution!

For the future PRs, please, don't use your local master (or main) branch, create a dedicated feature branch for changes.

Perhaps, it might help:
https://cwiki.apache.org/confluence/display/BEAM/Git+Tips

@TheNeuralBit
Copy link
Member

@aromanenko-dev whoops sorry about that, I didn't notice this PR was opened from a master branch. Just curious, did it cause any problems in this case?

@aromanenko-dev
Copy link
Contributor

aromanenko-dev commented Jan 6, 2021

@TheNeuralBit Afaict, it should not cause big problems after it was merged (only it won't be possible to delete a "feature" branch in fork and it will rewrite a commit history I guess). Though, until merge, all pushes to fork master branch will appear in the PR which likely won't be acceptable and better avoid this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants