Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-11065] Apache Beam pipeline example to ingest from Apache Kafka to Google Pub/Sub #13112

Merged
merged 78 commits into from
Dec 14, 2020

Conversation

ilya-kozyrev
Copy link
Contributor

@ilya-kozyrev ilya-kozyrev commented Oct 14, 2020

[Proposal] Apache Beam pipeline example to ingest data from Apache Kafka to Google Cloud Pub/Sub. It can be used as a Dataflow Flex template in the Google Cloud Platform.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status --- Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@ilya-kozyrev ilya-kozyrev changed the title [WIP] Apache Beam Template to ingest from Apache Kafka to Google Pub/Sub [BEAM-11065] [WIP] Apache Beam Template to ingest from Apache Kafka to Google Pub/Sub Oct 14, 2020
@kennknowles
Copy link
Member

Also pinging @azurezyq

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel strongly that this example should drop the "Beam template" terminology. "templates" (both flex, and classic) are a Dataflow feature, and the concept does not have meaning in the Beam context.

A Dataflow template is just a Beam pipeline that has been packaged into a container. This allows us (Dataflow) to provide a convenient way for users to run multiple instances of this pipeline (the template) just by specifying a few parameters, which are mapped to the pipeline options. Beam doesn't provide any features that can do that mapping, which is why I take issue with the "Beam template" concept.

That doesn't mean that something like this isn't useful in Apache Beam, it absolutely is! But for Beam the analogue is just an example pipeline, like the other pipelines in examples/. The language in the README suggesting that users modify the pipeline to try out different behavior is exactly the sort of thing we encourage people to do with the other example pipelines - for example in the wordcount walkthrough.

Fortunately I don't think there's much to do to make this into a Beam example:

  • Move the code to examples/, in the org.apache.beam.examples package.
  • Remove Dataflow-template specific files (I think this is just kafka_to_pubsub_metadata.json)
  • Remove directions for running this as a Dataflow template, and consider replacing them with directions for running on Dataflow (and other runners) directly.

@griscz
Copy link
Contributor

griscz commented Dec 1, 2020

Thanks for the review @TheNeuralBit

@ilya-kozyrev these suggestions make sense. What we should aim to do is to provide an example of how a Beam pipeline is constructed. In the Read me file, we can add that this Beam pipeline was also created as a Dataflow Flex Template and people can learn how to write something like this in the Dataflow templates repo, we can then provide a link to your templated hosted there.

@TheNeuralBit
Copy link
Member

In the Read me file, we can add that this Beam pipeline was also created as a Dataflow Flex Template and people can learn how to write something like this in the Dataflow templates repo, we can then provide a link to your templated hosted there.

This is a good idea. There could be a pointer the other direction as well (from the Dataflow template to the Beam example). In case anyone discovers the flex template and is interested in learning how to run a similar pipeline on other runners

@ilya-kozyrev
Copy link
Contributor Author

retest this please

@ilya-kozyrev ilya-kozyrev changed the title [BEAM-11065] Apache Beam Template to ingest from Apache Kafka to Google Pub/Sub [BEAM-11065] Apache Beam pipeline example to ingest from Apache Kafka to Google Pub/Sub Dec 7, 2020
Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actually the right place to put this would be in examples/java/complete/kafka-to-pubsub. Then we should link to it from apache/beam/examples/complete/README.md (sorry I wasn't more clear on this before).

examples/kafka-to-pubsub/README.md Outdated Show resolved Hide resolved
examples/kafka-to-pubsub/README.md Outdated Show resolved Hide resolved
Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I accidentally hit submit early on the other review. I have a few more comments.

examples/kafka-to-pubsub/README.md Outdated Show resolved Hide resolved
.apply("createValues", Values.create())
.apply("writeAvrosToPubSub", PubsubIO.writeAvros(AvroDataClass.class));

} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth having this PUBSUB path? The README and javadoc only discuss the AVRO path. I think we should just have that one and remove the enum

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why it may seem invaluable for the example, thank you for noticing this!

I suppose it is worth having the PUBSUB path because this example works out-of-the-box with it. For the AVRO path, the user has to add some code to make it work and also to understand what and how should be changed - the PUBSUB path doesn't require it.

I also updated the README file to highlight the value of it.

Copy link
Member

@TheNeuralBit TheNeuralBit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all your work on this!

@TheNeuralBit TheNeuralBit merged commit 34ae21b into apache:master Dec 14, 2020
dxichen pushed a commit to linkedin/beam that referenced this pull request Aug 9, 2021
… to Google Pub/Sub (apache#13112)

* add initial template and dependencies

* Added flex template creation with metadata support and instructions

* added new gradle modules for templates

* moved metadata to template location, reverted examples build.gradle

* Moved KafkaToPubsub to template, implemented options in separate package

* Added package-info.java to new packages

* Reverted build.gradle to master branch state

* fixed JAVADOC and metadata

* Added the Read Me section with a step-by-step guide

* Update README.md

* Readme fixes regarding comments

* Update README.md

* Update README.md

* Fixed typos in README.md

* refactored README.md added case to run template locally

* Update README.md

* fix build script for dataflow in README.md

* Added unit test and fixed metadata file

* Added Licenses and style fixes

* Added support for retrieving Kafka credentials from HashiCorp Vault secret storage with url and token

* Updated README.md and metadata with parameters for Vault access; refactored Kafka configuration

* Style fix

* Added description for Vault parameters in metadata

* FIX trailing whitespaces in README.md

* FIX. Blank line contains whitespace README.md

* Update README.md

* Refactored to examples folder

* Added conversion from JSON into PubsubMessage and extracted all transformations from the pipeline class into the separate class

* Whitespacelint fix

* Updated README.md and output formats

* Update README.md

* Update README.md

* Added support for SSL and removed outputFormat option

* Added avro usage example

* Added ssl to AVRO reader

* FIX whitespaces.

* added readme/docs regarding of Avro

* README.md and javadoc fixes

* Added Vault's response JSON schema description

* Style fix

* Refactoring.

* Fixed ssl parameters

* Fixed style

* optimize build.gradle

* Resolve conversations

* Updated regarding comments and added unit tests

* README.md update

* made Avro class more abstract

* fix style

* fixed review conversation items

* fix getting ssl credentials from Vault

* FIX add empty && null map validation to sslConfig

* FIX. remove vault ssl certs parameters

* metadata fix

* Local paths fix for SSL from GCS

* add new log message to avoid wrong local files usage

* fix style

* Moved kafka-to-pubsub to examples/ directory and updated README.md (#6)

* Stylefix

* Removed unused file

* add tbd section for e-2-e tests

* fix styles

* specifying kafka-clients version

* fix readme

* template -> exmples

* Update examples/kafka-to-pubsub/README.md

Co-authored-by: Brian Hulette <hulettbh@gmail.com>

* Fixed outdated import

* Moved template to examples/complete

* Updated paths in readme file

* Updated README.md and javadoc regarding comments

* README.md stylefix

* Added link to KafkaToPubsub example into complete/README.md

* Stylefix

Co-authored-by: Artur Khanin <artur.khanin@kzn.akvelon.com>
Co-authored-by: Artur Khanin <artur.khanin@akvelon.com>
Co-authored-by: AKosolapov <AKosolapov@users.noreply.github.com>
Co-authored-by: ramazan-yapparov <75415515+ramazan-yapparov@users.noreply.github.com>
Co-authored-by: Ramazan Yapparov <ramazan.yapparov@akvelon.com>
Co-authored-by: Brian Hulette <hulettbh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants