[BEAM-2661] Adds KuduIO #6021

timrobertson100 · 2018-07-23T21:29:12Z

Provides an implementation and tests for KuduIO.

Please note that design decisions have been captured on BEAM-2661.
This implementation follows similar design patterns to CassandraIO and naming convention from BigQueryIO.

The decision to use mocking and faking services for the unit tests was not taken lightly and will be replaced when Kudu offer an easier solution for Java - see KUDU-2411

This implementation will benefit from the addition of authentication and the BoundedSource could be replaced by a DoFn. I propose adding those at a later date.

Follow this checklist to help us incorporate your contribution quickly and easily:

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---		---	---	---	---

lukecwik · 2018-07-25T21:01:07Z

Run Go PreCommit

lukecwik · 2018-07-25T21:01:12Z

Run Java PreCommit

lukecwik · 2018-07-25T21:01:19Z

Run Python PreCommit

timrobertson100 · 2018-07-26T07:48:08Z

Run Java PreCommit

timrobertson100 · 2018-07-26T08:16:02Z

Thanks @lukecwik
As far as I can tell, the failing build is not KuduIO but build environment related.

Most recent build scan here

timrobertson100 · 2018-07-26T13:11:31Z

Run Java PreCommit

timrobertson100 · 2018-07-26T13:49:36Z

PTAL @lukecwik

reuvenlax

Overall this PR looks great! I left a few comments that can be addressed in a followup PR, and a question about exactly-once semantics in Kudu

reuvenlax · 2018-07-31T18:08:50Z

sdks/java/io/kudu/src/main/java/org/apache/beam/sdk/io/kudu/KuduIO.java

+
+  public static <T> Read<T> read() {
+    return new AutoValue_KuduIO_Read.Builder<T>().setKuduService(new KuduServiceImpl<>()).build();
+  }


I would recommend have some common helper functions here so that Coders don't need to be always set (e.g. readBytes -> byte[], readStrings -> String, etc.). However this can be done in a later PR

reuvenlax · 2018-07-31T18:10:27Z

sdks/java/io/kudu/src/main/java/org/apache/beam/sdk/io/kudu/KuduIO.java

+    /**
+     * Sets a {@link Coder} for the result of the parse function. This may be required if a coder
+     * can not be inferred automatically.
+     */


FYI you can also just have a withOutputType taking in a TypeDescriptor, to handle the case where the coder is in the registry but the type has been erased.

Thanks. I got a bit stumped on this and copied the approach of the TypedRead in BigQueryIO

reuvenlax · 2018-07-31T18:16:40Z

sdks/java/io/kudu/src/main/java/org/apache/beam/sdk/io/kudu/KuduIO.java

+        writer.openSession();
+      }
+
+      @ProcessElement


Given that bundles can be replay, what are the semantics of Kudu with respect to writes? Will there simply be duplicates written to Kudu, or is there a way to make things exactly once?

Kudu requires a primary key so repeats would usually be seen as Upsert operations. That is why in the JDoc I said:

... a {@link FormatFunction<T>} which is responsible for converting the input into an idempotent transformation on a row

The tests provide an example of that in the GenerateUpsert method.

However, people can get creative in their format function (e.g. mint UUIDs) and then you could potentially force duplicates. This is similar to how I recently patched ElasticSearchIO to allow ID functions to enable doc ID and upsert behaviour.

I did originally attempt to enforce it as Upsert behaviour and using Kudu classes but they simply do not lend themselves to serialization. I opted to model as close as possible to other IOs as the alternative.

timrobertson100 · 2018-07-31T19:01:22Z

Thank you very much @reuvenlax
Comments replied to - PTAL

reuvenlax · 2018-07-31T19:14:09Z

Thanks! PR is now merged. If you plan on following up on my comments, please file matching JIRAs

timrobertson100 · 2018-07-31T19:42:36Z

That was fast. Thanks @reuvenlax

FYI: I hope to be assigned owner of KuduIO, will file Jiras for all improvements, and will encourage others to contribute. I've also volunteered to write a guest blog on Beam/Kudu for the Kudu team who are trying to raise the profile of their project (CC @griscz for info)

[BEAM-2661] Adds KuduIO

24b78f3

timrobertson100 requested review from chamikaramj, jbonofre and lukecwik as code owners July 23, 2018 21:29

[BEAM-2661] KuduIO: Add missing licenses

13a8a90

reuvenlax self-requested a review July 31, 2018 18:03

reuvenlax reviewed Jul 31, 2018

View reviewed changes

reuvenlax merged commit eb0b611 into apache:master Jul 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-2661] Adds KuduIO #6021

[BEAM-2661] Adds KuduIO #6021

timrobertson100 commented Jul 23, 2018 •

edited

lukecwik commented Jul 25, 2018

lukecwik commented Jul 25, 2018

lukecwik commented Jul 25, 2018

timrobertson100 commented Jul 26, 2018

timrobertson100 commented Jul 26, 2018 •

edited

timrobertson100 commented Jul 26, 2018

timrobertson100 commented Jul 26, 2018

reuvenlax left a comment

reuvenlax Jul 31, 2018

reuvenlax Jul 31, 2018

timrobertson100 Jul 31, 2018

reuvenlax Jul 31, 2018

timrobertson100 Jul 31, 2018 •

edited

timrobertson100 commented Jul 31, 2018

reuvenlax commented Jul 31, 2018

timrobertson100 commented Jul 31, 2018

[BEAM-2661] Adds KuduIO #6021

[BEAM-2661] Adds KuduIO #6021

Conversation

timrobertson100 commented Jul 23, 2018 • edited

Post-Commit Tests Status (on master branch)

lukecwik commented Jul 25, 2018

lukecwik commented Jul 25, 2018

lukecwik commented Jul 25, 2018

timrobertson100 commented Jul 26, 2018

timrobertson100 commented Jul 26, 2018 • edited

timrobertson100 commented Jul 26, 2018

timrobertson100 commented Jul 26, 2018

reuvenlax left a comment

Choose a reason for hiding this comment

reuvenlax Jul 31, 2018

Choose a reason for hiding this comment

reuvenlax Jul 31, 2018

Choose a reason for hiding this comment

timrobertson100 Jul 31, 2018

Choose a reason for hiding this comment

reuvenlax Jul 31, 2018

Choose a reason for hiding this comment

timrobertson100 Jul 31, 2018 • edited

Choose a reason for hiding this comment

timrobertson100 commented Jul 31, 2018

reuvenlax commented Jul 31, 2018

timrobertson100 commented Jul 31, 2018

timrobertson100 commented Jul 23, 2018 •

edited

timrobertson100 commented Jul 26, 2018 •

edited

timrobertson100 Jul 31, 2018 •

edited