Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-2661] Adds KuduIO #6021

Merged
merged 2 commits into from Jul 31, 2018
Merged

Conversation

timrobertson100
Copy link
Contributor

@timrobertson100 timrobertson100 commented Jul 23, 2018

Provides an implementation and tests for KuduIO.

Please note that design decisions have been captured on BEAM-2661.
This implementation follows similar design patterns to CassandraIO and naming convention from BigQueryIO.

The decision to use mocking and faking services for the unit tests was not taken lightly and will be replaced when Kudu offer an easier solution for Java - see KUDU-2411

This implementation will benefit from the addition of authentication and the BoundedSource could be replaced by a DoFn. I propose adding those at a later date.


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
--- --- --- ---

@lukecwik
Copy link
Member

Run Go PreCommit

@lukecwik
Copy link
Member

Run Java PreCommit

@lukecwik
Copy link
Member

Run Python PreCommit

@timrobertson100
Copy link
Contributor Author

Run Java PreCommit

@timrobertson100
Copy link
Contributor Author

timrobertson100 commented Jul 26, 2018

Thanks @lukecwik
As far as I can tell, the failing build is not KuduIO but build environment related.

Most recent build scan here

@timrobertson100
Copy link
Contributor Author

Run Java PreCommit

@timrobertson100
Copy link
Contributor Author

PTAL @lukecwik

@reuvenlax reuvenlax self-requested a review July 31, 2018 18:03
Copy link
Contributor

@reuvenlax reuvenlax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this PR looks great! I left a few comments that can be addressed in a followup PR, and a question about exactly-once semantics in Kudu


public static <T> Read<T> read() {
return new AutoValue_KuduIO_Read.Builder<T>().setKuduService(new KuduServiceImpl<>()).build();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend have some common helper functions here so that Coders don't need to be always set (e.g. readBytes -> byte[], readStrings -> String, etc.). However this can be done in a later PR

/**
* Sets a {@link Coder} for the result of the parse function. This may be required if a coder
* can not be inferred automatically.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI you can also just have a withOutputType taking in a TypeDescriptor, to handle the case where the coder is in the registry but the type has been erased.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I got a bit stumped on this and copied the approach of the TypedRead in BigQueryIO

writer.openSession();
}

@ProcessElement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that bundles can be replay, what are the semantics of Kudu with respect to writes? Will there simply be duplicates written to Kudu, or is there a way to make things exactly once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudu requires a primary key so repeats would usually be seen as Upsert operations. That is why in the JDoc I said:

... a {@link FormatFunction<T>} which is responsible for converting the input into an idempotent transformation on a row

The tests provide an example of that in the GenerateUpsert method.

However, people can get creative in their format function (e.g. mint UUIDs) and then you could potentially force duplicates. This is similar to how I recently patched ElasticSearchIO to allow ID functions to enable doc ID and upsert behaviour.

I did originally attempt to enforce it as Upsert behaviour and using Kudu classes but they simply do not lend themselves to serialization. I opted to model as close as possible to other IOs as the alternative.

@timrobertson100
Copy link
Contributor Author

Thank you very much @reuvenlax
Comments replied to - PTAL

@reuvenlax reuvenlax merged commit eb0b611 into apache:master Jul 31, 2018
@reuvenlax
Copy link
Contributor

Thanks! PR is now merged. If you plan on following up on my comments, please file matching JIRAs

@timrobertson100
Copy link
Contributor Author

That was fast. Thanks @reuvenlax

FYI: I hope to be assigned owner of KuduIO, will file Jiras for all improvements, and will encourage others to contribute. I've also volunteered to write a guest blog on Beam/Kudu for the Kudu team who are trying to raise the profile of their project (CC @griscz for info)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants