New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-2661] Adds KuduIO #6021
[BEAM-2661] Adds KuduIO #6021
Conversation
Run Go PreCommit |
Run Java PreCommit |
Run Python PreCommit |
Run Java PreCommit |
Run Java PreCommit |
PTAL @lukecwik |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this PR looks great! I left a few comments that can be addressed in a followup PR, and a question about exactly-once semantics in Kudu
|
||
public static <T> Read<T> read() { | ||
return new AutoValue_KuduIO_Read.Builder<T>().setKuduService(new KuduServiceImpl<>()).build(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend have some common helper functions here so that Coders don't need to be always set (e.g. readBytes -> byte[], readStrings -> String, etc.). However this can be done in a later PR
/** | ||
* Sets a {@link Coder} for the result of the parse function. This may be required if a coder | ||
* can not be inferred automatically. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI you can also just have a withOutputType taking in a TypeDescriptor, to handle the case where the coder is in the registry but the type has been erased.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I got a bit stumped on this and copied the approach of the TypedRead in BigQueryIO
writer.openSession(); | ||
} | ||
|
||
@ProcessElement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that bundles can be replay, what are the semantics of Kudu with respect to writes? Will there simply be duplicates written to Kudu, or is there a way to make things exactly once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kudu requires a primary key so repeats would usually be seen as Upsert
operations. That is why in the JDoc I said:
... a {@link FormatFunction<T>} which is responsible for converting the input into an idempotent transformation on a row
The tests provide an example of that in the GenerateUpsert
method.
However, people can get creative in their format function (e.g. mint UUIDs) and then you could potentially force duplicates. This is similar to how I recently patched ElasticSearchIO
to allow ID functions to enable doc ID and upsert behaviour.
I did originally attempt to enforce it as Upsert behaviour and using Kudu classes but they simply do not lend themselves to serialization. I opted to model as close as possible to other IOs as the alternative.
Thank you very much @reuvenlax |
Thanks! PR is now merged. If you plan on following up on my comments, please file matching JIRAs |
That was fast. Thanks @reuvenlax FYI: I hope to be assigned owner of KuduIO, will file Jiras for all improvements, and will encourage others to contribute. I've also volunteered to write a guest blog on Beam/Kudu for the Kudu team who are trying to raise the profile of their project (CC @griscz for info) |
Provides an implementation and tests for KuduIO.
Please note that design decisions have been captured on BEAM-2661.
This implementation follows similar design patterns to
CassandraIO
and naming convention fromBigQueryIO
.The decision to use mocking and faking services for the unit tests was not taken lightly and will be replaced when Kudu offer an easier solution for Java - see KUDU-2411
This implementation will benefit from the addition of authentication and the
BoundedSource
could be replaced by aDoFn
. I propose adding those at a later date.Follow this checklist to help us incorporate your contribution quickly and easily:
[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.It will help us expedite review of your Pull Request if you tag someone (e.g.
@username
) to look at it.Post-Commit Tests Status (on master branch)