-
Notifications
You must be signed in to change notification settings - Fork 13.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-33058][formats] Add encoding option to Avro format #23395
Conversation
Apologies for the flurry of follow-on commits - this is my first contribution to Flink so I'd missed the checkstyle and spotless rules when testing locally. I think it's ready to review now, but I'm sure there are still other things I've unwittingly missed! Please let me know if there is anything else that I should do to get this PR into an acceptable state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dalelane, thanks for your contribution. We need to always consider what happens when the user tries to compile his old user code against the new version of Flink. As such, merging your changes would remove public APIs and is not backwards compatible. Public methods can only be removed though a deprecation process.
...ormats/flink-avro/src/main/java/org/apache/flink/formats/avro/AvroDeserializationSchema.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! Overall, this does exactly what it says -- it looks good! Thanks for checking the public API, but I think I found one more in AvroSerializationSchema
.
Other than that, there's a couple of minor suggestions that you can take or leave, especially the @ParameterizedTest
. This is something that I largely prefer, but it makes little difference when all the tests are passing!
I mentioned in the Jira: using JSON-encoded Avro is really noot a best practice, but it does provide human-readable messages with schema-enforced structure... I guess we could put a warning someplace in the code, but if that's what customers are looking for, it's hard to argue!
That being said, @afedulov, do you think it's worthwhile bringing up the new feature on the mailing list to discuss?
...flink-avro/src/main/java/org/apache/flink/formats/avro/AvroRowDataDeserializationSchema.java
Outdated
Show resolved
Hide resolved
...-formats/flink-avro/src/main/java/org/apache/flink/formats/avro/AvroSerializationSchema.java
Show resolved
Hide resolved
...ts/flink-avro/src/test/java/org/apache/flink/formats/avro/AvroDeserializationSchemaTest.java
Outdated
Show resolved
Hide resolved
...k-avro/src/test/java/org/apache/flink/formats/avro/AvroRowDataDeSerializationSchemaTest.java
Outdated
Show resolved
Hide resolved
Thanks for the reviews - much appreciated 👍 |
@RyanSkraba This was my initial thought, yes. Ideally we do not want to introduce functionality for very niche use cases, but this one makes sense to me, especially for building demos etc. Although this change, in my opinion, does not deserve a FLIP, I think it still makes sense to do a quick vote in the dev mailing list. The idea would be to prepend the topic with [VOTE], briefly describe the proposal, why it is useful and the downsides of it not being the best practice (Ryan's concerns). If no one comments - this is a silent yes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thanks!
I took the liberty of bringing it up in the mailing list -- I thought that was fair! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for driving it!
@afedulov Is there anything else that you think is needed here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@afedulov @JingGe @RyanSkraba It looks like this is approved by all and ready to go for some time now, please could one of you merge this - so this is not forgotten. Many thanks. |
From us three only Jing has the permissions to merge. |
@dalelane would you please rebase and squash the commits? Once the CI passed, I will merge it. Thanks! |
Signed-off-by: Dale Lane <dale.lane@uk.ibm.com>
@JingGe Thanks very much - I've done the squash and rebase now |
What is the purpose of the change
Initially proposed in https://issues.apache.org/jira/browse/FLINK-33058
Avro supports two serialization encoding methods: binary and JSON (cf. Avro docs)
flink-avro currently has a hard-coded assumption that Avro data is binary-encoded (and cannot process Avro data that has been JSON-encoded).
This pull request introduces a new optional format option to flink-avro:
avro.encoding
It supports two options: 'binary' and 'json'.
It unset, it will default to 'binary' to maintain compatibility/consistency with current behaviour.
Brief change log
Flink uses Avro
Decoder
andEncoder
classes for deserializing/serializing Avro data.However it was hard-coding the use of factory classes to only use the binary-encoding implementations of these abstract classes. (
DecoderFactory.get().binaryDecoder
andEncoderFactory.get().directBinaryEncoder
)In this pull request, I'm using the value of the new
avro.encoding
option to create the JSON Decoder/Encoder classes where appropriate.Verifying this change
This change modified existing tests by re-running all of the tests that perform Avro serialization/deserialization to repeat the test using both binary and avro encoding.
This verifies that the existing binary behaviour is unaffected by the new option, as well as the new JSON support.
I've also manually verified the new support using Flink SQL such as:
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: yesDocumentation