Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-33058][formats] Add encoding option to Avro format #23395

Merged
merged 1 commit into from
Nov 21, 2023
Merged

[FLINK-33058][formats] Add encoding option to Avro format #23395

merged 1 commit into from
Nov 21, 2023

Conversation

dalelane
Copy link
Contributor

@dalelane dalelane commented Sep 11, 2023

What is the purpose of the change

Initially proposed in https://issues.apache.org/jira/browse/FLINK-33058

Avro supports two serialization encoding methods: binary and JSON (cf. Avro docs)

flink-avro currently has a hard-coded assumption that Avro data is binary-encoded (and cannot process Avro data that has been JSON-encoded).

This pull request introduces a new optional format option to flink-avro: avro.encoding
It supports two options: 'binary' and 'json'.
It unset, it will default to 'binary' to maintain compatibility/consistency with current behaviour.

Brief change log

Flink uses Avro Decoder and Encoder classes for deserializing/serializing Avro data.

However it was hard-coding the use of factory classes to only use the binary-encoding implementations of these abstract classes. (DecoderFactory.get().binaryDecoder and EncoderFactory.get().directBinaryEncoder)

In this pull request, I'm using the value of the new avro.encoding option to create the JSON Decoder/Encoder classes where appropriate.

Verifying this change

This change modified existing tests by re-running all of the tests that perform Avro serialization/deserialization to repeat the test using both binary and avro encoding.

This verifies that the existing binary behaviour is unaffected by the new option, as well as the new JSON support.

I've also manually verified the new support using Flink SQL such as:

CREATE TABLE JSONAVRO
(
    ... my columns ...
)
WITH (
    'connector' = 'kafka',
    'topic' = 'MY.TOPIC',
    'properties.bootstrap.servers' = 'localhost:9092',
    'properties.group.id' = 'my-group',
    'scan.startup.mode' = 'earliest-offset',
    'format' = 'avro',
    'avro.encoding' = 'json'
);

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): yes
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? update to the Avro format page to explain the new option

@flinkbot
Copy link
Collaborator

flinkbot commented Sep 11, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@dalelane
Copy link
Contributor Author

Apologies for the flurry of follow-on commits - this is my first contribution to Flink so I'd missed the checkstyle and spotless rules when testing locally.

I think it's ready to review now, but I'm sure there are still other things I've unwittingly missed! Please let me know if there is anything else that I should do to get this PR into an acceptable state.

Copy link
Contributor

@afedulov afedulov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dalelane, thanks for your contribution. We need to always consider what happens when the user tries to compile his old user code against the new version of Flink. As such, merging your changes would remove public APIs and is not backwards compatible. Public methods can only be removed though a deprecation process.

Copy link
Contributor

@RyanSkraba RyanSkraba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello! Overall, this does exactly what it says -- it looks good! Thanks for checking the public API, but I think I found one more in AvroSerializationSchema.

Other than that, there's a couple of minor suggestions that you can take or leave, especially the @ParameterizedTest. This is something that I largely prefer, but it makes little difference when all the tests are passing!

I mentioned in the Jira: using JSON-encoded Avro is really noot a best practice, but it does provide human-readable messages with schema-enforced structure... I guess we could put a warning someplace in the code, but if that's what customers are looking for, it's hard to argue!

That being said, @afedulov, do you think it's worthwhile bringing up the new feature on the mailing list to discuss?

@dalelane
Copy link
Contributor Author

Thanks for the reviews - much appreciated 👍

@afedulov
Copy link
Contributor

afedulov commented Oct 24, 2023

That being said, @afedulov, do you think it's worthwhile bringing up the new feature on the mailing list to discuss?

@RyanSkraba This was my initial thought, yes. Ideally we do not want to introduce functionality for very niche use cases, but this one makes sense to me, especially for building demos etc. Although this change, in my opinion, does not deserve a FLIP, I think it still makes sense to do a quick vote in the dev mailing list. The idea would be to prepend the topic with [VOTE], briefly describe the proposal, why it is useful and the downsides of it not being the best practice (Ryan's concerns). If no one comments - this is a silent yes.

Copy link
Contributor

@RyanSkraba RyanSkraba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks!

@RyanSkraba
Copy link
Contributor

I took the liberty of bringing it up in the mailing list -- I thought that was fair!

Copy link
Contributor

@JingGe JingGe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for driving it!

@dalelane
Copy link
Contributor Author

dalelane commented Nov 7, 2023

@afedulov Is there anything else that you think is needed here?

Copy link
Contributor

@afedulov afedulov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidradl
Copy link
Contributor

@afedulov @JingGe @RyanSkraba It looks like this is approved by all and ready to go for some time now, please could one of you merge this - so this is not forgotten. Many thanks.

@afedulov
Copy link
Contributor

From us three only Jing has the permissions to merge.

@JingGe
Copy link
Contributor

JingGe commented Nov 21, 2023

@dalelane would you please rebase and squash the commits? Once the CI passed, I will merge it. Thanks!

Signed-off-by: Dale Lane <dale.lane@uk.ibm.com>
@dalelane
Copy link
Contributor Author

@JingGe Thanks very much - I've done the squash and rebase now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants