Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-20385][canal][json] Allow to read metadata for canal-json format #14464

Merged
merged 2 commits into from
Dec 31, 2020
Merged

[FLINK-20385][canal][json] Allow to read metadata for canal-json format #14464

merged 2 commits into from
Dec 31, 2020

Conversation

SteNicholas
Copy link
Member

What is the purpose of the change

Currently FLIP-107 supports reading meta from the Debezium format. According to FLIP-107, metadata should support to be exposed for the Canal JSON format.

Brief change log

  • Let CanalJsonDeserializationSchema access and convert those additional fields to metadata columns.

Verifying this change

  • CanalJsonSerDeSchemaTest adds testDeserializationWithMetadata to whether deserialization of CanalJsonDeserializationSchema could read metadata for the Canal JSON format.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 22, 2020

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit d6ec831 (Fri May 28 07:01:42 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 22, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

@wuchong
Copy link
Member

wuchong commented Dec 24, 2020

Hi @SteNicholas , I would like to discuss the metadata keys first. What do you think think just use the keys and types?

  • pk_names ARRAY<STRING>
  • sql_types MAP<STRING, INT>
  • table STRING
  • database STRING
  • ingestion_timestamp TIMESTAMP(3) WITH LOCAL TIME ZONE

Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good in general. Could you add a IT case in KafkaChangelogTableITCase to test kafka+canal-json with metadata accessing?

@SteNicholas
Copy link
Member Author

SteNicholas commented Dec 27, 2020

@wuchong I have added a IT case in KafkaChangelogTableITCase to test Kafka and Canal JSON with metadata accessing.
About the Canal JSON metadata, I prefer to the following keys and types:

  • database STRING
  • table STRING
  • sql-type MAP<STRING, INT>
  • pk-names ARRAY<STRING>
  • ingestion-timestamp TIMESTAMP(3) WITH LOCAL TIME ZONE

The name of keys are a little different from yours. What do you think about the above names of keys?

@wuchong
Copy link
Member

wuchong commented Dec 27, 2020

I'm also fine with that. Will review it tomorrow.

@wuchong
Copy link
Member

wuchong commented Dec 27, 2020

Btw, could you add documentation for this feature?

@SteNicholas
Copy link
Member Author

@wuchong I have added the document about the metadata for Canal JSON format. Please help to review the document together.

@SteNicholas
Copy link
Member Author

@wuchong I have kept code spotless with Maven for conflicts resolution and modified the document. Please help to review again.

@wuchong
Copy link
Member

wuchong commented Dec 30, 2020

I helped to beautify the format. Will merge this once Azure is passed.

@SteNicholas
Copy link
Member Author

@wuchong Thanks for helping to beautify the format. I have set the time zone for the Canal JSON format metadata ingestion-timestamp to solve the Azure failure. Please help to merge once Azure is passed.

Copy link
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@wuchong wuchong merged commit 814fe0e into apache:master Dec 31, 2020
V1ncentzzZ pushed a commit to V1ncentzzZ/flink that referenced this pull request Dec 31, 2020
@wangfeigithub
Copy link

Nicholas Jiang Jark Wu  I found a bug in the canal code. 'canal-json.table.include' does not filter out the binlog of the specified table correctly, which will cause an error in the parsing section. For example, if I want to read the binlog of canal-json.table.include = 'a' table, there is a source field of int in table a, but at this time if table b also has a source field of string, An error will be reported.

@wangfeigithub
Copy link

wangfeigithub commented Jan 7, 2021

image

image

image

@wuchong
Copy link
Member

wuchong commented Jan 7, 2021

@wangfeigithub thanks for reporting this. Could you create an JIRA issue for this?

lmagic233 pushed a commit to lmagic233/flink that referenced this pull request Jan 11, 2021
jnh5y pushed a commit to jnh5y/flink that referenced this pull request Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants