Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-28910][Connectors/hbase]Fix potential data deletion while updating HBase rows #20542

Closed
wants to merge 2 commits into from

Conversation

ganlute
Copy link

@ganlute ganlute commented Aug 11, 2022

What is the purpose of the change

https://issues.apache.org/jira/browse/FLINK-28910

Brief change log

  • *Add reduce when hbase connector process mutation.

Verifying this change

CI passed

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

@ganlute ganlute closed this Aug 11, 2022
@ganlute ganlute deleted the FLINK-28910 branch August 11, 2022 04:00
@ganlute ganlute restored the FLINK-28910 branch August 11, 2022 04:00
@ganlute ganlute reopened this Aug 11, 2022
@flinkbot
Copy link
Collaborator

flinkbot commented Aug 11, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@ganlute ganlute changed the title [FLINK-28910]CDC From Mysql To Hbase Bugs [WIP][FLINK-28910][connectors/hbase]CDC From Mysql To Hbase Bugs Aug 11, 2022
@ganlute ganlute changed the title [WIP][FLINK-28910][connectors/hbase]CDC From Mysql To Hbase Bugs [FLINK-28910][connectors/hbase]CDC From Mysql To Hbase Bugs Aug 12, 2022
@ganlute
Copy link
Author

ganlute commented Aug 12, 2022

@flinkbot run azure

@ganlute ganlute changed the title [FLINK-28910][connectors/hbase]CDC From Mysql To Hbase Bugs [WIP][FLINK-28910][connectors/hbase]CDC From Mysql To Hbase Bugs Aug 12, 2022
@ganlute
Copy link
Author

ganlute commented Aug 18, 2022

@flinkbot run azure

@ganlute ganlute changed the title [WIP][FLINK-28910][connectors/hbase]CDC From Mysql To Hbase Bugs [FLINK-28910][connectors/hbase]CDC From Mysql To Hbase Bugs Aug 18, 2022
@ganlute
Copy link
Author

ganlute commented Aug 18, 2022

hi @luoyuxia, could you please help me review the changes?Thank you.

@ganlute ganlute changed the title [FLINK-28910][connectors/hbase]CDC From Mysql To Hbase Bugs [FLINK-28910][Connectors/hbase]CDC From Mysql To Hbase Bugs Aug 25, 2022
@ganlute
Copy link
Author

ganlute commented Aug 25, 2022

@wuchong @dannycranmer could you please help me review the changes?Thank you.😄

@ganlute
Copy link
Author

ganlute commented Sep 2, 2022

@MartijnVisser could you please help me review the changes?Thank you.😄

@ganlute ganlute changed the title [FLINK-28910][Connectors/hbase]CDC From Mysql To Hbase Bugs [FLINK-28910][Connectors/hbase]Hbase Sink Bug Sep 13, 2022
Copy link
Contributor

@kylemeow kylemeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, the title could be changed to something like 'Fix potential data deletion while updating HBase rows'. Just my suggestion : )

@ganlute
Copy link
Author

ganlute commented Sep 13, 2022

For clarity, the title could be changed to something like 'Fix potential data deletion while updating HBase rows'. Just my suggestion : )

Thank you for your suggestion, I think it is really much more clear.

@ganlute ganlute changed the title [FLINK-28910][Connectors/hbase]Hbase Sink Bug [FLINK-28910][Connectors/hbase]Fix potential data deletion while updating HBase rows Sep 13, 2022
@MartijnVisser
Copy link
Contributor

@MartijnVisser could you please help me review the changes?Thank you.😄

@ganlute I have no experience with HBase, so I can't review it unfortunately. I think the Flink community is lacking on HBase maintainers in general to be honest.

@MartijnVisser
Copy link
Contributor

@leonardBang Do you think you could have a look? Since you have experience with CDC, I thought you might could help out here :)

Copy link
Contributor

@dannycranmer dannycranmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have no experience with HBase, however your thread safe logic looks ok, besides the following callouts.

@@ -76,6 +79,7 @@
private transient ScheduledExecutorService executor;
private transient ScheduledFuture scheduledFuture;
private transient AtomicLong numPendingRequests;
private static Map<byte[], Mutation> mutationMap = new HashMap<>();
Copy link
Contributor

@dannycranmer dannycranmer Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this static? This means subtasks of the same job would all have access to, and try to flush the same data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your review. I agree with you. I try to make it be the global queue to reduce the rowkey before. However, in this case, In fact, there is no need to declare it as static.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it later.

@@ -213,6 +225,7 @@ public void close() throws Exception {

if (mutator != null) {
try {
flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While your thread safety looks ok (besides the static Map), generally speaking flush() on close() can cause issues. If the destination is down, the job might fail to stop. A better solution is to checkpoint the internal buffer. Longer term we can consider migrating to the Async Sink base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dannycranmer , personally I reckon this might not be an issue because according to the documentation, the close() method of mutator inherently flushes buffered data to the HBase server before closing the connection, so the flush logic is already there before this PR.

Also, as it is already in the try ... catch block, when an IOException is thrown by the client during the flush, the job-stopping process would not be interrupted as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master:
close(){
...
mutator.close();
...
}

this pr:
close(){
...
flush();
mutator.close();
...
}

On the one hand, mutator.close will call mutator.flush as well. I think the metioned problem will happen at master as well. On the other hand, If the destination is down, mutator.close/mutator.flush will throw IOException.


Mutation mutation = mutationConverter.convertToMutation(value);
synchronized (mutationMap) {
mutationMap.put(mutation.getRow(), mutation);
Copy link
Contributor

@zjuwangg zjuwangg Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way, all mutations with the same rowkey value will only remain the last one.
But mutationConverter behavior is not controlled, which means such mutations as following list will casuse data quaility problem.
-D (rk1, f1:v1)

+I (rk1, f2:v2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, all mutationConverter implementations is ok in this case.

@ganlute
Copy link
Author

ganlute commented Sep 18, 2022

The failure of CI seems to have nothing to do with pr.


Mutation mutation = mutationConverter.convertToMutation(value);
synchronized (mutationMap) {
mutationMap.put(mutation.getRow(), mutation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential bug might arise when the mutation.getRow() returns an array. As we know, the hashCode and equals of two different array instances are different regardless of whether their contents are identical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To overcome this, I suggest to convert it to a base64 string, e.g.:

String key = Base64.getEncoder().encodeToString(mutation.getRow());

Or to create a simple wrapper class, where the equals and hashCode are overridden properly for arrays.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for yours suggestion, I will improve it.

Copy link
Contributor

@ferenc-csaky ferenc-csaky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the ByteBuffer improvement, I added another comment, but the logic itself LGTM.

@@ -76,6 +80,7 @@
private transient ScheduledExecutorService executor;
private transient ScheduledFuture scheduledFuture;
private transient AtomicLong numPendingRequests;
private Map<ByteBuffer, Mutation> mutationMap = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field could be final as well. Can we move it under mutationConverter and init it inside the ctor to be consistent with the field initialization?

Copy link
Contributor

@YesOrNo828 YesOrNo828 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganlute I tested about 2 million data ingest into the HBase,
When sink.buffer-flush.max-rows=1000, the data can finally be consistent.
When sink.buffer-flush.max-rows=1, the data cannot be guaranteed to be consistent

@@ -201,6 +208,12 @@ public void invoke(T value, Context context) throws Exception {
}

private void flush() throws IOException {
synchronized (mutationMap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding mutationMap to drop duplicated data on the client side could not avoid the data consistency issue. For example:
sink.buffer-flush.max-rows=1
+I(1,...)
-U(1,...)
+U(1,...)
These three rows are put into HBase with the same timestamp version.
In the end, HBase cannot find the data with rowkey=1.

Copy link
Contributor

@kylemeow kylemeow Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @YesOrNo828 , this issue has already been addressed in FLINK-32139, and the PR is merged into master. Therefore, this PR may no longer be needed.

@MartijnVisser
Copy link
Contributor

So is this superseded by #22612 or not?

@ferenc-csaky
Copy link
Contributor

So is this superseded by #22612 or not?

Yes, the 2 issues have the same root cause, that an insert and a delete operation are passed to HBase with the same millisecond precision TS and in that case, the order of the HBase execution is not guaranteed. The changes made in #22612 explicitly sets nanosecond precision timestamps for the HBase operations, so it eliminates the possibility to have multiple operations "at the same time", so deletes and inserts will be executed in correct order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants