New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBZ-4389: Add keepalive pings and handle retriable exceptions #55
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change itself LGTM. One commen on that sleep call inline though. And the option should be added to the connector docs. Thanks, @shichao-an!
protected TestConsumer testConsumer(int expectedRecordsCount, String... topicPrefixes) throws InterruptedException { | ||
TestConsumer consumer = new TestConsumer(expectedRecordsCount, topicPrefixes); | ||
// Sleep for 1 second before return to avoid race conditions. | ||
Thread.sleep(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid such unconditional thread sleep call? It's not reliable typically and it increases execution time of the test suite. What's the race condition about? Typically, the better fix is to use Awaitility to make sure some specific condition has been matched in a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
could you also comment a bit around where does the race condition come from? would stopConnector help to handle that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others have chimed in, but just wanted to add a thought. The general things you could look out for with sleeps is (1) making unit tests longer than they need to and (2) more importantly, sleeps having potential to make unit tests flaky.
Depending on the code you could do something with Awaitility as Gunnar said like await().until(<something>)
or some other variant of the API. Another thing is to inject a Clock so we can control/advance them as we need in our tests -- although this could be a larger refactor, for another time (I haven't looked as deeply into this part of the codebase yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the Vitess connector's IT tests always failed on my laptop due to similar reason to https://issues.redhat.com/browse/DBZ-4391 . Because the expected data change events could be not be observed.
After troubleshooting for a while, I think there is a race condition. A typical example would be starting from this line https://github.com/debezium/debezium-connector-vitess/blob/main/src/test/java/io/debezium/connector/vitess/VitessConnectorIT.java#L116
When the VStream is started it used "current" as starting gtid position. The test inserts seems started too early when the connector started VStream, so when VTGate receives the connection the insert already completed; this makes VTGate only emit new changes after the insert. This means the insert change event is "lost" for us in the test cases and expecting the record would time out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I intentionally set the GTID to the last GTID in the current Vitess mysql, by passing in configBuilder in the tests, like this:
builder = builder
.with(RelationalDatabaseConnectorConfig.SERVER_NAME, TEST_SERVER)
.with(VitessConnectorConfig.VTGATE_HOST, VTGATE_HOST)
.with(VitessConnectorConfig.VTGATE_PORT, VTGATE_PORT)
.with(VitessConnectorConfig.VTGATE_USER, USERNAME)
.with(VitessConnectorConfig.VTGATE_PASSWORD, PASSWORD)
.with(VitessConnectorConfig.GTID, "MySQL56/0588557d-5357-11ec-84d5-0242ac110002:1-1264")
.with(VitessConnectorConfig.POLL_INTERVAL_MS, 100);
The tests will work. This means it's defintely due to the race condition of VStream with "current" and inserts as mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes sense. I was using a slightly different approach to tackle this exact problem - it was taking advantage that VStream will pass along an event with "OTHER" type when there is no writes happening.
But I think the logging approach you have is more reliable, could you also change it in VitessReplicationConnectionIT
as well so that we are consistent on that front?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @shichao-an
The change makes sense to me, had a few questions in line.
Related to this fix: I think we observe the VStream endpoint would be "broken" if there is no events for net_read_timeout
(the mysql setting). Do we observe the same behavior here? I know you verified that after this change there is no error being raised after some idle time, it might be good to also confirm if we do another write after 10 min, do we still see that event go through.
@@ -194,6 +194,7 @@ private int getNumOfRowEvents(Vtgate.VStreamResponse response) { | |||
private ManagedChannel newChannel(String vtgateHost, int vtgatePort) { | |||
ManagedChannel channel = ManagedChannelBuilder.forAddress(vtgateHost, vtgatePort) | |||
.usePlaintext() | |||
.keepAliveTime(config.getKeepaliveInterval().toMillis(), TimeUnit.MILLISECONDS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not super clear to me why the channel get closed before this change - for my own understanding, is the channel closed by grpc because it is idle? I think the default idle timeout was 30 min: https://grpc.github.io/grpc-java/javadoc/io/grpc/ManagedChannelBuilder.html#idleTimeout-long-java.util.concurrent.TimeUnit-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After troubleshooting for quite a bit I couldn't make a conclusion. I tried to set idleTimeout to different values and there was no effect at all. I added debug output and found the channel itself is mostly in IDLE state even before the timeout, so this idle setting might be irrelevant. Because there is only one gRPC activity (VStream) and no new gRPCs. I also debugged server-side as well.
The general assumption is that the connection/transport is closed after no active streams for a while. To gRPC this means no pings between peers for a period, so the transport is disconnected. I have searched and found these references:
- https://stackoverflow.com/questions/57930529/grpc-connection-use-keepalive-or-idletimeout
- https://stackoverflow.com/questions/64119639/when-should-a-channel-be-closed-on-grpc
- https://grpc.github.io/grpc/core/md_doc_connectivity-semantics-and-api.html
- https://github.com/grpc/grpc/blob/master/doc/keepalive.md
- https://cs.mcgill.ca/~mxia3/2019/02/23/Using-gRPC-in-Production/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The links seem to suggest using idealTimeout over keepAliveTime 🤷
I think it's fine to add this option here, the only concern I have here is it requires corresponding setting on the server side as well (e.g., grpc.keepalive_permit_without_calls)
I'm not sure in this case if it worth adding a TODO for it, I will defer to you and others here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, idealTimeout doesn't work in my verification so concluded that it's not related to idleTimeout, because idleTimeout is about "channel", but our problem is with "transport/connection", which we need to tackle with keepalive pings.
As for keepalive_permit_without_calls, I tested without this option and it works. According to this and this:
- allows keepalive pings to be sent even if there are no calls in flight
- Sets whether keepalive will be performed when there are no outstanding RPC on a connection
I think we do have a "call" or outstanding RPC, that is, the VStream call itself. Therefore without this option it won't affect our keepalive pings. I will need to test it in prod for several passes once I had the snapshots ready. We can add it if we do need this option.
protected TestConsumer testConsumer(int expectedRecordsCount, String... topicPrefixes) throws InterruptedException { | ||
TestConsumer consumer = new TestConsumer(expectedRecordsCount, topicPrefixes); | ||
// Sleep for 1 second before return to avoid race conditions. | ||
Thread.sleep(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
could you also comment a bit around where does the race condition come from? would stopConnector help to handle that?
src/main/java/io/debezium/connector/vitess/VitessErrorHandler.java
Outdated
Show resolved
Hide resolved
@shichao-an LGTM. Could you also elaborate with a comment (in that same test is fine) on more details of the race condition? Otherwise, just some minor questions/comments. Thanks! |
Yes I tested these cases. If there are writes, the broken time will be delayed, until being idle for the same period of time again when the transport will be disconnected |
I added |
just to clairfy, when you say "disconnected" do you mean the Regardless, this is a good fix anyway 💯 , just want to see if there are more we need to investigate. And since most of the time, we should expect continuous events, the |
I mean UNAVAILABLE was being raised. After that, new writes won't be streamed out because the connector crashed.
Right, there's no issue with new writes as long as the transport is kept open with keepalive pings. I added the retry behavior to handle UNAVAILABLE error as well, so if somehow the transport is disconnected again when keepalive fails, then connector will be restarted automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, you will also need to update the doc here: https://github.com/debezium/debezium/blob/main/documentation/modules/ROOT/pages/connectors/vitess.adoc
Don't want to block you on the clean up with VitessReplicationConnectionIT, feel free to do that separately
Will do in the debezium's PR.
Sure. We can create a maintenance PR to refactor the IT tests a bit. There is lot of duplicate code here and there. |
Looks like I'm late for the party. I remember that my local docker setup has this issue. Here's the connector error:
And my original Bolt's internal note for this issue was:
|
Thanks @keweishang, that's a helpful insights! I think we do run docker compose in production, but in local we are just doing I'm curious if you observed this in local only or in prod as well and how did you solve the problem? Did we just move away from docker compose or did we do something differently? |
In local environment, I tried to deploy both Vitess and Connect Worker with the same docker-compose and I had this issue. However, if I ran Connect Worker on bare metal without docker in local environment (I don't remember how did I run Vitess locally in this case, probably in docker or bare metal), I didn't observe this issue. In production, we've never had this issue, but it's probably due to the fact that there's always events streamed from VTGate and the VStream gRPC is never idle. |
Also our production Vitess and Connect Workers are also deployed on bare metal without docker |
This fixes https://issues.redhat.com/browse/DBZ-4389