fix(core): self-heal etcd meta watch on transport reconnect#3062
Open
dpol1 wants to merge 2 commits into
Open
Conversation
) EtcdMetaDriver.listen/listenPrefix handed jetcd a bare Consumer<WatchResponse>, so a terminal watch error (e.g. after a transport reconnect) was swallowed and the JVM-global schema-cache-clear listener died silently: a node stopped receiving cross-node cache-clear events with no error or warning. Switch to the Watch.Listener overload and re-subscribe on onError/onCompleted via a daemon-backed backoff, mirroring the self-heal PdMetaDriver already gets from KvClient. The driver watch now stays live across reconnects, so CachedSchemaTransactionV2's register-once flag staying true is correct; the unused resetMetaListenerForReconnect stopgap and its TODO are removed. Add EtcdMetaDriverTest covering re-subscribe on error/completion and event delivery, and register it in UnitTestSuite.
imbajin
approved these changes
Jun 22, 2026
imbajin
left a comment
Member
There was a problem hiding this comment.
Blocking: no. Summary: No obvious issues found in the current head. Evidence: git diff --check passed; EtcdMetaDriverTest passed (4 tests); latest GitHub checks passed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose of the PR
CachedSchemaTransactionV2registers a JVM-global watch on the meta store so that aschema change on one node clears the schema cache on the other nodes. That watch goes
through
EtcdMetaDriver.listen/listenPrefix, which handed jetcd the bareConsumer<WatchResponse>overload. That overload dropsonErrorandonCompleted, sowhen jetcd ends a watch (for example after a transport reconnect tears down the gRPC
stream) the listener stopped receiving events with no log line and no exception. The node
kept serving stale schema and nothing reported it.
PdMetaDriverdoes not have thisproblem because its
KvClientalready re-subscribes on error.Main Changes
EtcdMetaDriver.listen/listenPrefixnow register aWatch.Listenerinstead of thebare
Consumeroverload. OnonErrororonCompletedthe driver re-subscribes after a1s backoff on a daemon thread, which is the same recovery
PdMetaDrivergets throughKvClient. AWARNis logged before each re-subscribe, so the failure is now visible.CachedSchemaTransactionV2.resetMetaListenerForReconnect(). It was a manualstopgap added in fix(server): sync hstore schema cache clears #3011 with no callers, and it never shipped in a tagged release. Now
that the driver recovers on its own, the JVM-global register-once flag is correct to stay
set, so the manual reset has nothing left to do. The lifecycle comment on
metaEventListenerRegisteredis updated to describe the self-heal behavior.MetaDriverinterface is unchanged, so neither implementor needs edits beyondEtcdMetaDriver.Known limitation: the re-subscribe opens a fresh watch without a stored revision, so any
cache-clear events emitted during the short reconnect window are not replayed. This is the
same behavior as
PdMetaDriver/KvClient. The change removes the permanent silentfailure. It does not add gap replay.
Verifying these changes
EtcdMetaDriverTest(JUnit + Mockito) mocks the jetcdClient/Watch, capturesthe registered
Watch.Listener, and asserts that bothonErrorandonCompletedtrigger a re-subscribe (a second
watch(...)call), and thatonNextstill reachesthe consumer.
EtcdMetaDriverTest4/4,CachedSchemaTransactionTest18/18, andMetaManagerSchemaCacheClearEventTest6/6 after the stopgap removal.restoring it makes them pass.
Does this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - No Need