Skip to content

fix(core): self-heal etcd meta watch on transport reconnect#3062

Open
dpol1 wants to merge 2 commits into
apache:masterfrom
dpol1:fix/3036-meta-listener-reconnect
Open

fix(core): self-heal etcd meta watch on transport reconnect#3062
dpol1 wants to merge 2 commits into
apache:masterfrom
dpol1:fix/3036-meta-listener-reconnect

Conversation

@dpol1

@dpol1 dpol1 commented Jun 22, 2026

Copy link
Copy Markdown
Member

Purpose of the PR

CachedSchemaTransactionV2 registers a JVM-global watch on the meta store so that a
schema change on one node clears the schema cache on the other nodes. That watch goes
through EtcdMetaDriver.listen / listenPrefix, which handed jetcd the bare
Consumer<WatchResponse> overload. That overload drops onError and onCompleted, so
when jetcd ends a watch (for example after a transport reconnect tears down the gRPC
stream) the listener stopped receiving events with no log line and no exception. The node
kept serving stale schema and nothing reported it. PdMetaDriver does not have this
problem because its KvClient already re-subscribes on error.

Main Changes

  • EtcdMetaDriver.listen / listenPrefix now register a Watch.Listener instead of the
    bare Consumer overload. On onError or onCompleted the driver re-subscribes after a
    1s backoff on a daemon thread, which is the same recovery PdMetaDriver gets through
    KvClient. A WARN is logged before each re-subscribe, so the failure is now visible.
  • Removed CachedSchemaTransactionV2.resetMetaListenerForReconnect(). It was a manual
    stopgap added in fix(server): sync hstore schema cache clears #3011 with no callers, and it never shipped in a tagged release. Now
    that the driver recovers on its own, the JVM-global register-once flag is correct to stay
    set, so the manual reset has nothing left to do. The lifecycle comment on
    metaEventListenerRegistered is updated to describe the self-heal behavior.
  • The MetaDriver interface is unchanged, so neither implementor needs edits beyond
    EtcdMetaDriver.

Known limitation: the re-subscribe opens a fresh watch without a stored revision, so any
cache-clear events emitted during the short reconnect window are not replayed. This is the
same behavior as PdMetaDriver / KvClient. The change removes the permanent silent
failure. It does not add gap replay.

Verifying these changes

  • Need tests and can be verified as follows:
    • New EtcdMetaDriverTest (JUnit + Mockito) mocks the jetcd Client/Watch, captures
      the registered Watch.Listener, and asserts that both onError and onCompleted
      trigger a re-subscribe (a second watch(...) call), and that onNext still reaches
      the consumer.
    • Local run: EtcdMetaDriverTest 4/4, CachedSchemaTransactionTest 18/18, and
      MetaManagerSchemaCacheClearEventTest 6/6 after the stopgap removal.
    • Red-green: commenting out the re-subscribe makes the three recovery tests fail;
      restoring it makes them pass.

Does this PR potentially affect the following parts?

  • Dependencies
  • Modify configurations
  • The public API
  • Other affects (typed here)
  • Nope

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

dpol1 added 2 commits June 22, 2026 18:35
)

EtcdMetaDriver.listen/listenPrefix handed jetcd a bare Consumer<WatchResponse>,
so a terminal watch error (e.g. after a transport reconnect) was swallowed and
the JVM-global schema-cache-clear listener died silently: a node stopped
receiving cross-node cache-clear events with no error or warning.

Switch to the Watch.Listener overload and re-subscribe on onError/onCompleted
via a daemon-backed backoff, mirroring the self-heal PdMetaDriver already gets
from KvClient. The driver watch now stays live across reconnects, so
CachedSchemaTransactionV2's register-once flag staying true is correct; the
unused resetMetaListenerForReconnect stopgap and its TODO are removed.

Add EtcdMetaDriverTest covering re-subscribe on error/completion and event
delivery, and register it in UnitTestSuite.
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working tests Add or improve test cases labels Jun 22, 2026

@imbajin imbajin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking: no. Summary: No obvious issues found in the current head. Evidence: git diff --check passed; EtcdMetaDriverTest passed (4 tests); latest GitHub checks passed.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files. tests Add or improve test cases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Improvement] Re-register HStore schema cache clear listener after a Meta reconnect

2 participants