That PR made CachedSchemaTransactionV2 listen for schema-cache-clear meta
events and drop the matching local V2 caches, so a schema change on one server
no longer leaves other nodes serving stale schema in HStore / multi-server mode.
The catch is the listener registers exactly once per JVM. The gRPC watch behind
it is process-wide and there is no unlisten. If the Meta transport reconnects
and the old watch gets dropped, the listener goes deaf and nothing brings it
back. There is already a manual recovery hook,
resetMetaListenerForReconnect(), but nothing calls it, because Meta exposes no
reconnect signal to hang it on.
The two halves of the gap, both in CachedSchemaTransactionV2:
What should happen
After the Meta transport reconnects, the JVM-wide schema-cache-clear listener
comes back on its own and keeps delivering events for every graph. No operator
action.
What happens now
MetaManager / MetaDriver expose listen and keepAlive but no reconnect
callback. A dropped watch goes unnoticed, events stop arriving, and the node can
keep serving stale schema until someone calls resetMetaListenerForReconnect()
by hand. Nothing does.
Proposed fix
Two parts, same theme, easiest to do in one go:
- Give
MetaManager / MetaDriver a reconnect callback (something like
listenReconnect / onTransportReconnect) that fires when the transport
reconnects and the previous watch is gone.
- Point
CachedSchemaTransactionV2 at that callback so it re-registers the
listener. resetMetaListenerForReconnect() stops being a manual entry point
and becomes the callback target.
How to verify
Two servers against HStore/Meta. Force a Meta reconnect (restart Meta or kill
the connection), then change a schema on server A and assert server B clears its
V2 caches and stops returning stale schema.
That PR made
CachedSchemaTransactionV2listen forschema-cache-clearmetaevents and drop the matching local V2 caches, so a schema change on one server
no longer leaves other nodes serving stale schema in HStore / multi-server mode.
The catch is the listener registers exactly once per JVM. The gRPC watch behind
it is process-wide and there is no
unlisten. If the Meta transport reconnectsand the old watch gets dropped, the listener goes deaf and nothing brings it
back. There is already a manual recovery hook,
resetMetaListenerForReconnect(), but nothing calls it, because Meta exposes noreconnect signal to hang it on.
The two halves of the gap, both in
CachedSchemaTransactionV2:https://github.com/apache/hugegraph/blob/master/hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/backend/cache/CachedSchemaTransactionV2.java#L52-L58
resetMetaListenerForReconnect()hook that has no caller and the javadocsaying it must be wired to a Meta reconnect callback:
https://github.com/apache/hugegraph/blob/master/hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/backend/cache/CachedSchemaTransactionV2.java#L254-L270
What should happen
After the Meta transport reconnects, the JVM-wide
schema-cache-clearlistenercomes back on its own and keeps delivering events for every graph. No operator
action.
What happens now
MetaManager/MetaDriverexposelistenandkeepAlivebut no reconnectcallback. A dropped watch goes unnoticed, events stop arriving, and the node can
keep serving stale schema until someone calls
resetMetaListenerForReconnect()by hand. Nothing does.
Proposed fix
Two parts, same theme, easiest to do in one go:
MetaManager/MetaDrivera reconnect callback (something likelistenReconnect/onTransportReconnect) that fires when the transportreconnects and the previous watch is gone.
CachedSchemaTransactionV2at that callback so it re-registers thelistener.
resetMetaListenerForReconnect()stops being a manual entry pointand becomes the callback target.
How to verify
Two servers against HStore/Meta. Force a Meta reconnect (restart Meta or kill
the connection), then change a schema on server A and assert server B clears its
V2 caches and stops returning stale schema.