Proposal: Keyspace Notifications for Kvrocks #3533
Replies: 2 comments 7 replies
-
|
I think the overall design looks good, but for the first phase, we should only support publishing changes from the primary node and not implement synchronization to replica nodes. The synchronization logic for replica nodes may need further consideration. I think the current approach is not very efficient from a performance perspective. |
Beta Was this translation helpful? Give feedback.
-
|
@jihuayu I think I'm ready to start implementing. My plan is to split it into two PRs to keep each one easy to review: the first one implementing keyspace notifications on the primary node (the notify-keyspace-events config plus set/del publishing on the Does that sound reasonable? If so I'll get started on the first PR. Thanks! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
1. Background and goals
Applications reading from a read-only Kvrocks replica keep a local cache to cut latency and reduce round-trips to the primary. Keeping that cache fresh requires a notification when a key changes. Kvrocks has no such mechanism today, which blocks applications that depend on Redis Keyspace Notifications from migrating.
This design implements the minimal Redis-compatible Keyspace Notifications surface needed by #2915:
notify-keyspace-eventsconfiguration.__keyspace@<db>__:<key>with payload = event name.__keyevent@<db>__:<event>with payload = key.SUBSCRIBE/PSUBSCRIBE.SETandDELnotifications on both primary and replica.The MVP is intentionally narrow. It does not infer Redis command semantics from RocksDB storage mutations alone: those records cannot distinguish
SETfromAPPEND,SETRANGE, orINCR, nor a genuineDELcommand from lazy expiry or internal delete-and-recreate flows. Instead, it reuses the existing Redis-type LogData carried in the same write batch to distinguish the supportedsetanddelcases.2. Architecture
Redis emits keyspace events at the command execution point, and replicas re-emit them by replaying the replicated command stream. Kvrocks replicates the RocksDB WAL (
WriteBatch), not Redis commands, so replicas need explicit command context when a storage mutation is ambiguous.The design therefore extends the existing
redis::WriteBatchLogDatacontext in the sameWriteBatchas the data mutation:flowchart TD subgraph primary["Primary / standalone"] cmd["Command layer<br/>SET / DEL succeeds"] logdata["Create existing LogData<br/>for replicas"] local_event["Keep command-derived<br/>notify fact"] write["db_->Write(batch) succeeds"] publish_primary["After write succeeds<br/>publish local fact"] notify_primary["NotifyKeyspaceEvent"] end subgraph replica["Replica"] recv["Receive replicated WriteBatch"] apply["Apply batch succeeds"] decode_replica["Extract notify events<br/>during batch iteration"] notify_replica["NotifyKeyspaceEvent"] end pub["PublishMessage<br/>keyspace / keyevent"] clients["Local SUBSCRIBE / PSUBSCRIBE subscribers"] cmd --> logdata --> write --> publish_primary --> notify_primary cmd --> local_event --> publish_primary logdata -. "replicated with WriteBatch" .-> recv recv --> apply --> decode_replica --> notify_replica notify_primary --> pub notify_replica --> pub pub --> clientsThis keeps the WAL as the replication carrier without guessing event names from column-family writes. The primary already has command execution context, so it keeps the local notification fact and publishes it only after
db_->Writesucceeds. The replica publishes after it applies the same batch and reconstructs the notification from the replicated LogData plus metadata records. With the same localnotify-keyspace-eventssetting, both roles publish the same event names and payloads.Pub/Sub fan-out is reused unchanged:
Server::PublishMessagealready handles exact-channel and pattern subscriptions and is a pure in-memory operation, so it works on read-only replicas.3. Design decisions
3.1 Config:
notify-keyspace-eventsAdd the
notify-keyspace-eventsconfig. The default is""(disabled). The MVP parser accepts the Kvrocks-supported subset of Redis flagsK E A g $ l s h z t, rejects unsupported or unknown characters, and supports runtime updates throughCONFIG SET.Only
g(del) and$(set) currently produce notifications. The remaining data-type flags are accepted because Kvrocks has those storage types, but they have no MVP emitter yet. Internally these flags are parsed into a new notification bitmask, separate from the existing storageRedisTypeenum:1 << 0K1 << 1E1 << 2gdel1 << 3$set1 << 4l1 << 5s1 << 6h1 << 7z1 << 8tIn this MVP,
Aexpands only to the accepted Kvrocks data-class subsetg$lshzt. Redis-only or out-of-scope flags such asa,d,x,e,m,n,o, andcare intentionally not accepted until their event semantics are implemented. Only$andgproduce events, so the commonKEAsetting enables both channel forms and letsset/delpass the filter.The other classes enabled by
Aare accepted but inert for now. If later releases add notification extraction for list/hash/etc., existingKEAdeployments will naturally start receiving those events; that matches Redis semantics, but should be called out in release notes.3.2 Reusing existing LogData
Kvrocks already writes
redis::WriteBatchLogDatainto many write batches withWriteBatch::PutLogData(). The current source model is:It is encoded as the numeric Redis type followed by optional arguments. Existing batch iterators already separate server-level log data from Redis-type log data before decoding. This MVP keeps that boundary and only extends Redis-type
WriteBatchLogData; it does not add a new top-level log-data kind.The MVP reuses that mechanism and slightly extends the optional arguments with notification command context, for example by adding Redis command codes such as
kRedisCmdSetandkRedisCmdDel. New command codes must be appended without renumbering existingRedisCommandvalues. It does not add an independentkse1record and does not put raw key names into LogData arguments. The affected namespace and key are still decoded from the existing metadata column-familyPutCF/DeleteCFrecords withExtractNamespaceKey, which preserves raw key bytes and avoids adding one log record per key. Primary-subkey and stream records useInternalKeyelsewhere and are ignored by thedelextractor.A new or extended replica-side notification extractor is stateful in the same way as the existing
WriteBatchExtractor:LogDataestablishes the current Redis type and optional command context, and subsequent metadata-column-family records can produce notifications when both the LogData context and the column-family operation match a supported event.set: requires a Redis string LogData context explicitly marked as a supported SET command path, then a metadataPutCFwhose decoded metadata type iskRedisString.del: requires a LogData context explicitly marked as the supported DEL command path, then a metadataDeleteCF. It must not fire for primary-subkey deletes such asHDEL,SREM,ZREM, or list element removals.The LogData must precede the records it describes, matching current write-batch patterns. If the context is missing, malformed, a server log, or unrelated to key notifications, the extractor skips notification emission for those records.
Decode errors are non-fatal in the notification pass. A malformed or unknown LogData command context is logged and skipped for notification purposes; it must not fail an already committed write or block replication.
LogData command context records semantic facts, not publication decisions. They are written only when a command has definitely performed a supported mutation. Each node still applies its own
notify-keyspace-eventsfilter when publishing.LogData write policy:
notify-keyspace-eventsenables no supported class bit (g/$), supported commands may skip adding notification command context to avoid default-off overhead. The skip keys only on class bits, never on theK/Echannel selectors, so a primary with a class bit but no selector still carries the context for replicas to publish.Configuration is thus checked twice — at LogData-context write time on the primary and at publish time on each node; a
CONFIG SETthat disables notifications between the two still ships the context to replicas but suppresses the primary's local publish.3.3 Command paths
Only commands with unambiguous MVP semantics add notification command context:
SET: mark the batch as a supportedsetproducer only when the literalSETcommand actually writes the key. Conditional variants such asSET key value NXmust not add notification context when the condition fails. Shared string helpers must not infersetfromkRedisStringalone, because other string commands also write string metadata.DEL: mark the batch as a supporteddelproducer only when the literalDELcommand deletes at least one key. The extractor emits onedelevent per metadataDeleteCF, each carrying that single key. Missing keys do not produce write records or notifications.UNLINKcurrently sharesCommandDelandMDel, so the implementation must distinguish the command name or command attributes and avoid enablingUNLINKunless it is explicitly added to the MVP scope.Every other write path (the out-of-scope list in §4) adds no notification command context in this MVP. This is deliberate: emitting no event is safer than emitting a Redis event name that does not match the command semantics.
3.4 Event source and trigger points
Primary and replica publish at different points, and only the replica reconstructs notifications from LogData:
SET/DEL, then callsServer::NotifyKeyspaceEventonly afterdb_->Writesucceeds. It does not infer primary notifications from LogData or column-family records.parseWriteBatch.Supported command paths pass notification command context to the type/storage operation that creates the existing
WriteBatchLogDatafor replicas, but those type/storage operations do not publish. On the primary, publication is a single after-write step from the retained local fact. This keeps failed writes silent and prevents a single-node write from publishing once from the command path and once from the batch path.Publication is bound only to live write/apply paths. During restart, RocksDB replays WAL internally inside
DB::Open; that recovery does not invoke the notification path, so historical LogData is not republished.Decode errors are non-fatal in the notification handler. Malformed or unsupported notification context is logged and skipped; keyspace notifications must never abort a committed write or stall replication.
3.5 Event emitter
Server::NotifyKeyspaceEventapplies the local config filter and reuses the existing Pub/Sub path to publish the two Redis-compatible channel forms:For a successful
SET foo barin the default namespace,KEApublishes:For a successful
DEL foo, it publishes:3.6
<db>/ namespace mappingFor Redis compatibility, the default namespace maps to db
0, so standard subscriptions such asPSUBSCRIBE __keyevent@0__:*work out of the box.Non-default Kvrocks namespaces cannot be inserted into
<db>directly. A namespace literally named"0"would collide with Redis db0channels and could leak events across tenants. The mapping must therefore be collision-free.MapNamespaceToKeyspaceDB(ns)is defined as:"0";"ns:" + PercentEncode(ns), where percent encoding escapes every byte outside[A-Za-z0-9_.-]and also escapes%.Subscribers to
__keyevent@0__:*therefore only see the default namespace. Subscribers that opt into non-default namespaces use the encoded form, for example__keyevent@ns:tenantA__:*. The payload remains only the key name, so namespace identity comes from the channel.The key component in
__keyspace@<db>__:<key>and the payload in__keyevent@<db>__:<event>are the raw user key bytes, matching Redis behavior. Keys are not percent-encoded, even if they contain:, whitespace, or binary bytes.4. Event scope
Supported MVP events:
SET$setDELgdelOut of scope:
UNLINK,APPEND,SETRANGE,GETSET,INCR, and other string commands.EXPIRE,PEXPIRE,expired, and lazy/compaction expiry.FLUSHDB/FLUSHALLandDeleteRangeCFnotifications.rename_from,rename_to,new,keymiss,evicted, and module events.FLUSHDBis explicitly a limitation for cache invalidation: this MVP does not emit per-key or aggregate flush notifications. Applications that rely on flush visibility need a later design for aggregateflushdb/flushallevents.5. Semantics and guarantees
db_->Writeor replica apply.notify-keyspace-eventsis node-local, and the primary's setting gates notification LogData context (§3.2). A primary with no supported class bit writes no notification context, so its replicas publish nothing no matter how they are configured. For the Receive notification for SET and DEL events on replicas (minimal implementation of notify-keyspace-events) #2915 replica-cache use case operators must therefore enable a supported class bit (for exampleKEA) on the primary — even if it has no local subscribers — and set the same value on every subscribed replica.WriteBatchLogDataarguments. Older nodes may ignore the notification context and not publish, but they must still apply and replicate the write batch.DEL k1 k2 k3, concurrent writes, or events on different nodes. Extraction follows write-batch iteration order, but clients must not depend on cross-key ordering.6. Tests
Add focused coverage, grouped by concern.
Configuration:
CONFIG SETruntime changes.RedisTypeenum, andg/$makedel/setpass the filter for their event classes.Aexpansion sets the accepted Kvrocks data-class bits (g$lshzt) and makesKEApassset/delthrough filtering.Emission:
SETpublishessetonly on success;NX/XXcondition failures publish nothing.DELpublishes onedelper actually deleted key, none for missing keys.UNLINK,APPEND,SETRANGE,INCR, and expiry paths publish no notifications.set/delfrom master writes.LogData and replication:
WriteBatchLogDatacontext is present.WriteBatchLogDatastill decodes Redis type and arguments; new notification command codes forset/delare recognized, while unknown command context is skipped by the notification handler without abortingWriteBatch::Iterate.PutCF/DeleteCFrecords, not from raw key names stored in LogData arguments.Semantics:
CONFIG SETbetween LogData-context write and publication take effect at publish time.0; non-default names are prefixed and percent-encoded, including numeric names like"0", so tenant channels cannot collide.7. References
Beta Was this translation helpful? Give feedback.
All reactions