storage: improve single delete safety #114492

jbowens · 2023-11-15T14:47:10Z

@sumeerbhola and I were talking in the office about what we can do to protect/assert against SINGLEDEL misuse or invariant violations in light of the a test cluster’s replica divergence which could plausibly be explained by a SINGLEDEL misuse. During intent resolution we have an iterator walking through the lock table. When we decide to use a SINGLEDEL, we could conceivably (with some Pebble interface contortions) next the internal iterator once to confirm that there are no additional internal keys within the LSM. We’d add a small regression to intent resolution in the name of safety and quality. The interface contortions to do this would be ugly, and we might still miss keys above the iterator’s sequence number or keys that might commit later but before our batch commits.

This has me wondering whether we might be better off relying wholly on Pebble to make the determination of whether to persist a DEL or a SINGLEDEL. Today we only use SINGLEDELs for intents because it’s hard to ensure that we’ll never write the same Pebble user key twice, even if the vast majority of the time we’ll only write once. Because of this we don’t get any of the SINGLEDEL benefits for raft log truncation or MVCC garbage collection in low GC TTL ranges.

We could make the decision to transform a DEL into a SINGLEDEL at flush time, if heuristics indicate there are enough DELs to be worth the overhead of reading the lower LSM state. This is a nontrivial additional cost to flushes, but we save write amp and compaction cpu time when we successfully remove a resulting SINGLEDEL before it reaches L6.

Informs #114421.

Jira issue: CRDB-33530

blathers-crl · 2023-11-15T14:47:13Z

Hi @jbowens, please add a C-ategory label to your issue. Check out the label system docs.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Informs cockroachdb/cockroach#114492.

Add a new specialized InternalNext iterator operation that allows a user to peek ahead at hidden internal keys within the current user key. The InternalNext operation may only be used in the forward direction, and does not change the external Iterator position whatsoever. InternalNext is intended for use within a very specific CockroachDB context: During intent resolution, CockroachDB will write a SINGLEDEL tombstone to delete the intent if the intent's value indicates that it must have only been written once. Subtle logic that cuts across the stack makes this optimization possible, and one possible explanation for a recent instance of replica divergence (cockroachdb/cockroach#114421) is a bug in this logic. We anticipate using InternalNext within CockroachDB's intent resolution to peek ahead and ensure that no additional SETs exist within a user key that we plan to remove using a SINGLEDEL tombstone. Informs cockroachdb/cockroach#114492.

During intent resolution, examine internal keys when intending to delete an intent using a single delete tombstone to validate that it is indeed safe within the local Engine. This is a safety measure to validate that we do not violate the invariants that make singe deletion deterministic causing replica divergence. False negatives are possible, since this safety mechanism only examines the local engine state and only the state visible to the open lock table iterator at the time of proposal evaluation. Release note: none Epic: none Informs #114492. Informs #114421.

114820: storage: assert local single delete safety r=sumeerbhola a=jbowens During intent resolution, examine internal keys when intending to delete an intent using a single delete tombstone to validate that it is indeed safe within the local Engine. This is a safety measure to validate that we do not violate the invariants that make singe deletion deterministic and cause replica divergence. False negatives are possible, since this safety mechanism only examines the local engine state and only the state visible to the open lock table iterator at the time of proposal evaluation. Release note: none Epic: none Informs #114492. Informs #114421. Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>

jbowens · 2023-11-22T17:38:59Z

The biggest obstacle to an automatic single delete optimization performed entirely within the storage engine is the need to perform reads. Writes are always blind within the storage engine. In order to determine how many keys a tombstone deletes, the engine would need to read the full LSM at each tombstone. The memtable tends to be as wide as the entire engine keyspace, which in the worst case means every tombstone would require loading a disjoint set of n sstables, where n is the read amplification.

But In Cockroach, we actually have significant locality of delete tombstones. There are three primary sources of deletes, all receiving an about equal portion of all the deletes in the system:

The lock table is a high-churn region of the keyspace. All of the KV ranges' lock table keyspaces are contiguous, which at the Engine-level means the lock table is an uninterrupted continuous keyspace.
The raft log is another high-churn region, and tombstones are written during log truncation. The KV ranges' raft logs are not contiguous, because there are other unreplicated RangeID-prefixed keys interspersed, but the interspersed keys should be relatively low volume.
MVCC GC builds a batch deleting keys within a single KV range at a time. Since the resulting batch is constrained to a single KV range, the batch has high locality. With 512 MB ranges and 128 MB L6 sstables, we'd expect one of these batches to overlap ~5 sstables in L6.

If Pebble were to perform reads to transparently identify tombstones that may be relaxed into SINGLEDELs, where would it do it? During a flush is viable, but adding additional read I/O to the flush risks further constraining an existing bottleneck. In some high write throughput experiments with high bandwidth disks, we've observed consistently high flush utilization.

Alternatively, we could perform these reads in the background, while building the current mutable memtable.

A sketch:

Batches record the count of point tombstones they receive. At memtable application time, if point tombstones make up a sufficiently large portion of the batch's contents, the batch signals on a condition variable after recording:

the memtable arena offset of its first node (batch order)
the memtable arena offset of its last node (batch order)
the memtable arena offset of its first node in key order (this can be deduced without additional key comparisons during batch application by comparing offsets of incoming links in the lowest level)
the memtable arena offset of its last node in key order.

cockroachdb/pebble#3097 would avoid the above arena offsets capturing many more nodes than belong to the specific batch.

The above batch metadata forms a queue of "delete batches" for a goroutine associated with the memtable to process. The goroutine inspects the smallest and largest keys of the each queued delete batch (through accessing the memtable nodes at the offsets recorded), and uses them to find overlapping files within the LSM beneath the batch. Heuristics (?) determine whether the batch's contents are sufficiently dense to motivate reading the lower LSM. If they are, the goroutine uses an internal iterator constructed over the entirety of the LSM beneath the memtable and including the memtable itself to determine the number of internal keys beneath each tombstone within the "delete batch": 0, 1 or 2+. It atomically updates a field on the KV's arenaskl.Node.

During compaction time, the additional knowledge retained by the memtable's goroutine is surfaced on the base.InternalKV, allowing the compaction iterator to mutate key kinds or elide tombstones. The memtable's goroutine would be racing with the compaction, but there's no correctness problem to failing to observe the memtable goroutine's metadata. We would need to ensure we don't free the memtable until the memtable goroutine has exited.

There is still a lot of ambiguity in the above:

What would a reasonable heuristic on when to perform the LSM read look like?
There are likely opportunities to reduce the cost of the LSM read. For example, if we used a getIter-style iterator, and we haven't found a version of a key beneath a tombstone by the time we reach L6, maybe it's not worth reading L6 since a single delete that must be compacted into L6 to drop anything is no better than an ordinary delete tombstone.
Would it be worthwhile to transform DELs into appropriately-sized DELSIZEDs during reads that find multiple internal keys, or reads that find a single key in L6?

The above is complicated, but arguably not compared to the KV transaction invariants we currently rely on.

During intent resolution, examine internal keys when intending to delete an intent using a single delete tombstone to validate that it is indeed safe within the local Engine. This is a safety measure to validate that we do not violate the invariants that make singe deletion deterministic causing replica divergence. False negatives are possible, since this safety mechanism only examines the local engine state and only the state visible to the open lock table iterator at the time of proposal evaluation. Release note: none Epic: none Informs cockroachdb#114492. Informs cockroachdb#114421.

Add a new specialized CanDeterministicallySingleDelete function that allows a user to inspect hidden internal keys within the current user key to determine whether the behavior of a SingleDelete of the current Key would be deterministic, assuming no writes are made to the key between Iterator creation and commit of the SingleDelete. The CanDeterministicallySingleDelete func may only be called on an Iterator oriented in the forward direction and does not change the external Iterator position. During intent resolution, CockroachDB will write a SINGLEDEL tombstone to delete the intent if the intent's value indicates that it must have only been written once. Subtle logic that cuts across the stack makes this optimization possible, and one possible explanation for a recent instance of replica divergence (cockroachdb/cockroach#114421) is a bug in this logic. We anticipate using CanDeterministicallySingleDelete within CockroachDB's intent resolution to validate this logic at runtime. There is some potential for this function to drive the decision to use SingleDelete when creating batches for local Engine application only. Informs cockroachdb/cockroach#114492.

jbowens · 2023-12-04T20:39:15Z

Informs #8979

jbowens added A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Nov 15, 2023

jbowens added this to Incoming in (Deprecated) Storage via automation Nov 15, 2023

jbowens added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Nov 15, 2023

jbowens added a commit to jbowens/pebble that referenced this issue Nov 15, 2023

db: add InternalNext iterator operation

2a4304e

Informs cockroachdb/cockroach#114492.

jbowens mentioned this issue Nov 15, 2023

db: add CanDeterministicallySingleDelete cockroachdb/pebble#3077

Merged

jbowens added a commit to jbowens/pebble that referenced this issue Nov 16, 2023

db: add InternalNext iterator operation

92ea423

Informs cockroachdb/cockroach#114492.

jbowens added a commit to jbowens/pebble that referenced this issue Nov 16, 2023

db: add InternalNext iterator operation

3f35cf1

Informs cockroachdb/cockroach#114492.

This was referenced Nov 21, 2023

storage: assert local single delete safety #114820

Merged

crl-release-23.2: db: add CanDeterministicallySingleDelete cockroachdb/pebble#3096

Merged

blathers-crl bot mentioned this issue Nov 22, 2023

release-23.2: storage: assert local single delete safety #114878

Merged

jbowens changed the title ~~storage: improve SINGLEDEL safety~~ storage: improve single delete safety Nov 22, 2023

nicktrav moved this from Incoming to 24.1 candidates in (Deprecated) Storage Nov 28, 2023

This was referenced Dec 8, 2023

storage: assert single deletes do indeed delete keys #115881

Closed

storage: use merge for intents; enforce singledel invariants #116324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: improve single delete safety #114492

storage: improve single delete safety #114492

jbowens commented Nov 15, 2023 •

edited

blathers-crl bot commented Nov 15, 2023

jbowens commented Nov 22, 2023 •

edited

jbowens commented Dec 4, 2023

storage: improve single delete safety #114492

storage: improve single delete safety #114492

Comments

jbowens commented Nov 15, 2023 • edited

blathers-crl bot commented Nov 15, 2023

jbowens commented Nov 22, 2023 • edited

jbowens commented Dec 4, 2023

jbowens commented Nov 15, 2023 •

edited

jbowens commented Nov 22, 2023 •

edited