-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: improve single delete safety #114492
Comments
Hi @jbowens, please add a C-ategory label to your issue. Check out the label system docs. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Add a new specialized InternalNext iterator operation that allows a user to peek ahead at hidden internal keys within the current user key. The InternalNext operation may only be used in the forward direction, and does not change the external Iterator position whatsoever. InternalNext is intended for use within a very specific CockroachDB context: During intent resolution, CockroachDB will write a SINGLEDEL tombstone to delete the intent if the intent's value indicates that it must have only been written once. Subtle logic that cuts across the stack makes this optimization possible, and one possible explanation for a recent instance of replica divergence (cockroachdb/cockroach#114421) is a bug in this logic. We anticipate using InternalNext within CockroachDB's intent resolution to peek ahead and ensure that no additional SETs exist within a user key that we plan to remove using a SINGLEDEL tombstone. Informs cockroachdb/cockroach#114492.
During intent resolution, examine internal keys when intending to delete an intent using a single delete tombstone to validate that it is indeed safe within the local Engine. This is a safety measure to validate that we do not violate the invariants that make singe deletion deterministic causing replica divergence. False negatives are possible, since this safety mechanism only examines the local engine state and only the state visible to the open lock table iterator at the time of proposal evaluation. Release note: none Epic: none Informs #114492. Informs #114421.
114820: storage: assert local single delete safety r=sumeerbhola a=jbowens During intent resolution, examine internal keys when intending to delete an intent using a single delete tombstone to validate that it is indeed safe within the local Engine. This is a safety measure to validate that we do not violate the invariants that make singe deletion deterministic and cause replica divergence. False negatives are possible, since this safety mechanism only examines the local engine state and only the state visible to the open lock table iterator at the time of proposal evaluation. Release note: none Epic: none Informs #114492. Informs #114421. Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
The biggest obstacle to an automatic single delete optimization performed entirely within the storage engine is the need to perform reads. Writes are always blind within the storage engine. In order to determine how many keys a tombstone deletes, the engine would need to read the full LSM at each tombstone. The memtable tends to be as wide as the entire engine keyspace, which in the worst case means every tombstone would require loading a disjoint set of n sstables, where n is the read amplification. But In Cockroach, we actually have significant locality of delete tombstones. There are three primary sources of deletes, all receiving an about equal portion of all the deletes in the system:
If Pebble were to perform reads to transparently identify tombstones that may be relaxed into SINGLEDELs, where would it do it? During a flush is viable, but adding additional read I/O to the flush risks further constraining an existing bottleneck. In some high write throughput experiments with high bandwidth disks, we've observed consistently high flush utilization. Alternatively, we could perform these reads in the background, while building the current mutable memtable. A sketch:
cockroachdb/pebble#3097 would avoid the above arena offsets capturing many more nodes than belong to the specific batch. The above batch metadata forms a queue of "delete batches" for a goroutine associated with the memtable to process. The goroutine inspects the smallest and largest keys of the each queued delete batch (through accessing the memtable nodes at the offsets recorded), and uses them to find overlapping files within the LSM beneath the batch. Heuristics (?) determine whether the batch's contents are sufficiently dense to motivate reading the lower LSM. If they are, the goroutine uses an internal iterator constructed over the entirety of the LSM beneath the memtable and including the memtable itself to determine the number of internal keys beneath each tombstone within the "delete batch": 0, 1 or 2+. It atomically updates a field on the KV's arenaskl.Node. During compaction time, the additional knowledge retained by the memtable's goroutine is surfaced on the base.InternalKV, allowing the compaction iterator to mutate key kinds or elide tombstones. The memtable's goroutine would be racing with the compaction, but there's no correctness problem to failing to observe the memtable goroutine's metadata. We would need to ensure we don't free the memtable until the memtable goroutine has exited. There is still a lot of ambiguity in the above:
The above is complicated, but arguably not compared to the KV transaction invariants we currently rely on. |
During intent resolution, examine internal keys when intending to delete an intent using a single delete tombstone to validate that it is indeed safe within the local Engine. This is a safety measure to validate that we do not violate the invariants that make singe deletion deterministic causing replica divergence. False negatives are possible, since this safety mechanism only examines the local engine state and only the state visible to the open lock table iterator at the time of proposal evaluation. Release note: none Epic: none Informs cockroachdb#114492. Informs cockroachdb#114421.
Add a new specialized CanDeterministicallySingleDelete function that allows a user to inspect hidden internal keys within the current user key to determine whether the behavior of a SingleDelete of the current Key would be deterministic, assuming no writes are made to the key between Iterator creation and commit of the SingleDelete. The CanDeterministicallySingleDelete func may only be called on an Iterator oriented in the forward direction and does not change the external Iterator position. During intent resolution, CockroachDB will write a SINGLEDEL tombstone to delete the intent if the intent's value indicates that it must have only been written once. Subtle logic that cuts across the stack makes this optimization possible, and one possible explanation for a recent instance of replica divergence (cockroachdb/cockroach#114421) is a bug in this logic. We anticipate using CanDeterministicallySingleDelete within CockroachDB's intent resolution to validate this logic at runtime. There is some potential for this function to drive the decision to use SingleDelete when creating batches for local Engine application only. Informs cockroachdb/cockroach#114492.
Informs #8979 |
@sumeerbhola and I were talking in the office about what we can do to protect/assert against SINGLEDEL misuse or invariant violations in light of the a test cluster’s replica divergence which could plausibly be explained by a SINGLEDEL misuse. During intent resolution we have an iterator walking through the lock table. When we decide to use a SINGLEDEL, we could conceivably (with some Pebble interface contortions) next the internal iterator once to confirm that there are no additional internal keys within the LSM. We’d add a small regression to intent resolution in the name of safety and quality. The interface contortions to do this would be ugly, and we might still miss keys above the iterator’s sequence number or keys that might commit later but before our batch commits.
This has me wondering whether we might be better off relying wholly on Pebble to make the determination of whether to persist a DEL or a SINGLEDEL. Today we only use SINGLEDELs for intents because it’s hard to ensure that we’ll never write the same Pebble user key twice, even if the vast majority of the time we’ll only write once. Because of this we don’t get any of the SINGLEDEL benefits for raft log truncation or MVCC garbage collection in low GC TTL ranges.
We could make the decision to transform a DEL into a SINGLEDEL at flush time, if heuristics indicate there are enough DELs to be worth the overhead of reading the lower LSM state. This is a nontrivial additional cost to flushes, but we save write amp and compaction cpu time when we successfully remove a resulting SINGLEDEL before it reaches L6.
Informs #114421.
Jira issue: CRDB-33530
The text was updated successfully, but these errors were encountered: