Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: improve single delete safety #114492

Open
jbowens opened this issue Nov 15, 2023 · 3 comments
Open

storage: improve single delete safety #114492

jbowens opened this issue Nov 15, 2023 · 3 comments
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team

Comments

@jbowens
Copy link
Collaborator

jbowens commented Nov 15, 2023

@sumeerbhola and I were talking in the office about what we can do to protect/assert against SINGLEDEL misuse or invariant violations in light of the a test cluster’s replica divergence which could plausibly be explained by a SINGLEDEL misuse. During intent resolution we have an iterator walking through the lock table. When we decide to use a SINGLEDEL, we could conceivably (with some Pebble interface contortions) next the internal iterator once to confirm that there are no additional internal keys within the LSM. We’d add a small regression to intent resolution in the name of safety and quality. The interface contortions to do this would be ugly, and we might still miss keys above the iterator’s sequence number or keys that might commit later but before our batch commits.

This has me wondering whether we might be better off relying wholly on Pebble to make the determination of whether to persist a DEL or a SINGLEDEL. Today we only use SINGLEDELs for intents because it’s hard to ensure that we’ll never write the same Pebble user key twice, even if the vast majority of the time we’ll only write once. Because of this we don’t get any of the SINGLEDEL benefits for raft log truncation or MVCC garbage collection in low GC TTL ranges.

We could make the decision to transform a DEL into a SINGLEDEL at flush time, if heuristics indicate there are enough DELs to be worth the overhead of reading the lower LSM state. This is a nontrivial additional cost to flushes, but we save write amp and compaction cpu time when we successfully remove a resulting SINGLEDEL before it reaches L6.

Informs #114421.

Jira issue: CRDB-33530

@jbowens jbowens added A-storage Relating to our storage engine (Pebble) on-disk storage. T-storage Storage Team labels Nov 15, 2023
@jbowens jbowens added this to Incoming in (Deprecated) Storage via automation Nov 15, 2023
Copy link

blathers-crl bot commented Nov 15, 2023

Hi @jbowens, please add a C-ategory label to your issue. Check out the label system docs.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@jbowens jbowens added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Nov 15, 2023
jbowens added a commit to jbowens/pebble that referenced this issue Nov 15, 2023
jbowens added a commit to jbowens/pebble that referenced this issue Nov 16, 2023
jbowens added a commit to jbowens/pebble that referenced this issue Nov 16, 2023
jbowens added a commit to jbowens/pebble that referenced this issue Nov 20, 2023
Add a new specialized InternalNext iterator operation that allows a user to
peek ahead at hidden internal keys within the current user key. The
InternalNext operation may only be used in the forward direction, and does not
change the external Iterator position whatsoever.

InternalNext is intended for use within a very specific CockroachDB context:
During intent resolution, CockroachDB will write a SINGLEDEL tombstone to
delete the intent if the intent's value indicates that it must have only been
written once. Subtle logic that cuts across the stack makes this optimization
possible, and one possible explanation for a recent instance of replica
divergence (cockroachdb/cockroach#114421) is a bug in this logic. We anticipate
using InternalNext within CockroachDB's intent resolution to peek ahead and
ensure that no additional SETs exist within a user key that we plan to remove
using a SINGLEDEL tombstone.

Informs cockroachdb/cockroach#114492.
craig bot pushed a commit that referenced this issue Nov 22, 2023
During intent resolution, examine internal keys when intending to delete an
intent using a single delete tombstone to validate that it is indeed safe
within the local Engine. This is a safety measure to validate that we do not
violate the invariants that make singe deletion deterministic causing replica
divergence. False negatives are possible, since this safety mechanism only
examines the local engine state and only the state visible to the open lock
table iterator at the time of proposal evaluation.

Release note: none
Epic: none
Informs #114492.
Informs #114421.
craig bot pushed a commit that referenced this issue Nov 22, 2023
114820: storage: assert local single delete safety r=sumeerbhola a=jbowens

During intent resolution, examine internal keys when intending to delete an
intent using a single delete tombstone to validate that it is indeed safe
within the local Engine. This is a safety measure to validate that we do not
violate the invariants that make singe deletion deterministic and cause replica
divergence. False negatives are possible, since this safety mechanism only
examines the local engine state and only the state visible to the open lock
table iterator at the time of proposal evaluation.

Release note: none
Epic: none
Informs #114492.
Informs #114421.

Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
@jbowens jbowens changed the title storage: improve SINGLEDEL safety storage: improve single delete safety Nov 22, 2023
@jbowens
Copy link
Collaborator Author

jbowens commented Nov 22, 2023

The biggest obstacle to an automatic single delete optimization performed entirely within the storage engine is the need to perform reads. Writes are always blind within the storage engine. In order to determine how many keys a tombstone deletes, the engine would need to read the full LSM at each tombstone. The memtable tends to be as wide as the entire engine keyspace, which in the worst case means every tombstone would require loading a disjoint set of n sstables, where n is the read amplification.

But In Cockroach, we actually have significant locality of delete tombstones. There are three primary sources of deletes, all receiving an about equal portion of all the deletes in the system:

  1. The lock table is a high-churn region of the keyspace. All of the KV ranges' lock table keyspaces are contiguous, which at the Engine-level means the lock table is an uninterrupted continuous keyspace.
  2. The raft log is another high-churn region, and tombstones are written during log truncation. The KV ranges' raft logs are not contiguous, because there are other unreplicated RangeID-prefixed keys interspersed, but the interspersed keys should be relatively low volume.
  3. MVCC GC builds a batch deleting keys within a single KV range at a time. Since the resulting batch is constrained to a single KV range, the batch has high locality. With 512 MB ranges and 128 MB L6 sstables, we'd expect one of these batches to overlap ~5 sstables in L6.

If Pebble were to perform reads to transparently identify tombstones that may be relaxed into SINGLEDELs, where would it do it? During a flush is viable, but adding additional read I/O to the flush risks further constraining an existing bottleneck. In some high write throughput experiments with high bandwidth disks, we've observed consistently high flush utilization.

Alternatively, we could perform these reads in the background, while building the current mutable memtable.

A sketch:

Batches record the count of point tombstones they receive. At memtable application time, if point tombstones make up a sufficiently large portion of the batch's contents, the batch signals on a condition variable after recording:

  • the memtable arena offset of its first node (batch order)
  • the memtable arena offset of its last node (batch order)
  • the memtable arena offset of its first node in key order (this can be deduced without additional key comparisons during batch application by comparing offsets of incoming links in the lowest level)
  • the memtable arena offset of its last node in key order.

cockroachdb/pebble#3097 would avoid the above arena offsets capturing many more nodes than belong to the specific batch.

The above batch metadata forms a queue of "delete batches" for a goroutine associated with the memtable to process. The goroutine inspects the smallest and largest keys of the each queued delete batch (through accessing the memtable nodes at the offsets recorded), and uses them to find overlapping files within the LSM beneath the batch. Heuristics (?) determine whether the batch's contents are sufficiently dense to motivate reading the lower LSM. If they are, the goroutine uses an internal iterator constructed over the entirety of the LSM beneath the memtable and including the memtable itself to determine the number of internal keys beneath each tombstone within the "delete batch": 0, 1 or 2+. It atomically updates a field on the KV's arenaskl.Node.

During compaction time, the additional knowledge retained by the memtable's goroutine is surfaced on the base.InternalKV, allowing the compaction iterator to mutate key kinds or elide tombstones. The memtable's goroutine would be racing with the compaction, but there's no correctness problem to failing to observe the memtable goroutine's metadata. We would need to ensure we don't free the memtable until the memtable goroutine has exited.

There is still a lot of ambiguity in the above:

  1. What would a reasonable heuristic on when to perform the LSM read look like?
  2. There are likely opportunities to reduce the cost of the LSM read. For example, if we used a getIter-style iterator, and we haven't found a version of a key beneath a tombstone by the time we reach L6, maybe it's not worth reading L6 since a single delete that must be compacted into L6 to drop anything is no better than an ordinary delete tombstone.
  3. Would it be worthwhile to transform DELs into appropriately-sized DELSIZEDs during reads that find multiple internal keys, or reads that find a single key in L6?

The above is complicated, but arguably not compared to the KV transaction invariants we currently rely on.

cockroach-teamcity pushed a commit to cockroach-teamcity/cockroach that referenced this issue Nov 27, 2023
During intent resolution, examine internal keys when intending to delete an
intent using a single delete tombstone to validate that it is indeed safe
within the local Engine. This is a safety measure to validate that we do not
violate the invariants that make singe deletion deterministic causing replica
divergence. False negatives are possible, since this safety mechanism only
examines the local engine state and only the state visible to the open lock
table iterator at the time of proposal evaluation.

Release note: none
Epic: none
Informs cockroachdb#114492.
Informs cockroachdb#114421.
@nicktrav nicktrav moved this from Incoming to 24.1 candidates in (Deprecated) Storage Nov 28, 2023
RaduBerinde pushed a commit to RaduBerinde/pebble that referenced this issue Nov 30, 2023
Add a new specialized CanDeterministicallySingleDelete function that allows a
user to inspect hidden internal keys within the current user key to determine
whether the behavior of a SingleDelete of the current Key would be
deterministic, assuming no writes are made to the key between Iterator creation
and commit of the SingleDelete. The CanDeterministicallySingleDelete func may
only be called on an Iterator oriented in the forward direction and does not
change the external Iterator position.

During intent resolution, CockroachDB will write a SINGLEDEL tombstone to
delete the intent if the intent's value indicates that it must have only been
written once. Subtle logic that cuts across the stack makes this optimization
possible, and one possible explanation for a recent instance of replica
divergence (cockroachdb/cockroach#114421) is a bug in this logic. We anticipate
using CanDeterministicallySingleDelete within CockroachDB's intent resolution
to validate this logic at runtime.

There is some potential for this function to drive the decision to use
SingleDelete when creating batches for local Engine application only.

Informs cockroachdb/cockroach#114492.
@jbowens
Copy link
Collaborator Author

jbowens commented Dec 4, 2023

Informs #8979

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-storage Storage Team
Projects
Status: 24.2 candidates
(Deprecated) Storage
  
24.2 candidates
Development

No branches or pull requests

1 participant