Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: don't send log truncations through Raft #16993

Merged
merged 1 commit into from
Jul 17, 2017

Conversation

tbg
Copy link
Member

@tbg tbg commented Jul 11, 2017

Please discuss - this looks like an easy win. I just whipped this up on a whim,
so I hope I'm not missing something obvious here.

--

Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat).

It also removes one migration concern for #16809, see
#16809 (comment).

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator

Doesn't this add a migration concern? In a mixed-version cluster, a server with this PR will not generate a Raft command that truncates the Raft log for a 1.0.x server.


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

				keys.RaftLogKey(r.RangeID, newTruncState.Index).PrefixEnd(),
			)
			if err := r.store.engine.ClearRange(start, end); err != nil {

We definitely don't want to use ClearRange here as this would create an unreasonable number of range deletion tombstones. Use ClearIterRange instead.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 11, 2017

Doesn't this add a migration concern?

Does it? The below-Raft side effect is optional, and above Raft we can do what we want. The only thing that comes to mind is that when you run a mixed cluster for a long time, replicas running the old version may not actually remove their log entries.


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


Comments from Reviewable

@petermattis
Copy link
Collaborator

The only thing that comes to mind is that when you run a mixed cluster for a long time, replicas running the old version may not actually remove their log entries.

That was my concern. That seems like it could be problematic. Maybe I'm not thinking it through properly.


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 11, 2017

Could be problematic, but my point is rather that it's easy to migrate this change - the "new version" just has to emit old-style truncations as long as it thinks there are still old nodes around. Pretty simple compared to other migrations on the table right now, and should be easy to cover this.


Review status: 0 of 2 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We definitely don't want to use ClearRange here as this would create an unreasonable number of range deletion tombstones. Use ClearIterRange instead.

Done.


Comments from Reviewable

@tbg tbg force-pushed the eval-truncate-log branch 2 times, most recently from 22b7835 to 2fac16a Compare July 11, 2017 19:24
@tamird
Copy link
Contributor

tamird commented Jul 11, 2017

Reviewed 2 of 2 files at r2.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/storage/replica_command.go, line 1908 at r2 (raw file):

	ms, err := iter.ComputeStats(start, end, 0 /* nowNanos */)
	if err != nil {
		return EvalResult{}, errors.Wrap(err, "while computing stats of Raft log freed by truncation")

"could not compute ..." is more in line with our convention, I think.


pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done.

Would be good to add a comment explaining why this doesn't ClearRange, which naively seems better.


pkg/storage/replica_proposal.go, line 764 at r2 (raw file):

			iter := r.store.engine.NewIterator(false /* !prefix */)
			if err := r.store.engine.ClearIterRange(iter, start, end); err != nil {
				iter.Close()

extract below, or wrap in a function (instead of the scope) and defer it


pkg/storage/replica_proposal.go, line 776 at r2 (raw file):

		// Truncate the sideloaded storage.
		{

this isn't a real scope, right? could do without it.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 11, 2017

Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/storage/replica_command.go, line 1908 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

"could not compute ..." is more in line with our convention, I think.

Done.


pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Would be good to add a comment explaining why this doesn't ClearRange, which naively seems better.

Done.


pkg/storage/replica_proposal.go, line 764 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

extract below, or wrap in a function (instead of the scope) and defer it

Done.


pkg/storage/replica_proposal.go, line 776 at r2 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this isn't a real scope, right? could do without it.

Done.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 11, 2017

Review status: 0 of 2 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done.

(Should have clarified: I added a TODO for Peter to add a comment...)


Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jul 11, 2017

Reviewed 2 of 2 files at r3.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/storage/replica_proposal.go, line 764 at r3 (raw file):

			iter := r.store.engine.NewIterator(false /* !prefix */)
			// TODO(petermattis): add a comment explaining why using ClearRange() below
			// is ill-advised.

Range deletion tombstones add a per-query overhead. The RocksDB folks recommend limiting the number that are used. Currently we only use them when deleting an entire range. Add such tombstones for Raft log truncation is likely to result in an excessive number. I should really investigate how the range deletion tombstones are indexed within sstables to understand this better.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 11, 2017

Review status: 1 of 2 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


pkg/storage/replica_proposal.go, line 764 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Range deletion tombstones add a per-query overhead. The RocksDB folks recommend limiting the number that are used. Currently we only use them when deleting an entire range. Add such tombstones for Raft log truncation is likely to result in an excessive number. I should really investigate how the range deletion tombstones are indexed within sstables to understand this better.

Interested in understanding this better, too. The tombstones we make here are essentially [0,i for increasing i, so it seems that any that hit the same level/sstable/whatever should, in theory, coagulate and leave only one (also seeing that we never "break them up" by writing over them).


Comments from Reviewable

@bdarnell
Copy link
Contributor

In addition to Peter's concern about migrating to this change from 1.0, this change moves logic downstream of raft. This will generally make future migrations harder. Maybe it's justifiable in this case, but we'll need to see what the performance numbers are like (or maybe we want to do this for other reasons, like the way it would interact with the migration needed for @irfansharif's change).


Review status: 1 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


Comments from Reviewable

@petermattis
Copy link
Collaborator

One nice bit this change would enable is we could defer the actual deletion of Raft log entries in order to avoid needing to sync the normal (non-Raft log) engine.


Review status: 1 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 12, 2017 via email

@petermattis
Copy link
Collaborator

Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


pkg/storage/replica_proposal.go, line 764 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Interested in understanding this better, too. The tombstones we make here are essentially [0,i for increasing i, so it seems that any that hit the same level/sstable/whatever should, in theory, coagulate and leave only one (also seeing that we never "break them up" by writing over them).

Range tombstones are stored in a per-sstable "meta" block that is loaded when the sstable is open. I'm not seeing any code which merges adjacent range tombstones together, though I did run across code which "collapses deletions". That could be it.

When retrieving a key, all of the tombstones contained in each sstable encountered during retrieval are added to a RangeDelAggregator which then answers whether the key has been deleted. It looks like some effort has been put into making RangeDelAggregator efficient, but note that all of the tombstones for a given sstable are added to it even when those tombstones do not overlap the key being retrieved or the span being iterated. I imagine this is where the warning about not using too many tombstones comes from.


Comments from Reviewable

keys.RaftLogKey(r.RangeID, newTruncState.Index).PrefixEnd(),
)
iter := r.store.engine.NewIterator(false /* !prefix */)
if err := r.store.engine.ClearIterRange(iter, start, end); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates per-key tombstones. Why not use ClearRange instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nm...i just read the discussion.

@tbg
Copy link
Member Author

tbg commented Jul 14, 2017

Review status: 0 of 2 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/storage/replica_proposal.go, line 764 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Range tombstones are stored in a per-sstable "meta" block that is loaded when the sstable is open. I'm not seeing any code which merges adjacent range tombstones together, though I did run across code which "collapses deletions". That could be it.

When retrieving a key, all of the tombstones contained in each sstable encountered during retrieval are added to a RangeDelAggregator which then answers whether the key has been deleted. It looks like some effort has been put into making RangeDelAggregator efficient, but note that all of the tombstones for a given sstable are added to it even when those tombstones do not overlap the key being retrieved or the span being iterated. I imagine this is where the warning about not using too many tombstones comes from.

As an additional datapoint for future archaeologists, SCATTER at one point created 1000s of these tombstones, and it really got RocksDB down to its knees: #16249 (comment)


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 17, 2017

Ok, I take that back. I don't actually want to play with performance tests just yet. Instead, I first want to discuss the merits of this PR under the assumption that performance stays pretty much the same for now. The relevant points here:

  • we'll need to migrate this. It's straightforward with the in-flight PR rfc: version migration for backwards incompatible functionality #16977
  • we're moving logic downstream of Raft. However, we can easily migrate it upstream again, without a real migration, though I don't think that's going to happen.
  • the big upshot is hopefully a large reduction in complexity for @irfansharif's PR: log truncation is one of the odd cases that requires a RaftWriteBatch. storage: RHS HardState clobbering during splits #16749 is the only other one, and there the (correct) solution also involves going downstream of Raft for a Raft-related write. So, after solving both of those, I think RaftWriteBatch can go? cc @irfansharif
  • as @petermattis pointed out, after @irfansharif's change, we should be able to not sync the base engine on truncation changes but do it only as we actually clear the log entries (which can be delayed as we see fit). So for 1000 log truncations across many ranges, we'll only have to sync once if that's how we set it up.

For performance, I think it'll be more obvious that it's a pro when the last bullet comes in. Right now, as the change is written, we save only a few on-the-wire bytes. That's nice, but not really something to write home about, and it's hard to demonstrate it, too.

My suggestion is the following:

  1. leave this change open until we have the migration framework in place, at which point I adapt & merge it
  2. @irfansharif rebases his change on top of this (and storage: RHS HardState clobbering during splits #16749 which I'll prioritize)
  3. once his change lands, we optimize the truncation sync as outlined above (punting to 1.2 if we really run out of time)

@bdarnell @petermattis Thoughts?

@irfansharif please confirm that the above plan would really simplify your PR as I imagine.


Review status: 0 of 2 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


Comments from Reviewable

@bdarnell
Copy link
Contributor

SGTM


Reviewed 2 of 2 files at r4.
Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

(Should have clarified: I added a TODO for Peter to add a comment...)

Looks like you haven't pushed that TODO.


Comments from Reviewable

@irfansharif
Copy link
Contributor

As an aside re: delaying actual log entry deletions, this is what I was alluding to here (comment).

So, after solving both of those, I think RaftWriteBatch can go?

Yup (WriteBatch.RaftData you mean).

So for 1000 log truncations across many ranges, we'll only have to sync once if that's how we set it up.

Yup, this batching and lazy actual deletion will be crucial. I'll be posting to #16624 shortly with some recent numbers but log truncations (and by extension, base engine syncs) happen far too often right now to bring about desired speedups.

@irfansharif please confirm that the above plan would really simplify your PR as I imagine.

Yup.


Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 17, 2017

Cool. PR for #16749 is about to be posted, too.

@petermattis
Copy link
Collaborator

SGTM too


Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


Comments from Reviewable

tbg added a commit to tbg/cockroach that referenced this pull request Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Note that there is no migration concern.

Fixes cockroachdb#16749.
@tbg
Copy link
Member Author

tbg commented Jul 17, 2017

PTAL: I rebased on top of #17068.


Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions.


pkg/storage/replica_proposal.go, line 762 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Looks like you haven't pushed that TODO.

Added a comment.


Comments from Reviewable

@petermattis
Copy link
Collaborator

:lgtm:


Review status: 0 of 3 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/storage/replica_command.go, line 1893 at r5 (raw file):

	// truncation.
	start := engine.MakeMVCCMetadataKey(keys.RaftLogKey(cArgs.EvalCtx.RangeID(), 0))
	end := engine.MakeMVCCMetadataKey(keys.RaftLogKey(cArgs.EvalCtx.RangeID(), args.Index-1).PrefixEnd())

Was the a reason you changed end here? I'm not seeing why it makes a difference.


pkg/storage/replica_command.go, line 1896 at r5 (raw file):

	var ms enginepb.MVCCStats
	if util.IsMigrated() {

@spencerkimball Per my comment in #16977, I think the ergonomics here are somewhat important. I'd like this to look something like a cluster setting:

var inconsistentRaftLogTruncation = settings.RegisterClusterVersionRequirement("1.1-alpha.20170608")
...
if inconsistentRaftLogTruncation.enabled() {
}

pkg/storage/replica_proposal.go, line 773 at r5 (raw file):

			} else {
				iter.Close()
			}

Closing iter on both code paths is a bit awkward. Could restructure as:

  iter := r.store.engine.NewIterator(...)
  err := ...
  iter.Close()
  if err != nil {
    log.Errorf(...)
  }

Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Jul 17, 2017

TFTR! Are you OK'ing the first commit (IsMigrated() as well)?


Review status: 0 of 3 files reviewed at latest revision, 7 unresolved discussions.


pkg/storage/replica_command.go, line 1893 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Was the a reason you changed end here? I'm not seeing why it makes a difference.

No good reason, changing it back.


pkg/storage/replica_proposal.go, line 773 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Closing iter on both code paths is a bit awkward. Could restructure as:

  iter := r.store.engine.NewIterator(...)
  err := ...
  iter.Close()
  if err != nil {
    log.Errorf(...)
  }

Done.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Yes, IsMigrated looks fine in the short term.


Review status: 0 of 3 files reviewed at latest revision, 3 unresolved discussions.


Comments from Reviewable

Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat). As
explained above, the change isn't made for performance at this point, though.

It also removes one migration concern for cockroachdb#16809, see
cockroachdb#16809 (comment).

We'll need to migrate this. It's straightforward with the in-flight PR cockroachdb#16977.

- we're moving logic downstream of Raft. However, we can easily migrate it
  upstream again, without a real migration, though I don't think that's going
  to happen.
- the big upshot is hopefully a large reduction in complexity for
  @irfansharif's PR: log truncation is one of the odd cases that requires
  a RaftWriteBatch. cockroachdb#16749 is the only other one, and there the (correct)
  solution also involves going downstream of Raft for a Raft-related write. So,
  after solving both of those, I think RaftWriteBatch can go? cc @irfansharif
- as @petermattis pointed out, after @irfansharif's change, we should be able
  to not sync the base engine on truncation changes but do it only as we
  actually clear the log entries (which can be delayed as we see fit). So for
  1000 log truncations across many ranges, we'll only have to sync once if
  that's how we set it up.
tbg added a commit to tbg/cockroach that referenced this pull request Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Migration concerns: a lease holder running the new version will propose splits
that don't propose the HardState to Raft. A follower running the old version
will not write the HardState downstream of Raft. In combination, the HardState
would never get written, and would thus be incompatible with the
TruncatedState. Thus, while 1.0 might be around, we're still sending the
potentially dangerous HardState.

Fixes cockroachdb#16749.
tbg added a commit to tbg/cockroach that referenced this pull request Jul 17, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Migration concerns: a lease holder running the new version will propose splits
that don't propose the HardState to Raft. A follower running the old version
will not write the HardState downstream of Raft. In combination, the HardState
would never get written, and would thus be incompatible with the
TruncatedState. Thus, while 1.0 might be around, we're still sending the
potentially dangerous HardState.

Fixes cockroachdb#16749.
@tbg tbg merged commit 6c37d18 into cockroachdb:master Jul 17, 2017
@tbg tbg deleted the eval-truncate-log branch July 17, 2017 23:43
tbg added a commit to tbg/cockroach that referenced this pull request Jul 18, 2017
Since the move to proposer-evaluated KV, we were potentially clobbering the
HardState on splits as we accidentally moved HardState synthesis upstream of
Raft as well. This change moves it downstream again.

Though not strictly necessary, writing lastIndex was moved as well. This is
cosmetic, though it aids @irfansharif's PR cockroachdb#16809, which moves lastIndex to
the Raft engine. After this PR, neither HardState nor last index keys are
added to the WriteBatch, so that pre-cockroachdb#16993 `TruncateLog` is the only remaining
command that does so (and it, too, won't keep doing that for long).

Migration concerns: a lease holder running the new version will propose splits
that don't propose the HardState to Raft. A follower running the old version
will not write the HardState downstream of Raft. In combination, the HardState
would never get written, and would thus be incompatible with the
TruncatedState. Thus, while 1.0 might be around, we're still sending the
potentially dangerous HardState.

Fixes cockroachdb#16749.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 15, 2018
Raft log truncations currently perform two steps (there may be
others, but for the sake of this discussion, let's consider only
these two):
1. above raft, they compute the stats of all raft log entries up
   to the truncation entry.
2. beneath raft, they use ClearIterRange to clear all raft log
   entries up to the truncation entry.

In both steps, operations are performed on all entries up to the
truncation entry, and in both steps these operations start from
entry 0. A comment added in cockroachdb#16993 gives some idea as to why:

> // We start at index zero because it's always possible that a previous
> // truncation did not clean up entries made obsolete by the previous
> // truncation.

My current understanding is that this case where a Raft log has
been truncated but its entries not cleaned up is only possible if
a node crashes between `applyRaftCommand` and `handleEvalResultRaftMuLocked`.
This brings up the question: why don't we truncate raft entries
downstream of raft in `applyRaftCommand`? That way, the entries
could be deleted atomically with the update to the `RaftTruncatedStateKey`
and we wouldn't have to worry about them ever diverging or Raft entries
being leaked. That seems like a trivial change, and if that was the
case, would the approach here be safe? I don't see a reason why
not.

For motivation on why we should explore this, I've found that when
running `sysbench oltp_insert` on a fresh cluster without pre-splits to
measure single range write through, raft log truncation accounts for
about 20% of CPU utilization.

If we switch the ClearIterRange to a ClearRange downstream of raft,
we improve throughput by 13% and reduce the amount of CPU that raft
log truncation uses to about 5%. It's obvious why this speeds up the
actual truncation itself downstream of raft. The reason why it speeds
up the stats computation is less clear, but it may be allowing a RocksDB
iterator to more easily skip over the deleted entry keys.

If we make the change proposed here, we improve throughput by 28% and
reduce the amount of CPU that raft log truncation uses to a negligible
amount (< 1%, hard to tell exactly). The reason this speeds both the
truncation and the stats computation is because it avoids iterating
over RocksDB tombstones for all Raft entries that have ever existed
on the range.

The throughput improvements are of course exaggerated because we are
isolating the throughput of a single range, but they're significant
enough to warrant exploration about whether we can make this approach
work.

Finally, the outsized impact of this small change naturally justifies
further exploration. If we could make the change here safe (i.e. if we
could depend on replica.FirstIndex() to always be a lower bound on raft
log entry keys), could we make similar changes elsewhere? Are there
other places where we iterate over an entire raft log keyspace and
inadvertently run into all of the deletion tombstones when we could
simply skip to the `replica.FirstIndex()`? At a minimum, I believe
that `clearRangeData` fits this description, so there may be room to
speed up snapshots and replica GC.

Release note (performance improvement): Reduce the cost of Raft log
truncations and increase single-range throughput.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 17, 2018
Raft log truncations currently perform two steps (there may be
others, but for the sake of this discussion, let's consider only
these two):
1. above raft, they compute the stats of all raft log entries up
   to the truncation entry.
2. beneath raft, they use ClearIterRange to clear all raft log
   entries up to the truncation entry.

In both steps, operations are performed on all entries up to the
truncation entry, and in both steps these operations start from
entry 0. A comment added in cockroachdb#16993 gives some idea as to why:

> // We start at index zero because it's always possible that a previous
> // truncation did not clean up entries made obsolete by the previous
> // truncation.

My current understanding is that this case where a Raft log has
been truncated but its entries not cleaned up is only possible if
a node crashes between `applyRaftCommand` and `handleEvalResultRaftMuLocked`.
This brings up the question: why don't we truncate raft entries
downstream of raft in `applyRaftCommand`? That way, the entries
could be deleted atomically with the update to the `RaftTruncatedStateKey`
and we wouldn't have to worry about them ever diverging or Raft entries
being leaked. That seems like a trivial change, and if that was the
case, would the approach here be safe? I don't see a reason why
not.

For motivation on why we should explore this, I've found that when
running `sysbench oltp_insert` on a fresh cluster without pre-splits to
measure single range write through, raft log truncation accounts for
about 20% of CPU utilization.

If we switch the ClearIterRange to a ClearRange downstream of raft,
we improve throughput by 13% and reduce the amount of CPU that raft
log truncation uses to about 5%. It's obvious why this speeds up the
actual truncation itself downstream of raft. The reason why it speeds
up the stats computation is less clear, but it may be allowing a RocksDB
iterator to more easily skip over the deleted entry keys.

If we make the change proposed here, we improve throughput by 28% and
reduce the amount of CPU that raft log truncation uses to a negligible
amount (< 1%, hard to tell exactly). The reason this speeds both the
truncation and the stats computation is because it avoids iterating
over RocksDB tombstones for all Raft entries that have ever existed
on the range.

The throughput improvements are of course exaggerated because we are
isolating the throughput of a single range, but they're significant
enough to warrant exploration about whether we can make this approach
work.

Finally, the outsized impact of this small change naturally justifies
further exploration. If we could make the change here safe (i.e. if we
could depend on replica.FirstIndex() to always be a lower bound on raft
log entry keys), could we make similar changes elsewhere? Are there
other places where we iterate over an entire raft log keyspace and
inadvertently run into all of the deletion tombstones when we could
simply skip to the `replica.FirstIndex()`? At a minimum, I believe
that `clearRangeData` fits this description, so there may be room to
speed up snapshots and replica GC.

Release note (performance improvement): Reduce the cost of Raft log
truncations and increase single-range throughput.
craig bot pushed a commit that referenced this pull request Aug 19, 2018
28126: storage: truncate log only between first index and truncate index r=nvanbenschoten a=nvanbenschoten

### Question

Raft log truncations currently perform two steps (there may be
others, but for the sake of this discussion, let's consider only
these two):
1. above raft, they compute the stats of all raft log entries up
   to the truncation entry.
2. beneath raft, they use ClearIterRange to clear all raft log
   entries up to the truncation entry.

In both steps, operations are performed on all entries up to the
truncation entry, and in both steps these operations start from
entry 0. A comment added in #16993 gives some idea as to why:

> // We start at index zero because it's always possible that a previous
> // truncation did not clean up entries made obsolete by the previous
> // truncation.

My current understanding is that this case where a Raft log has
been truncated but its entries not cleaned up is only possible if
a node crashes between `applyRaftCommand` and `handleEvalResultRaftMuLocked`.
This brings up the question: why don't we truncate raft entries
downstream of raft in `applyRaftCommand`? That way, the entries
could be deleted atomically with the update to the `RaftTruncatedStateKey`
and we wouldn't have to worry about them ever diverging or Raft entries
being leaked. That seems like a trivial change, and if that was the
case, would the approach here be safe? I don't see a reason why
not.

### Motivation

For motivation on why we should explore this, I've found that when
running `sysbench oltp_insert` on a fresh cluster without pre-splits to
measure single range write through, raft log truncation accounts for
about **20%** of CPU utilization.

<img width="1272" alt="truncate" src="https://user-images.githubusercontent.com/5438456/43502846-bb7a98d2-952a-11e8-9ba0-0b886d3e3ad9.png">

If we switch the ClearIterRange to a ClearRange downstream of raft,
we improve throughput by **13%** and reduce the amount of CPU that raft
log truncation uses to about **5%**. It's obvious why this speeds up the
actual truncation itself downstream of raft. The reason why it speeds
up the stats computation is less clear, but it may be allowing a RocksDB
iterator to more easily skip over the deleted entry keys.

If we make the change proposed here, we improve throughput by **28%** and
reduce the amount of CPU that raft log truncation uses to a negligible
amount (**< 1%**, hard to tell exactly). The reason this speeds both the
truncation and the stats computation is because it avoids iterating
over RocksDB tombstones for all Raft entries that have ever existed
on the range.

The throughput improvements are of course exaggerated because we are
isolating the throughput of a single range, but they're significant
enough to warrant exploration about whether we can make this approach
work.

### Extension

Finally, the outsized impact of this small change naturally justifies
further exploration. If we could make the change here safe (i.e. if we
could depend on replica.FirstIndex() to always be a lower bound on raft
log entry keys), could we make similar changes elsewhere? Are there
other places where we iterate over an entire raft log keyspace and
inadvertently run into all of the deletion tombstones when we could
simply skip to the `replica.FirstIndex()`? At a minimum, I believe
that `clearRangeData` fits this description, so there may be room to
speed up snapshots and replica GC.

cc. @tschottdorf @petermattis @benesch

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants