storage: fix Raft log size accounting #31914

tbg · 2018-10-26T08:49:41Z

We were accounting for sideloaded payloads (SSTs) when adding them to
the log, but were omitting them during truncations. As a result, the
tracked raft log size would permanently skyrocket which in turn would
lead to extremely aggressive truncations and resulted in pathological
amounts of Raft snapshots.

I'm still concerned about this logic as we're now relying on numbers
obtained from the file system to match exactly a prior in-mem
computation, and there may be other bugs that cause a divergence. But
this fixes the blatant obvious one, so it's a step in the right
direction.

The added tests highlight a likely omission in the sideloaded storage
code which doesn't access the file system through the RocksDB env as it
seems like it should, filed as #31913.

At this point it's unclear whether it fixes the below issues, but at the
very least it seems to address a major problem they encountered:

Touches #31732.
Touches #31740.
Touches #30261.
Touches #31768.
Touches #31745.

Release note (bug fix): avoid a performance degradation related to
overly aggressive Raft log truncations that could occur during RESTORE
or IMPORT operations.

cockroach-teamcity · 2018-10-26T08:49:47Z

This change is

petermattis

Thanks for pushing this over the finish line.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/cmd/roachtest/tpcc.go, line 362 at r1 (raw file):

	if _, err := db.ExecContext(ctx, `SET CLUSTER SETTING kv.range_merge.queue_enabled = false`); err != nil {
		return err
	}

Perhaps revert the changes in this file which I made to debug tpcc-bench stuff.

@benesch might have an opinion about disabling the merge queue for tpcc bench runs until the merge issues are sorted out.

pkg/storage/replica.go, line 6584 at r1 (raw file):

	// Gossip the cluster ID from all replicas of the first range; there
	// is no expiration on the cluster ID.
	if false && log.V(1) {

There are a bunch of log changes such as this that should be reverted.

pkg/storage/replica.go, line 6617 at r1 (raw file):

	log.Event(ctx, "gossiping sentinel and first range")
	if log.V(1) {
		//log.Infof(ctx, "gossiping sentinel from store %d, r%d", r.store.StoreID(), r.RangeID)

Revert?

pkg/storage/replica.go, line 6624 at r1 (raw file):

		log.Errorf(ctx, "failed to gossip sentinel: %s", err)
	}
	if false && log.V(1) {

Revert?

pkg/storage/replica_proposal.go, line 573 at r1 (raw file):

					log.Errorf(ctx, "while removing sideloaded files during log truncation: %s", err)
				} else {
					rResult.RaftLogDelta -= size

I trust that you agree this is copacetic. This was the part of the change that I was most unsure about.

tbg · 2018-10-26T13:16:46Z

I ran a few roachtests to verify that things are stable now.

These two look good (another two running now):

--- PASS: restore2TB/nodes=10 (8959.51s)
--- PASS: restore2TB/nodes=10 (8743.44s)

But I got this suspicious failure:

--- FAIL: import/tpch/nodes=8 (9943.52s)
	test.go:639,cluster.go:1453,import.go:122: pq: split at key /Table/53/1/390157478/5 failed: replica 0 is invalid: ReplicaID must not be zero

This originates in

cockroach/pkg/roachpb/metadata.go

Lines 222 to 234 in a050d56

    
           // Validate performs some basic validation of the contents of a replica descriptor. 
        
           func (r ReplicaDescriptor) Validate() error { 
        
           	if r.NodeID == 0 { 
        
           		return errors.Errorf("NodeID must not be zero") 
        
           	} 
        
           	if r.StoreID == 0 { 
        
           		return errors.Errorf("StoreID must not be zero") 
        
           	} 
        
           	if r.ReplicaID == 0 { 
        
           		return errors.Errorf("ReplicaID must not be zero") 
        
           	} 
        
           	return nil 
        
           }

So we have a replica that has a NodeID and StoreID but a zero replicaID either in

cockroach/pkg/storage/replica_command.go

Line 348 in 62ea444

if err := updateRangeDescriptor(b, rightDescKey, nil, rightDesc); err != nil {

or in the other call to updateRangeDescriptor. I'm taking a look at the logs now. Doubtful that it's at all related to this change.

PS merges are turned off in this testing.

My best bet for this error is that

cockroach/pkg/storage/replica_command.go

Line 173 in 62ea444

reply, lastErr = r.adminSplitWithDescriptor(ctx, args, r.Desc())

returns a descriptor with a replica that has a replica with replicaID zero.

tbg · 2018-10-26T13:33:22Z

Filed that error as a separate issue #31918.

We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as cockroachdb#31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches cockroachdb#31732. Touches cockroachdb#31740. Touches cockroachdb#30261. Touches cockroachdb#31768. Touches cockroachdb#31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations.

tbg

Reviewed 5 of 9 files at r1, 4 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/replica_proposal.go, line 573 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I trust that you agree this is copacetic. This was the part of the change that I was most unsure about.

Are you concerned about anything in particular? This seems fine to me. We could compute this upstream, but it doesn't seem better, and we don't need this to be equivalent between replicas anyway.
One problem could be that we might be removing more than we added. For example, if a bunch of data gets sideloaded and then the cluster restarts, the first truncation after the restart will get lots of bytes here "for free". This is likely going to result in a raft log size of zero, and it would be a little off (too small) from that point on. (This isn't a problem that needs sideloaded payloads, it's basically by design). This can only result in a de facto size that is off by a factor of two (i.e. truncating 2x as late as configured), so it's probably not a big deal.
It'll be painful to get all of this "correct". We can't simply recompute the size of the remaining log here as there might be more sideloaded payloads. Besides, sideloaded payloads for older terms may coexist with new ones.

I just wrote a comment to deal with residual space in Raft logs for idle ranges here:

cockroach/pkg/storage/raft_log_queue.go

Lines 38 to 41 in e808caf

    
           // RaftLogQueueStaleThreshold is the minimum threshold for stale raft log 
        
           // entries. A stale entry is one which all replicas of the range have 
        
           // progressed past and thus is no longer needed and can be truncated. 
        
           RaftLogQueueStaleThreshold = 100

Perhaps when performing such a truncation, we could do a little more work to recompute the exact length. I hope it's never going to be worth it though.

pkg/cmd/roachtest/tpcc.go, line 362 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps revert the changes in this file which I made to debug tpcc-bench stuff.

@benesch might have an opinion about disabling the merge queue for tpcc bench runs until the merge issues are sorted out.

Not intentional. Sorry for not catching that.

pkg/storage/replica.go, line 6584 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

There are a bunch of log changes such as this that should be reverted.

Not intentional. Sorry for not catching that.

pkg/storage/replica.go, line 6617 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Revert?

Not intentional. Sorry for not catching that.

pkg/storage/replica.go, line 6624 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Revert?

Not intentional. Sorry for not catching that.

tbg · 2018-10-26T14:26:59Z

Another one just passed (and yet another one within 20 minutes of completion and looking good)

--- PASS: restore2TB/nodes=10 (8099.52s)
PASS

Really hoping to see less import tests fail tonight.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/replica_proposal.go, line 573 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Are you concerned about anything in particular? This seems fine to me. We could compute this upstream, but it doesn't seem better, and we don't need this to be equivalent between replicas anyway.
One problem could be that we might be removing more than we added. For example, if a bunch of data gets sideloaded and then the cluster restarts, the first truncation after the restart will get lots of bytes here "for free". This is likely going to result in a raft log size of zero, and it would be a little off (too small) from that point on. (This isn't a problem that needs sideloaded payloads, it's basically by design). This can only result in a de facto size that is off by a factor of two (i.e. truncating 2x as late as configured), so it's probably not a big deal.
It'll be painful to get all of this "correct". We can't simply recompute the size of the remaining log here as there might be more sideloaded payloads. Besides, sideloaded payloads for older terms may coexist with new ones.

I just wrote a comment to deal with residual space in Raft logs for idle ranges here:

cockroach/pkg/storage/raft_log_queue.go

Lines 38 to 41 in e808caf

// RaftLogQueueStaleThreshold is the minimum threshold for stale raft log

// entries. A stale entry is one which all replicas of the range have

// progressed past and thus is no longer needed and can be truncated.

RaftLogQueueStaleThreshold = 100

Perhaps when performing such a truncation, we could do a little more work to recompute the exact length. I hope it's never going to be worth it though.

No particular concerns, but this is the only place where we're computing RaftLogDelta below Raft.

tbg · 2018-10-26T14:31:29Z

Ack. I'm going to merge but would like to hear if @nvanbenschoten has any concerns and would rather move this upstream of Raft.

bors r=petermattis

31914: storage: fix Raft log size accounting r=petermattis a=tschottdorf We were accounting for sideloaded payloads (SSTs) when adding them to the log, but were omitting them during truncations. As a result, the tracked raft log size would permanently skyrocket which in turn would lead to extremely aggressive truncations and resulted in pathological amounts of Raft snapshots. I'm still concerned about this logic as we're now relying on numbers obtained from the file system to match exactly a prior in-mem computation, and there may be other bugs that cause a divergence. But this fixes the blatant obvious one, so it's a step in the right direction. The added tests highlight a likely omission in the sideloaded storage code which doesn't access the file system through the RocksDB env as it seems like it should, filed as #31913. At this point it's unclear whether it fixes the below issues, but at the very least it seems to address a major problem they encountered: Touches #31732. Touches #31740. Touches #30261. Touches #31768. Touches #31745. Release note (bug fix): avoid a performance degradation related to overly aggressive Raft log truncations that could occur during RESTORE or IMPORT operations. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

craig · 2018-10-26T14:57:03Z

Build succeeded

GitHub CI (Cockroach)

nvanbenschoten

I don't think this can be moved upstream into Raft because etcd/raft has no concept of log truncation. My understanding is that it assumes that logs are persisted forever.

Reviewed 4 of 9 files at r1, 4 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

pkg/storage/replica_proposal.go, line 573 at r1 (raw file):

Besides, sideloaded payloads for older terms may coexist with new ones.

That's interesting. Don't we run into the same issue that we do here when a new term overwrites old sideloaded entries? Should we be doing something similar to this here:

cockroach/pkg/storage/replica.go

Lines 4333 to 4340 in 0368172

    
           firstPurge := rd.Entries[0].Index // first new entry written 
        
           purgeTerm := rd.Entries[0].Term - 1 
        
           lastPurge := prevLastIndex // old end of the log, include in deletion 
        
           for i := firstPurge; i <= lastPurge; i++ { 
        
           	err := r.raftMu.sideloaded.Purge(ctx, i, purgeTerm) 
        
           	if err != nil && errors.Cause(err) != errSideloadedFileNotFound { 
        
           		const expl = "while purging index %d" 
        
           		return stats, expl, errors.Wrapf(err, expl, i)

tbg

Sorry, I didn't mean "into etcd/raft" but "into proposal evaluation".

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)

pkg/storage/replica_proposal.go, line 573 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Besides, sideloaded payloads for older terms may coexist with new ones.

That's interesting. Don't we run into the same issue that we do here when a new term overwrites old sideloaded entries? Should we be doing something similar to this here:

cockroach/pkg/storage/replica.go

Lines 4333 to 4340 in 0368172

firstPurge := rd.Entries[0].Index // first new entry written

purgeTerm := rd.Entries[0].Term - 1

lastPurge := prevLastIndex // old end of the log, include in deletion

for i := firstPurge; i <= lastPurge; i++ {

err := r.raftMu.sideloaded.Purge(ctx, i, purgeTerm)

if err != nil && errors.Cause(err) != errSideloadedFileNotFound {

const expl = "while purging index %d"

return stats, expl, errors.Wrapf(err, expl, i)

Absolutely, let me look into that.

nvanbenschoten · 2018-10-26T16:07:10Z

Sorry, I didn't mean "into etcd/raft" but "into proposal evaluation".

Ah, I see. I don't have any big concerns about what we do here. If computing this during proposal evaluation means that we would need to peek at each Raft log entry (in Go) then I'd say it's a non-starter. If we could just look directly at the sideloaded storage, that might be a workable solution. It would be nice to avoid looking at the file size on each replica beneath Raft and hoping they're all the same, even if it is ok for them to diverge. It would also be nice to keep this out of the single-threaded Raft loop.

@nvanbenschoten

…ries This follows up on a comment of @nvanbenschoten in cockroachdb#31914. Unfortunately, it seems really hairy to come up with a test for this, so a bit of refactoring will be needed. Release note: None

@nvanbenschoten

31881: exec: distinct manages its own scratch column r=jordanlewis a=jordanlewis Previously, distinct relied on its input batch to have a scratch boolean column for working. It's unnecessary - instead, manage the scratch boolean directly during construction. Release note: None 31926: storage: adjust raft log size correctly when replacing sideloaded entries r=nvanbenschoten a=tschottdorf This follows up on a comment of @nvanbenschoten in #31914 which highlighted yet another potential (though hopefully rare) source of raft log size not being reduced correctly. Release note: None Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

@nvanbenschoten

…ries This follows up on a comment of @nvanbenschoten in cockroachdb#31914. Unfortunately, it seems really hairy to come up with a test for this, so a bit of refactoring will be needed. Release note: None

tbg requested a review from a team October 26, 2018 08:49

tbg force-pushed the fix/truncate-addsstable branch from b8e4f92 to 51f40c5 Compare October 26, 2018 08:50

tbg mentioned this pull request Oct 26, 2018

storage: lazily load storage.Replica objects #31663

Open

tbg requested a review from petermattis October 26, 2018 11:29

petermattis approved these changes Oct 26, 2018

View reviewed changes

tbg force-pushed the fix/truncate-addsstable branch from 51f40c5 to 1eb4852 Compare October 26, 2018 13:42

tbg force-pushed the fix/truncate-addsstable branch from 1eb4852 to e808caf Compare October 26, 2018 13:44

tbg commented Oct 26, 2018

View reviewed changes

petermattis approved these changes Oct 26, 2018

View reviewed changes

craig bot merged commit e808caf into cockroachdb:master Oct 26, 2018

nvanbenschoten reviewed Oct 26, 2018

View reviewed changes

tbg deleted the fix/truncate-addsstable branch October 26, 2018 15:20

tbg commented Oct 26, 2018

View reviewed changes

tbg mentioned this pull request Oct 26, 2018

storage: adjust raft log size correctly when replacing sideloaded entries #31926

Merged

This was referenced Oct 26, 2018

roachtest: import/tpch/nodes=8 failed on master #30261

Closed

No results on tpc-c 10k plus stuck fetching logs #31788

Closed

roachtest: stability problems during tpcc-{5,10,20}k on {6,12,24} nodes #31409

Closed

tbg mentioned this pull request Nov 16, 2018

backport-2.1: fix Raft log size accounting for sideloaded entries #32412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: fix Raft log size accounting #31914

storage: fix Raft log size accounting #31914

tbg commented Oct 26, 2018 •

edited

Loading

cockroach-teamcity commented Oct 26, 2018

petermattis left a comment

tbg commented Oct 26, 2018 •

edited

Loading

tbg commented Oct 26, 2018

tbg left a comment

tbg commented Oct 26, 2018

petermattis left a comment

tbg commented Oct 26, 2018

craig bot commented Oct 26, 2018

nvanbenschoten left a comment

tbg left a comment

nvanbenschoten commented Oct 26, 2018

	// RaftLogQueueStaleThreshold is the minimum threshold for stale raft log
	// entries. A stale entry is one which all replicas of the range have
	// progressed past and thus is no longer needed and can be truncated.
	RaftLogQueueStaleThreshold = 100

	firstPurge := rd.Entries[0].Index // first new entry written
	purgeTerm := rd.Entries[0].Term - 1
	lastPurge := prevLastIndex // old end of the log, include in deletion
	for i := firstPurge; i <= lastPurge; i++ {
	err := r.raftMu.sideloaded.Purge(ctx, i, purgeTerm)
	if err != nil && errors.Cause(err) != errSideloadedFileNotFound {
	const expl = "while purging index %d"
	return stats, expl, errors.Wrapf(err, expl, i)

storage: fix Raft log size accounting #31914

storage: fix Raft log size accounting #31914

Conversation

tbg commented Oct 26, 2018 • edited Loading

cockroach-teamcity commented Oct 26, 2018

petermattis left a comment

Choose a reason for hiding this comment

tbg commented Oct 26, 2018 • edited Loading

tbg commented Oct 26, 2018

tbg left a comment

Choose a reason for hiding this comment

tbg commented Oct 26, 2018

petermattis left a comment

Choose a reason for hiding this comment

tbg commented Oct 26, 2018

craig bot commented Oct 26, 2018

Build succeeded

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Oct 26, 2018

tbg commented Oct 26, 2018 •

edited

Loading

tbg commented Oct 26, 2018 •

edited

Loading