Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: add a configuration to enable GC of system.rangelog #30913

Merged
merged 1 commit into from Oct 11, 2018

Conversation

mvijaykarthik
Copy link
Contributor

@mvijaykarthik mvijaykarthik commented Oct 3, 2018

system.rangelog table currently grows unboundedly. The rate of
growth is slow (as long as there is no replica rebalancing
thrashing), but it can still become a problem in long running
clusters.

This commit adds cluster settings to specify interval and TTL
for rows in system.rangelog.
By default, TTL of system.rangelog is set to 30 days.

Fixes #21260

Release note: Add configuration to enable GC of system.rangelog

@mvijaykarthik mvijaykarthik requested a review from a team October 3, 2018 10:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@mvijaykarthik mvijaykarthik force-pushed the gc_rangelog branch 2 times, most recently from b809190 to 1ad4130 Compare October 3, 2018 14:34
Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! @tschottdorf, are you the right person to review?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/server.go, line 137 at r1 (raw file):

		"if non zero, range log entries older than this duration are deleted periodically "+
			"based on storage.rangelog.gc_interval",
		0,

I think we could probably default this to something like 30d. @bdarnell do you have thoughts?

Copy link
Member

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/server.go, line 137 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

I think we could probably default this to something like 30d. @bdarnell do you have thoughts?

Yeah, I think this should be on by default. 30d seems reasonable (or maybe the threshold should be based on the size of the log, to cover cases when the log blows up suddenly due to rebalancer thrashing?)

Copy link
Member

@BramGruneir BramGruneir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be running on every server and there's no need for that. Can we run this only on a specific node? Say check for which node is the leaseholder for the rangelog and only try the delete there.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad this is getting done. I'll echo @BramGruneir's comment.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/server.go, line 89 at r1 (raw file):

const (
	// defaultRangeLogGCInterval is the default interval to run gc on rangelog.
	defaultRangeLogGCInteval = 30 * time.Minute

s/Inteval/Interval/g (you've got it spelled right in the comment!)

Copy link
Contributor Author

@mvijaykarthik mvijaykarthik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to run on lease holder of range 1.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/server.go, line 89 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/Inteval/Interval/g (you've got it spelled right in the comment!)

Done.


pkg/server/server.go, line 137 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Yeah, I think this should be on by default. 30d seems reasonable (or maybe the threshold should be based on the size of the log, to cover cases when the log blows up suddenly due to rebalancer thrashing?)

Done.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a coarse review. This already looks pretty good! A little bit of work is left to do to make sure this doesn't cause problems. Thanks so far @mvijaykarthik!

Reviewed 5 of 5 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/server.go, line 41 at r2 (raw file):

	"github.com/cockroachdb/cmux"

Remove this newline.


pkg/server/server.go, line 131 at r2 (raw file):

	// RangeLogTTL is the TTL for rows in system.rangelog. If non zero, range log
	// entries are periodically garbage collectd. The period is given by

collected


pkg/server/server.go, line 144 at r2 (raw file):

	RangeLogGCInterval = settings.RegisterDurationSetting(
		"server.rangelog.gc_interval",
		"interval for running gc on rangelog. If storage.rangelog.ttl is non zero, "+

Shouldn't this simply be a function of server.rangelog.ttl (i.e. not a cluster var)? I think it should be computed as min(10*time.Minute, RangeLogTTL). We don't want to want 30 days to make an attempt or deployments which see restarts every <30 days will never GC the range log. We also can't just blindly run at startup because leaseholdership is often not known then (on top of generating silly errors that might end up in the logs).


pkg/server/server.go, line 1071 at r2 (raw file):

// It returns the node currently responsible for performing GC, the number of
// rows affected and error (if any).
func (s *Server) gcRangeLog(

please extract all of this new code into a new file server_rangelog_gc.go.


pkg/server/server.go, line 1074 at r2 (raw file):

	ctx context.Context, cutoffTimestamp time.Time,
) (roachpb.NodeID, int, error) {
	const selectLeaseHolderStmt = `SELECT lease_holder from crdb_internal.ranges where range_id=1`

Careful, last I checked this was really expensive -- it sends an RPC to all nodes, collects information about all ranges, queries all leaseholders in a loop, and then throws it all away just to give you your boolean -- @changangela, I think you worked on this recently in #30846. What I say above is still mostly true, except for the loop to get the leaseholder, right? That would be way too expensive.

I think you can use

repl, err := s.node.stores.GetReplicaForRangeID(roachpb.RangeID(1))
repl.IsFirstRange() && repl.OwnsValidLease(s.clock.Now())

for a much cheaper check.


pkg/server/server.go, line 1075 at r2 (raw file):

) (roachpb.NodeID, int, error) {
	const selectLeaseHolderStmt = `SELECT lease_holder from crdb_internal.ranges where range_id=1`
	const deleteStmt = `DELETE FROM system.rangelog WHERE timestamp <= $1 LIMIT 1000`

This is a queue-like workload and as such we should give it both a lower and upper bound so that the query reads

DELETE FROM system.rangelog WHERE timestamp <= $1 AND timestamp >= $2 LIMIT 1000

and in each iteration update the $2 with the last returned result (and you return the final $2 from the method so that it can be passed in again for the next run).

When the loop runs for the first time, I think it's OK if it uses a $2 of zero (hard to avoid).

Also make sure that you exercise the case in which you insert north of 1000 entries to be GC'ed, if you haven't already.


pkg/server/server.go, line 1081 at r2 (raw file):

Rangelog

RangeLog (for both knobs).


pkg/server/server.go, line 1136 at r2 (raw file):

func (s *Server) startRangeLogGC(ctx context.Context) {
	s.stopper.RunWorker(ctx, func(ctx context.Context) {
		intervalChangeCh := make(chan time.Duration)

(This is a comment on this whole method)

Unfortunately it's really difficult to write a bug-free and robust implementation of a timer loop that's sensitive to cluster settings in the "optimal" way. Could you structure this exactly like #25227 which had a much much simpler loop (and also went through a couple of iterations in review)? You can start a server that overrides the default to a low value.

Here you have the problem that the cluster setting has an "on" default (30d), so naively it won't work. Instead, you can just override the default for the GC duration to something low by passing a suitable *cluster.Settings into TestServerArgs. Just look for a test that passes a non-nil *cluster.Settings into a TestServerArgs for inspiration. You can set the value with myClusterVar.Set(&st.SV, 3*time.Millisecond).


pkg/storage/log_test.go, line 401 at r2 (raw file):

}

func TestLogGC(t *testing.T) {

I'd prefer if this lived close to the code that you're testing, in pkg/server. Any reason to not do that?


pkg/storage/log_test.go, line 518 at r2 (raw file):

	a.Equal(2000, rangeLogRowCount())

	// GC everything

Stylistic nit: add . to every sentence.

Copy link
Contributor

@changangela changangela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/server.go, line 1074 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Careful, last I checked this was really expensive -- it sends an RPC to all nodes, collects information about all ranges, queries all leaseholders in a loop, and then throws it all away just to give you your boolean -- @changangela, I think you worked on this recently in #30846. What I say above is still mostly true, except for the loop to get the leaseholder, right? That would be way too expensive.

I think you can use

repl, err := s.node.stores.GetReplicaForRangeID(roachpb.RangeID(1))
repl.IsFirstRange() && repl.OwnsValidLease(s.clock.Now())

for a much cheaper check.

Since the ranges virtual table is now just a view over ranges_no_leases, the optimizer should be able to apply the filter first and then do the RPC request for only the one leaseholder.

Copy link
Contributor

@changangela changangela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/server.go, line 1074 at r2 (raw file):

Previously, changangela (Angela Chang) wrote…

Since the ranges virtual table is now just a view over ranges_no_leases, the optimizer should be able to apply the filter first and then do the RPC request for only the one leaseholder.

Oh, but yeah unfortunately it still populates the entirety of the ranges_no_leases virtual table.

@mvijaykarthik mvijaykarthik force-pushed the gc_rangelog branch 2 times, most recently from 010799c to 9318ee1 Compare October 8, 2018 19:14
Copy link
Contributor Author

@mvijaykarthik mvijaykarthik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have also added gc for system.eventlog as it was a very minor change to also include this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/server.go, line 41 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Remove this newline.

Done.


pkg/server/server.go, line 131 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

collected

Done.


pkg/server/server.go, line 144 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Shouldn't this simply be a function of server.rangelog.ttl (i.e. not a cluster var)? I think it should be computed as min(10*time.Minute, RangeLogTTL). We don't want to want 30 days to make an attempt or deployments which see restarts every <30 days will never GC the range log. We also can't just blindly run at startup because leaseholdership is often not known then (on top of generating silly errors that might end up in the logs).

I have kept this as 10m. If people accidentally set a low TTL, the gc loop may hamper system performance. Mentioned this in the documentation of this setting.


pkg/server/server.go, line 1071 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

please extract all of this new code into a new file server_rangelog_gc.go.

Done.


pkg/server/server.go, line 1074 at r2 (raw file):

Previously, changangela (Angela Chang) wrote…

Oh, but yeah unfortunately it still populates the entirety of the ranges_no_leases virtual table.

Done. But it's not in the same transaction though. It shouldn't matter correctness wise, and should be fine performance wise as long as the lease doesn't bounce around.


pkg/server/server.go, line 1075 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

This is a queue-like workload and as such we should give it both a lower and upper bound so that the query reads

DELETE FROM system.rangelog WHERE timestamp <= $1 AND timestamp >= $2 LIMIT 1000

and in each iteration update the $2 with the last returned result (and you return the final $2 from the method so that it can be passed in again for the next run).

When the loop runs for the first time, I think it's OK if it uses a $2 of zero (hard to avoid).

Also make sure that you exercise the case in which you insert north of 1000 entries to be GC'ed, if you haven't already.

Done. A test case is present with 2000 rows.


pkg/server/server.go, line 1081 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…
Rangelog

RangeLog (for both knobs).

Done.


pkg/server/server.go, line 1136 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

(This is a comment on this whole method)

Unfortunately it's really difficult to write a bug-free and robust implementation of a timer loop that's sensitive to cluster settings in the "optimal" way. Could you structure this exactly like #25227 which had a much much simpler loop (and also went through a couple of iterations in review)? You can start a server that overrides the default to a low value.

Here you have the problem that the cluster setting has an "on" default (30d), so naively it won't work. Instead, you can just override the default for the GC duration to something low by passing a suitable *cluster.Settings into TestServerArgs. Just look for a test that passes a non-nil *cluster.Settings into a TestServerArgs for inspiration. You can set the value with myClusterVar.Set(&st.SV, 3*time.Millisecond).

Done. Changed to ticker. Tweaking this interval ideally shouldn't be required.


pkg/storage/log_test.go, line 401 at r2 (raw file):
I'm using the below to insert test rangelog events.

store.LogReplicaChangeTest(

It not usable from server package. That's why I added it here


pkg/storage/log_test.go, line 518 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Stylistic nit: add . to every sentence.

Done.

@mvijaykarthik mvijaykarthik force-pushed the gc_rangelog branch 2 times, most recently from 4fa9659 to 759bb39 Compare October 9, 2018 13:04
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good already. Some more comments but getting close!

Reviewed 6 of 6 files at r3.
Dismissed @petermattis from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/server.go, line 144 at r2 (raw file):

Previously, mvijaykarthik wrote…

I have kept this as 10m. If people accidentally set a low TTL, the gc loop may hamper system performance. Mentioned this in the documentation of this setting.

SGTM.


pkg/server/server_systemlog_gc.go, line 45 at r3 (raw file):

	repl, err := s.node.stores.GetReplicaForRangeID(roachpb.RangeID(1))
	if err != nil {
		return 0, err

You want to handle the case in which there isn't a replica for r1 more gracefully to avoid spurious errors.


pkg/server/server_systemlog_gc.go, line 48 at r3 (raw file):

	}

	if repl.IsFirstRange() && repl.OwnsValidLease(s.clock.Now()) {

I'd make that an early return, so you can de-dent all the code below. Same comment as above.


pkg/server/server_systemlog_gc.go, line 56 at r3 (raw file):

		// A timestamp lower-bound is used in the delete query to avoid hitting
		// tombstones.
		// This lower bound is periodically updated after every successful delete

Sorry if that wasn't clear, but I was also expecting you to return this bound from this method. A large enough range tombstone minefield will cause a noticeable CPU spike and affect unrelated requests' p99. Ideally at some point range tombstones are less of a problem, but they are one today. The only case in which we'll eat the tombstones is when the server first starts and runs this loop.


pkg/server/server_systemlog_gc.go, line 118 at r3 (raw file):

	s.stopper.RunWorker(ctx, func(ctx context.Context) {
		period := 10 * time.Minute

make this a const somewhere so that you can also pretty-print it into the description of the cluster setting (where you refer to the 10min interval).


pkg/server/server_systemlog_gc.go, line 120 at r3 (raw file):

		period := 10 * time.Minute
		if storeKnobs, ok := s.cfg.TestingKnobs.Store.(*storage.StoreTestingKnobs); ok && storeKnobs.SystemLogsGCPeriod != 0 {
			period = storeKnobs.SystemLogsGCPeriod

Ah, I see. That's why you were able to hard-code the 10min interval. Works for me.


pkg/server/server_systemlog_gc.go, line 144 at r3 (raw file):

						if rowsAffected > 0 {
							log.Infof(ctx, "garbage collected %d rows from %s", rowsAffected, table)

Make this V(1). In a sufficiently busy cluster this will log every 10min, and that's not useful.


pkg/storage/log_test.go, line 401 at r2 (raw file):

Previously, mvijaykarthik wrote…

I'm using the below to insert test rangelog events.

store.LogReplicaChangeTest(

It not usable from server package. That's why I added it here

I'd prefer if you took what you needed from func (s *Store) insertRangeLogEvent( to insert dummy events.


pkg/storage/log_test.go, line 578 at r3 (raw file):

			)

			tc.setting.Override(&s.ClusterSettings().SV, time.Nanosecond)

Please don't use Override after the server has been started, this is flaky since your custom value can be overridden by a Gossip update. Use SET CLUSTER SETTING instead. (I gave you bad advice here, sorry about that).

Copy link
Contributor Author

@mvijaykarthik mvijaykarthik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/server/server_systemlog_gc.go, line 45 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You want to handle the case in which there isn't a replica for r1 more gracefully to avoid spurious errors.

Done.


pkg/server/server_systemlog_gc.go, line 48 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'd make that an early return, so you can de-dent all the code below. Same comment as above.

Done.


pkg/server/server_systemlog_gc.go, line 56 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Sorry if that wasn't clear, but I was also expecting you to return this bound from this method. A large enough range tombstone minefield will cause a noticeable CPU spike and affect unrelated requests' p99. Ideally at some point range tombstones are less of a problem, but they are one today. The only case in which we'll eat the tombstones is when the server first starts and runs this loop.

Done. I think it will also hit tombstones if the lease holder for range 1 changes.


pkg/server/server_systemlog_gc.go, line 118 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

make this a const somewhere so that you can also pretty-print it into the description of the cluster setting (where you refer to the 10min interval).

Done.


pkg/server/server_systemlog_gc.go, line 144 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Make this V(1). In a sufficiently busy cluster this will log every 10min, and that's not useful.

Done.


pkg/storage/log_test.go, line 401 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'd prefer if you took what you needed from func (s *Store) insertRangeLogEvent( to insert dummy events.

Done.


pkg/storage/log_test.go, line 578 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Please don't use Override after the server has been started, this is flaky since your custom value can be overridden by a Gossip update. Use SET CLUSTER SETTING instead. (I gave you bad advice here, sorry about that).

Done.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: after my comments (none of which are substantial) are addressed. Thanks @mvijaykarthik and good work!

Reviewed 6 of 6 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/server/server.go, line 41 at r4 (raw file):

	"github.com/cockroachdb/cmux"

stray newline.


pkg/server/server_systemlog_gc.go, line 44 at r4 (raw file):

		"server.rangelog.ttl",
		fmt.Sprintf(
			"if non zero, range log entries older than this duration are deleted periodically (%v period).",

are deleted every %s


pkg/server/server_systemlog_gc.go, line 55 at r4 (raw file):

		"server.eventlog.ttl",
		fmt.Sprintf(
			"if non zero, event log entries older than this duration are deleted periodically (%v period).",

are deleted every %s


pkg/server/server_systemlog_gc.go, line 68 at r4 (raw file):

// performs gc.
// The system log table is expected to have a "timestamp" column.
// It returns the highest timestamp gcd, number of rows affected and error

If it didn't GC anything, it should return timestampUpperBound, so I would just say that it returns a lower bound to be used in the next iteration.

Note that there's also a race here: if the ttl is very low, it's possible that events could be inserted at timestamps below the cutoff and so they would be missed.

I think that's OK. Perhaps worth a comment to suggest not setting the TTL lower than, say, a day.


pkg/server/server_systemlog_gc.go, line 71 at r4 (raw file):

// (if any).
func (s *Server) gcSystemLog(
	ctx context.Context, table string, timestampLowerBound, timestampUpperBound time.Time,

just for paranoia, let's make sure that if timestampLowerBound >= timestampUpperBound, this method works. I think it does (won't remove anything and might regress the cutoff, but that's ok) but please also take a look.


pkg/server/server_systemlog_gc.go, line 77 at r4 (raw file):

	if err != nil {
		if log.V(1) {
			log.Warning(ctx, "unable to get descriptor for range 1")

Don't want to log the error? Also this would be Infof. In a large cluster most nodes won't have a replica of r1 and will see RangeNotFoundError here.


pkg/server/server_systemlog_gc.go, line 79 at r4 (raw file):

			log.Warning(ctx, "unable to get descriptor for range 1")
		}
		return time.Time{}, 0, nil

You want to return the old timestamp bound to the caller.


pkg/server/server_systemlog_gc.go, line 86 at r4 (raw file):

			log.Warningf(ctx, "not running gc on system.%v as node is not lease holder of range 1", table)
		}
		return time.Time{}, 0, nil

ditto


pkg/server/server_systemlog_gc.go, line 149 at r4 (raw file):

is the timestamp below which events are gc'ed.


pkg/server/server_systemlog_gc_test.go, line 80 at r4 (raw file):

					txn.OrigTimestamp().GoTime(),
					testRangeID,
					1, // storeID.

optional nit: no dot here (inline comment). Go style guide 🤷‍♂️


pkg/server/server_systemlog_gc_test.go, line 104 at r4 (raw file):

	// GC up to maxTs1.
	tm, rowsGCd, err := ts.GCSystemLog(ctx, table, timeutil.Unix(0, 0), maxTs1)

nit: maxTS1, maxTS2, ... (Golang style guide wants abbreviations capitalized, like HTTP not Http).

Copy link
Contributor Author

@mvijaykarthik mvijaykarthik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/server/server.go, line 41 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

stray newline.

Ah sorry. My editor auto-formats imports and it keeps adding this newline every time I edit this file.


pkg/server/server_systemlog_gc.go, line 44 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

are deleted every %s

Done.


pkg/server/server_systemlog_gc.go, line 55 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

are deleted every %s

Done.


pkg/server/server_systemlog_gc.go, line 68 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

If it didn't GC anything, it should return timestampUpperBound, so I would just say that it returns a lower bound to be used in the next iteration.

Note that there's also a race here: if the ttl is very low, it's possible that events could be inserted at timestamps below the cutoff and so they would be missed.

I think that's OK. Perhaps worth a comment to suggest not setting the TTL lower than, say, a day.

Done.


pkg/server/server_systemlog_gc.go, line 71 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

just for paranoia, let's make sure that if timestampLowerBound >= timestampUpperBound, this method works. I think it does (won't remove anything and might regress the cutoff, but that's ok) but please also take a look.

I added a test case for equality case. I'll also add one for > case.


pkg/server/server_systemlog_gc.go, line 77 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Don't want to log the error? Also this would be Infof. In a large cluster most nodes won't have a replica of r1 and will see RangeNotFoundError here.

Done.


pkg/server/server_systemlog_gc.go, line 79 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You want to return the old timestamp bound to the caller.

Done.


pkg/server/server_systemlog_gc.go, line 86 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

ditto

Done.


pkg/server/server_systemlog_gc.go, line 149 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

is the timestamp below which events are gc'ed.

Done.


pkg/server/server_systemlog_gc_test.go, line 80 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

optional nit: no dot here (inline comment). Go style guide 🤷‍♂️

Done.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mvijaykarthik! I had a few edits that were so minor that I went ahead patched them into your branch, I hope that's ok.

bors r+

Reviewed 6 of 6 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@craig
Copy link
Contributor

craig bot commented Oct 11, 2018

Build failed

system.rangelog table currently grows unboundedly. The rate of
growth is slow (as long as there is no replica rebalancing
thrashing), but it can still become a problem in long running
clusters.

This commit adds cluster settings to specify interval and TTL
for rows in system.rangelog.
By default, TTL of system.rangelog is set to 30 days, and that
for system.eventlog to 90 days.

Fixes cockroachdb#21260.

Release note (sql change): the range log and system events logs will
automatically purge records older than 30 and 90 days, respectively.
This can be adjusted via the server.rangelog.ttl and server.eventlog.ttl
cluster settings.
@tbg
Copy link
Member

tbg commented Oct 11, 2018

Overlooked a silly lint error. Second time's the charm.

bors r+

@mvijaykarthik
Copy link
Contributor Author

Thanks @tschottdorf !

@craig
Copy link
Contributor

craig bot commented Oct 11, 2018

Build failed

@tbg
Copy link
Member

tbg commented Oct 11, 2018

Flaked on #31246

@tbg
Copy link
Member

tbg commented Oct 11, 2018

Just triggered the merge for the PR that fixes that, so:

bors r+

craig bot pushed a commit that referenced this pull request Oct 11, 2018
30913: server: add a configuration to enable GC of system.rangelog r=tschottdorf a=mvijaykarthik

system.rangelog table currently grows unboundedly. The rate of
growth is slow (as long as there is no replica rebalancing
thrashing), but it can still become a problem in long running
clusters.

This commit adds cluster settings to specify interval and TTL
for rows in system.rangelog.
By default, TTL of system.rangelog is set to 30 days.

Fixes #21260

Release note: Add configuration to enable GC of system.rangelog

31239: sql: attempt to deflake distsql physical planner tests r=tschottdorf a=jordanlewis

Make sure the range cache is populated before verifying things about it.

This seems like a hack, but otherwise I think these will just keep flaking.

Closes #25808.
Closes #31235.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
@craig
Copy link
Contributor

craig bot commented Oct 11, 2018

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants