Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: use a new ClearRange RPC to improve DROP|TRUNCATE efficiency #20601

Merged
merged 1 commit into from Dec 14, 2017

Conversation

spencerkimball
Copy link
Member

Previously, dropping or truncating a table would result in
significant write amplification as each key in the table was
individually tombstoned, and then individually deleted on
eventual GC.

This change adds a new field to the table descriptor indicating
a GC deadline, after which the contents of the table will be deleted
using a new KV ClearRange API call. This efficiently uses the lower
level storage rocksdb::DeleteRange, which creates a range tombstone
across a key span, usually the entire range.

The ClearRange API call also updates the range's last GC timestamp
to prevent historical reads from incorrectly returning empty results
when reading subsequent to a ClearRange.

A dropped table's GC deadline is now visible in the
crdb_internal.tables table.

Release note (sql change): dramatic improvement in efficiency when
DROP|TRUNCATE TABLE is used.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spencerkimball
Copy link
Member Author

@vivekmenezes @tschottdorf I definitely need you both to review this. @petermattis if you have time.

@spencerkimball
Copy link
Member Author

Actually, scratch that. @petermattis need your review on this as well for the C++ changes.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend a SHOW KV TRACE FOR ... SQL logic test to confirm the change is observable in traces.

// would not set the GC deadline, but would still expect the data
// to remain historically queryable until the zone's TTL expires.
if td.tableDesc().GCDeadline == 0 {
log.VEventf(ctx, 2, "backwards-compatible DelRange %s - %s", resume.Key, resume.EndKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the string "backwards-compatible" is needed / useful.
At any rate you need to update the condition in sql/show_trace.go ((p *planner) ShowTrace) accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the "backwards-compatible" bit.

if timeutil.Since(timeutil.Unix(0, td.tableDesc().GCDeadline)) < 0 {
panic(fmt.Sprintf("not past the GC deadline: %s", td.tableDesc()))
}
log.VEventf(ctx, 2, "ClearRange %s - %s", resume.Key, resume.EndKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this needs to be added to (p *planner) ShowTrace too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to ShowTrace. Unfortunately it's really difficult to see either of these because both are run through an asynchronous pathway which doesn't show up SHOW KV TRACE....

@nvanbenschoten
Copy link
Member

Reviewed 18 of 20 files at r1, 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/roachpb/method.go, line 41 at r2 (raw file):

	// the value has been deleted.
	Delete
	// DeleteRange creates tombstone valuess for keys which fall between

s/valuess/values/


pkg/sql/drop_table.go, line 315 at r2 (raw file):

	// If the table is not interleaved, use the GCDeadline mechanism
	// to schedule usage of the more efficient ClearRange pathway.

Let's complete this by commenting on why we can't use the GCDeadline mechanism for interleaved tables. Soon enough, we're going to want to when we detect ON DELETE CASCADE (@BramGruneir).


pkg/sql/drop_table.go, line 325 at r2 (raw file):

			return err
		}
		tableDesc.GCDeadline = timeutil.Now().UnixNano() + int64(zoneCfg.GC.TTLSeconds)*1E9

Consider s/1E9/time.Second.Nanoseconds()/


pkg/sql/drop_table.go, line 325 at r2 (raw file):

			return err
		}
		tableDesc.GCDeadline = timeutil.Now().UnixNano() + int64(zoneCfg.GC.TTLSeconds)*1E9

Do we make sure that this isn't overridden anywhere?


pkg/sql/tablewriter.go, line 789 at r2 (raw file):

// the underlying data quickly. Unlike DeleteRange, ClearRange doesn't
// leave tombstone data and instead actually removes all of the data,
// eventually freeing the space on the underlying storage engine.

"eventually".. how eventual? After RocksDB compactions?


pkg/storage/batcheval/cmd_clear_range.go, line 73 at r2 (raw file):
Consider something like:

cArgs.Stats.SysCount = 0 // no change
cArgs.Stats.SysBytes = 0 // no change

It makes things a bit easier to skim


pkg/storage/batcheval/cmd_clear_range.go, line 99 at r2 (raw file):

	stateLoader := MakeStateLoader(cArgs.EvalCtx)
	pd.Replicated.State = &storagebase.ReplicaState{}
	if newThreshold != (hlc.Timestamp{}) {

nit: should we just move this to the block above? Seems like we're effectively checking the same condition twice.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/roachpb/method.go, line 41 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/valuess/values/

Done.


pkg/sql/drop_table.go, line 315 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's complete this by commenting on why we can't use the GCDeadline mechanism for interleaved tables. Soon enough, we're going to want to when we detect ON DELETE CASCADE (@BramGruneir).

Done.


pkg/sql/drop_table.go, line 325 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider s/1E9/time.Second.Nanoseconds()/

Done.


pkg/sql/drop_table.go, line 325 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we make sure that this isn't overridden anywhere?

What do you mean overridden?


pkg/sql/tablewriter.go, line 789 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"eventually".. how eventual? After RocksDB compactions?

I cleaned this up. No sense here in commenting on compaction.


pkg/storage/batcheval/cmd_clear_range.go, line 73 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider something like:

cArgs.Stats.SysCount = 0 // no change
cArgs.Stats.SysBytes = 0 // no change

It makes things a bit easier to skim

Done.


pkg/storage/batcheval/cmd_clear_range.go, line 99 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: should we just move this to the block above? Seems like we're effectively checking the same condition twice.

Done.


Comments from Reviewable

@nvanbenschoten
Copy link
Member

Review status: 17 of 21 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/sql/drop_table.go, line 325 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

What do you mean overridden?

Is it possible to set a GCDeadline on a table descriptor and then set a later deadline on the same descriptor? I'd expect the same mechanism that prevents us from dropping a table twice would also ensure that this isn't possible, but I'd like to be sure.


Comments from Reviewable

@lgo
Copy link
Contributor

lgo commented Dec 10, 2017

Review status: 17 of 21 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


pkg/sql/tablewriter.go, line 810 at r3 (raw file):

	// would not set the GC deadline, but would still expect the data
	// to remain historically queryable until the zone's TTL expires.
	if td.tableDesc().GCDeadline == 0 {

We may want to add a similar TODO here as the one you've added before GCDeadline is set. Both places need to be changed to accomplish the TODO for handling interleaved tables with ON DELETE CASCADE.

(Ugh, this is also a bit chaotic. There is now a slow, fast, and faster path in two functions, for which the path is chosen by two conditions that are far apart. I don't have any good suggestions for how to prevent that. Making 3 different delete functions and consolidating the conditions into helpers in this file doesn't seem like a great alternative, so I'm not asking for anything here.)


pkg/sql/tablewriter.go, line 925 at r3 (raw file):

		log.VEventf(ctx, 2, "DelRange %s - %s", resume.Key, resume.EndKey)
	}
	td.b.DelRange(resume.Key, resume.EndKey, false /* returnKeys */)

We should be able to use ClearRange here to speed up index truncation (for ALTER TABLE ... DROP INDEX schema changes). I'm happy with putting a TODO here for that. cc: @vivekmenezes

I don't know if a user can query from an index AOST after it has been dropped. If they can, we'll have to also do a similar GCDeadline on an IndexDescriptor. If they can't, then this should be a simple change.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 17 of 21 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/roachpb/api.proto, line 249 at r3 (raw file):

// transactional intents and writes tombstones to the deleted
// keys. ClearRange is used when permanently dropping or truncating
// table data.

What bad things will happen if a ClearRange is executed on a range that is seeing concurrent transactional writes? I assume all of the transactions will be aborted, but does that actually occur. Are the stats properly maintained?


pkg/sql/drop_table.go, line 325 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is it possible to set a GCDeadline on a table descriptor and then set a later deadline on the same descriptor? I'd expect the same mechanism that prevents us from dropping a table twice would also ensure that this isn't possible, but I'd like to be sure.

Also, it is probably worth a bit of paranoia and ensure that GCDeadline is some reasonable time in the future. If zoneCfg.GC.TTLSeconds were 1, I think we'd run into problems with the ClearRange happening concurrently with transactions.


pkg/sql/schema_changer.go, line 1068 at r3 (raw file):

							// instead of the standard delay.
							if table.Dropped() && table.GCDeadline != 0 {
								schemaChanger.execAfter = timeutil.Unix(0, table.GCDeadline)

This might be increasing execAfter significantly since the default is 6m. Is that problematic? I'm not too familiar with this code and don't understand what else this loop does.


pkg/sql/tablewriter.go, line 836 at r3 (raw file):

	}
	td.b.AddRawRequest(crr)
	if _, err := td.finalize(ctx, traceKV); err != nil {

This will fail in mixed-version clusters. I think we need to protect this code path with a version upgrade check.


pkg/storage/batcheval/cmd_clear_range.go, line 50 at r3 (raw file):

// Note that "correct" use of this command is only possible for key
// spans consisting of user data that we know is not being written to
// or queried any more, such as after a DROP or TRUNCATE table.

Is there anything else we can do to enforce "correct" usage? For example, can we check the timestamp cache to verify user-keys have not be accessed recently.


Comments from Reviewable

@petermattis
Copy link
Collaborator

:lgtm: The C++ bits look good.


Review status: 17 of 21 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


Comments from Reviewable

@lgo
Copy link
Contributor

lgo commented Dec 10, 2017

Review status: 17 of 21 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/sql/schema_changer.go, line 1068 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This might be increasing execAfter significantly since the default is 6m. Is that problematic? I'm not too familiar with this code and don't understand what else this loop does.

I think this should be fine. If this wasn't a drop table mutation, it would have an impact. Any future mutations would be delayed until after this time.

Basically, when a gossip update for a table descriptor occurs or a schema change finishes, a timer is reset (timer = s.newTimer()) to fire for at the minimum execAfter of each table's next schema change. If the drop table is the next schema change, it should always be the last one also.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: 17 of 21 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/roachpb/api.proto, line 249 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What bad things will happen if a ClearRange is executed on a range that is seeing concurrent transactional writes? I assume all of the transactions will be aborted, but does that actually occur. Are the stats properly maintained?

  • If there were concurrent writes, they'd be sequenced to either be before or after the ClearRange has entered the command queue and completed.
  • Stats would be properly maintained.
  • If the ClearRange summarily disposed of a txn'al intent, the transaction would experience an inconsistency.

However, the above points are irrelevant. ClearRange is never intended for use on a range of keys which can have any activity. This is enforced by the schema changer, which guarantees that there are no active transactions with a lease on the table.

I've added a comment here indicating that this method must only be called on key spans which are guaranteed to be inactive and will never see future writes.


pkg/sql/drop_table.go, line 325 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Also, it is probably worth a bit of paranoia and ensure that GCDeadline is some reasonable time in the future. If zoneCfg.GC.TTLSeconds were 1, I think we'd run into problems with the ClearRange happening concurrently with transactions.

@nvanbenschoten Yes, it is not possible. A table cannot be dropped twice and this is the only code path which updates GCDeadline. However, even if GCDeadline were to be updated at a future point (I'd like to support this), it would not lead to any problems with correctness. This is why I've taken care to update the GC threshold on affected ranges when ClearRange is invoked.

The TTL which we use to GC is advisory and subject to change for any range in the system. Readers which might have been counting on historical reads at some timestamp, which is suddenly past an aggressively reduced TTL, will get GC threshold error, which is the correct outcome.

@petermattis that should never be a problem because of table leases. I expect many operators to set a table or databases TTL to 0 and then drop.


pkg/sql/schema_changer.go, line 1068 at r3 (raw file):

Previously, lego (Joey Pereira) wrote…

I think this should be fine. If this wasn't a drop table mutation, it would have an impact. Any future mutations would be delayed until after this time.

Basically, when a gossip update for a table descriptor occurs or a schema change finishes, a timer is reset (timer = s.newTimer()) to fire for at the minimum execAfter of each table's next schema change. If the drop table is the next schema change, it should always be the last one also.

Should be fine. Each one of these schema changers for a table is run independently based on the shortest execAfter deadline amongst the pending changers.

Separately, the 6m thing seems arbitrary and I think it should be reconsidered. I don't believe it achieves anything meaningful. To the extent it is a good idea, it should maybe be jittered...?


pkg/sql/tablewriter.go, line 810 at r3 (raw file):

Previously, lego (Joey Pereira) wrote…

We may want to add a similar TODO here as the one you've added before GCDeadline is set. Both places need to be changed to accomplish the TODO for handling interleaved tables with ON DELETE CASCADE.

(Ugh, this is also a bit chaotic. There is now a slow, fast, and faster path in two functions, for which the path is chosen by two conditions that are far apart. I don't have any good suggestions for how to prevent that. Making 3 different delete functions and consolidating the conditions into helpers in this file doesn't seem like a great alternative, so I'm not asking for anything here.)

This code will remain the same, even after we support this faster path with interleaved tables. The point of this isn't to deal with interleaved tables, but with the migration between cockroach v1.1 and v2.0. If a v1.1 server initiates a drop, it won't set the GCDeadline, which means we need to still rely on DeleteRange instead of ClearRange and the eventual range GC.

Yes, this supporting legacy cruft gets ugly. I've moved this into a new method named legacyDeleteAllRowsFast to manage the complexity a bit.

...and lo and behold making this change reminds me that this requires a new feature version, requiring the full cluster to upgrade in order to make use of this new faster drop/truncate path. So thank you for complaining.


pkg/sql/tablewriter.go, line 836 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This will fail in mixed-version clusters. I think we need to protect this code path with a version upgrade check.

Done.


pkg/sql/tablewriter.go, line 925 at r3 (raw file):

Previously, lego (Joey Pereira) wrote…

We should be able to use ClearRange here to speed up index truncation (for ALTER TABLE ... DROP INDEX schema changes). I'm happy with putting a TODO here for that. cc: @vivekmenezes

I don't know if a user can query from an index AOST after it has been dropped. If they can, we'll have to also do a similar GCDeadline on an IndexDescriptor. If they can't, then this should be a simple change.

Adding a TODO. I've spent too long already this weekend understanding a small bit of SQL land. Probably best to get some miles on this fast path before diving into index truncation. My guess is that we'll need to do the same thing as with tables because I don't see any reason we'd be disallowed from reading from indexes historically. But take that with a grain of salt.


pkg/storage/batcheval/cmd_clear_range.go, line 50 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is there anything else we can do to enforce "correct" usage? For example, can we check the timestamp cache to verify user-keys have not be accessed recently.

Not sure what the threshold would be when examining the timestamp cache. In any case, I'm not too worried about concurrent reads because we update the GC threshold value – we should correctly return an error for any subsequent reads which arrive after the ClearRange. We're trusting the table drop schema changer to make subsequent writes impossible. I'm open to other suggestions.


Comments from Reviewable

@lgo
Copy link
Contributor

lgo commented Dec 10, 2017

Review status: 16 of 22 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending.


pkg/sql/tablewriter.go, line 925 at r3 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Adding a TODO. I've spent too long already this weekend understanding a small bit of SQL land. Probably best to get some miles on this fast path before diving into index truncation. My guess is that we'll need to do the same thing as with tables because I don't see any reason we'd be disallowed from reading from indexes historically. But take that with a grain of salt.

👏 kudos for becoming a SQL intern for the weekend! :)


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: 15 of 23 files reviewed at latest revision, 8 unresolved discussions, some commit checks pending.


pkg/sql/tablewriter.go, line 925 at r3 (raw file):

Previously, lego (Joey Pereira) wrote…

👏 kudos for becoming a SQL intern for the weekend! :)

It's been very rewarding!

I still want to tackle the task of post-hoc adjustments to the gc deadline in order to allow operators to free up the space faster if that becomes an issue.


Comments from Reviewable

@bdarnell
Copy link
Member

:lgtm: with some naming nits.


Reviewed 14 of 20 files at r1, 1 of 3 files at r2, 1 of 4 files at r3, 7 of 7 files at r4.
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


pkg/roachpb/api.proto, line 258 at r4 (raw file):

  Span header = 1 [(gogoproto.nullable) = false, (gogoproto.embed) = true];
  util.hlc.Timestamp gc_threshold = 2 [(gogoproto.nullable) = false,

Document this field. Why is it different from the timestamp at which the command executes?


pkg/settings/cluster/cockroach_versions.go, line 104 at r4 (raw file):

		// VersionClearRange is https://github.com/cockroachdb/cockroach/pull/20601.
		Key:     VersionClearRange,
		Version: roachpb.Version{Major: 1, Minor: 1, Unstable: 5},

@solongordon 's #20587 is also creating this version; whichever one commits second will have to change to 1.1-6.


pkg/sql/tablewriter.go, line 820 at r4 (raw file):

			EndKey: resume.EndKey,
		},
		GCThreshold: hlc.Timestamp{WallTime: td.tableDesc().GCDeadline},

Why is it GCThreshold in ClearRangeRequest and GCDeadline in the table descriptor? GCThreshold and GCDeadline sound awfully similar; if they're not the same they should have names that indicate their meanings. (Isn't GCDeadline really GCAfter?)


pkg/storage/batcheval/cmd_clear_range.go, line 104 at r4 (raw file):

	// Clear the key span.
	return pd, batch.ClearRange(from, to)

According to comments in clearRangeData, it's bad to call ClearRange for small amounts of data, so it falls back to individual deletions in this case. Do we need to do the same here?


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Dec 11, 2017

In master...tschottdorf:experiment/fastdrops I made ClearRange transactional because we have this limitation about not allowing range-spanning transactions. I don't see that restriction lifted here, yet you claim you're not using a Txn. Either you're really using one and getting away with it or you're not using one and none of the tests exercises a multi-range table drop. The first one will be obvious once my comments are addressed but we need a test that drops a table that has been split (this should be easy with logictests and the ALTER TABLE SPLIT AT syntax).

I think you're secretly using a txn, because of this code

func (td *tableDeleter) init(txn *client.Txn) error {
	td.txn = txn
	td.b = txn.NewBatch()
	return nil
}

the batch of which you are then using.

On the plus side, looking at the range-spanning-enforcement code

			if ba.Txn == nil && ba.IsPossibleTransaction() && ba.ReadConsistency != roachpb.INCONSISTENT {
				// TODO(nvanbenschoten): if this becomes an issue for RangeLookup
				// scans, we could look into changing IsPossibleTransaction for
				// ScanRequest/ReverseScanRequest to detect RangeLookups.
				responseCh <- response{pErr: roachpb.NewError(&roachpb.OpRequiresTxnError{})}
				return
			}

it seems that this would do the right thing, since IsPossibleTransaction is false.

Basically LGTM, though some work is left to do in the comments.


Review status: all files reviewed at latest revision, 21 unresolved discussions, all commit checks successful.


pkg/roachpb/method.go, line 45 at r4 (raw file):

removes all values (including all of their versions)


pkg/sql/drop_table.go, line 318 at r4 (raw file):

	// If the table is not interleaved and the ClearRange feature is
	// enabled in the cluster, use the GCDeadline mechanism to schedule
	// usage of the more efficient ClearRange pathway.  ClearRange will

nit: double space


pkg/sql/drop_table.go, line 326 at r4 (raw file):

	// able to use this faster mechanism.
	if !tableDesc.IsInterleaved() &&
		p.ExecCfg().Settings.Version.IsMinSupported(cluster.VersionClearRange) {

IsActive


pkg/sql/drop_test.go, line 53 at r4 (raw file):

// Ensure that the zone config matches.

Also should be a line down, inside the expected != nil


pkg/sql/drop_test.go, line 67 at r4 (raw file):

			return err
		}
		if e := expected; !reflect.DeepEqual(*e, cfg) {

NYC, but don't we auto-generate an Equal method on all protos? Should be able to write e.Equal(cfg).


pkg/sql/schema_changer.go, line 1068 at r3 (raw file):

Separately, the 6m thing seems arbitrary and I think it should be reconsidered.

+1. I brought this up in a past review in which I touched this code and the impression I got was that this could easily go away (or be reduced to basically nothing). I think there's some interaction with synchronous schema changes which we don't want picked up by the async method here? @vivekmenezes would know.


pkg/sql/schema_changer_test.go, line 2512 at r4 (raw file):

	// Add a zone config.
	cfg := config.DefaultZoneConfig()
	cfg.GC.TTLSeconds = 0 // Set TTL so the data is deleted immediately.

nit: remove trailing dot


pkg/sql/tablewriter.go, line 812 at r4 (raw file):

	// Assert that GC deadline is in the past.
	if timeutil.Since(timeutil.Unix(0, td.tableDesc().GCDeadline)) < 0 {
		panic(fmt.Sprintf("not past the GC deadline: %s", td.tableDesc()))

log.Fatalf(ctx, "not past the GC deadline: %s", td.tableDesc())


pkg/sql/tablewriter.go, line 820 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Why is it GCThreshold in ClearRangeRequest and GCDeadline in the table descriptor? GCThreshold and GCDeadline sound awfully similar; if they're not the same they should have names that indicate their meanings. (Isn't GCDeadline really GCAfter?)

I'm not quite sure whether setting GCThreshold is the right thing to do here in the first place. It certainly doesn't hurt, but does it achieve anything meaningful? Since ClearRange removes all versions older or newer, and given the constraint that for the sake of avoiding HLC-overbumping bugs we don't want to "invent" timestamps, if you chose a timestamp here you could still have an in-flight write at a higher timestamp that commits before the ClearRange and promptly gets wiped out (but isn't covered by the GCThreshold).

I think the right thing to do is to not send a GCTreshold along, but to change the evaluation function of ClearRangeRequest so that it sets a GCThreshold of evalCtx.Clock().Now(). This in turn guarantees that the timestamp is larger than any other previously committed write and we actually protect against any incorrect reads.


pkg/sql/tablewriter.go, line 930 at r4 (raw file):

		log.VEventf(ctx, 2, "DelRange %s - %s", resume.Key, resume.EndKey)
	}
	// TODO(vivekmenezes): adapt index deletion to use the same GC

Seems worth doing this sooner rather than later. Does doing that change anything in the design of (the SQL parts of) this feature? Would we just add a similar deadline on the index descriptor?


pkg/storage/batcheval/cmd_clear_range.go, line 59 at r4 (raw file):

) (result.Result, error) {
	log.VEventf(ctx, 2, "ClearRange %+v", cArgs.Args)

Check that there's no Txn.


pkg/storage/batcheval/cmd_clear_range.go, line 76 at r4 (raw file):

		statsCurrent := cArgs.EvalCtx.GetMVCCStats()
		cArgs.Stats.Subtract(statsCurrent)
		cArgs.Stats.SysCount = 0 // no change

I'd prefer if you set the zeroes on statsCurrent before subtraction.


pkg/storage/batcheval/cmd_clear_range.go, line 78 at r4 (raw file):

		cArgs.Stats.SysCount = 0 // no change
		cArgs.Stats.SysBytes = 0 // no change
	} else {

Stats yoga always makes me suspicious. Change this code to assert that both are equal in race builds, i.e. something like

fast := desc.StartKey.Equal(args.Key) && desc.EndKey.Equal(args.EndKey)
var subtractStats enginepb.MVCCStats
var assertStats enginepb.MVCCStats

if fast {
  subtractStats := cArgs.EvalCtx.GetMVCCStats()
  subTractStats.SysCount, subtractStats.SysBytes = 0, 0 // no changes
  if util.RaceEnabled {
    assertStats = subtractStats
  }
}
if !fast || util.RaceEnabled {
  /* do the iterator thing without actually subtracting */
}

if util.RaceEnabled {
  if !assertStats.Equal(subtractStats) {
    log.Fatalf(ctx, "fast-path MVCCStats computation gave wrong result: diff(fast, slow) = %s", pretty.Diff(assertStats, subtractStats))
  }
}

cArgs.Stats.Subtract(subtractStats)

pkg/storage/batcheval/cmd_clear_range.go, line 94 at r4 (raw file):

	var pd result.Result
	gcThreshold := cArgs.EvalCtx.GetGCThreshold()
	gcThreshold.Forward(args.GCThreshold)

You'd use cArgs.EvalCtx.Clock().Now() here instead. (you'll have to plumb the clock but it's easy).


pkg/storage/batcheval/cmd_clear_range.go, line 104 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

According to comments in clearRangeData, it's bad to call ClearRange for small amounts of data, so it falls back to individual deletions in this case. Do we need to do the same here?

I think simple thresholding might work well here. If subtractStats.Total() exceeds some magic number (512KB?) then use a range deletion.


Comments from Reviewable

@vivekmenezes
Copy link
Contributor

Review status: all files reviewed at latest revision, 21 unresolved discussions, all commit checks successful.


pkg/sql/schema_changer.go, line 1068 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Separately, the 6m thing seems arbitrary and I think it should be reconsidered.

+1. I brought this up in a past review in which I touched this code and the impression I got was that this could easily go away (or be reduced to basically nothing). I think there's some interaction with synchronous schema changes which we don't want picked up by the async method here? @vivekmenezes would know.

Yeah the 6 minute interval is pretty arbitrary. It was put in to allow the synchronous path to grab the schema change lease before anyone else #20488 has been filed


Comments from Reviewable

Copy link
Contributor

@vivekmenezes vivekmenezes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done! My main comment is to not change the TableDescriptor at all and use ModificationTime + TTL to know when to delete the data.

// the timestamp after which the table's contents will be cleared
// from the cluster. In nanoseconds since the epoch.
optional int64 gc_deadline = 29 [(gogoproto.nullable) = false,
(gogoproto.customname) = "GCDeadline"];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a need for this. What you can do is use the ModificationTime on the table and ClearRange when (!UpVersion && CurrentTime > ModificationTime +TTL)

// instead of the standard delay.
if table.Dropped() && table.GCDeadline != 0 {
schemaChanger.execAfter = timeutil.Unix(0, table.GCDeadline)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is safe to do!

td.b.AddRawRequest(crr)
if _, err := td.finalize(ctx, traceKV); err != nil {
return resume, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to be fast? The reason I ask is because we don't want it getting in the way of user traffic.

@@ -889,6 +927,8 @@ func (td *tableDeleter) deleteIndexFast(
if traceKV {
log.VEventf(ctx, 2, "DelRange %s - %s", resume.Key, resume.EndKey)
}
// TODO(vivekmenezes): adapt index deletion to use the same GC
// deadline / ClearRange fast path that table deletion uses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted!

@spencerkimball
Copy link
Member Author

Yes, we're definitely using a txn. Are you saying we shouldn't because it's just confusing semantics as ClearRange doesn't behave even remotely transactionally? I think that's correct – it is confusing and doesn't seem right to allow ClearRange within a txn. Modified to run in a non-transactional batch.


Review status: all files reviewed at latest revision, 25 unresolved discussions, all commit checks successful.


pkg/roachpb/api.proto, line 258 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Document this field. Why is it different from the timestamp at which the command executes?

Actually I think that suggestion is better. I'm removing gc_threshold from the proto.


pkg/roachpb/method.go, line 45 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

removes all values (including all of their versions)

Done.


pkg/settings/cluster/cockroach_versions.go, line 104 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

@solongordon 's #20587 is also creating this version; whichever one commits second will have to change to 1.1-6.

Blegh


pkg/sql/drop_table.go, line 318 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: double space

Done.


pkg/sql/drop_table.go, line 326 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

IsActive

Done.


pkg/sql/drop_test.go, line 53 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

// Ensure that the zone config matches.

Also should be a line down, inside the expected != nil

Done.


pkg/sql/drop_test.go, line 67 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

NYC, but don't we auto-generate an Equal method on all protos? Should be able to write e.Equal(cfg).

Done.


pkg/sql/schema_changer.go, line 1069 at r4 (raw file):

Previously, vivekmenezes wrote…

this is safe to do!

Thanks for weighing in.


pkg/sql/schema_changer_test.go, line 2512 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: remove trailing dot

Done.


pkg/sql/tablewriter.go, line 812 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

log.Fatalf(ctx, "not past the GC deadline: %s", td.tableDesc())

Done.


pkg/sql/tablewriter.go, line 820 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'm not quite sure whether setting GCThreshold is the right thing to do here in the first place. It certainly doesn't hurt, but does it achieve anything meaningful? Since ClearRange removes all versions older or newer, and given the constraint that for the sake of avoiding HLC-overbumping bugs we don't want to "invent" timestamps, if you chose a timestamp here you could still have an in-flight write at a higher timestamp that commits before the ClearRange and promptly gets wiped out (but isn't covered by the GCThreshold).

I think the right thing to do is to not send a GCTreshold along, but to change the evaluation function of ClearRangeRequest so that it sets a GCThreshold of evalCtx.Clock().Now(). This in turn guarantees that the timestamp is larger than any other previously committed write and we actually protect against any incorrect reads.

I've removed GCThreshold from the ClearRangeRequest. I'm instead just setting the GC threshold on the range to the request batch timestamp. Actually, you're right – better to use the current clock's now()b method.


pkg/sql/tablewriter.go, line 825 at r4 (raw file):

Previously, vivekmenezes wrote…

Is this expected to be fast? The reason I ask is because we don't want it getting in the way of user traffic.

The expectation is that this will be very fast.


pkg/sql/tablewriter.go, line 930 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Seems worth doing this sooner rather than later. Does doing that change anything in the design of (the SQL parts of) this feature? Would we just add a similar deadline on the index descriptor?

I think it would be analogous, but I haven't looked into the index descriptor code carefully to see if that's true. I agree we should do this asap. @vivekmenezes, who's plate should we put this on?


pkg/sql/tablewriter.go, line 931 at r4 (raw file):

Previously, vivekmenezes wrote…

noted!

Done.


pkg/sql/sqlbase/structured.proto, line 632 at r4 (raw file):

Previously, vivekmenezes wrote…

There isn't a need for this. What you can do is use the ModificationTime on the table and ClearRange when (!UpVersion && CurrentTime > ModificationTime +TTL)

The GCDeadline has the significant advantage of clarity. A derived GC deadline has the advantage of respecting a change in zone TTLSeconds, although we'd need to be careful to ping the schema changer manager to recompute evalAfter settings per schema changer and reset the timer if necessary. However, making this explicit feels a little less fragile.


pkg/storage/batcheval/cmd_clear_range.go, line 59 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Check that there's no Txn.

Done.


pkg/storage/batcheval/cmd_clear_range.go, line 76 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I'd prefer if you set the zeroes on statsCurrent before subtraction.

Done.


pkg/storage/batcheval/cmd_clear_range.go, line 78 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Stats yoga always makes me suspicious. Change this code to assert that both are equal in race builds, i.e. something like

fast := desc.StartKey.Equal(args.Key) && desc.EndKey.Equal(args.EndKey)
var subtractStats enginepb.MVCCStats
var assertStats enginepb.MVCCStats

if fast {
  subtractStats := cArgs.EvalCtx.GetMVCCStats()
  subTractStats.SysCount, subtractStats.SysBytes = 0, 0 // no changes
  if util.RaceEnabled {
    assertStats = subtractStats
  }
}
if !fast || util.RaceEnabled {
  /* do the iterator thing without actually subtracting */
}

if util.RaceEnabled {
  if !assertStats.Equal(subtractStats) {
    log.Fatalf(ctx, "fast-path MVCCStats computation gave wrong result: diff(fast, slow) = %s", pretty.Diff(assertStats, subtractStats))
  }
}

cArgs.Stats.Subtract(subtractStats)

Done. Good idea.


pkg/storage/batcheval/cmd_clear_range.go, line 94 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You'd use cArgs.EvalCtx.Clock().Now() here instead. (you'll have to plumb the clock but it's easy).

Done.


pkg/storage/batcheval/cmd_clear_range.go, line 104 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I think simple thresholding might work well here. If subtractStats.Total() exceeds some magic number (512KB?) then use a range deletion.

Done and added a unittest to verify.


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the fast-drop-table branch 2 times, most recently from 672815c to 0e37895 Compare December 12, 2017 00:16
@vivekmenezes
Copy link
Contributor

Review status: 15 of 28 files reviewed at latest revision, 25 unresolved discussions, all commit checks successful.


pkg/sql/tablewriter.go, line 930 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I think it would be analogous, but I haven't looked into the index descriptor code carefully to see if that's true. I agree we should do this asap. @vivekmenezes, who's plate should we put this on?

You can put it on my plate once you've merged this change I'll send you a PR for this change. I suspect I'll just Clear the index data rather than wait on it. One problem with waiting is that the current schema changer processes schema changes one at a time, and so waiting will hold up all other schema changes on the table. Since the index is derived data there isn't a great danger in clearing out the index.


Comments from Reviewable

@vivekmenezes
Copy link
Contributor

Review status: 15 of 28 files reviewed at latest revision, 22 unresolved discussions, all commit checks successful.


pkg/sql/sqlbase/structured.proto, line 632 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

The GCDeadline has the significant advantage of clarity. A derived GC deadline has the advantage of respecting a change in zone TTLSeconds, although we'd need to be careful to ping the schema changer manager to recompute evalAfter settings per schema changer and reset the timer if necessary. However, making this explicit feels a little less fragile.

I'm okay with keeping it in. Not a big deal really


Comments from Reviewable

@spencerkimball
Copy link
Member Author

@tschottdorf does this still require additional review?

@tbg
Copy link
Member

tbg commented Dec 13, 2017

:lgtm: with a few minor comments!


Review status: 15 of 28 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.


pkg/sql/drop_test.go, line 217 at r5 (raw file):

		SQLSchemaChanger: &sql.SchemaChangerTestingKnobs{
			// Turn on quick garbage collection.
			AsyncExecQuickly: true,

Do you still need this? Would be good to see that a 0 TTL drops the table right away even in the real world where this knob isn't set.


pkg/sql/drop_test.go, line 595 at r5 (raw file):

		SQLSchemaChanger: &sql.SchemaChangerTestingKnobs{
			// Turn on quick garbage collection.
			AsyncExecQuickly: true,

Do you still need this? Would be good to see that a 0 TTL drops the table right away even in the real world where this knob isn't set.


pkg/sql/tablewriter.go, line 930 at r4 (raw file):

Previously, vivekmenezes wrote…

You can put it on my plate once you've merged this change I'll send you a PR for this change. I suspect I'll just Clear the index data rather than wait on it. One problem with waiting is that the current schema changer processes schema changes one at a time, and so waiting will hold up all other schema changes on the table. Since the index is derived data there isn't a great danger in clearing out the index.

I filed #20696 to track this. I think it needs a bit of discussion due to time travel complications with this approach. Not insurmountable, but out of scope for this PR.


pkg/storage/batcheval/cmd_clear_range.go, line 104 at r5 (raw file):

	// clearRangeBytesThreshold, clear the individual values manually,
	// instead of using a range tombstone (inefficient for small ranges).
	if delta := statsDelta.KeyBytes + statsDelta.ValBytes; delta < clearRangeBytesThreshold {

if total := statsDelta.Total(); ...


pkg/storage/batcheval/cmd_clear_range.go, line 106 at r5 (raw file):

	if delta := statsDelta.KeyBytes + statsDelta.ValBytes; delta < clearRangeBytesThreshold {
		log.VEventf(ctx, 2, "delta=%d < threshold=%d; using non-range clear", delta, clearRangeBytesThreshold)
		return pd, batch.Iterate(

Don't return pd on error. (It wouldn't cause problems, but looks fishy).


pkg/storage/batcheval/cmd_clear_range.go, line 135 at r5 (raw file):

	fast := desc.StartKey.Equal(from.Key) && desc.EndKey.Equal(to.Key)
	if fast {
		// Note this it is valid to use the full range MVCC stats, as

You need to also declare a full range-local span for this method to make this statement true. You implicitly do because you declare GCThreshold, but I think it's good to be explicit here (since contention is of no concern anyway).


pkg/storage/batcheval/cmd_clear_range.go, line 155 at r5 (raw file):

		if fast {
			if !delta.Equal(computed) {
				log.Fatalf(ctx, "fast-path MVCCStats copmutation gave wrong result: diff(fast, computed) = %s",

a cop-mutation?

✂-1


pkg/storage/batcheval/cmd_clear_range_test.go, line 64 at r5 (raw file):

	var value roachpb.Value
	value.SetString(valueStr) // 10KiB
	halfFull := clearRangeBytesThreshold / len(valueStr) / 2

nit: clearRangeBytesThreshold / (2*len(valueStr)) avoids worrying about precedence here.


pkg/storage/batcheval/cmd_clear_range_test.go, line 129 at r5 (raw file):

			newStats.SysBytes, newStats.SysCount = 0, 0 // ignore these values
			delta.SysBytes, delta.SysCount = 0, 0       // these too, as GC threshold is updated
			newStats.Add(delta)

delta.AgeTo(newStats.LastUpdateNanos) for completeness. (delta will be ahead anyway).


Comments from Reviewable

@solongordon
Copy link
Contributor

Review status: 15 of 28 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed.


pkg/settings/cluster/cockroach_versions.go, line 104 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Blegh

Sorry, guess I won the race here.


Comments from Reviewable

Previously, dropping or truncating a table would result in
significant write amplification as each key in the table was
individually tombstoned, and then individually deleted on
eventual GC.

This change adds a new field to the table descriptor indicating
a GC deadline, after which the contents of the table will be deleted
using a new KV ClearRange API call. This efficiently uses the lower
level storage rocksdb::DeleteRange, which creates a range tombstone
across a key span, usually the entire range.

The ClearRange API call also updates the range's last GC timestamp
to prevent historical reads from incorrectly returning empty results
when reading subsequent to a ClearRange.

A dropped table's GC deadline is now visible in the
`crdb_internal.tables` table.

Release note (sql change): dramatic improvement in efficiency when
`DROP|TRUNCATE TABLE` is used.
@spencerkimball
Copy link
Member Author

Review status: 8 of 28 files reviewed at latest revision, 19 unresolved discussions.


pkg/settings/cluster/cockroach_versions.go, line 104 at r4 (raw file):

Previously, solongordon wrote…

Sorry, guess I won the race here.

Done.


pkg/sql/drop_test.go, line 217 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Do you still need this? Would be good to see that a 0 TTL drops the table right away even in the real world where this knob isn't set.

We still need the knob for other schema changes which are tested. I've removed it here.


pkg/sql/drop_test.go, line 595 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Do you still need this? Would be good to see that a 0 TTL drops the table right away even in the real world where this knob isn't set.

Removed this instance as well.


pkg/sql/sqlbase/structured.proto, line 632 at r4 (raw file):

Previously, vivekmenezes wrote…

I'm okay with keeping it in. Not a big deal really

Done.


pkg/storage/batcheval/cmd_clear_range.go, line 104 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

if total := statsDelta.Total(); ...

Done.


pkg/storage/batcheval/cmd_clear_range.go, line 106 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Don't return pd on error. (It wouldn't cause problems, but looks fishy).

Done.


pkg/storage/batcheval/cmd_clear_range.go, line 135 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You need to also declare a full range-local span for this method to make this statement true. You implicitly do because you declare GCThreshold, but I think it's good to be explicit here (since contention is of no concern anyway).

But range-local stats are counted only as SysCount and SysBytes, which are explicitly ignored. I'll add to the comment.


pkg/storage/batcheval/cmd_clear_range.go, line 155 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

a cop-mutation?

✂-1

Nicely done.


pkg/storage/batcheval/cmd_clear_range_test.go, line 64 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: clearRangeBytesThreshold / (2*len(valueStr)) avoids worrying about precedence here.

Done.


pkg/storage/batcheval/cmd_clear_range_test.go, line 129 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

delta.AgeTo(newStats.LastUpdateNanos) for completeness. (delta will be ahead anyway).

Done.


Comments from Reviewable

@spencerkimball spencerkimball merged commit 255a75a into cockroachdb:master Dec 14, 2017
@spencerkimball spencerkimball deleted the fast-drop-table branch December 14, 2017 18:34
@tbg
Copy link
Member

tbg commented Dec 14, 2017

🎉


Reviewed 12 of 20 files at r1, 1 of 3 files at r2, 2 of 7 files at r4, 13 of 13 files at r5, 11 of 11 files at r6.
Review status: all files reviewed at latest revision, 12 unresolved discussions, all commit checks successful.


pkg/storage/batcheval/cmd_clear_range.go, line 106 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

Not done.


Comments from Reviewable

spencerkimball added a commit to spencerkimball/cockroach that referenced this pull request Dec 14, 2017
@spencerkimball
Copy link
Member Author

pkg/storage/batcheval/cmd_clear_range.go, line 106 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Not done.

Shoot. Sending another PR.


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants