kv: don't be pessimistic about txn that can't refresh #30074

andreimatei · 2018-09-11T16:55:00Z

Before this patch, the span refresher had code to:
// If the transaction will retry and the refresh spans are
// exhausted, return a non-retryable error indicating that the
// transaction is too large and should potentially be split.
// We do this to avoid endlessly retrying a txn likely refail.

This was uncharacteristically pessimistic for cockroach. There's no
particular indication that the txn in question will need a refresh upon
restart. We don't engage in such judgements anywhere else. This patch
removes this behavior.
Instead of removing it, another option would be to eagerly generate a
retryable error. However, that would probably not be a useful thing to
do given the behavior of the other layers: the server side generally
tries to let a txn run as long as possible before forcing a restart in
the hope that it will lay down more intents, and the SQL layer
implements a policy of eager restarts while the txn can still be
"auto-retried".

Relatedly, I think the protection that we had was deficient because it
didn't handle the txn's RefreshTimestamp, only its OrigTimestamp.

Release note: None

cockroach-teamcity · 2018-09-11T16:55:07Z

This change is

andreimatei · 2018-09-11T16:56:38Z

Spencer, @nvanbenschoten and I submit this for your consideration.

spencerkimball · 2018-09-11T17:09:32Z

pkg/kv/txn_interceptor_span_refresher.go

-	// We do this to avoid endlessly retrying a txn likely refail.
-	if sr.refreshInvalid && (br.Txn.OrigTimestamp != br.Txn.Timestamp) {
-		return nil, roachpb.NewErrorWithTxn(
-			errors.New("transaction is too large to complete; try splitting into pieces"), br.Txn,


Note that this was added when we had workloads which were creating massive transactions as a result of executing statements like DELETE FROM d.t WHERE timestamp > x. These would retry endlessly because they could never finish so long as the timestamp was moved forward sufficiently. I would not be in favor of removing this protection. Perhaps you could enhance it instead.

andreimatei

Well, we have all sorts of workloads where transactions can end up retrying endlessly. We don't have any other similar protections elsewhere. The answer was always a combination of:
a) the client can stop retrying when they've had it (assuming the retries are under the client's control)
b) the client can close the connection to signal cancelation when retries are not under the client's control. And separately perhaps we should add some support for the pgwire cancelation protocol
c) we should find ways to give clients more policy control on a per-session / per-query / per-txn basis around timeouts/deadlines/resource caps (#5924).

So, as I see it, removing this code would be in line with more general policies. No?

You suggest enhancing it - but how? (other than perhaps keeping track of the RefreshTimestamp)

Reviewable status: complete! 0 of 0 LGTMs obtained

spencerkimball · 2018-09-11T18:00:26Z

I agree with you from a high level perspective, but it was pretty shitty to watch this case in practice, where customers were running a delete command which never finished. To be fair, I think this was before the client cutting the connection would cancel the right context.

nvanbenschoten · 2018-09-11T20:52:10Z

I find myself agreeing with Andrei on this for most of the reasons he listed, but also because I'm not convinced this is even an effective way to stop endless retries. Transactions could just as easily remain below this refresh span size limit and continually fail to refresh, leading to endless client-side retries.

It seems to me that the refresh span size limit is being used for a secondary purpose here - to make a determination about when a transaction has performed so much work that retrying will likely hit more errors. I don't think that's a reasonable thing to base off of the size of the keys that need to be refreshed. At a minimum, it should be based on the number of refresh spans instead. Barring that, I don't think we should try to be smart here and make a determination on this.

spencerkimball · 2018-09-11T21:40:41Z

As long as we're sure that killing the client that started the DELETE FROM in my example will stop the retries (like I said, I'm pretty sure this wasn't true before but is now), I'm fine with this change.

bdarnell

I don't think that's a reasonable thing to base off of the size of the keys that need to be refreshed.

It's based on the size of the refresh spans and the fact that it had to refresh once, which gives us reasonable probability that the same thing will happen again.

I don't like "just kill the client" as a solution here - if it's a statement being retried on the server side the user has no way of telling whether it's making progress or just spinning. The fact that we have a lot of places that do unbounded retries is not an excuse for removing one of the safeguards we do have (for something that has been observed in practice). I'd prefer to leave this in place until we can migrate to a higher-level retry limit.

Did you actually run into this or are you just trying to clean up the code?

Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei

Did you actually run into this or are you just trying to clean up the code?

Both. Nathan quoted two instances of users running into this and sneakily snipped me into starting this conversation (he was supposed to send this PR :P ). Of course, just because people ran into it is not exactly an argument for getting rid of it - this protection was put here so that people run in it.
But it's just too arbitrary for my taste. It doesn't cover the "small keys" case that Nathan is talking about, and besides, if we were to keep some like this protection here, it should at least allow for 1 restart: if the txn never restarted before, at the point where this fires, the transaction is still "making progress" (to the extent that this concept can be defined) - it's laying down intents that might prevent other reads from pushing it the next time around.
In conclusion, this would definitely not be the point in code or the mechanism I would use for starting precedents for short circuits.

Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei

Offline compromise: go ahead with the change but don't backport to 2.1 because of Ben's "strong bias towards the status quo" :)

Reviewable status: complete! 0 of 0 LGTMs obtained

andreimatei

rebased. @bdarnell can I merge this?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @spencerkimball)

bdarnell · 2019-03-17T23:06:59Z

Not until the release branch is cut. The point of the previous compromise was to get this change in at the beginning of a development cycle, not to get it into 19.1. Now that it's sat untouched and untested for 6 months, it'll have to wait for the next release.

andreimatei · 2019-03-18T15:26:24Z

The code that it deletes is equally untested.
Working with imperfect people such as myself, there's a real risk it will keep slipping :)

Before this patch, the span refresher had code to: // If the transaction will retry and the refresh spans are // exhausted, return a non-retryable error indicating that the // transaction is too large and should potentially be split. // We do this to avoid endlessly retrying a txn likely refail. This was uncharacteristically pessimistic for cockroach. There's no particular indication that the txn in question will need a refresh upon restart. We don't engage in such judgements anywhere else. This patch removes this behavior. Instead of removing it, another option would be to eagerly generate a retryable error. However, that would probably not be a useful thing to do given the behavior of the other layers: the server side generally tries to let a txn run as long as possible before forcing a restart in the hope that it will lay down more intents, and the SQL layer implements a policy of eager restarts while the txn can still be "auto-retried". Relatedly, I think the protection that we had was deficient because it didn't handle the txn's RefreshTimestamp, only its OrigTimestamp. Release note: None

andreimatei

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @spencerkimball)

30074: kv: don't be pessimistic about txn that can't refresh r=andreimatei a=andreimatei Before this patch, the span refresher had code to: // If the transaction will retry and the refresh spans are // exhausted, return a non-retryable error indicating that the // transaction is too large and should potentially be split. // We do this to avoid endlessly retrying a txn likely refail. This was uncharacteristically pessimistic for cockroach. There's no particular indication that the txn in question will need a refresh upon restart. We don't engage in such judgements anywhere else. This patch removes this behavior. Instead of removing it, another option would be to eagerly generate a retryable error. However, that would probably not be a useful thing to do given the behavior of the other layers: the server side generally tries to let a txn run as long as possible before forcing a restart in the hope that it will lay down more intents, and the SQL layer implements a policy of eager restarts while the txn can still be "auto-retried". Relatedly, I think the protection that we had was deficient because it didn't handle the txn's RefreshTimestamp, only its OrigTimestamp. Release note: None 35870: workload,workloadccl: inject stats for TPC-C workload fixture r=rytaft a=rytaft This commit adds logic to pre-calculate stats for any TPC-C workload, and inject the stats using `ALTER TABLE ... INJECT STATISTICS`. The benefit of injecting stats is it allows the benchmark to run without the overhead of performing the initial stats collection. Stats may still be refreshed throughout the benchmark, but they will not be running constantly. This is a much closer approximation to how a workload would be run on a day-to-day basis in production. Release note (performance improvement): Improved performance of the TPC-C benchmark by pre-calculating statistics and injecting them during IMPORT. Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>

craig · 2019-03-18T21:50:41Z

Build succeeded

GitHub CI (Cockroach)

andreimatei assigned spencerkimball Sep 11, 2018

andreimatei requested a review from a team September 11, 2018 16:55

andreimatei force-pushed the span-refresher-no-pessimism branch from 6cbd20a to 67a6045 Compare September 11, 2018 16:56

spencerkimball reviewed Sep 11, 2018

View reviewed changes

andreimatei commented Sep 11, 2018

View reviewed changes

bdarnell reviewed Sep 12, 2018

View reviewed changes

andreimatei commented Sep 12, 2018

View reviewed changes

andreimatei force-pushed the span-refresher-no-pessimism branch from 67a6045 to e433e20 Compare March 17, 2019 18:17

andreimatei commented Mar 17, 2019

View reviewed changes

andreimatei force-pushed the span-refresher-no-pessimism branch from e433e20 to 2ceda27 Compare March 18, 2019 20:12

andreimatei commented Mar 18, 2019

View reviewed changes

craig bot merged commit 2ceda27 into cockroachdb:master Mar 18, 2019

andreimatei deleted the span-refresher-no-pessimism branch March 19, 2019 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: don't be pessimistic about txn that can't refresh #30074

kv: don't be pessimistic about txn that can't refresh #30074

andreimatei commented Sep 11, 2018 •

edited

cockroach-teamcity commented Sep 11, 2018

andreimatei commented Sep 11, 2018

spencerkimball Sep 11, 2018

andreimatei left a comment

spencerkimball commented Sep 11, 2018

nvanbenschoten commented Sep 11, 2018

spencerkimball commented Sep 11, 2018

bdarnell left a comment

andreimatei left a comment

andreimatei left a comment

andreimatei left a comment

bdarnell commented Mar 17, 2019

andreimatei commented Mar 18, 2019

andreimatei left a comment

craig bot commented Mar 18, 2019

kv: don't be pessimistic about txn that can't refresh #30074

kv: don't be pessimistic about txn that can't refresh #30074

Conversation

andreimatei commented Sep 11, 2018 • edited

cockroach-teamcity commented Sep 11, 2018

andreimatei commented Sep 11, 2018

spencerkimball Sep 11, 2018

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

spencerkimball commented Sep 11, 2018

nvanbenschoten commented Sep 11, 2018

spencerkimball commented Sep 11, 2018

bdarnell left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

bdarnell commented Mar 17, 2019

andreimatei commented Mar 18, 2019

andreimatei left a comment

Choose a reason for hiding this comment

craig bot commented Mar 18, 2019

Build succeeded

andreimatei commented Sep 11, 2018 •

edited