Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: don't be pessimistic about txn that can't refresh #30074

Merged
merged 1 commit into from Mar 18, 2019

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Sep 11, 2018

Before this patch, the span refresher had code to:
// If the transaction will retry and the refresh spans are
// exhausted, return a non-retryable error indicating that the
// transaction is too large and should potentially be split.
// We do this to avoid endlessly retrying a txn likely refail.

This was uncharacteristically pessimistic for cockroach. There's no
particular indication that the txn in question will need a refresh upon
restart. We don't engage in such judgements anywhere else. This patch
removes this behavior.
Instead of removing it, another option would be to eagerly generate a
retryable error. However, that would probably not be a useful thing to
do given the behavior of the other layers: the server side generally
tries to let a txn run as long as possible before forcing a restart in
the hope that it will lay down more intents, and the SQL layer
implements a policy of eager restarts while the txn can still be
"auto-retried".

Relatedly, I think the protection that we had was deficient because it
didn't handle the txn's RefreshTimestamp, only its OrigTimestamp.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

Spencer, @nvanbenschoten and I submit this for your consideration.

// We do this to avoid endlessly retrying a txn likely refail.
if sr.refreshInvalid && (br.Txn.OrigTimestamp != br.Txn.Timestamp) {
return nil, roachpb.NewErrorWithTxn(
errors.New("transaction is too large to complete; try splitting into pieces"), br.Txn,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this was added when we had workloads which were creating massive transactions as a result of executing statements like DELETE FROM d.t WHERE timestamp > x. These would retry endlessly because they could never finish so long as the timestamp was moved forward sufficiently. I would not be in favor of removing this protection. Perhaps you could enhance it instead.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we have all sorts of workloads where transactions can end up retrying endlessly. We don't have any other similar protections elsewhere. The answer was always a combination of:
a) the client can stop retrying when they've had it (assuming the retries are under the client's control)
b) the client can close the connection to signal cancelation when retries are not under the client's control. And separately perhaps we should add some support for the pgwire cancelation protocol
c) we should find ways to give clients more policy control on a per-session / per-query / per-txn basis around timeouts/deadlines/resource caps (#5924).

So, as I see it, removing this code would be in line with more general policies. No?

You suggest enhancing it - but how? (other than perhaps keeping track of the RefreshTimestamp)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@spencerkimball
Copy link
Member

I agree with you from a high level perspective, but it was pretty shitty to watch this case in practice, where customers were running a delete command which never finished. To be fair, I think this was before the client cutting the connection would cancel the right context.

@nvanbenschoten
Copy link
Member

I find myself agreeing with Andrei on this for most of the reasons he listed, but also because I'm not convinced this is even an effective way to stop endless retries. Transactions could just as easily remain below this refresh span size limit and continually fail to refresh, leading to endless client-side retries.

It seems to me that the refresh span size limit is being used for a secondary purpose here - to make a determination about when a transaction has performed so much work that retrying will likely hit more errors. I don't think that's a reasonable thing to base off of the size of the keys that need to be refreshed. At a minimum, it should be based on the number of refresh spans instead. Barring that, I don't think we should try to be smart here and make a determination on this.

@spencerkimball
Copy link
Member

As long as we're sure that killing the client that started the DELETE FROM in my example will stop the retries (like I said, I'm pretty sure this wasn't true before but is now), I'm fine with this change.

Copy link
Member

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a reasonable thing to base off of the size of the keys that need to be refreshed.

It's based on the size of the refresh spans and the fact that it had to refresh once, which gives us reasonable probability that the same thing will happen again.

I don't like "just kill the client" as a solution here - if it's a statement being retried on the server side the user has no way of telling whether it's making progress or just spinning. The fact that we have a lot of places that do unbounded retries is not an excuse for removing one of the safeguards we do have (for something that has been observed in practice). I'd prefer to leave this in place until we can migrate to a higher-level retry limit.

Did you actually run into this or are you just trying to clean up the code?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you actually run into this or are you just trying to clean up the code?

Both. Nathan quoted two instances of users running into this and sneakily snipped me into starting this conversation (he was supposed to send this PR :P ). Of course, just because people ran into it is not exactly an argument for getting rid of it - this protection was put here so that people run in it.
But it's just too arbitrary for my taste. It doesn't cover the "small keys" case that Nathan is talking about, and besides, if we were to keep some like this protection here, it should at least allow for 1 restart: if the txn never restarted before, at the point where this fires, the transaction is still "making progress" (to the extent that this concept can be defined) - it's laying down intents that might prevent other reads from pushing it the next time around.
In conclusion, this would definitely not be the point in code or the mechanism I would use for starting precedents for short circuits.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offline compromise: go ahead with the change but don't backport to 2.1 because of Ben's "strong bias towards the status quo" :)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebased. @bdarnell can I merge this?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @spencerkimball)

@bdarnell
Copy link
Member

Not until the release branch is cut. The point of the previous compromise was to get this change in at the beginning of a development cycle, not to get it into 19.1. Now that it's sat untouched and untested for 6 months, it'll have to wait for the next release.

@andreimatei
Copy link
Contributor Author

The code that it deletes is equally untested.
Working with imperfect people such as myself, there's a real risk it will keep slipping :)

Before this patch, the span refresher had code to:
// If the transaction will retry and the refresh spans are
// exhausted, return a non-retryable error indicating that the
// transaction is too large and should potentially be split.
// We do this to avoid endlessly retrying a txn likely refail.

This was uncharacteristically pessimistic for cockroach. There's no
particular indication that the txn in question will need a refresh upon
restart. We don't engage in such judgements anywhere else. This patch
removes this behavior.
Instead of removing it, another option would be to eagerly generate a
retryable error. However, that would probably not be a useful thing to
do given the behavior of the other layers: the server side generally
tries to let a txn run as long as possible before forcing a restart in
the hope that it will lay down more intents, and the SQL layer
implements a policy of eager restarts while the txn can still be
"auto-retried".

Relatedly, I think the protection that we had was deficient because it
didn't handle the txn's RefreshTimestamp, only its OrigTimestamp.

Release note: None
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @spencerkimball)

craig bot pushed a commit that referenced this pull request Mar 18, 2019
30074: kv: don't be pessimistic about txn that can't refresh r=andreimatei a=andreimatei

Before this patch, the span refresher had code to:
// If the transaction will retry and the refresh spans are
// exhausted, return a non-retryable error indicating that the
// transaction is too large and should potentially be split.
// We do this to avoid endlessly retrying a txn likely refail.

This was uncharacteristically pessimistic for cockroach. There's no
particular indication that the txn in question will need a refresh upon
restart. We don't engage in such judgements anywhere else. This patch
removes this behavior.
Instead of removing it, another option would be to eagerly generate a
retryable error. However, that would probably not be a useful thing to
do given the behavior of the other layers: the server side generally
tries to let a txn run as long as possible before forcing a restart in
the hope that it will lay down more intents, and the SQL layer
implements a policy of eager restarts while the txn can still be
"auto-retried".

Relatedly, I think the protection that we had was deficient because it
didn't handle the txn's RefreshTimestamp, only its OrigTimestamp.

Release note: None

35870: workload,workloadccl: inject stats for TPC-C workload fixture r=rytaft a=rytaft

This commit adds logic to pre-calculate stats for any TPC-C workload,
and inject the stats using `ALTER TABLE ... INJECT STATISTICS`. The
benefit of injecting stats is it allows the benchmark to run without the
overhead of performing the initial stats collection. Stats may still
be refreshed throughout the benchmark, but they will not be running
constantly. This is a much closer approximation to how a workload would
be run on a day-to-day basis in production.

Release note (performance improvement): Improved performance of the TPC-C
benchmark by pre-calculating statistics and injecting them during IMPORT.

Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Mar 18, 2019

Build succeeded

@craig craig bot merged commit 2ceda27 into cockroachdb:master Mar 18, 2019
@andreimatei andreimatei deleted the span-refresher-no-pessimism branch March 19, 2019 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants