-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve experience on pg serialization errors #1552
Merged
vroldanbet
merged 5 commits into
main
from
improve-experience-on-pg-serialization-errors
Oct 4, 2023
Merged
Improve experience on pg serialization errors #1552
vroldanbet
merged 5 commits into
main
from
improve-experience-on-pg-serialization-errors
Oct 4, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
vroldanbet
force-pushed
the
improve-experience-on-pg-serialization-errors
branch
2 times, most recently
from
September 29, 2023 08:36
71426b5
to
dc9dc2a
Compare
vroldanbet
force-pushed
the
improve-experience-on-pg-serialization-errors
branch
2 times, most recently
from
September 29, 2023 09:05
bdcf9d5
to
f6c1619
Compare
github-actions
bot
added
the
area/tooling
Affects the dev or user toolchain (e.g. tests, ci, build tools)
label
Sep 29, 2023
vroldanbet
force-pushed
the
improve-experience-on-pg-serialization-errors
branch
2 times, most recently
from
September 29, 2023 09:36
1f2df0f
to
4a3dc5d
Compare
vroldanbet
force-pushed
the
add-new-indexes
branch
from
September 29, 2023 10:55
9fe5986
to
84250fd
Compare
vroldanbet
force-pushed
the
improve-experience-on-pg-serialization-errors
branch
from
September 29, 2023 11:02
4a3dc5d
to
01e33be
Compare
vroldanbet
added a commit
to authzed/api
that referenced
this pull request
Oct 2, 2023
to support authzed/spicedb#1552
vroldanbet
force-pushed
the
add-new-indexes
branch
from
October 2, 2023 12:51
84250fd
to
b3b5531
Compare
vroldanbet
force-pushed
the
improve-experience-on-pg-serialization-errors
branch
from
October 2, 2023 12:54
d9ca0de
to
4f81511
Compare
vroldanbet
added a commit
to authzed/api
that referenced
this pull request
Oct 2, 2023
to support authzed/spicedb#1552
vroldanbet
force-pushed
the
add-new-indexes
branch
3 times, most recently
from
October 3, 2023 09:31
9025669
to
1e81e68
Compare
For folks running schema migrations, which typically involve deleting relationships of one of the relations affected in their schema, we recommend using DeleteRelationship API with limit enabled not to overload the database in situations where relations have millions of relationships. DeleteRelationships with limits is fundamentally different in that it has to do a SELECT first because Postgres does not support DELETE ... LIMIT statements. That particular SELECT statement was found to be unindexed and be extremely slow in SpiceDB clusters with millions of relationships. This commit adds a new index to support that access pattern and make deletes+limit faster and less impactful overall.⚠️ This index uses the xid8 type and also requires Postgres 15 for the query planner to do the right thing.
- they tend to be logged by the `pgx` driver with potentially very large SQL statements - they are not retried, nor there is a backoff. There are also other types of errors that are not being retried, like `conn closed` (which pgx considers safe to retry) - they return to the client as unknown instead of the recommended aborted gRPC code - no information is logged on retries in `debug` level - logs can get extremely verbose due to large SQL and args Serialization errors can also manifest as a transaction rollback, and so code is added to retry and wrap the error accordingly, in which case gRPC `aborted` is returned. Additionally, retries after serialization can lead to longer requests, and longer requests can hit the deadline. Sometimes a cancellation leads to the pgx connection being asynchronously closed, losing information about the original cause that closed it. Unfortunately, there isn't an exported type for this, so it's detected via the string. The error is turned into a multi-error that has cancellation. In a local load test, evidence shows that in all cases the original reason was cancellation. A call to `pgx.SafeToRetry` is added to handle various pgx internal errors (like `conn closed`) and determine if they can be retried. This has helped retry errors it didn't use to retry before. Retries also now have a backoff, since the default maximum of 10 can knock down the database. The retry function is refactored, and the minimum wait is reduced since the total backoff grows pretty quickly, and will easily make the client hit the typical deadlines. The log message for retries is moved to Debug since serialization errors can be pretty common so it shouldn't be considered a warning. I also detected that the backoff function wasn't respecting the context cancellation, so even when the client had canceled, the goroutine would be still asleep. The function was changed to cancel immediately after the client signal.
when there is a lot of data to be GC, it would be useful to have richer understanding of what's happening via logs. This commit changes the completed log messages to info since this is something operators would want to monitor under normal operations, and also enriches the error message to understand where is it stuck deleting
vroldanbet
force-pushed
the
improve-experience-on-pg-serialization-errors
branch
from
October 3, 2023 15:53
4f81511
to
23386fa
Compare
vroldanbet
force-pushed
the
improve-experience-on-pg-serialization-errors
branch
from
October 3, 2023 15:53
23386fa
to
a225dc7
Compare
josephschorr
reviewed
Oct 3, 2023
vroldanbet
force-pushed
the
improve-experience-on-pg-serialization-errors
branch
from
October 3, 2023 16:12
a225dc7
to
a54d26e
Compare
with authzed/api#88 a new error reason is exposed to signal datastore serialization errors.
vroldanbet
force-pushed
the
improve-experience-on-pg-serialization-errors
branch
from
October 3, 2023 16:30
a54d26e
to
a4cc965
Compare
josephschorr
approved these changes
Oct 3, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
github-merge-queue
bot
removed this pull request from the merge queue due to failed status checks
Oct 3, 2023
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
area/datastore
Affects the storage system
area/dependencies
Affects dependencies
area/tooling
Affects the dev or user toolchain (e.g. tests, ci, build tools)
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ℹ️ Depends on authzed/authzed-go#147 and authzed/api#88 merging first
I tried to improve two aspects of serialization errors:
Aborted
instead ofUnknown
)This greatly improved the odds of transactions to succeed.
Details
serialization errors are not well handled in the Postgres datastore:
pgx
driver with potentially very large SQL statementsconn closed
(which pgx considers safe to retry)debug
levelSerialization errors can also manifest as a transaction rollback, and so code is added to retry and wrap the error accordingly, in which case gRPC
aborted
is returned.Additionally, retries after serialization can lead to longer requests, and longer requests can hit the
deadline. Sometimes a cancellation leads to the pgx connection being asynchronously closed, losing information about the original cause that closed it. Unfortunately, there isn't an exported type for this, so it's detected via the string. The error is turned into a multi-error that has cancellation. In a local load test, evidence shows that in all cases the original reason was cancellation.
A call to
pgx.SafeToRetry
is added to handle various pgx internal errors (likeconn closed
) and determine if they can be retried. This has helped retry errors it didn't use to retry before.Retries also now have a backoff, since the default maximum of 10 can knock down the database. The retry function is refactored, and the minimum wait is reduced since the total backoff grows pretty quickly, and will easily
make the client hit the typical deadlines. The log message for retries is moved to Debug since serialization errors can be pretty common so it shouldn't be considered a warning.
I also detected that the backoff function wasn't respecting the context cancellation, so even when the client had canceled, the goroutine would be still asleep. The function was changed to cancel immediately after the client signal.