Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve experience on pg serialization errors #1552

Merged
merged 5 commits into from
Oct 4, 2023

Conversation

vroldanbet
Copy link
Contributor

@vroldanbet vroldanbet commented Sep 28, 2023

ℹ️ Depends on authzed/authzed-go#147 and authzed/api#88 merging first

I tried to improve two aspects of serialization errors:

  • retrying them on the server side to spare the client doing it themselves
  • do exponential backoff
  • better logging and more concise logging
  • better gRPC response (Aborted instead of Unknown)

This greatly improved the odds of transactions to succeed.

Details

serialization errors are not well handled in the Postgres datastore:

  • they tend to be logged by the pgx driver with potentially very large SQL statements
  • they are not retried, nor there is a backoff. There are also other types of errors that are not being retried, like conn closed (which pgx considers safe to retry)
  • they return to the client as unknown instead of the recommended aborted gRPC code
  • no information is logged on retries in debug level
  • logs can get extremely verbose due to large SQL and args

Serialization errors can also manifest as a transaction rollback, and so code is added to retry and wrap the error accordingly, in which case gRPC aborted is returned.

Additionally, retries after serialization can lead to longer requests, and longer requests can hit the
deadline. Sometimes a cancellation leads to the pgx connection being asynchronously closed, losing information about the original cause that closed it. Unfortunately, there isn't an exported type for this, so it's detected via the string. The error is turned into a multi-error that has cancellation. In a local load test, evidence shows that in all cases the original reason was cancellation.

A call to pgx.SafeToRetry is added to handle various pgx internal errors (like conn closed) and determine if they can be retried. This has helped retry errors it didn't use to retry before.

Retries also now have a backoff, since the default maximum of 10 can knock down the database. The retry function is refactored, and the minimum wait is reduced since the total backoff grows pretty quickly, and will easily
make the client hit the typical deadlines. The log message for retries is moved to Debug since serialization errors can be pretty common so it shouldn't be considered a warning.

I also detected that the backoff function wasn't respecting the context cancellation, so even when the client had canceled, the goroutine would be still asleep. The function was changed to cancel immediately after the client signal.

@vroldanbet vroldanbet changed the base branch from main to add-new-indexes September 28, 2023 22:36
@github-actions github-actions bot added the area/datastore Affects the storage system label Sep 28, 2023
@vroldanbet vroldanbet self-assigned this Sep 28, 2023
@vroldanbet vroldanbet force-pushed the improve-experience-on-pg-serialization-errors branch 2 times, most recently from 71426b5 to dc9dc2a Compare September 29, 2023 08:36
@vroldanbet vroldanbet marked this pull request as ready for review September 29, 2023 08:36
@vroldanbet vroldanbet requested a review from a team as a code owner September 29, 2023 08:36
@vroldanbet vroldanbet force-pushed the improve-experience-on-pg-serialization-errors branch 2 times, most recently from bdcf9d5 to f6c1619 Compare September 29, 2023 09:05
@github-actions github-actions bot added the area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) label Sep 29, 2023
@vroldanbet vroldanbet force-pushed the improve-experience-on-pg-serialization-errors branch 2 times, most recently from 1f2df0f to 4a3dc5d Compare September 29, 2023 09:36
@vroldanbet vroldanbet force-pushed the improve-experience-on-pg-serialization-errors branch from 4a3dc5d to 01e33be Compare September 29, 2023 11:02
vroldanbet added a commit to authzed/api that referenced this pull request Oct 2, 2023
@github-actions github-actions bot added the area/dependencies Affects dependencies label Oct 2, 2023
@vroldanbet vroldanbet force-pushed the improve-experience-on-pg-serialization-errors branch from d9ca0de to 4f81511 Compare October 2, 2023 12:54
vroldanbet added a commit to authzed/api that referenced this pull request Oct 2, 2023
@vroldanbet vroldanbet force-pushed the add-new-indexes branch 3 times, most recently from 9025669 to 1e81e68 Compare October 3, 2023 09:31
Base automatically changed from add-new-indexes to main October 3, 2023 14:59
For folks running schema migrations, which typically involve
deleting relationships of one of the relations affected
in their schema, we recommend using DeleteRelationship
API with limit enabled not to overload the database in
situations where relations have millions of relationships.

DeleteRelationships with limits is fundamentally different
in that it has to do a SELECT first because Postgres does not
support DELETE ... LIMIT statements.

That particular SELECT statement was found to be unindexed
and be extremely slow in SpiceDB clusters with millions of relationships.
This commit adds a new index to support that access pattern
and make deletes+limit faster and less impactful overall.

⚠️This index uses the xid8 type and also requires Postgres 15
for the query planner to do the right thing.
- they tend to be logged by the `pgx` driver with potentially very
  large SQL statements
- they are not retried, nor there is a backoff. There are also other
  types of errors that are not being retried, like `conn closed`
  (which pgx considers safe to retry)
- they return to the client as unknown instead of the recommended
  aborted gRPC code
- no information is logged on retries in `debug` level
- logs can get extremely verbose due to large SQL and args

Serialization errors can also manifest as a transaction rollback,
and so code is added to retry and wrap the error accordingly,
in which case gRPC `aborted` is returned.

Additionally, retries after serialization can lead to longer
requests, and longer requests can hit the deadline. Sometimes
a cancellation leads to the pgx connection being asynchronously
closed, losing information about the original cause that closed it.
Unfortunately, there isn't an exported type for this, so it's
detected via the string. The error is turned into a multi-error
that has cancellation. In a local load test, evidence shows that
in all cases the original reason was cancellation.

A call to `pgx.SafeToRetry` is added to handle various pgx internal
errors (like `conn closed`) and determine if they can be retried.
This has helped retry errors it didn't use to retry before.

Retries also now have a backoff, since the default maximum of 10
can knock down the database. The retry function is refactored, and
the minimum wait is reduced since the total backoff grows pretty quickly,
and will easily make the client hit the typical deadlines. The log
message for retries is moved to Debug since serialization errors can
be pretty common so it shouldn't be considered a warning.

I also detected that the backoff function wasn't respecting the
context cancellation, so even when the client had canceled, the
goroutine would be still asleep. The function was changed to cancel
immediately after the client signal.
when there is a lot of data to be GC, it
would be useful to have richer understanding
of what's happening via logs. This commit
changes the completed log messages to info
since this is something operators would want
to monitor under normal operations, and
also enriches the error message to understand
where is it stuck deleting
@vroldanbet vroldanbet force-pushed the improve-experience-on-pg-serialization-errors branch from 4f81511 to 23386fa Compare October 3, 2023 15:53
@vroldanbet vroldanbet force-pushed the improve-experience-on-pg-serialization-errors branch from 23386fa to a225dc7 Compare October 3, 2023 15:53
@vroldanbet vroldanbet force-pushed the improve-experience-on-pg-serialization-errors branch from a225dc7 to a54d26e Compare October 3, 2023 16:12
with authzed/api#88 a new error
reason is exposed to signal datastore serialization
errors.
@vroldanbet vroldanbet force-pushed the improve-experience-on-pg-serialization-errors branch from a54d26e to a4cc965 Compare October 3, 2023 16:30
@vroldanbet vroldanbet added this pull request to the merge queue Oct 3, 2023
Copy link
Member

@josephschorr josephschorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 3, 2023
@vroldanbet vroldanbet added this pull request to the merge queue Oct 4, 2023
Merged via the queue into main with commit 21f67a5 Oct 4, 2023
25 checks passed
@vroldanbet vroldanbet deleted the improve-experience-on-pg-serialization-errors branch October 4, 2023 06:50
@github-actions github-actions bot locked and limited conversation to collaborators Oct 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/datastore Affects the storage system area/dependencies Affects dependencies area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants