Skip to content

Conversation

@vroldanbet
Copy link
Contributor

This PR modifies the backup create command to allow an incomplete backup to be resumed.

A new marker file is added with the last written bulk export cursor. This was not added to the OCF file because OCF container is not meant for in-place updates, but streaming append operations. The marker is updated every time a bulk export page is successfully written.

For convenience, the command now also supports creating a backup without a file name; in this case, it will derive the backup file name from the Zed context name. A new page-limit flag has also been added to allow users to specify the number of relationships to ingest per page. This may be useful as a mechanism to throttle the number of relationships read from the datastore.

The new logic detects several scenarios like:

  • A backup exists, but no marker exists (meaning it has completed)
  • Backup does not exist, but the marker is left behind (it gets truncated)

The marker file will be removed only when the backup completes.

This new logic does not guarantee that relationships will not be duplicated. When terminated gracefully, the system should write the last page received, close, and flush the files. But if the process was abruptly terminated (i.e. SIGKILL) it could lead to relationships being written to the OCF file, but the marker not being updated.

@vroldanbet vroldanbet requested a review from josephschorr April 22, 2025 17:33
@vroldanbet vroldanbet self-assigned this Apr 22, 2025
@vroldanbet vroldanbet force-pushed the backup-improvements branch from 26afc6f to f33f408 Compare April 22, 2025 17:38
Copy link
Contributor

@tstirrat15 tstirrat15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, had a question

@vroldanbet vroldanbet force-pushed the backup-improvements branch from f33f408 to 4eb56d4 Compare April 22, 2025 17:43
derives the file name from the zed context name
adds a flag that allows the user defining
the number of relationships they'd like to be sent
per bulk export page
@vroldanbet vroldanbet force-pushed the backup-improvements branch from 4eb56d4 to 01a2035 Compare April 22, 2025 17:44
this commit changes the backup create command
so that a canceled backup can be resumed.

A new marker file is added with the last written
bulk export cursor. This was not added to the OCF
file because OCF container is not meant for in-place
updates, but streaming append operations. The marker
is updated every time a bulk export page is successfully
written.

For convenience the command now also supports creating
a backup without a file name: in that case it will
derive the backup file name from the zed context name.

The new logic detects several scenarios like:
- a backup exists, but no marker exists (meaning it completed)
- backup does not exist, but marker is left behind (it gets truncated)

The marker file will be removed only when the backup completes.

This new logic does not guarantee relationships may not get duplicated.
When terminated gracefully the system should write the last page received
and close and flush the files. But if the process was abruptly terminated
(e.g. SIGKILL) it could lead to relationships being written to the OCF
file, but the marker not being updated.
Copy link
Contributor

@alecmerdler alecmerdler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but I was able to get an error by running against SpiceDB seeded using the test data:

ERR terminated with errors error="error receiving relationships: %!w(<nil>): set flag --max-message-size=9793576 to increase the maximum allowable size

- print backup stats always, regardless termination reason
- improve handling of ResourceExhausted error, instead of fixing
  all exit paths, wrap the command function so the error is always
  evaluated.
- fixes malformed error message due to error variable overlap
- made resilient to existing backup with empty marker file
@vroldanbet
Copy link
Contributor Author

Not a blocker, but I was able to get an error by running against SpiceDB seeded using the test data:

@alecmerdler this is a known issue, which can be addressed by setting the --max-message-size in case of large schemas, or --page-limit (introduced in this PR) to control the size of the pages in a single gRPC message.

I fixed the confusing error message in cea939a, thanks for reporting it!

module github.com/authzed/zed

go 1.23.8
go 1.24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in a separate pull request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no because I needed it for t.Context()

Copy link
Contributor

@alecmerdler alecmerdler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!

@vroldanbet vroldanbet merged commit ce8c112 into main Apr 23, 2025
11 checks passed
@vroldanbet vroldanbet deleted the backup-improvements branch April 23, 2025 14:23
@github-actions github-actions bot locked and limited conversation to collaborators Apr 23, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants