-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement take backup api in gateway #9901
Comments
There will be a new actuator under I thought about having a general response type, but honestly I'm not sure everything fits. For example, what would be the common response type for a I think for the So, summarize, a new endpoint under Backup Actuator Specopenapi: "3.0.2"
info:
title: Backups API
version: "1.0"
description: |
Management endpoint to query, take, and delete backups of Zeebe.
servers:
- url: "{schema}://{host}:{port}/actuator/backups"
description: Test server
variables:
host:
default: localhost
description: Management server hostname
port:
default: "9600"
description: Management server port
schema:
default: http
description: Management server schema
paths:
/{id}:
get:
summary: Monitors backup
description: |
Aggregates the complete status of a backup across all partitions.
parameters:
- $ref: '#/components/parameters/BackupId'
responses:
'200':
description: |
The aggregated status of a given backup across all partition leaders.
content:
application/json:
schema:
$ref: '#/components/schemas/BackupStatus'
examples:
in-progress:
$ref: '#/components/examples/in-progress'
completed:
$ref: '#/components/examples/completed'
'404':
$ref: '#/components/responses/NotFound'
'500':
$ref: '#/components/responses/ServerError'
'501':
$ref: '#/components/responses/NotImplementedError'
'502':
$ref: '#/components/responses/GatewayError'
'503':
$ref: '#/components/responses/NotImplementedError'
'504':
$ref: '#/components/responses/GatewayTimeoutError'
post:
summary: Take backup
description: |
Asynchronously start a backup operation with the given ID. To monitor the state of the
operation, it's recommended that you poll the `GET /id` endpoint at a low, periodic interval.
Backups taken via this can later be deleted by sending a `DELETE /id`.
parameters:
- $ref: '#/components/parameters/BackupId'
responses:
'202':
description: |
A new backup with the given ID was created, and the operation is started, but is not
necessarily completed.
content:
application/json:
schema:
$ref: '#/components/schemas/BackupStatus'
examples:
in-progress:
$ref: '#/components/examples/in-progress'
'500':
$ref: '#/components/responses/ServerError'
'501':
$ref: '#/components/responses/NotImplementedError'
'502':
$ref: '#/components/responses/GatewayError'
'503':
$ref: '#/components/responses/NotImplementedError'
'504':
$ref: '#/components/responses/GatewayTimeoutError'
delete:
summary: Delete backup
description: |
Deletes the backup with the given ID from the configured backup store.
parameters:
- $ref: '#/components/parameters/BackupId'
responses:
'202':
description: |
Indicates that the backup has been successfully deleted. This may be an asynchronous
operation based on the configured backup store.
'404':
$ref: '#/components/responses/NotFound'
'500':
$ref: '#/components/responses/ServerError'
'501':
$ref: '#/components/responses/NotImplementedError'
'502':
$ref: '#/components/responses/GatewayError'
'503':
$ref: '#/components/responses/NotImplementedError'
'504':
$ref: '#/components/responses/GatewayTimeoutError'
components:
parameters:
BackupId:
name: id
in: path
description: ID of the backup
required: true
style: simple
schema:
$ref: '#/components/schemas/BackupId'
responses:
NotFound:
description: |
Indicates that no backup with the given ID exists to be deleted. This can sometimes be
temporary when the backup operation was just started, but it may indicate errors if this
is consistently failing, or if no partitions report a status different than
`DOES_NOT_EXIST`.
content:
application/json:
schema:
$ref: '#/components/schemas/BackupStatus'
examples:
not-found:
$ref: '#/components/examples/not-found'
not-found-yet:
$ref: '#/components/examples/not-found-yet'
ServerError:
description: |
An error occurred in the gateway, most likely while trying to communicate with one of the
partition leaders. You should check the gateway logs for more.
$ref: '#/components/responses/Error'
GatewayError:
description: |
An error occurred on one of the partition leaders.
It's possible that the backup was started on some partitions; in this case, it will
eventually be marked as failed. You should try again with a new, higher backup
ID.
$ref: '#/components/responses/Error'
NotImplementedError:
description: |
One of the partition leaders is of an older version and does not support taking backups.
This can happen if you're in the middle of a rolling upgrade, and you should simply
retry later. However, if this always happens, then you should double check your cluster
configuration and make sure all nodes in your cluster are of version which supports
taking backups.
$ref: '#/components/responses/Error'
GatewayTimeoutError:
description: |
Indicates one or more requests to a partition leader timed out, and the response cannot be
completely aggregated. You can find more details about each partition request in the
response body.
$ref: '#/components/responses/Error'
Error:
description: |
A general error for an aggregated request.
content:
"application/problem+json":
schema:
$ref: '#/components/schemas/Error'
examples:
default:
summary: Complete failure
value:
type: "/backups/3"
title: "some title for the error situation"
status: 501
detail: "some description for the error situation"
instance: "/actuator/backups/3"
partial-error:
summary: Partial failure
value:
type: "/backups/3"
title: "Request timed out with leader of partition 2"
status: 504
detail: |
Backup status request timed out between zeebe-gateway-0 and leader of partition 2
zeebe-broker-1 after 30 seconds.
instance: "/actuator/backups/3"
partitions:
- id: 1
status: IN_PROGRESS
descriptor:
snapshotId: 238878141-55-691634857-691606445
checkpointPosition: 1
brokerId: 1
brokerVersion: 8.0.6
- type: "/backups/3/partitions/2"
title: "Request timed out after 30 seconds"
status: 504
detail: |
Backup status request timed out between zeebe-gateway-0 and leader of
partition 2 zeebe-broker-1 after 30 seconds
instance: "/actuator/backups/3"
schemas:
Error:
title: Error
type: object
allOf:
- $ref: 'https://opensource.zalando.com/restful-api-guidelines/models/problem-1.0.1.yaml#/Problem'
properties:
partitions:
readOnly: true
type: array
items:
anyOf:
- $ref: '#/components/schemas/PartitionBackupStatus'
- $ref: 'https://opensource.zalando.com/restful-api-guidelines/models/problem-1.0.1.yaml#/Problem'
BackupId:
title: Backup ID
description: The ID of the backup
type: number
example: 1
minimum: 0
BackupStatus:
title: Backup Status
description: The status of the backup
type: object
properties:
id:
readOnly: true
allOf:
- $ref: '#/components/schemas/BackupId'
status:
readOnly: true
allOf:
- $ref: '#/components/schemas/StatusCode'
partitions:
readOnly: true
description: |
Detailed list of the status of the backup per partition. It should always contain all
partitions known to the cluster.
type: array
items:
allOf:
- $ref: '#/components/schemas/PartitionBackupStatus'
required:
- id
- status
- partitions
PartitionBackupStatus:
title: Backup Status per Partition
description: The status of the backup for a given partition.
type: object
properties:
id:
readOnly: true
description: The ID of the partition.
type: number
example: 1
minimum: 1
status:
readOnly: true
allOf:
- $ref: '#/components/schemas/StatusCode'
createdAt:
description: The timestamp at which the backup was started on this partition.
readOnly: true
type: string
format: date-time
example: "2022-09-15T13:10:38.176514094Z"
lastUpdatedAt:
description: |
The timestamp at which the backup was last updated on this partition, e.g. changed
status from IN_PROGRESS to COMPLETED.
readOnly: true
type: string
format: date-time
example: "2022-09-15T13:10:38.176514094Z"
descriptor:
readOnly: true
allOf:
- $ref: '#/components/schemas/BackupDescriptor'
required:
- id
- status
StatusCode:
title: Status code
description: The status of the backup.
type: string
enum:
- DOES_NOT_EXIST
- IN_PROGRESS
- COMPLETED
- FAILED
example: IN_PROGRESS
BackupDescriptor:
title: Backup Descriptor
description: |
Context information about the specific backup and what it contains for a given partition.
type: object
properties:
snapshotId:
description: The ID of the snapshot which is included in this backup.
type: string
readOnly: true
example: 238632143-55-690906332-690905294
checkpointPosition:
description: The position of the checkpoint for this backup.
type: number
readOnly: true
example: 10
brokerId:
description: The ID of the broker from which the backup was taken for this partition.
type: number
readOnly: true
example: 0
minimum: 0
brokerVersion:
description: The version of the broker from which the backup was taken for this partition.
type: string
readOnly: true
example: 8.0.5
required:
- snapshotId
- checkpointPosition
- brokerId
- brokerVersion
examples:
in-progress:
summary: Status of an in progress backup
description: |
Status response of a backup which is still in progress, with an ID of 1, across two partitions.
value:
id: 1
status: IN_PROGRESS
partitions:
- id: 1
status: IN_PROGRESS
descriptor:
snapshotId: 238878141-55-691634857-691606445
checkpointPosition: 1
brokerId: 1
brokerVersion: 8.0.6
- id: 2
status: IN_PROGRESS
descriptor:
snapshotId: 238878141-55-691634857-691606445
checkpointPosition: 1
brokerId: 1
brokerVersion: 8.0.6
completed:
summary: Status of completed backup
description: |
Status response for a completed backup with ID 2, on a cluster with 2 partitions.
value:
id: 2
status: COMPLETED
partitions:
- id: 1
status: COMPLETED
descriptor:
snapshotId: 238878141-55-691634857-691606445
checkpointPosition: 2
brokerId: 1
brokerVersion: 8.0.6
- id: 2
status: COMPLETED
descriptor:
snapshotId: 238632143-55-690906332-690905294
checkpointPosition: 2
brokerId: 1
brokerVersion: 8.0.6
not-found:
summary: Status response of a non-existent backup
description: |
Status response for a backup does not exist anywhere in the cluster.
value:
id: 2
status: DOES_NOT_EXIST
partitions:
- id: 1
status: DOES_NOT_EXIST
- id: 2
status: DOES_NOT_EXIST
not-found-yet:
summary: Status response of yet to be found backup
description: |
Status response for a backup which was only partially successful, where one partition does
does not have it yet.
value:
id: 2
status: DOES_NOT_EXIST
partitions:
- id: 1
status: DOES_NOT_EXIST
- id: 2
status: IN_PROGRESS
descriptor:
snapshotId: 238632143-55-690906332-690905294
checkpointPosition: 2
brokerId: 1
brokerVersion: 8.0.6 You can more easily visualize it using an online editor, or installing an IntelliJ extension. I'm not yet 100% sure about the descriptions I put for the response types, as I didn't implement this, and I'm also not 100% about the usage of Let me know what you think. |
I figured it makes sense to have an OpenAPI spec already since I expect people will automate using this, but we could also omit it for now and go with simple documentation in our own docs. |
Right, you did link it to me, I just forgot about it sorry 🙈 I think we can general align with them. Some things that stand out:
I'll update it tomorrow, and start work on the take backup API (as I think we can sort of verify it via the actual back up store, even without the status API). |
This PR would already help in testing it. |
So differences between both:
|
We cannot allow string IDs. But Operate and Optimize can because ElasticSearch support string for snapshot names.
Would it be ok, if we start with generic errors and iteratively improve and add specific errors? Then we could also suggest them to add specific errors later. And we can already add specific ones.
More detailed Status would be useful for debugging. For example, if the status is failed it would be useful to know which partition failed and what was the reason. User's probably cannot do anything about it, but just retrying with a new backupId. But it will be useful for investigations.
I don't have a strong opinion on it. May be we could align with them on this and use |
I'm happy to have a simpler error message, but I think your comment applies to it as well:
This is also important for errors, no? We can ignore I won't insist though, I don't feel that strongly about it 😄 |
Let's then quickly discuss what are the expected errors, and what users can do about it. |
Let's go with this for now, and I think it will most likely be adjusted a bit as I implement it and write tests for it 👍 Backup management APIopenapi: "3.0.2"
info:
title: Backups API
version: "1.0"
description: |
Management endpoint to query, take, and delete backups of Zeebe.
servers:
- url: "{schema}://{host}:{port}/actuator/backups"
description: Test server
variables:
host:
default: localhost
description: Management server hostname
port:
default: "9600"
description: Management server port
schema:
default: http
description: Management server schema
paths:
/{id}:
get:
summary: Monitors backup
description: |
Aggregates the complete status of a backup across all partitions.
parameters:
- $ref: '#/components/parameters/BackupId'
responses:
'200':
$ref: '#/components/responses/BackupStatus'
'404':
$ref: '#/components/responses/BackupStatusNotFound'
'500':
$ref: '#/components/responses/BackupStatusError'
post:
summary: Take backup
description: |
Asynchronously start a backup operation with the given ID. To monitor the state of the
operation, it's recommended that you poll the `GET /id` endpoint at a low, periodic interval.
Backups taken via this can later be deleted by sending a `DELETE /id`.
The ID returned in the response is the actual ID of your backup. As backups are logically
ordered by ID, ascending, each successive backup must use a higher ID than the last.If you
use one that is lower than the latest backup, that ID will be returned. You should query
the status of that backup and decide if you need to take a new backup with a higher ID than
that one.
parameters:
- $ref: '#/components/parameters/BackupId'
responses:
'202':
$ref: '#/components/responses/TakeBackupSuccess'
'500':
$ref: '#/components/responses/TakeBackupError'
delete:
summary: Delete backup
description: |
Deletes the backup with the given ID from the configured backup store.
parameters:
- $ref: '#/components/parameters/BackupId'
responses:
'202':
$ref: '#/components/responses/DeleteBackupSuccess'
'404':
$ref: '#/components/responses/DeleteBackupNotFound'
'500':
$ref: '#/components/responses/DeleteBackupError'
components:
parameters:
BackupId:
name: id
in: path
description: ID of the backup
required: true
style: simple
schema:
$ref: '#/components/schemas/BackupId'
responses:
BackupStatusNotFound:
description: |
Indicates that no backup with the given ID exists across all partitions. This can sometimes
be temporary when the backup operation was just started, but it may indicate errors if this
is consistently failing, or if no partitions report a status different than
`DOES_NOT_EXIST`.
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
examples:
not-found:
$ref: '#/components/examples/backup-status-not-found'
BackupStatus:
description: |
The aggregated status of the request. The aggregated status is computed from each
partition specific backup status as:
- If all are `COMPLETED`, then the overall status is `COMPLETED`.
- If one is `FAILED`, then the overall status is `FAILED`.
- Otherwise, if one is `DOES_NOT_EXIST`, then the overall status is `DOES_NOT_EXIST`.
- Otherwise, if one is `IN_PROGRESS`, then the overall status is `IN_PROGRESS`.
content:
application/json:
schema:
$ref: '#/components/schemas/BackupStatus'
TakeBackupSuccess:
description: |
Returned when a backup operation was successfully started on all partitions. Note however
that the response body may contain a backup ID which is different than the given ID.
This can happen if the given ID is lower than the latest backup ID.
You should always use the ID returned in the response body thereafter.
content:
application/json:
schema:
$ref: '#/components/schemas/TakeBackupSuccess'
examples:
success:
$ref: '#/components/examples/take-backup-success'
DeleteBackupSuccess:
description: |
Returned when a backup deletion was successfully started on all partitions. Note that the
backup may not have been found on some partitions; the response will contain those partition
IDs on which the backup delete operation was successfully started.
content:
application/json:
schema:
$ref: '#/components/schemas/DeleteBackupSuccess'
examples:
success:
$ref: '#/components/examples/delete-backup-success'
partial-success:
$ref: '#/components/examples/partial-delete-backup-success'
DeleteBackupNotFound:
description: |
Returned when no partitions knows of any backup with this ID. If you believe there is
indeed a backup with that ID, you will have to delete it directly from storage,
bypassing Zeebe.
NOTE: if a backup is partially present in some partitions, you will receive a 202, not a
404.
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
examples:
not-found:
$ref: '#/components/examples/delete-backup-not-found'
BackupStatusError:
description: |
Returned when an error occurred while trying to get the status of a backup. This may contain
a partial status, but will always contain at least one failure.
content:
application/json:
schema:
$ref: '#/components/schemas/BackupStatusError'
examples:
status-partial-failure:
$ref: '#/components/examples/status-partial-failure'
TakeBackupError:
description: |
Returned when a failure occurred when requesting to take a new backup.
content:
application/json:
schema:
$ref: '#/components/schemas/TakeBackupError'
examples:
partial-failure:
$ref: '#/components/examples/take-backup-partial-failure'
DeleteBackupError:
description: |
An error occurred in the gateway, most likely while trying to communicate with one of the
partition leaders. You should check the gateway logs for more.
content:
application/json:
schema:
$ref: '#/components/schemas/DeleteBackupError'
examples:
failure:
$ref: '#/components/examples/delete-backup-partial-failure'
schemas:
BackupId:
title: Backup ID
description: |
The ID of the backup. The ID of the backup must be a positive numerical value. As backups
are logically ordered by their IDs (ascending), each successive backup must use a higher
ID than the previous one.
type: number
example: 1
minimum: 0
PartitionId:
title: ID of the partition
description: |
The ID of a partition. This is always a positive number greater than or equal to 1.
type: number
minimum: 1
example: 3
PartitionBackupStatus:
title: Backup Status per Partition
description: The status of the backup for a given partition.
type: object
properties:
id:
readOnly: true
allOf:
- $ref: '#/components/schemas/PartitionId'
status:
readOnly: true
allOf:
- $ref: '#/components/schemas/StatusCode'
createdAt:
description: The timestamp at which the backup was started on this partition.
readOnly: true
type: string
format: date-time
example: "2022-09-15T13:10:38.176514094Z"
lastUpdatedAt:
description: |
The timestamp at which the backup was last updated on this partition, e.g. changed
status from IN_PROGRESS to COMPLETED.
readOnly: true
type: string
format: date-time
example: "2022-09-15T13:10:38.176514094Z"
descriptor:
readOnly: true
allOf:
- $ref: '#/components/schemas/BackupDescriptor'
required:
- id
- status
StatusCode:
title: Status code
description: The status of the backup.
type: string
enum:
- DOES_NOT_EXIST
- IN_PROGRESS
- COMPLETED
- FAILED
example: IN_PROGRESS
BackupDescriptor:
title: Backup Descriptor
description: |
Context information about the specific backup and what it contains for a given partition.
type: object
properties:
snapshotId:
description: The ID of the snapshot which is included in this backup.
type: string
readOnly: true
example: 238632143-55-690906332-690905294
checkpointPosition:
description: The position of the checkpoint for this backup.
type: number
readOnly: true
example: 10
brokerId:
description: The ID of the broker from which the backup was taken for this partition.
type: number
readOnly: true
example: 0
minimum: 0
brokerVersion:
description: The version of the broker from which the backup was taken for this partition.
type: string
readOnly: true
example: 8.0.5
required:
- snapshotId
- checkpointPosition
- brokerId
- brokerVersion
BackupStatus:
title: Backup Status
description: The status of the backup
type: object
properties:
id:
readOnly: true
allOf:
- $ref: '#/components/schemas/BackupId'
status:
readOnly: true
allOf:
- $ref: '#/components/schemas/StatusCode'
partitions:
readOnly: true
description: |
Detailed list of the status of the backup per partition. It should always contain all
partitions known to the cluster.
type: array
items:
allOf:
- $ref: '#/components/schemas/PartitionBackupStatus'
required:
- id
- status
- partitions
DeleteBackupSuccess:
type: object
properties:
id:
readOnly: true
allOf:
- $ref: '#/components/schemas/BackupId'
partitions:
readOnly: true
description: |
List of partition IDs where the backup was successfully deleted.
type: array
items:
allOf:
- $ref: '#/components/schemas/PartitionId'
required:
- id
- partitions
TakeBackupSuccess:
type: object
properties:
id:
readOnly: true
allOf:
- $ref: '#/components/schemas/BackupId'
partitions:
readOnly: true
description: |
List of partition IDs where the backup was successfully started.
type: array
items:
allOf:
- $ref: '#/components/schemas/PartitionId'
required:
- id
- partitions
Error:
type: object
properties:
id:
readOnly: true
allOf:
- $ref: '#/components/schemas/BackupId'
failure:
readOnly: true
type: string
example: |
Failed to take backup with ID 3.
failures:
readOnly: true
type: array
items:
type: object
properties:
id:
readOnly: true
allOf:
- $ref: '#/components/schemas/PartitionId'
failure:
readOnly: true
type: string
description: |
A message describing the reason why the request failed for a given partition.
example: |
Request to zeebe-broker-1 timed out after 30 seconds.
required:
- id
- failure
required:
- id
BackupStatusError:
title: Backup Status Error
type: object
allOf:
- $ref: '#/components/schemas/Error'
- type: object
properties:
partitions:
readOnly: true
description: |
Status information for partitions which returned a successful response.
type: array
items:
allOf:
- $ref: '#/components/schemas/PartitionBackupStatus'
required:
- partitions
TakeBackupError:
title: Backup Creation Error
type: object
allOf:
- $ref: '#/components/schemas/Error'
- type: object
properties:
partitions:
readOnly: true
description: |
List of partition IDs where the backup was successfully started.
type: array
items:
allOf:
- $ref: '#/components/schemas/PartitionId'
required:
- partitions
DeleteBackupError:
title: Backup Deletion Error
type: object
allOf:
- $ref: '#/components/schemas/Error'
- type: object
properties:
partitions:
readOnly: true
description: |
List of partition IDs where the backup was successfully deleted.
type: array
items:
allOf:
- $ref: '#/components/schemas/PartitionId'
required:
- partitions
examples:
in-progress:
summary: Status of an in progress backup
description: |
Status response of a backup which is still in progress, with an ID of 1, across two partitions.
value:
id: 1
status: IN_PROGRESS
partitions:
- id: 1
status: IN_PROGRESS
descriptor:
snapshotId: 238878141-55-691634857-691606445
checkpointPosition: 1
brokerId: 1
brokerVersion: 8.0.6
- id: 2
status: IN_PROGRESS
descriptor:
snapshotId: 238878141-55-691634857-691606445
checkpointPosition: 1
brokerId: 1
brokerVersion: 8.0.6
completed:
summary: Status of completed backup
description: |
Status response for a completed backup with ID 2, on a cluster with 2 partitions.
value:
id: 2
status: COMPLETED
partitions:
- id: 1
status: COMPLETED
descriptor:
snapshotId: 238878141-55-691634857-691606445
checkpointPosition: 2
brokerId: 1
brokerVersion: 8.0.6
- id: 2
status: COMPLETED
descriptor:
snapshotId: 238632143-55-690906332-690905294
checkpointPosition: 2
brokerId: 1
brokerVersion: 8.0.6
backup-status-not-found:
summary: Non-existent backup
description: |
Cannot get the status of backup with ID 3 as no partitions is aware of such a backup.
value:
id: 3
failure: |
Failed to get status of backup with ID 3 across all partitions (out of 3 partitions).
take-backup-success:
summary: Cluster-wide success
description: |
All partition leaders have started, or were already, taking a backup with the ID
returned in the response body.
value:
id: 3
partitions:
- id: 1
- id: 2
- id: 3
delete-backup-success:
summary: Cluster-wide success
description: |
All partition leaders have started deleting the backup with the given ID.
value:
id: 3
partitions:
- id: 1
- id: 2
- id: 3
partial-delete-backup-success:
summary: Partial deletion
description: |
The backup with ID 3 existed only for partition 2 and 3, and was successfully deleted. As
it did not exist on partition 1, it is not returned in the partitions list.
value:
id: 3
partitions:
- id: 2
- id: 3
complete-failure:
summary: Complete failure
description: |
No requests could be sent to any of the partition leaders, so no aggregated backup
status will be available, and no partition information will be available.
value:
id: 3
message: |
The topology is currently incomplete, meaning no cluster-wide requests can be
sent. Try again later. If this persist, check your cluster topology using any
Zeebe client.
failed-backup:
summary: Failed backup
description: |
In a cluster of two partitions, one partition leader completed a backup, but the
other failed. This means the complete backup is failed and cannot be used.
value:
id: 3
message: |
Backup status request timed out between zeebe-gateway-0 and leader of partition 2
zeebe-broker-1 after 30 seconds.
status: FAILED
partitions:
- id: 1
status: COMPLETED
descriptor:
snapshotId: 238878141-55-691634857-691606445
checkpointPosition: 3
brokerId: 1
brokerVersion: 8.0.6
- id: 2
status: FAILED
descriptor:
snapshotId: 334818341-25-597614652-631601425
checkpointPosition: 3
brokerId: 2
brokerVersion: 8.0.6
status-partial-failure:
summary: Partial failure
description: |
In a cluster of two partitions, one partition leader returned a successful status
response, but the other never answered and the request timed out. The aggregated
status cannot be computed in this case, but partial information is still available.
value:
id: 3
failure: |
Failed to get the status for backup 3 on partitions [2] (out of 2 partitions).
failures:
- id: 2
failure: |
Request to zeebe-broker-1 timed out after 30 seconds.
partitions:
- id: 1
status: IN_PROGRESS
descriptor:
snapshotId: 238878141-55-691634857-691606445
checkpointPosition: 3
brokerId: 1
brokerVersion: 8.0.6
take-backup-partial-failure:
summary: One partition failed to take a backup
description: |
In a cluster of two partitions, one partition leader did not answer a request to take a
backup. This is safe to retry.
value:
id: 3
failure: |
Failed to take backup with ID 3 on partitions [1] (out of 2 partitions).
failures:
- id: 1
failure: |
Request to zeebe-broker-1 timed out after 30 seconds.
partitions:
- id: 2
delete-backup-partial-failure:
summary: One partition failed to delete a backup
description: |
In a cluster of two partitions, one partition leader did not answer a request to delete a
backup. This is safe to retry.
value:
id: 3
failure: |
Failed to take backup with ID 3 on partitions [2] (out of 2 partitions).
failures:
- id: 2
failure: |
Request to zeebe-broker-1 timed out after 30 seconds.
partitions:
- id: 1
delete-backup-not-found:
summary: Non-existent backup
description: |
Cannot delete a backup with ID 3 as no partitions is aware of such a backup.
value:
id: 3
failure: |
Failed to delete backup with ID 3 across all partitions (out of 3 partitions). I mostly aligned with the Optimize/Operate approach: StatusIf one partition returns an error (e.g. timeout, connection, etc.), we return a 500. The successful requests information is present in the returned payload under On success, you get the backup status (see schema above), and a 200. If all partitions return TakeIf all partitions are successful, we return a 200, with the list of partitions which were successful, and the new backup ID (which may be higher). If one partition fails, we return a 500, with the list of partitions which were successful, the new backup ID (which may be higher), and a list of Again I opted for a model where if one error occurs, then we return an error code so the user can decide to retry more easily. As requests are idempotent, this isn't too big of a deal. DeleteIf at least one partition deletes something, we return a 200. However, if any partition returns an error, then we return 500 (again with the list of successful/failed partitions and details). We only return 404 if no partitions found the backup. Further improvements
|
Depends on #9726
Tasks:
The text was updated successfully, but these errors were encountered: