Implement take backup api in gateway #9901

deepthidevaki · 2022-07-27T09:20:41Z

Depends on #9726

Define user facing api
Gateway should trigger backups on all partition
Collect the responses from all partitions and send it back to the user

Tasks:

Coordinate take backup operation in gateway #10281
Implement actuator endpoints for take backup api

npepinpe · 2022-09-15T14:44:05Z

There will be a new actuator under /actuator/backups. The endpoint will have a route POST /actuator/backups/<backupId>, which will take a new backup with the given ID. This can later be adapted to use DELETE /actuator/backups/<backupId> and GET /actuator/backups/<backupId>, following REST conventions. In the future, if we ever need more context or parameters, it can be sent as a JSON body. Unless we expect to change the IDs, I think this is a good approach. If the IDs may change in the future (e.g. strings, composites, etc.), then we should simply pack everything in the body - but I doubt this is will happen.

I thought about having a general response type, but honestly I'm not sure everything fits. For example, what would be the common response type for a DELETE and a POST?

I think for the take and status operations, we can return something similar to the BackupStatus. There it makes sense to return the current status. For delete, we would return no body, e.g. 204

So, summarize, a new endpoint under /actuator/backups, with the following endpoints, as documented with this OpenAPI spec:

Backup Actuator Spec

openapi: "3.0.2"
info:
  title: Backups API
  version: "1.0"
  description: |
    Management endpoint to query, take, and delete backups of Zeebe.
servers:
  - url: "{schema}://{host}:{port}/actuator/backups"
    description: Test server
    variables:
      host:
        default: localhost
        description: Management server hostname
      port:
        default: "9600"
        description: Management server port
      schema:
        default: http
        description: Management server schema

paths:
  /{id}:
    get:
      summary: Monitors backup
      description: |
        Aggregates the complete status of a backup across all partitions.
      parameters:
        - $ref: '#/components/parameters/BackupId'
      responses:
        '200':
          description: |
            The aggregated status of a given backup across all partition leaders.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/BackupStatus'
              examples:
                in-progress:
                  $ref: '#/components/examples/in-progress'
                completed:
                  $ref: '#/components/examples/completed'
        '404':
          $ref: '#/components/responses/NotFound'
        '500':
          $ref: '#/components/responses/ServerError'
        '501':
          $ref: '#/components/responses/NotImplementedError'
        '502':
          $ref: '#/components/responses/GatewayError'
        '503':
          $ref: '#/components/responses/NotImplementedError'
        '504':
          $ref: '#/components/responses/GatewayTimeoutError'
    post:
      summary: Take backup
      description: |
        Asynchronously start a backup operation with the given ID. To monitor the state of the
        operation, it's recommended that you poll the `GET /id` endpoint at a low, periodic interval.

        Backups taken via this can later be deleted by sending a `DELETE /id`.
      parameters:
        - $ref: '#/components/parameters/BackupId'
      responses:
        '202':
          description: |
            A new backup with the given ID was created, and the operation is started, but is not
            necessarily completed.
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/BackupStatus'
              examples:
                in-progress:
                  $ref: '#/components/examples/in-progress'
        '500':
          $ref: '#/components/responses/ServerError'
        '501':
          $ref: '#/components/responses/NotImplementedError'
        '502':
          $ref: '#/components/responses/GatewayError'
        '503':
          $ref: '#/components/responses/NotImplementedError'
        '504':
          $ref: '#/components/responses/GatewayTimeoutError'
    delete:
      summary: Delete backup
      description: |
        Deletes the backup with the given ID from the configured backup store.
      parameters:
        - $ref: '#/components/parameters/BackupId'
      responses:
        '202':
          description: |
            Indicates that the backup has been successfully deleted. This may be an asynchronous
            operation based on the configured backup store.
        '404':
          $ref: '#/components/responses/NotFound'
        '500':
          $ref: '#/components/responses/ServerError'
        '501':
          $ref: '#/components/responses/NotImplementedError'
        '502':
          $ref: '#/components/responses/GatewayError'
        '503':
          $ref: '#/components/responses/NotImplementedError'
        '504':
          $ref: '#/components/responses/GatewayTimeoutError'

components:
  parameters:
    BackupId:
      name: id
      in: path
      description: ID of the backup
      required: true
      style: simple
      schema:
        $ref: '#/components/schemas/BackupId'

  responses:
    NotFound:
      description: |
        Indicates that no backup with the given ID exists to be deleted. This can sometimes be
        temporary when the backup operation was just started, but it may indicate errors if this
        is consistently failing, or if no partitions report a status different than
        `DOES_NOT_EXIST`.
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/BackupStatus'
          examples:
            not-found:
              $ref: '#/components/examples/not-found'
            not-found-yet:
              $ref: '#/components/examples/not-found-yet'
    ServerError:
      description: |
        An error occurred in the gateway, most likely while trying to communicate with one of the
        partition leaders. You should check the gateway logs for more.
      $ref: '#/components/responses/Error'
    GatewayError:
      description: |
        An error occurred on one of the partition leaders.

        It's possible that the backup was started on some partitions; in this case, it will
        eventually be marked as failed. You should try again with a new, higher backup
        ID.
      $ref: '#/components/responses/Error'
    NotImplementedError:
      description: |
        One of the partition leaders is of an older version and does not support taking backups.
        This can happen if you're in the middle of a rolling upgrade, and you should simply
        retry later. However, if this always happens, then you should double check your cluster
        configuration and make sure all nodes in your cluster are of version which supports
        taking backups.
      $ref: '#/components/responses/Error'
    GatewayTimeoutError:
      description: |
        Indicates one or more requests to a partition leader timed out, and the response cannot be
        completely aggregated. You can find more details about each partition request in the
        response body.
      $ref: '#/components/responses/Error'
    Error:
      description: |
        A general error for an aggregated request.
      content:
        "application/problem+json":
          schema:
            $ref: '#/components/schemas/Error'
          examples:
            default:
              summary: Complete failure
              value:
                type: "/backups/3"
                title: "some title for the error situation"
                status: 501
                detail: "some description for the error situation"
                instance: "/actuator/backups/3"
            partial-error:
              summary: Partial failure
              value:
                type: "/backups/3"
                title: "Request timed out with leader of partition 2"
                status: 504
                detail: |
                  Backup status request timed out between zeebe-gateway-0 and leader of partition 2
                  zeebe-broker-1 after 30 seconds.
                instance: "/actuator/backups/3"
                partitions:
                  - id: 1
                    status: IN_PROGRESS
                    descriptor:
                      snapshotId: 238878141-55-691634857-691606445
                      checkpointPosition: 1
                      brokerId: 1
                      brokerVersion: 8.0.6
                  - type: "/backups/3/partitions/2"
                    title: "Request timed out after 30 seconds"
                    status: 504
                    detail: |
                      Backup status request timed out between zeebe-gateway-0 and leader of
                      partition 2 zeebe-broker-1 after 30 seconds
                    instance: "/actuator/backups/3"

  schemas:
    Error:
      title: Error
      type: object
      allOf:
        - $ref: 'https://opensource.zalando.com/restful-api-guidelines/models/problem-1.0.1.yaml#/Problem'
      properties:
        partitions:
          readOnly: true
          type: array
          items:
            anyOf:
              - $ref: '#/components/schemas/PartitionBackupStatus'
              - $ref: 'https://opensource.zalando.com/restful-api-guidelines/models/problem-1.0.1.yaml#/Problem'
    BackupId:
      title: Backup ID
      description: The ID of the backup
      type: number
      example: 1
      minimum: 0
    BackupStatus:
      title: Backup Status
      description: The status of the backup
      type: object
      properties:
        id:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/BackupId'
        status:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/StatusCode'
        partitions:
          readOnly: true
          description: |
            Detailed list of the status of the backup per partition. It should always contain all
            partitions known to the cluster.
          type: array
          items:
            allOf:
              - $ref: '#/components/schemas/PartitionBackupStatus'
      required:
        - id
        - status
        - partitions
    PartitionBackupStatus:
      title: Backup Status per Partition
      description: The status of the backup for a given partition.
      type: object
      properties:
        id:
          readOnly: true
          description: The ID of the partition.
          type: number
          example: 1
          minimum: 1
        status:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/StatusCode'
        createdAt:
          description: The timestamp at which the backup was started on this partition.
          readOnly: true
          type: string
          format: date-time
          example: "2022-09-15T13:10:38.176514094Z"
        lastUpdatedAt:
          description: |
            The timestamp at which the backup was last updated on this partition, e.g. changed
            status from IN_PROGRESS to COMPLETED.
          readOnly: true
          type: string
          format: date-time
          example: "2022-09-15T13:10:38.176514094Z"
        descriptor:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/BackupDescriptor'
      required:
        - id
        - status
    StatusCode:
      title: Status code
      description: The status of the backup.
      type: string
      enum:
        - DOES_NOT_EXIST
        - IN_PROGRESS
        - COMPLETED
        - FAILED
      example: IN_PROGRESS
    BackupDescriptor:
      title: Backup Descriptor
      description: |
        Context information about the specific backup and what it contains for a given partition.
      type: object
      properties:
        snapshotId:
          description: The ID of the snapshot which is included in this backup.
          type: string
          readOnly: true
          example: 238632143-55-690906332-690905294
        checkpointPosition:
          description: The position of the checkpoint for this backup.
          type: number
          readOnly: true
          example: 10
        brokerId:
          description: The ID of the broker from which the backup was taken for this partition.
          type: number
          readOnly: true
          example: 0
          minimum: 0
        brokerVersion:
          description: The version of the broker from which the backup was taken for this partition.
          type: string
          readOnly: true
          example: 8.0.5
      required:
        - snapshotId
        - checkpointPosition
        - brokerId
        - brokerVersion
  examples:
    in-progress:
      summary: Status of an in progress backup
      description: |
        Status response of a backup which is still in progress, with an ID of 1, across two partitions.
      value:
        id: 1
        status: IN_PROGRESS
        partitions:
          - id: 1
            status: IN_PROGRESS
            descriptor:
              snapshotId: 238878141-55-691634857-691606445
              checkpointPosition: 1
              brokerId: 1
              brokerVersion: 8.0.6
          - id: 2
            status: IN_PROGRESS
            descriptor:
              snapshotId: 238878141-55-691634857-691606445
              checkpointPosition: 1
              brokerId: 1
              brokerVersion: 8.0.6
    completed:
      summary: Status of completed backup
      description: |
        Status response for a completed backup with ID 2, on a cluster with 2 partitions.
      value:
        id: 2
        status: COMPLETED
        partitions:
          - id: 1
            status: COMPLETED
            descriptor:
              snapshotId: 238878141-55-691634857-691606445
              checkpointPosition: 2
              brokerId: 1
              brokerVersion: 8.0.6
          - id: 2
            status: COMPLETED
            descriptor:
              snapshotId: 238632143-55-690906332-690905294
              checkpointPosition: 2
              brokerId: 1
              brokerVersion: 8.0.6
    not-found:
      summary: Status response of a non-existent backup
      description: |
        Status response for a backup does not exist anywhere in the cluster.
      value:
        id: 2
        status: DOES_NOT_EXIST
        partitions:
          - id: 1
            status: DOES_NOT_EXIST
          - id: 2
            status: DOES_NOT_EXIST
    not-found-yet:
      summary: Status response of yet to be found backup
      description: |
        Status response for a backup which was only partially successful, where one partition does
        does not have it yet.
      value:
        id: 2
        status: DOES_NOT_EXIST
        partitions:
          - id: 1
            status: DOES_NOT_EXIST
          - id: 2
            status: IN_PROGRESS
            descriptor:
              snapshotId: 238632143-55-690906332-690905294
              checkpointPosition: 2
              brokerId: 1
              brokerVersion: 8.0.6

You can more easily visualize it using an online editor, or installing an IntelliJ extension.

I'm not yet 100% sure about the descriptions I put for the response types, as I didn't implement this, and I'm also not 100% about the usage of problem+json (extending to add the list of per partition request on partial failure), but I'd like to try it and see.

Let me know what you think.

npepinpe · 2022-09-15T14:45:32Z

I figured it makes sense to have an OpenAPI spec already since I expect people will automate using this, but we could also omit it for now and go with simple documentation in our own docs.

deepthidevaki · 2022-09-15T15:03:47Z

Thanks @npepinpe. I didn't read it in detail. But over all it looks nice. I will read it tomorrow.
Just FYI, here is operate and optimize api spec. We could try to align with it if possible. I don't think we can provide exact same api, but aligning when ever possible would be good.

npepinpe · 2022-09-15T15:09:52Z

Right, you did link it to me, I just forgot about it sorry 🙈 I think we can general align with them. Some things that stand out:

the backup ID is a string?
All errors are packed under 500, mostly because they don't have our separation between gateway/broker. I think it's probably fine to keep them split for us.
I couldn't find a reference to the error response format
We don't do authorization, so I guess we can ignore that now
They return 200 for asynchronous results, I think 202 is more appropriate, but I wouldn't insist too much

I'll update it tomorrow, and start work on the take backup API (as I think we can sort of verify it via the actual back up store, even without the status API).

deepthidevaki · 2022-09-15T15:17:01Z

I'll update it tomorrow, and start work on the take backup API (as I think we can sort of verify it via the actual back up store, even without the status API).

This PR would already help in testing it.

npepinpe · 2022-09-16T07:28:36Z

So differences between both:

Errors are simple { "message": "..." } payloads.
Backup IDs are strings (?)
All non-client errors are 500 (:shrug: I like more specific ones for error handling, but I'm not going to insist all that much)
Some authorization/authentication stuff we don't have to care too much about (Spring Boot would take care of it for us)
Status payload is simplified, but I guess that makes sense since there it's only about whether a snapshots exists, and for us we do a little more aggregation logic. Or do you think ours could also be simplified? 🤔
There's a PARTIAL status which is used when a snapshot is in progress, instead of IN_PROGRESS. I'm neutral about it, happy to use this instead 🤷
The DELETE endpoint will return a success even if some partitions would return a NOT_FOUND, as long as others did delete the snapshot. I think this makes sense, but it would be nice to not lose the information in the forms of additional details/context.

deepthidevaki · 2022-09-16T07:39:00Z

So differences between both:

1. Errors are simple `{ "message": "..." }` payloads.

2. Backup IDs are strings (?)

We cannot allow string IDs. But Operate and Optimize can because ElasticSearch support string for snapshot names.

3. All non-client errors are 500 (shrug I like more specific ones for error handling, but I'm not going to insist all that much)

Would it be ok, if we start with generic errors and iteratively improve and add specific errors? Then we could also suggest them to add specific errors later. And we can already add specific ones.

5. Status payload is simplified, but I guess that makes sense since there it's only about whether a snapshots exists, and for us we do a little more aggregation logic. Or do you think ours could also be simplified? thinking

More detailed Status would be useful for debugging. For example, if the status is failed it would be useful to know which partition failed and what was the reason. User's probably cannot do anything about it, but just retrying with a new backupId. But it will be useful for investigations.

6. There's a `PARTIAL` status which is used when a snapshot is in progress, instead of `IN_PROGRESS`. I'm neutral about it, happy to use this instead shrug

I don't have a strong opinion on it. May be we could align with them on this and use PARTIAL.

npepinpe · 2022-09-16T08:32:15Z

I'm happy to have a simpler error message, but I think your comment applies to it as well:

More detailed Status would be useful for debugging. For example, if the status is failed it would be useful to know which partition failed and what was the reason. User's probably cannot do anything about it, but just retrying with a new backupId. But it will be useful for investigations.

This is also important for errors, no? We can ignore problem+json for now, I admit it seemed nice but it's also just because I want to try it out on a small scale before advocating it. Just a message is sort of OK, but it kind of means there's no way to react to errors other than a human reading the message 🤷

I won't insist though, I don't feel that strongly about it 😄

deepthidevaki · 2022-09-16T08:47:21Z

ust a message is sort of OK, but it kind of means there's no way to react to errors other than a human reading the message

Let's then quickly discuss what are the expected errors, and what users can do about it.

npepinpe · 2022-09-18T18:46:43Z

Let's go with this for now, and I think it will most likely be adjusted a bit as I implement it and write tests for it 👍

Backup management API

openapi: "3.0.2"
info:
  title: Backups API
  version: "1.0"
  description: |
    Management endpoint to query, take, and delete backups of Zeebe.
servers:
  - url: "{schema}://{host}:{port}/actuator/backups"
    description: Test server
    variables:
      host:
        default: localhost
        description: Management server hostname
      port:
        default: "9600"
        description: Management server port
      schema:
        default: http
        description: Management server schema

paths:
  /{id}:
    get:
      summary: Monitors backup
      description: |
        Aggregates the complete status of a backup across all partitions.
      parameters:
        - $ref: '#/components/parameters/BackupId'
      responses:
        '200':
          $ref: '#/components/responses/BackupStatus'
        '404':
          $ref: '#/components/responses/BackupStatusNotFound'
        '500':
          $ref: '#/components/responses/BackupStatusError'
    post:
      summary: Take backup
      description: |
        Asynchronously start a backup operation with the given ID. To monitor the state of the
        operation, it's recommended that you poll the `GET /id` endpoint at a low, periodic interval.
        Backups taken via this can later be deleted by sending a `DELETE /id`.

        The ID returned in the response is the actual ID of your backup. As backups are logically
        ordered by ID, ascending, each successive backup must use a higher ID than the last.If you
        use one that is lower than the latest backup, that ID will be returned. You should query
        the status of that backup and decide if you need to take a new backup with a higher ID than
        that one.
      parameters:
        - $ref: '#/components/parameters/BackupId'
      responses:
        '202':
          $ref: '#/components/responses/TakeBackupSuccess'
        '500':
          $ref: '#/components/responses/TakeBackupError'
    delete:
      summary: Delete backup
      description: |
        Deletes the backup with the given ID from the configured backup store.
      parameters:
        - $ref: '#/components/parameters/BackupId'
      responses:
        '202':
          $ref: '#/components/responses/DeleteBackupSuccess'
        '404':
          $ref: '#/components/responses/DeleteBackupNotFound'
        '500':
          $ref: '#/components/responses/DeleteBackupError'

components:
  parameters:
    BackupId:
      name: id
      in: path
      description: ID of the backup
      required: true
      style: simple
      schema:
        $ref: '#/components/schemas/BackupId'

  responses:
    BackupStatusNotFound:
      description: |
        Indicates that no backup with the given ID exists across all partitions. This can sometimes
        be temporary when the backup operation was just started, but it may indicate errors if this
        is consistently failing, or if no partitions report a status different than
        `DOES_NOT_EXIST`.
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/Error'
          examples:
            not-found:
              $ref: '#/components/examples/backup-status-not-found'
    BackupStatus:
      description: |
        The aggregated status of the request. The aggregated status is computed from each
        partition specific backup status as:
          - If all are `COMPLETED`, then the overall status is `COMPLETED`.
          - If one is `FAILED`, then the overall status is `FAILED`.
          - Otherwise, if one is `DOES_NOT_EXIST`, then the overall status is `DOES_NOT_EXIST`.
          - Otherwise, if one is `IN_PROGRESS`, then the overall status is `IN_PROGRESS`.
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/BackupStatus'
    TakeBackupSuccess:
      description: |
        Returned when a backup operation was successfully started on all partitions. Note however
        that the response body may contain a backup ID which is different than the given ID.
        This can happen if the given ID is lower than the latest backup ID.

        You should always use the ID returned in the response body thereafter.
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/TakeBackupSuccess'
          examples:
            success:
              $ref: '#/components/examples/take-backup-success'
    DeleteBackupSuccess:
      description: |
        Returned when a backup deletion was successfully started on all partitions. Note that the
        backup may not have been found on some partitions; the response will contain those partition
        IDs on which the backup delete operation was successfully started.
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/DeleteBackupSuccess'
          examples:
            success:
              $ref: '#/components/examples/delete-backup-success'
            partial-success:
              $ref: '#/components/examples/partial-delete-backup-success'
    DeleteBackupNotFound:
      description: |
        Returned when no partitions knows of any backup with this ID. If you believe there is
        indeed a backup with that ID, you will have to delete it directly from storage,
        bypassing Zeebe.

        NOTE: if a backup is partially present in some partitions, you will receive a 202, not a
        404.
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/Error'
          examples:
            not-found:
              $ref: '#/components/examples/delete-backup-not-found'
    BackupStatusError:
      description: |
        Returned when an error occurred while trying to get the status of a backup. This may contain
        a partial status, but will always contain at least one failure.
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/BackupStatusError'
          examples:
            status-partial-failure:
              $ref: '#/components/examples/status-partial-failure'
    TakeBackupError:
      description: |
        Returned when a failure occurred when requesting to take a new backup.
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/TakeBackupError'
          examples:
            partial-failure:
              $ref: '#/components/examples/take-backup-partial-failure'
    DeleteBackupError:
      description: |
        An error occurred in the gateway, most likely while trying to communicate with one of the
        partition leaders. You should check the gateway logs for more.
      content:
        application/json:
          schema:
            $ref: '#/components/schemas/DeleteBackupError'
          examples:
            failure:
              $ref: '#/components/examples/delete-backup-partial-failure'

  schemas:
    BackupId:
      title: Backup ID
      description: |
        The ID of the backup. The ID of the backup must be a positive numerical value. As backups
        are logically ordered by their IDs (ascending), each successive backup must use a higher
        ID than the previous one.
      type: number
      example: 1
      minimum: 0
    PartitionId:
      title: ID of the partition
      description: |
        The ID of a partition. This is always a positive number greater than or equal to 1.
      type: number
      minimum: 1
      example: 3
    PartitionBackupStatus:
      title: Backup Status per Partition
      description: The status of the backup for a given partition.
      type: object
      properties:
        id:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/PartitionId'
        status:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/StatusCode'
        createdAt:
          description: The timestamp at which the backup was started on this partition.
          readOnly: true
          type: string
          format: date-time
          example: "2022-09-15T13:10:38.176514094Z"
        lastUpdatedAt:
          description: |
            The timestamp at which the backup was last updated on this partition, e.g. changed
            status from IN_PROGRESS to COMPLETED.
          readOnly: true
          type: string
          format: date-time
          example: "2022-09-15T13:10:38.176514094Z"
        descriptor:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/BackupDescriptor'
      required:
        - id
        - status
    StatusCode:
      title: Status code
      description: The status of the backup.
      type: string
      enum:
        - DOES_NOT_EXIST
        - IN_PROGRESS
        - COMPLETED
        - FAILED
      example: IN_PROGRESS
    BackupDescriptor:
      title: Backup Descriptor
      description: |
        Context information about the specific backup and what it contains for a given partition.
      type: object
      properties:
        snapshotId:
          description: The ID of the snapshot which is included in this backup.
          type: string
          readOnly: true
          example: 238632143-55-690906332-690905294
        checkpointPosition:
          description: The position of the checkpoint for this backup.
          type: number
          readOnly: true
          example: 10
        brokerId:
          description: The ID of the broker from which the backup was taken for this partition.
          type: number
          readOnly: true
          example: 0
          minimum: 0
        brokerVersion:
          description: The version of the broker from which the backup was taken for this partition.
          type: string
          readOnly: true
          example: 8.0.5
      required:
        - snapshotId
        - checkpointPosition
        - brokerId
        - brokerVersion
    BackupStatus:
      title: Backup Status
      description: The status of the backup
      type: object
      properties:
        id:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/BackupId'
        status:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/StatusCode'
        partitions:
          readOnly: true
          description: |
            Detailed list of the status of the backup per partition. It should always contain all
            partitions known to the cluster.
          type: array
          items:
            allOf:
              - $ref: '#/components/schemas/PartitionBackupStatus'
      required:
        - id
        - status
        - partitions
    DeleteBackupSuccess:
      type: object
      properties:
        id:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/BackupId'
        partitions:
          readOnly: true
          description: |
            List of partition IDs where the backup was successfully deleted.
          type: array
          items:
            allOf:
              - $ref: '#/components/schemas/PartitionId'
      required:
        - id
        - partitions
    TakeBackupSuccess:
      type: object
      properties:
        id:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/BackupId'
        partitions:
          readOnly: true
          description: |
            List of partition IDs where the backup was successfully started.
          type: array
          items:
            allOf:
              - $ref: '#/components/schemas/PartitionId'
      required:
        - id
        - partitions
    Error:
      type: object
      properties:
        id:
          readOnly: true
          allOf:
            - $ref: '#/components/schemas/BackupId'
        failure:
          readOnly: true
          type: string
          example: |
            Failed to take backup with ID 3.
        failures:
          readOnly: true
          type: array
          items:
            type: object
            properties:
              id:
                readOnly: true
                allOf:
                  - $ref: '#/components/schemas/PartitionId'
              failure:
                readOnly: true
                type: string
                description: |
                  A message describing the reason why the request failed for a given partition.
                example: |
                  Request to zeebe-broker-1 timed out after 30 seconds.
            required:
              - id
              - failure
      required:
        - id
    BackupStatusError:
      title: Backup Status Error
      type: object
      allOf:
        - $ref: '#/components/schemas/Error'
        - type: object
          properties:
            partitions:
              readOnly: true
              description: |
                Status information for partitions which returned a successful response.
              type: array
              items:
                allOf:
                  - $ref: '#/components/schemas/PartitionBackupStatus'
          required:
            - partitions
    TakeBackupError:
      title: Backup Creation Error
      type: object
      allOf:
        - $ref: '#/components/schemas/Error'
        - type: object
          properties:
            partitions:
              readOnly: true
              description: |
                List of partition IDs where the backup was successfully started.
              type: array
              items:
                allOf:
                  - $ref: '#/components/schemas/PartitionId'
          required:
            - partitions
    DeleteBackupError:
      title: Backup Deletion Error
      type: object
      allOf:
        - $ref: '#/components/schemas/Error'
        - type: object
          properties:
            partitions:
              readOnly: true
              description: |
                List of partition IDs where the backup was successfully deleted.
              type: array
              items:
                allOf:
                  - $ref: '#/components/schemas/PartitionId'
          required:
            - partitions

  examples:
    in-progress:
      summary: Status of an in progress backup
      description: |
        Status response of a backup which is still in progress, with an ID of 1, across two partitions.
      value:
        id: 1
        status: IN_PROGRESS
        partitions:
          - id: 1
            status: IN_PROGRESS
            descriptor:
              snapshotId: 238878141-55-691634857-691606445
              checkpointPosition: 1
              brokerId: 1
              brokerVersion: 8.0.6
          - id: 2
            status: IN_PROGRESS
            descriptor:
              snapshotId: 238878141-55-691634857-691606445
              checkpointPosition: 1
              brokerId: 1
              brokerVersion: 8.0.6
    completed:
      summary: Status of completed backup
      description: |
        Status response for a completed backup with ID 2, on a cluster with 2 partitions.
      value:
        id: 2
        status: COMPLETED
        partitions:
          - id: 1
            status: COMPLETED
            descriptor:
              snapshotId: 238878141-55-691634857-691606445
              checkpointPosition: 2
              brokerId: 1
              brokerVersion: 8.0.6
          - id: 2
            status: COMPLETED
            descriptor:
              snapshotId: 238632143-55-690906332-690905294
              checkpointPosition: 2
              brokerId: 1
              brokerVersion: 8.0.6
    backup-status-not-found:
      summary: Non-existent backup
      description: |
        Cannot get the status of backup with ID 3 as no partitions is aware of such a backup.
      value:
        id: 3
        failure: |
          Failed to get status of backup with ID 3 across all partitions (out of 3 partitions).
    take-backup-success:
      summary: Cluster-wide success
      description: |
        All partition leaders have started, or were already, taking a backup with the ID
        returned in the response body.
      value:
        id: 3
        partitions:
          - id: 1
          - id: 2
          - id: 3
    delete-backup-success:
      summary: Cluster-wide success
      description: |
        All partition leaders have started deleting the backup with the given ID.
      value:
        id: 3
        partitions:
          - id: 1
          - id: 2
          - id: 3
    partial-delete-backup-success:
      summary: Partial deletion
      description: |
        The backup with ID 3 existed only for partition 2 and 3, and was successfully deleted. As
        it did not exist on partition 1, it is not returned in the partitions list.
      value:
        id: 3
        partitions:
          - id: 2
          - id: 3
    complete-failure:
      summary: Complete failure
      description: |
        No requests could be sent to any of the partition leaders, so no aggregated backup
        status will be available, and no partition information will be available.
      value:
        id: 3
        message: |
          The topology is currently incomplete, meaning no cluster-wide requests can be
          sent. Try again later. If this persist, check your cluster topology using any
          Zeebe client.
    failed-backup:
      summary: Failed backup
      description: |
        In a cluster of two partitions, one partition leader completed a backup, but the
        other failed. This means the complete backup is failed and cannot be used.
      value:
        id: 3
        message: |
          Backup status request timed out between zeebe-gateway-0 and leader of partition 2
          zeebe-broker-1 after 30 seconds.
        status: FAILED
        partitions:
          - id: 1
            status: COMPLETED
            descriptor:
              snapshotId: 238878141-55-691634857-691606445
              checkpointPosition: 3
              brokerId: 1
              brokerVersion: 8.0.6
          - id: 2
            status: FAILED
            descriptor:
              snapshotId: 334818341-25-597614652-631601425
              checkpointPosition: 3
              brokerId: 2
              brokerVersion: 8.0.6
    status-partial-failure:
      summary: Partial failure
      description: |
        In a cluster of two partitions, one partition leader returned a successful status
        response, but the other never answered and the request timed out. The aggregated
        status cannot be computed in this case, but partial information is still available.
      value:
        id: 3
        failure: |
          Failed to get the status for backup 3 on partitions [2] (out of 2 partitions).
        failures:
          - id: 2
            failure: |
              Request to zeebe-broker-1 timed out after 30 seconds.
        partitions:
          - id: 1
            status: IN_PROGRESS
            descriptor:
              snapshotId: 238878141-55-691634857-691606445
              checkpointPosition: 3
              brokerId: 1
              brokerVersion: 8.0.6
    take-backup-partial-failure:
      summary: One partition failed to take a backup
      description: |
        In a cluster of two partitions, one partition leader did not answer a request to take a
        backup. This is safe to retry.
      value:
        id: 3
        failure: |
          Failed to take backup with ID 3 on partitions [1] (out of 2 partitions).
        failures:
          - id: 1
            failure: |
              Request to zeebe-broker-1 timed out after 30 seconds.
        partitions:
          - id: 2
    delete-backup-partial-failure:
      summary: One partition failed to delete a backup
      description: |
        In a cluster of two partitions, one partition leader did not answer a request to delete a
        backup. This is safe to retry.
      value:
        id: 3
        failure: |
          Failed to take backup with ID 3 on partitions [2] (out of 2 partitions).
        failures:
          - id: 2
            failure: |
              Request to zeebe-broker-1 timed out after 30 seconds.
        partitions:
          - id: 1
    delete-backup-not-found:
      summary: Non-existent backup
      description: |
        Cannot delete a backup with ID 3 as no partitions is aware of such a backup.
      value:
        id: 3
        failure: |
          Failed to delete backup with ID 3 across all partitions (out of 3 partitions).

I mostly aligned with the Optimize/Operate approach:

Status

If one partition returns an error (e.g. timeout, connection, etc.), we return a 500. The successful requests information is present in the returned payload under partitions, and the failures under failures. I opted to do this to clearly indicate when something goes wrong so the user can retry (if they choose). Further improvements would be to add an error code to the failures so it's easier to choose whether to retry or not. If all partitions failed, for example, the partitions would be empty, and failures would have a message for all. If an error occurs in the gateway itself, then both partitions and failures would be empty, and an error message would contain the proper error. Again having a more specific error code would help, but can be done in the future.

On success, you get the backup status (see schema above), and a 200.

If all partitions return DOES_NOT_EXIST, we return a 404.

Take

If all partitions are successful, we return a 200, with the list of partitions which were successful, and the new backup ID (which may be higher). If one partition fails, we return a 500, with the list of partitions which were successful, the new backup ID (which may be higher), and a list of failures which contain details about which partitions failed and why. Similarly, a specific error code would help here.

Again I opted for a model where if one error occurs, then we return an error code so the user can decide to retry more easily. As requests are idempotent, this isn't too big of a deal.

Delete

If at least one partition deletes something, we return a 200. However, if any partition returns an error, then we return 500 (again with the list of successful/failed partitions and details). We only return 404 if no partitions found the backup.

Further improvements

We could differentiate between 500 and 502, e.g. 500 is an error occurred in the gateway itself, and 502 an error occurred during the request to one or more of the partitions.
Error details could have their own codes as well for better handling.

deepthidevaki mentioned this issue Jul 27, 2022

Zeebe can backup its data to an external storage without downtime and restore from it #9606

Closed

58 tasks

deepthidevaki mentioned this issue Aug 29, 2022

Implement delete backup api in gateway #10209

Closed

5 tasks

npepinpe self-assigned this Sep 15, 2022

npepinpe mentioned this issue Sep 20, 2022

Introduce take backup management API #10411

Merged

10 tasks

ghost closed this as completed in 50a1716 Sep 21, 2022

Zelldon added the version:8.1.0 Marks an issue as being completely or in parts released in 8.1.0 label Oct 4, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement take backup api in gateway #9901

Implement take backup api in gateway #9901

deepthidevaki commented Jul 27, 2022 •

edited

Loading

npepinpe commented Sep 15, 2022

npepinpe commented Sep 15, 2022

deepthidevaki commented Sep 15, 2022

npepinpe commented Sep 15, 2022 •

edited

Loading

deepthidevaki commented Sep 15, 2022

npepinpe commented Sep 16, 2022

deepthidevaki commented Sep 16, 2022

npepinpe commented Sep 16, 2022

deepthidevaki commented Sep 16, 2022

npepinpe commented Sep 18, 2022

Implement take backup api in gateway #9901

Implement take backup api in gateway #9901

Comments

deepthidevaki commented Jul 27, 2022 • edited Loading

npepinpe commented Sep 15, 2022

npepinpe commented Sep 15, 2022

deepthidevaki commented Sep 15, 2022

npepinpe commented Sep 15, 2022 • edited Loading

deepthidevaki commented Sep 15, 2022

npepinpe commented Sep 16, 2022

deepthidevaki commented Sep 16, 2022

npepinpe commented Sep 16, 2022

deepthidevaki commented Sep 16, 2022

npepinpe commented Sep 18, 2022

Status

Take

Delete

Further improvements

deepthidevaki commented Jul 27, 2022 •

edited

Loading

npepinpe commented Sep 15, 2022 •

edited

Loading