Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: enhance cold backup and restore function #1081

Open
hycdong opened this issue Jul 27, 2022 · 0 comments
Open

Feature: enhance cold backup and restore function #1081

hycdong opened this issue Jul 27, 2022 · 0 comments
Assignees
Labels
type/enhancement Indicates new feature requests

Comments

@hycdong
Copy link
Contributor

hycdong commented Jul 27, 2022

Background

Pegasus currently supports cold backup and restore functions, but both of them have some disadvantages.

For cold backup, pegasus supports periodic backup through policy. Users can create a policy with backup related parameters such as provider, interval time, and apply this policy to sereval tables. Besides, pegasus also supports onetime backup since release 2.3.0.
However, backup function has following disadvantages:

  • Periodic backup can not start accurately by start time
  • When Periodic backup interval time is less than 1 day, periodic backup will be triggered unexpectedly.
  • User defined provider path is not supported for periodic backup.
  • Once backup is started, it can not be canceled. When backup failed, it will continue to retry until succeed, even restart meta server.
  • Current backup will cost heavy I/O during copying checkpoint.
  • The path on provider is hard to find one table's backup.
  • Backup code is not firendly to read and maintain.

For restore, pegasus supports two data_version. Tables created in release 1.x is V0, and tables created in release 2.x is V1. Restore process will create an empty table, then apply the backup checkpoint. There will be a compatible problem that release 2.x table can not apply V0 checkpoint, which will lead to coredump making cluster useless. As a result, restore need to check table data_version to make it robust.

New backup design

The enhance version of backup, simplify backup v2, will solve all probelms above, providing a simple backup function.

Components

Meta backup function is consist of three parts:

  • Backup engine - intertact with replica server
  • Periodic backup context - manage table periodic backup policy and backups
    • meta server will have a timer to check whether periodic backup should be triggered
    • for first triggered backup, server will check it by start_time whose format is like "15:00"
    • for not-first backup, server will compare last backup start time and periodic backup interval
    • periodic backup is not allowed to be modified, but can be deleted and recreated
  • Backup service - manage cluster all tables backup, including onetime backup and periodic backup. Besides, it also expose the rpc interface to admin-cli and shell
    • add table periodic backup policy
    • query periodic backup policy
    • disable/enable periodic backup policy
    • delete periodic backup policy
    • start onetime backup
    • query backup (onetime and periodic)
    • cancel backup (onetime and periodic)

Main flow

image

  • when receving start backup, engine will turn its backup status into checkpointing and send request to replica servers
  • replica will turn its state into checkpointing, and turn to checkpointed after generating checkpoint succeed
  • when all partitions status is checkpointed, meta will turn status into uploading
  • replica will turn its state into uploading, and turn to succeed after uploading checkpoint succeed, the backup checkpoint directory will be deleted after a while
  • when all partitions status is succeed, meta will turn status into succeed and consider backup succeed
  • if any errors happended during whole process, backup will be failed
  • if receiving cancel backup, checkpointing or uploading backup will be canceled

Backup paths

Path on remote storage (zk)

<cluster_root>/backup/<app_id>/once/<timestamp>/<backup_item>
<cluster_root>/backup/<app_id>/periodic/<policy_context>
                                       /<timestamp>/<backup_item>

Path on remote backup provider (such as HDFS)

<root>/<cluster_name>/<app_name>_<app_id>/<timestamp>/<pidx>/chkpt_<ip>_<port>
                                                            /meta
                                                            /backup_info

New restore

Restore v2 won't update design, just add data version check, refactor code and compatible for old backup path on backup provider.

Pull request merge plan

  • Add a new branch call backup-restore-dev, all pull reuqests will be firstly added into this branch, and finally into master branch.
  • Remove all old backup and restore codes firstly because that new code is huge different from the old implementation.
  • This feature is NOT planed in 2.4.0, just next release, will not block releasing process
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Indicates new feature requests
Projects
None yet
Development

No branches or pull requests

1 participant