Skip to content

Disaster Recovery - Implement step function to restore backup Dynamo Tables from PITR in Primary region #892

@landonshumway-ia

Description

@landonshumway-ia

This ticket involves creating step functions which a developer with proper permissions can run to restore data from our backups into the primary region Dynamo Tables. Using step functions will automate the most complex and error-prone part of disaster recovery, reducing our RTO and improving recovery success probability.

The main use case for these step functions will be to restore our tables to specific points in time in the event that a disaster in the primary region causes major loss or corruption of data that requires rollback (ie a development bug during migration, states uploading large amounts of corrupt license data, etc.)

This ticket involves implementing the top level step function that calls all the other step functions to simplify the DR trigger process.

The step function will perform the following:

  • Throttle ALL lambdas in the system by setting their concurrency limit to 0, effectively freezing all our computing resources during the DR run.
  • add DynamoDB restore steps that restore PITR recovery points into new tables, which will then be passed into the step functions that perform a hard reset (to be completed as part of Disaster Recovery - Implement Step functions to perform synchronization (hard reset) on non-ssn tables #987)
  • Call the step function to perform a hard reset
  • Un-throttle all the lambdas once DR process is complete

Notes

We will be adding a directory in backend named disasterRecovery which will contain documentation for running the DR cutover.

The step function to restore S3 buckets will not be implemented in phase 1, as we currently only have the provider records bucket which will be versioned and will not initially hold many objects and can be restored through manual intervention if needed.

Dependent on our retention policy implementation #246

Questions

Assumptions

Estimate

5

Tasks

  • TODO
  • Automated tests
  • API Docs
  • Postman collection
  • PR opened with labels / reviewers / assignee / linked-issue

Implementation Notes

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions