You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This ticket involves creating step functions which a developer with proper permissions can run to restore data from our backups into the primary region Dynamo Tables. Using step functions will automate the most complex and error-prone part of disaster recovery, reducing our RTO and improving recovery success probability.
The main use case for these step functions will be to restore our tables to specific points in time in the event that a disaster in the primary region causes major loss or corruption of data that requires rollback (ie a development bug during migration, states uploading large amounts of corrupt license data, etc.)
This ticket involves implementing the top level step function that calls all the other step functions to simplify the DR trigger process.
The step function will perform the following:
Throttle ALL lambdas in the system by setting their concurrency limit to 0, effectively freezing all our computing resources during the DR run.
Un-throttle all the lambdas once DR process is complete
Notes
We will be adding a directory in backend named disasterRecovery which will contain documentation for running the DR cutover.
The step function to restore S3 buckets will not be implemented in phase 1, as we currently only have the provider records bucket which will be versioned and will not initially hold many objects and can be restored through manual intervention if needed.
Dependent on our retention policy implementation #246
Questions
Assumptions
Estimate
5
Tasks
TODO
Automated tests
API Docs
Postman collection
PR opened with labels / reviewers / assignee / linked-issue
This ticket involves creating step functions which a developer with proper permissions can run to restore data from our backups into the primary region Dynamo Tables. Using step functions will automate the most complex and error-prone part of disaster recovery, reducing our RTO and improving recovery success probability.
The main use case for these step functions will be to restore our tables to specific points in time in the event that a disaster in the primary region causes major loss or corruption of data that requires rollback (ie a development bug during migration, states uploading large amounts of corrupt license data, etc.)
This ticket involves implementing the top level step function that calls all the other step functions to simplify the DR trigger process.
The step function will perform the following:
Notes
We will be adding a directory in backend named
disasterRecoverywhich will contain documentation for running the DR cutover.The step function to restore S3 buckets will not be implemented in phase 1, as we currently only have the provider records bucket which will be versioned and will not initially hold many objects and can be restored through manual intervention if needed.
Dependent on our retention policy implementation #246
Questions
Assumptions
Estimate
5
Tasks
Implementation Notes