This ticket involves creating step functions which a developer with proper permissions can run to restore data from our Point in time recovery backup tables into the primary region Dynamo Tables. Using step functions will automate the most complex and error-prone part of disaster recovery, reducing our RTO and improving recovery success probability.
The main use case for these step functions will be to restore our tables to specific points in time in the event that a disaster in the primary region causes major loss or corruption of data that requires rollback (ie a development bug during migration, states uploading large amounts of corrupt license data, etc.)
This ticket is focused on creating the step functions for each that will specifically perform the hard reset. Given a source table arn and a destination table arn, the step function will first delete all records from the destination table, and then copy over all records from the source table into the destination table.
As part of this, we will also need to set a resource policy on the step function that only allows an individual assuming a specific DR role which will need to be manually created in the management account. Edit: It turns out that Step Functions do not have resource-based policies. In light of this, we determined that rather than creating a DR role for accidental running of step functions, we will add a confirmation flag where the admin must pass in the name of the table they are trying to restore. This is because Admins have the ability to change policies anyway, and the real objective with having a separate role was to prevent accidental running of these step functions, which having this flag will account for.
Note that due to the sensitive nature of SSN keys in Dynamo, we will need to implement a different DR solution for that specific table. That effort has been split into #988
Notes
Questions
Assumptions
Estimate
Tasks
Implementation Notes
This ticket involves creating step functions which a developer with proper permissions can run to restore data from our Point in time recovery backup tables into the primary region Dynamo Tables. Using step functions will automate the most complex and error-prone part of disaster recovery, reducing our RTO and improving recovery success probability.
The main use case for these step functions will be to restore our tables to specific points in time in the event that a disaster in the primary region causes major loss or corruption of data that requires rollback (ie a development bug during migration, states uploading large amounts of corrupt license data, etc.)
This ticket is focused on creating the step functions for each that will specifically perform the hard reset. Given a source table arn and a destination table arn, the step function will first delete all records from the destination table, and then copy over all records from the source table into the destination table.
As part of this, we will also need to set a resource policy on the step function that only allows an individual assuming a specific DR role which will need to be manually created in the management account. Edit: It turns out that Step Functions do not have resource-based policies. In light of this, we determined that rather than creating a DR role for accidental running of step functions, we will add a confirmation flag where the admin must pass in the name of the table they are trying to restore. This is because Admins have the ability to change policies anyway, and the real objective with having a separate role was to prevent accidental running of these step functions, which having this flag will account for.
Note that due to the sensitive nature of SSN keys in Dynamo, we will need to implement a different DR solution for that specific table. That effort has been split into #988
Notes
Questions
Assumptions
Estimate
Tasks
Implementation Notes