[Spike] Improve reprocessing a request with new collections #1036

pattisdr · 2022-08-04T13:36:17Z

Is your feature request related to a specific problem?

Investigate how to better reprocess a privacy request that maximizes retrieving and/or masking the relevant data when the graph has changed.

The original version of reprocessing assumed the graph didn't change between retries, and maximized masking as much of the original data requested as possible. If we mask some of the collections and then we have a failure, current logic lets us to mask the original remaining collections using the saved data we retrieved originally, instead of re-querying the collections to figure out which data we should mask. Once data is masked in one collection, it potentially prevents us from being able to reach data in downstream collections, so we opt to use our temporarily saved data.

The side effect is that data related to newly added collections can be missed:

If the access step fails, and a new collection is added, we are potentially missing data from already completed collections downstream, plus any collections further downstream of that set.
If the erasure step fails, and a new collection is added, we potentially miss masking data from the new collection and data downstream of the new collection.

Describe the solution you'd like

Investigate how to better retrieve and mask newly added data and its downstream collections when reprocessing, while still being able to execute the erasure step in full, even when some collections have already had their data destroyed.

Describe alternatives you've considered, if any

Changing run order

We currently don't care about the order erasures are run. They're just run sequentially in an order determined by dask. Instead, run it in reverse of the access graph, so that nodes with the least amount of descendants are run first. Any time something is reprocessed, switch to running everything in full, but run the access graph left to right, and run the erasure graph right to left.

Merging multiple access results to use for erasures

Another idea, if the access step fails, have a "retry" visit all collections in the graph instead of skipping completed ones. Start caching two versions of the results instead of one:
- (A) The latest access results, which we would format and then return to the user
- (B) a merged version of the access results that we save separately to carry out the erasure. So every time we run an access request, we replace (A) and merge the latest results with (B).
If the erasure step fails, have reprocessing rerun the access step instead of skipping it, and then retry the entire erasure step using data from (B).

Current reprocessing logic

A privacy request is run in two sequential steps, I'll call them the access step, and the erasure step:
- The access step builds a graph of all collections and visits each collection after all of its upstream dependencies have been visited. We temporarily cache the results from each collection, both to later format as a package to the user, and to re-use to build queries to mask the relevant data. This data expires when the redis cache expires.
  - Example: Root -> A -> B -> C -> Done.
- The optional erasure step waits until the entire access step is complete, and then uses the data cached from the access step to run individual erasures in a deterministic order. The erasure step is not really a graph, all collections could technically be run in parallel, although we run them in sequence.
  - Example:
    - A -> Done
    - B -> Done
    - C -> Done
Running the access step: We commit to retrieving data from each collection in the access step once, until the entire access step is complete. If the access step fails and we restart from failure, we skip already retrieved collections, but visit uncompleted collections and newly added collections to collect data. We have one version of the saved results for each privacy request.
Running the erasure step: Once the access step is complete, we don't revisit that step at all. We try to mask data in collections using data we saved from the access step. We mask each collection just once. If the erasure step fails, and we restart from failure, we skip collections that were already masked and try to mask the remainder. If new collections have been added in the meantime, because we don't revisit the access step, in most cases we won't have the data to mask those new collections, or mask the data downstream.

pattisdr added the enhancement New feature or request label Aug 4, 2022

This was referenced Aug 4, 2022

Reprocess a DSR when a collection was added to be referenced before already-completed collection #868

Closed

[Spike] Datastore level ExecutionLog groupings (Timebox: 3 day) #1023

Open

pattisdr self-assigned this Aug 18, 2022

seanpreston changed the title ~~[Spike][Backend] Improve reprocessing a request with new collections~~ [Spike] Improve reprocessing a request with new collections Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spike] Improve reprocessing a request with new collections #1036

[Spike] Improve reprocessing a request with new collections #1036

pattisdr commented Aug 4, 2022 •

edited

Loading

[Spike] Improve reprocessing a request with new collections #1036

[Spike] Improve reprocessing a request with new collections #1036

Comments

pattisdr commented Aug 4, 2022 • edited Loading

Is your feature request related to a specific problem?

Describe the solution you'd like

Describe alternatives you've considered, if any

Current reprocessing logic

pattisdr commented Aug 4, 2022 •

edited

Loading