Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

[Spike] Improve reprocessing a request with new collections #1036

Open
pattisdr opened this issue Aug 4, 2022 · 0 comments
Open

[Spike] Improve reprocessing a request with new collections #1036

pattisdr opened this issue Aug 4, 2022 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@pattisdr
Copy link
Contributor

pattisdr commented Aug 4, 2022

Is your feature request related to a specific problem?

Investigate how to better reprocess a privacy request that maximizes retrieving and/or masking the relevant data when the graph has changed.

The original version of reprocessing assumed the graph didn't change between retries, and maximized masking as much of the original data requested as possible. If we mask some of the collections and then we have a failure, current logic lets us to mask the original remaining collections using the saved data we retrieved originally, instead of re-querying the collections to figure out which data we should mask. Once data is masked in one collection, it potentially prevents us from being able to reach data in downstream collections, so we opt to use our temporarily saved data.

The side effect is that data related to newly added collections can be missed:

  • If the access step fails, and a new collection is added, we are potentially missing data from already completed collections downstream, plus any collections further downstream of that set.
  • If the erasure step fails, and a new collection is added, we potentially miss masking data from the new collection and data downstream of the new collection.

Describe the solution you'd like

Investigate how to better retrieve and mask newly added data and its downstream collections when reprocessing, while still being able to execute the erasure step in full, even when some collections have already had their data destroyed.

Describe alternatives you've considered, if any

Changing run order

  • We currently don't care about the order erasures are run. They're just run sequentially in an order determined by dask. Instead, run it in reverse of the access graph, so that nodes with the least amount of descendants are run first. Any time something is reprocessed, switch to running everything in full, but run the access graph left to right, and run the erasure graph right to left.

Merging multiple access results to use for erasures

  • Another idea, if the access step fails, have a "retry" visit all collections in the graph instead of skipping completed ones. Start caching two versions of the results instead of one:
    • (A) The latest access results, which we would format and then return to the user
    • (B) a merged version of the access results that we save separately to carry out the erasure. So every time we run an access request, we replace (A) and merge the latest results with (B).
  • If the erasure step fails, have reprocessing rerun the access step instead of skipping it, and then retry the entire erasure step using data from (B).

Current reprocessing logic

  • A privacy request is run in two sequential steps, I'll call them the access step, and the erasure step:
    • The access step builds a graph of all collections and visits each collection after all of its upstream dependencies have been visited. We temporarily cache the results from each collection, both to later format as a package to the user, and to re-use to build queries to mask the relevant data. This data expires when the redis cache expires.
      • Example: Root -> A -> B -> C -> Done.
    • The optional erasure step waits until the entire access step is complete, and then uses the data cached from the access step to run individual erasures in a deterministic order. The erasure step is not really a graph, all collections could technically be run in parallel, although we run them in sequence.
      • Example:
        • A -> Done

        • B -> Done

        • C -> Done

  • Running the access step: We commit to retrieving data from each collection in the access step once, until the entire access step is complete. If the access step fails and we restart from failure, we skip already retrieved collections, but visit uncompleted collections and newly added collections to collect data. We have one version of the saved results for each privacy request.
  • Running the erasure step: Once the access step is complete, we don't revisit that step at all. We try to mask data in collections using data we saved from the access step. We mask each collection just once. If the erasure step fails, and we restart from failure, we skip collections that were already masked and try to mask the remainder. If new collections have been added in the meantime, because we don't revisit the access step, in most cases we won't have the data to mask those new collections, or mask the data downstream.
@pattisdr pattisdr added the enhancement New feature or request label Aug 4, 2022
@pattisdr pattisdr self-assigned this Aug 18, 2022
@seanpreston seanpreston changed the title [Spike][Backend] Improve reprocessing a request with new collections [Spike] Improve reprocessing a request with new collections Sep 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant