Skip to content
This repository has been archived by the owner on Nov 30, 2022. It is now read-only.

Email Connector: Build Masking Instructions #1168

Merged
merged 8 commits into from
Sep 1, 2022

Conversation

pattisdr
Copy link
Contributor

@pattisdr pattisdr commented Aug 26, 2022

👉 The EmailConnector doesn't yet "send" an email. Documentation will be added for this feature in followup #1158

Purpose

Update EmailConnector.mask_data to cache details about actions needed to mask that collection (how to locate the records, which fields to mask, how to mask).

The email is not yet sent, but in #1158 we'll look into pulling these details out of the cache to send one email per "email-based dataset" at the end of the privacy request, not per collection.

Changes

  • Add PrivacyRequest.cache_email_connector_template_contents to cache raw details about action needed that we will email to a third party to complete.
  • Add PrivacyRequest.get_email_connector_template_contents_by_dataset to retrieve the cached email template details.
  • Pass in the original input_data into each "erasure" task that we also pass into each "access" task. Generally, erasure tasks work with rows retrieved from the access tasks, but for email-based tasks, we won't have any data. So we'll take the original data we would have used to perform an access request and supply that to the third party so they can use that to locate the relevant records.
  • Expand examples in contrived email dataset yaml for testing purposes
  • Add a new method TraversalNode.incoming_edges_from_same_dataset so we know any immediate upstream dependencies of the current collection that are also in the same email-related dataset. We won't have data from those upstream edges.
  • Make some of the classes/functions used to describe manual actions needed more generic so they can be shared by email-related actions. For example, rename PausedStep to CurrentStep, and StoppedCollection to CollectionActionRequired. cache_restart_details was renamed to cache_action_required.

Checklist

  • Update CHANGELOG.md file
    • Merge in main so the most recent CHANGELOG.md file is being appended to
    • Add description within the Unreleased section in an appropriate category. Add a new category from the list at the top of the file if the needed one isn't already there.
    • Add a link to this PR at the end of the description with the PR number as the text. example: #1
  • Applicable documentation updated (guides, quickstart, postman collections, tutorial, fidesdemo, database diagram.
  • If docs updated (select one):
    • documentation complete, or draft/outline provided (tag docs-team to complete/review on this branch)
    • documentation issue created (tag docs-team to complete issue separately)
  • Good unit test/integration test coverage
  • This PR contains a DB migration. If checked, the reviewer should confirm with the author that the down_revision correctly references the previous migration before merging
  • The Run Unsafe PR Checks label has been applied, and checks have passed, if this PR touches any external services

Ticket

Fixes #1135

… so it can be used for the email connector, which won't have any rows returned from an access request.

- Add an EmailConnector.build_masking_instructions method with a draft of data needed to instruct the user how to query/mask/what fields to mask on their end.
… to be masked in Redis. We'll use this to send one email at the end for each "email"-based dataset at end, instead of sending one email for each collection.

Reuse some of the caching code created for manual connectors / failed privacy requests where similar to the EmailConnectors, we have some separate action that is required on a given collection.  Rename to make more generic.
…he manual action could just be locating data for another collection downstream.

Cache email template details, even if there are no actions needed on that specific collection,
@pattisdr pattisdr changed the title [Draft] Email Connector: Build Masking Instructions Email Connector: Build Masking Instructions Aug 29, 2022
@pattisdr pattisdr marked this pull request as ready for review August 29, 2022 14:58
@seanpreston seanpreston self-assigned this Aug 29, 2022
Comment on lines +86 to +87
for edge in node.incoming_edges_from_same_dataset():
append(locators, edge.f2.field_path.string_path, str(edge.f1))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an upstream dependency for the email connector is in the same email-based dataset, we won't have data to give them to help them query for it, so instead I'm caching the upstream field so they can find it themselves.

Comment on lines +98 to +103
mask_map[rule_field_path.string_path] = (
rule.masking_strategy.get("strategy")
if rule.masking_strategy
else None
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm caching the masking strategy here, but we will probably just send a list of fields they should mask, not the strategy. I didn't think it hurt to cache the strategy still in case that is useful later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good — I could also imagine us precomputing values for the third party to update PII to also

Comment on lines +101 to +103
locators={
"parent_id": ["email_dataset:daycare_customer:id"]
}, # The only locator is on a separate collection on their end. We don't have data for it.
Copy link
Contributor Author

@pattisdr pattisdr Aug 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tricky if one email collection is dependent on another collection in the same email connector, we won't have any data to give them to query from the dependent collection, they need to locate on their own. Here, we don't have any parent_ids to give them, they need to go to the daycare_customer collection, use the customer_id of 1, and then find the parent_id, and use that here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting edge case! @mfbrown do you have any insight on whether this is likely to happen?

Comment on lines +130 to +139
"payment": CollectionActionRequired(
step=CurrentStep.erasure,
collection=CollectionAddress("email_dataset", "payment"),
action_needed=[
ManualAction(
locators={"payer_email": ["customer-1@example.com"]},
get=None,
update=None, # Nothing to mask on this collection
)
],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though there's nothing to mask on the payment collection, I am still caching details for all the "email" collections because its locator may be useful in helping them find data on other downstream collections.

Comment on lines 72 to +77
class ManualAction(BaseSchema):
"""Surface how to manually retrieve or mask data in a database-agnostic way
"""Surface how to retrieve or mask data in a database-agnostic way

"locators" are similar to the SQL "WHERE" information.
"get" contains a list of fields that should be retrieved from the source
"update" is a dictionary of fields and the value they should be replaced with.
"update" is a dictionary of fields and the replacement value/masking strategy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reusing a structure created for the "manual" connector to cache details regarding further actions needed by some other party to continue processing of this request.

Comment on lines +53 to +68
"""Cache instructions for how to mask data in this collection.
One email will be sent for all collections in this dataset at the end of the privacy request execution.
"""

manual_action: ManualAction = self.build_masking_instructions(
node, policy, input_data
)

logger.info("Caching action needed for collection: '%s", node.address.value)
privacy_request.cache_email_connector_template_contents(
step=CurrentStep.erasure,
collection=node.address,
action_needed=[manual_action],
)

return 0 # Fidesops itself does not mask this collection.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a primary change of this PR - EmailConnector.mask_data caches some instructions on how to find and mask data for this particular collection.

@pattisdr
Copy link
Contributor Author

pattisdr commented Sep 1, 2022

Looking at test failure ^

@seanpreston seanpreston merged commit dd141ca into main Sep 1, 2022
@seanpreston seanpreston deleted the fidesops_1135_generate_data_for_erasure_email branch September 1, 2022 15:49
sanders41 pushed a commit that referenced this pull request Sep 22, 2022
* Pass in input_data to erasure requests, and not just access requests, so it can be used for the email connector, which won't have any rows returned from an access request.

- Add an EmailConnector.build_masking_instructions method with a draft of data needed to instruct the user how to query/mask/what fields to mask on their end.

* Have the EmailConnector.mask_data  cache the raw details of what needs to be masked in Redis. We'll use this to send one email at the end for each "email"-based dataset at end, instead of sending one email for each collection.

Reuse some of the caching code created for manual connectors / failed privacy requests where similar to the EmailConnectors, we have some separate action that is required on a given collection.  Rename to make more generic.

* Remove restriction that a ManualAction needs a get or update value.  The manual action could just be locating data for another collection downstream.

Cache email template details, even if there are no actions needed on that specific collection,

* Update the expected number of collections in the email dataset.

* build_masking_instructions is not required to return a ManualAction.

* Reconcile this test with the work to make log send asynchronous.

Co-authored-by: Sean Preston <sean@ethyca.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Email Connector: Generate Data for Erasure Email
2 participants