Skip to content

abcxyz/pmap

pmap

This is not an official Google product.

Background

Privacy data management is the process of collecting, storing, using, and disposing of data in a way that protects the privacy of users. It is a critical part of any organization that collects or uses user data, and organizations are typically required to maintain compliance with policies set by regulatory bodies.

To ensure that organizations maintain compliance with policies set by regulatory bodies, they need to know the following:

  • The requirements for what teams must do, driven by legal requirements or external commitments (aka. policy and compliance controls). This includes translating the comprehensive external legal requirements into requirements that are tailored to products and services of the organizations.

  • Where the user data is stored or processed. This includes understanding the different systems and databases that store or process user data, as well as the physical locations where user data is stored or processed.

  • Which policy or compliance control applies to the system that stores/processes user data (aka. data mapping). This includes understanding how the organization's policies and compliance control are applied to different systems and databases.

  • The visibility of the privacy compliance. This includes being able to track and monitor the organization's compliance with its policies/controls and applicable laws and regulations.

    PMAP provides a solution for the first three problems. We are working on a solution to provide visibility of privacy compliance in the near future.

Architecture

pmap architecture

  • Registration - Data owners and policy owners will register data mappings and policies/controls in a central GitHub repository.
  • GCS Snapshots - Snapshot the data mappings and policies/controls from GitHub to GCS with Workload Identity Federation.
  • Additional Processors - Extension point of validation and enrichment for data mappings.
  • Processing Service - The service that is responsible for ingesting, validating and storing the data mappings and policies/controls .
  • Storage and Analysis - The data warehouse for processed data mappings and policies/controls , and UI for dashboarding.

Why GitHub

We choose GitHub as it can preserve change history and enable multi-person review and approval. Change history and review/approval process are crucial in privacy data management.

Why BigQuery

We choose BigQuery for its excellent analytics support:

  • Be able to visualize data to reveal meaningful insights.
  • Be able to join data from other data sources in the future to achieve the privacy compliance monitoring.

Set Up

The central privacy/compliance eng team need to complete the steps below.

Workload Identity Federation

Set up Workload Identity Federation, and a service account with adequate condition and permission, see guide here. Please restrict any human access to this service account, it should only be used by your PMAP instance.

-  Service account used in Authenticating via Workload Identity Federation
   needs [roles/storage.objectCreator]
   to snapshot the data mappings and policies/controls from GitHub to GCS.

-   When creating the workload identity pool provider, make sure to map the
    attributes such as `"attribute.job_workflow_ref":
    "assertion.job_workflow_ref"` and add attribute conditions:
    -   `attribute.event_name != \"pull_request_target\"` to prevent
         workflows triggered by a forked repository.
    -   `attribute.repository_owner_id == \"${var.github_owner_id}\" &&
         attribute.repository_id == \"${var.github_repository_id}\"` to only
         allow workflows from your pmap repository.
    -   `matches(attribute.job_workflow_ref, \"abcxyz/pmap/*\")`
         to only allow trusted workflow jobs which are from
         `abcxyz/pmap` source repo. 

GitHub Central Repository

The central privacy/compliance eng team can determine how to group data mappings and policies/controls as long as at least one level of group are needed (sub folders in the root of the central GitHub repository are needed). Files containing the data mappings or policies/controls can’t be stored directly in the root of the central GitHub repository.

Yoy can leverage pmap-template to create the GitHub Central Repository.

Data Mapping

  • Presubmit workflows for sanity checks, see example here.

  • Postsubmit workflows to snapshot added_files and modified_files of data mappings to GCS, see example here.

  • Cron Workflows to snapshot the all files of data mappings to GCS, see example here.

Policy and Control

  • Postsubmit workflows to snapshot added_files and modified_files of policies/controls to GCS, see example here.

  • Cron Workflows to snapshot the all files of policies/controls to GCS, see example here

Infrastructure for pmap

  • You can use the provided Terraform module to setup the basic infrastructure needed for this service. Otherwise you can refer to the provided module to see how to build your own Terraform from scratch.
module "pmap" {
  source = "git::https://github.com/abcxyz/pmap.git//terraform/e2e?ref=main" # this should be pinned to the SHA desired

  project_id = "YOUR_PROJECT_ID"

  gcs_bucket_name                  = "pmap"
  pmap_container_image             = "us-docker.pkg.dev/abcxyz-artifacts/docker-images/pmap:0.0.4-amd64"
  pmap_prober_image                = "us-docker.pkg.dev/abcxyz-artifacts/docker-images/pmap-prober:0.0.4-amd64"
  bigquery_table_delete_protection = true
  # This is used when searching global Cloud Resources like GCS bucket.
  pmap_specific_envvars            = { "PMAP_MAPPING_DEFAULT_RESOURCE_SCOPE" : "YOUR_DEFAULT_RESOURCE_SCOPE" }
  notification_channel_email       = "YOUR_NOTIFICATION_CHANNEL_EMAIL"
}
  • Make sure the Service Account used in the Cloud Run service for Data Mapping is granted the roles/cloudasset.viewer to the corresponding scope PMAP_MAPPING_DEFAULT_RESOURCE_SCOPE level following docs here.
# Grep the Service Account used in the Cloud Run service for Data Mapping 
gcloud run services describe <NAME_OF_DATA_MAPPING_CLOUD_RUN_SERVICE> 

End User Workflows

Policy/Control Owner

  • Create a policy/control (e.g. a wipeout plan) by opening a PR and add a yaml file under the sub folder where stores all the policies/controls. See example here.

Data Owner

  • Register and annotate resources to associate the resources to its specific policies/controls by opening a PR and add a mapping yaml file under the sub folder where stores all the data mappings. The association of the resource to the corresponding policies/controls is achieved via annotations field. See example here.

Data Governor(TODO)