This repository provides an AWS CloudFormation (CFn) template that builds sample data pipeline, with the functionality of identifying personal and sensitive data using the Sensitive Data Detection feature of AWS Glue Studio and apply hashing algorithm to protect columns identified with the use of AWS Glue DataBrew, through an event-driven and serverless architecture.
We recommend that you use this template as a starting point for creating your own template, not for launching production-level environments. Before launching a template, always review the resources and policies that it will create and the permissions it requires. Using this code I Agree
I'm solely responsible for any security issue caused due any misconfiguration and/or bugs.
Although the data contained in the sample file respects the real format, they are fictitious data generated randomly, respecting the rules for creating each document. The misuse of the data generated here is the sole responsibility of the user.
- Launch the AWS CloudFormation stack using the
cfn-demo-detect-and-handling-custom-pii.yaml
template file as the source.
To get the template, download the CFn template file here or clone the repository.
git clone git@github.com:aws-samples/detect-and-handling-custom-pii-with-aws-glue-studio-and-aws-glue-databrew.git
Note: Check AWS account and region before stack deploy.
- CFn template provides 12 parameters filled by default, 5 custom sensitive data name (
1CustomSensitiveDataName
,2CustomSensitiveDataName
,3CustomSensitiveDataName
,4CustomSensitiveDataName
,5CustomSensitiveDataName
), their respective regular expression (1CustomSensitiveDataValue
,2CustomSensitiveDataValue
,3CustomSensitiveDataValue
,4CustomSensitiveDataValue
,5CustomSensitiveDataValue
) to detect some Personally Identifiable Information (PII) from Brazil (CPF, RG, CNPJ, CEP and Telefone), percentage of rows to sample (GlueSamplePortion
) and percentage of rows that contain the sensitive data (GlueDetectionThreshold
). You can edit these parameters if you prefer. Provide a value forSecretString
parameter, that will be base64 encoded as a secret (AWS Secrets Manager) and used for data hashing.
-
After deploying CFn stack, get output parameters in Outputs section
AmazonS3BucketForDataInput
,AWSStepFunctionsStateMachine
andAmazonS3BucketForDataOuput
. It will be useful in the next steps! -
Customers who decide example data to leverage, download sample synthetic file here generated by 4devs or use any other generator.
Note: File must be a CSV with semicolon (;
) delimiter (semicolon was chosen in this solution because it causes less problems with decimal points, digit grouping and does not appear in much text).
-
Upload file with sensitive data in the input bucket created before (
AmazonS3BucketForDataInput
). -
Wait until AWS Step Functions state machine (
AWSStepFunctionsStateMachine
) ends, you can watch the execution. -
After completed, download generated file in output bucket created before (
AmazonS3BucketForDataOuput
).
In the AWS CloudFormation User Guide, you can view more information about the following topics:
- Learn how to use templates to create AWS CloudFormation stacks using the AWS Management Console or AWS Command Line Interface (AWS CLI).
- To view all the supported AWS resources and their properties, see the Template Reference.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.