This project deploys a minimum ETL workload using AWS Glue. It loads data from Aurora cluster and store the ETL results to S3 bucket as parquet format.
The Glue job is quite simple that replaces "content" column of the table to "*". (content: Hello
=> *****
)
You need to setup your CDK environment. See Getting started with the AWS CDK.
cdk deploy
First, you need to create demo data in Aurora cluster. We deployed a lambda function that inserts 1000 records to the database. Let's invoke it by below.
aws lambda invoke --function-name create-demo-data /dev/null
Next, run the Glue job to do the ETL. Go to AWS Glue Console (Jobs) and select AwsGlueEtlSampleCdk. Then click Action and Run job.
After the job succeeds, go to AWS Glue Console (Crawlers) and select AwsGlueEtlSampleCdk. Then click Run crawler.
After the crawler succeeds, go to Athena (Query) and select AwsDataCatalog as Data source and mydatabase as Database. Then enter the following query in the box. Then click Run query.
SELECT * FROM mytable;
As you can see, the "content" column is masked by "*".
To get the number of records, run the query below.
SELECT COUNT(*) FROM mytable;
You will get 1000 as the result if you invoked the lambda function once.
cdk destroy
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.