This is a baseline template for AWS CDK development with AWS Glue. This CDK template is built with AWS CDK v2 and AWS CDK Pipelines.
Typically, you have multiple accounts to manage and provision resources for your data pipeline. In this template, we assume the following three accounts:
- Pipeline account - This hosts the end-to-end pipeline
- Dev account – This hosts the integration pipeline in the development environment
- Prod account – This hosts the data integration pipeline in the production environment
If you want, you can use the same account and the same Region for all three.
To start applying this end-to-end development lifecycle model to your data platform easily and quickly, we prepared a baseline template aws-glue-cdk-baseline
using AWS CDK. The template is built on top of AWS CDK v2 and AWS CDK Pipelines. It provisions two kinds of stacks;
- AWS Glue app stack – This provisions the data integration pipeline: one in the dev account and one in the prod account
- Pipeline stack – This provisions the Git repository and CI/CD pipeline in the pipeline account
The AWS Glue app stack provisions the data integration pipeline, including the following resources:
- AWS Glue jobs
- AWS Glue job scripts
At the time of publishing of this template, the AWS CDK has two versions of the AWS Glue module: @aws-cdk/aws-glue and @aws-cdk/aws-glue-alpha, containing L1 constructs and L2 constructs, respectively. The sample Glue app stack is defined using aws-glue-alpha, the L2 construct for AWS Glue because it’s straightforward to define and manage AWS Glue resources. If you want to use the L1 construct, refer to Build, Test and Deploy ETL solutions using AWS Glue and AWS CDK based CI/CD pipelines.
The pipeline stack provisions the entire CI/CD pipeline, including the following resources:
- AWS IAM roles
- Amazon S3 bucket
- AWS CodeCommit
- AWS CodePipeline
- AWS CodeBuild
Every time the business requirement changes (such as adding data sources or changing data transformation logic), you make changes on the AWS Glue app stack and re-provision the stack to reflect your changes. This is done by committing your changes in the AWS CDK template to the CodeCommit repository, then CodePipeline reflects changes on AWS resources using AWS CloudFormation change sets.
In the following sections, we present the steps to set up the required environment and demonstrate the end-to-end development lifecycle.
- Python 3.9 or later
- AWS accounts for Pipeline account, Dev account, and Prod account
- AWS Named profile for Pipeline account, Dev account, and Prod account
- The AWS CDK Toolkit (cdk command) 2.87.0 or later
- Docker
- Visual Studio Code
- Visual Studio Code Dev Containers
To initialize the project, complete the following steps:
- Clone the baseline template to your workplace.
$ git clone git@github.com:aws-samples/aws-glue-cdk-baseline.git
$ cd aws-glue-cdk-baseline.git
- Create a Python virtual environment specific to the project on the client machine.
$ python3 -m venv .venv
We use a virtual environment in order to isolate the Python environment for this project and not install software globally.
- Activate the virtual environment according to your OS:
- On MacOS and Linux, use the following code:
$ source .venv/bin/activate
- On a Windows platform, use the following code:
% .venv\Scripts\activate.bat
After this step, the subsequent steps run within the bounds of the virtual environment on the client machine and interact with the AWS account as needed.
- Install the required dependencies described in requirements.txt to the virtual environment:
$ pip install -r requirements.txt
$ pip install -r requirements-dev.txt
- Edit the configuration file
default-config.yaml
based on your environments (replace each account ID with your own):
pipelineAccount:
awsAccountId: 123456789101
awsRegion: us-east-1
devAccount:
awsAccountId: 123456789102
awsRegion: us-east-1
prodAccount:
awsAccountId: 123456789103
awsRegion: us-east-1
- Run
pytest
to initialize the snapshot test files by running following command:
$ python3 -m pytest --snapshot-update
Run the following commands to bootstrap your AWS environments.
- In the pipeline account, replace
PIPELINE-ACCOUNT-NUMBER
,REGION
, andPIPELINE-PROFILE
with your own values:
$ cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE> \
--cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess
- In the dev account, replace
DEV-ACCOUNT-NUMBER
,REGION
, andDEV-PROFILE
with your own values:
$ cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> \
--cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess \
--trust PIPELINE-ACCOUNT-NUMBER
- In the prod account, replace
PROD-ACCOUNT-NUMBER
,REGION
, andPROD-PROFILE
with your own values:
$ cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> \
--cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess \
--trust PIPELINE-ACCOUNT-NUMBER
When you use only one account for all environments, you can just run the cdk bootstrap
command one time.
Run the command using Pipeline account to deploy resources defined in the AWS CDK baseline template:
$ cdk deploy --profile <PIPELINE-PROFILE>
This creates the pipeline stack in the pipeline account and the AWS Glue app stack in the development account.
When the cdk deploy
command is completed, let’s verify the pipeline using the pipeline account.
- Open AWS CodePipeline console.
- Choose
GluePipeline
.
Then verify that GluePipeline has stages; Source
, Build
, UpdatePipeline
, Assets
, DeployDev
, and DeployProd
. Also verify that these five stages Source
, Build
, UpdatePipeline
, Assets
, DeployDev
have been succeeded, and DeployProd
is in pending status. It can take about 15 minutes.
Now that the pipeline has been created successfully, you can also verify the AWS Glue app stack resource on the AWS CloudFormation console in the dev account.
At this step, the AWS Glue app stack is deployed only in the dev account. You can try to run the AWS Glue job ProcessLegislators
to see how it works.
In the earlier step, you cloned the Git repository from GitHub. Although it is possible to configure the CDK template to work with GitHub, GitHub Enterprise, or Bitbucket, this time we use AWS CodeCommit. If you prefer those 3rd party Git providers, configure connections, and edit [pipeline_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/aws_glue_cdk_baseline/pipeline_stack.py)
to define the variable source
to use the target Git provider using [CodePipelineSource](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.pipelines.CodePipelineSource.html)
.
Because you already ran the cdk deploy
command, the CodeCommit repository has already been created with all the required code and related files. The first step is to setup required for access to CodeCommit. The next step is to clone the repository from the CodeCommit repository to your local. Run the following commands:
$ mkdir aws-glue-cdk-baseline-codecommit
$ cd aws-glue-cdk-baseline-codecommit
$ git clone ssh://git-codecommit.us-east-1.amazonaws.com/v1/repos/aws-glue-cdk-baseline
In the next step, we make changes in this local copy of the CodeCommit repository.
Now that the environment has been successfully created, you’re ready to start developing a data integration pipeline using this baseline template. Let’s walk through end-to-end development lifecycle.
When you want to define your own data integration pipeline, you need to add more AWS Glue jobs and implement job scripts. For this tutorial, let’s assume the use case to add a new AWS Glue job with a new job script to read multiple S3 locations and join them.
First, implement and test the AWS Glue job and its job script in your local environment using Visual Studio Code.
Set up your development environment by following the steps in Develop and test AWS Glue version 3.0 and 4.0 jobs locally using a Docker container. The following steps are required in the context of this template.
- Start Docker.
- Pull the Docker image which has local development environment using AWS Glue ETL library:
$ docker pull `public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01`
- Run the following command to define AWS named profile name:
$ PROFILE_NAME="<DEV-PROFILE>"
- Run the following command to make it available with the baseline template:
$ cd aws-glue-cdk-baseline/
$ WORKSPACE_LOCATION=$(pwd)
- Run the Docker container:
$ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01 pyspark
- Start Visual Studio Code.
- Choose Remote Explorer in the navigation pane, and choose the arrow icon of the
workspace
folder in the containerpublic.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01
.
If the workspace folder is not shown up, choose Open folder and select /home/glue_user/workspace
.
Now you install the required dependencies described in requirements.txt to the container environment.
- Run the following commands in the terminal in Visual Studio Code:
$ pip install -r requirements.txt
$ pip install -r requirements-dev.txt
- Implement the code.
Now let’s make the required changes for a new AWS Glue job here.
[aws_glue_cdk_baseline/glue_app_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/aws_glue_cdk_baseline/glue_app_stack.py)
Let’s add this new code block right after the existing job definition of ProcessLegislators
in order to add the new Glue job JoinLegislators
:
self.new_glue_job = glue.Job(self, "JoinLegislators",
executable=glue.JobExecutable.python_etl(
glue_version=glue.GlueVersion.V4_0,
python_version=glue.PythonVersion.THREE,
script=glue.Code.from_asset(
path.join(path.dirname(__file__), "job_scripts/join_legislators.py")
)
),
description="a new example PySpark job",
default_arguments={
"--input_path_orgs": config[stage]['jobs']['JoinLegislators']['inputLocationOrgs'],
"--input_path_persons": config[stage]['jobs']['JoinLegislators']['inputLocationPersons'],
"--input_path_memberships": config[stage]['jobs']['JoinLegislators']['inputLocationMemberships']
},
tags={
"environment": self.environment,
"artifact_id": self.artifact_id,
"stack_id": self.stack_id,
"stack_name": self.stack_name
}
)
Here, you added three job parameters for different S3 locations. In the proceeding steps, you will provide those locations through the Glue job parameters.
Then, create a new job script, and a new unit test script for the new Glue job:
aws_glue_cdk_baseline/job_scripts/join_legislators.py
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import Join
from awsglue.utils import getResolvedOptions
class JoinLegislators:
def __init__(self):
params = []
if '--JOB_NAME' in sys.argv:
params.append('JOB_NAME')
params.append('input_path_orgs')
params.append('input_path_persons')
params.append('input_path_memberships')
args = getResolvedOptions(sys.argv, params)
self.context = GlueContext(SparkContext.getOrCreate())
self.job = Job(self.context)
if 'JOB_NAME' in args:
jobname = args['JOB_NAME']
self.input_path_orgs = args['input_path_orgs']
self.input_path_persons = args['input_path_persons']
self.input_path_memberships = args['input_path_memberships']
else:
jobname = "test"
self.input_path_orgs = "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
self.input_path_persons = "s3://awsglue-datasets/examples/us-legislators/all/persons.json"
self.input_path_memberships = "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
self.job.init(jobname, args)
def run(self):
dyf = join_legislators(self.context, self.input_path_orgs, self.input_path_persons, self.input_path_memberships)
df = dyf.toDF()
df.printSchema()
df.show()
print(df.count())
def read_dynamic_frame_from_json(glue_context, path):
return glue_context.create_dynamic_frame.from_options(
connection_type='s3',
connection_options={
'paths': [path],
'recurse': True
},
format='json'
)
def join_legislators(glue_context, path_orgs, path_persons, path_memberships):
orgs = read_dynamic_frame_from_json(glue_context, path_orgs)
persons = read_dynamic_frame_from_json(glue_context, path_persons)
memberships = read_dynamic_frame_from_json(glue_context, path_memberships)
orgs = orgs.drop_fields(['other_names', 'identifiers']).rename_field('id', 'org_id').rename_field('name', 'org_name')
dynamicframe_joined = Join.apply(orgs, Join.apply(persons, memberships, 'id', 'person_id'), 'org_id', 'organization_id').drop_fields(['person_id', 'org_id'])
return dynamicframe_joined
if __name__ == '__main__':
JoinLegislators().run()
aws_glue_cdk_baseline/job_scripts/tests/test_join_legislators.py
import pytest
import sys
import join_legislators
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
@pytest.fixture(scope="module", autouse=True)
def glue_context():
sys.argv.append('--JOB_NAME')
sys.argv.append('test_count')
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
context = GlueContext(SparkContext.getOrCreate())
job = Job(context)
job.init(args['JOB_NAME'], args)
yield(context)
def test_counts(glue_context):
dyf = join_legislators.join_legislators(glue_context,
"s3://awsglue-datasets/examples/us-legislators/all/organizations.json",
"s3://awsglue-datasets/examples/us-legislators/all/persons.json",
"s3://awsglue-datasets/examples/us-legislators/all/memberships.json")
assert dyf.toDF().count() == 10439
[default-config.yaml](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/default-config.yaml)
Add following under prod
and dev
:
JoinLegislators:
inputLocationOrgs: "s3://awsglue-datasets/examples/us-legislators/all/organizations.json"
inputLocationPersons: "s3://awsglue-datasets/examples/us-legislators/all/persons.json"
inputLocationMemberships: "s3://awsglue-datasets/examples/us-legislators/all/memberships.json"
[tests/unit/test_glue_app_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/tests/unit/test_glue_app_stack.py)
[tests/unit/test_pipeline_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/tests/unit/test_pipeline_stack.py)
[tests/snapshot/test_snapshot_glue_app_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/tests/snapshot/test_snapshot_glue_app_stack.py)
Add following under "jobs"
in the variable config
in the above three files (No need to replace S3 locations):
,
"JoinLegislators": {
"inputLocationOrgs": "s3://path_to_data_orgs",
"inputLocationPersons": "s3://path_to_data_persons",
"inputLocationMemberships": "s3://path_to_data_memberships"
}
- Choose Run at the top right to run individual job scripts. If Run button is not shown, install Python into the container through Extensions in the navigation pane.
- For local unit testing, run following command in the terminal in Visual Studio Code:
$ cd aws_glue_cdk_baseline/job_scripts/
$ python3 -m pytest
Then you can verify that the newly added unit test passed successfully.
- Run
pytest
to initialize the snapshot test files by running following command:
$ cd ../../
$ python3 -m pytest --snapshot-update
Complete following steps to deploy the AWS Glue app stack to the development environment and run integration tests there:
- Setup required for access to CodeCommit.
- Commit and push your changes to AWS CodeCommit repo.
$ git add .
$ git commit -m "Add the second Glue job"
$ git push
You can see that the pipeline is successfully triggered.
There is nothing required for running the integration test for newly added Glue job. The integration test script [integ_test_glue_app_stack.py](https://github.com/aws-samples/aws-glue-cdk-baseline/blob/main/tests/integ/integ_test_glue_app_stack.py)
runs all the jobs including a specific tag, then verify the state and its duration. If you want to change the condition or the threshold, you can edit assertions at the end of integ_test_glue_job
method.
Complete the following steps to deploy the AWS Glue app stack to the production environment:
- On the
GluePipeline
page in the AWS CodePipeline console, choose Review underDeployProd
stage. - Choose Approve.
Wait for the DeployProd
stage to be completed, then you can verify the AWS Glue app stack resource in the dev account.
For cleaning up your resources, complete following steps:
- Run the following command using Pipeline account:
$ cdk destroy --profile <PIPELINE-PROFILE>
- Delete the AWS Glue app stack in the dev account and prod account.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.