This repository demonstrates how to build a cost-effective, transient Amazon EMR on EC2 architecture using AWS Step Functions, Amazon EventBridge, and infrastructure as code. The solution provisions an EMR cluster on a daily schedule, runs a Spark job, and automatically terminates the cluster upon completion.
The solution orchestrates the creation of a short-lived Amazon EMR on EC2 cluster, runs a Spark job for COVID-19 data analysis, and terminates the cluster immediately after completion. It integrates CloudWatch Logs and Step Functions execution history to provide robust observability for monitoring, debugging, and troubleshooting job executions.
This pattern is ideal for customers who need to:
- Run daily Spark jobs with custom cluster configurations (AMIs, instance types, networking)
- Optimize costs by avoiding long-running EMR clusters
- Automate and schedule data processing workflows
- Maintain full observability and debugging capabilities
- AWS CLI installed and configured
- AWS account with appropriate permissions for EMR, Step Functions, EventBridge, S3, and CloudWatch
- For a quick start, you can use AWS CloudShell which includes AWS CLI
Download the sample project either to your computer or to your AWS CloudShell Console.
git clone https://github.com/aws-samples/sample-emr-transient-cluster-step-functions-eventbridge.git
cd sample-emr-transient-cluster-step-functions-eventbridge
├── emr_transient_cluster_step_functions_eventbridge.yaml # CloudFormation template
├── covid19_processor.py # PySpark job for COVID-19 data analysis
The solution processes the COVID-19 public dataset from HealthData.gov.
This CloudFormation template deploys EMR cluster version 6.15.0 by default. If you want to use the latest EMR 7.x release version for newer Spark, Hadoop, and other components, you can update the ReleaseLabel in the CloudFormation template before deployment:
- Open
emr_transient_cluster_step_functions_eventbridge.yaml - Find the line with
"ReleaseLabel": "emr-6.15.0" - Update it to your desired EMR version (e.g.,
"ReleaseLabel": "emr-7.6.0")
export AWS_REGION=<YOUR_AWS_REGION>
export AWS_ACCOUNT_ID=<YOUR_AWS_ACCOUNT_ID>aws cloudformation deploy \
--template-file emr_transient_cluster_step_functions_eventbridge.yaml \
--stack-name emr-transient-cluster-stack \
--capabilities CAPABILITY_IAM \
--region $AWS_REGIONAfter deployment, upload the PySpark script to the created S3 bucket:
# Get the bucket name from CloudFormation outputs
INPUT_BUCKET_NAME=$(aws cloudformation describe-stacks \
--stack-name emr-transient-cluster-stack \
--query 'Stacks[0].Outputs[?OutputKey==`InputBucketName`].OutputValue' \
--output text)
# Upload the PySpark script
aws s3 cp covid19_processor.py s3://$INPUT_BUCKET_NAME/scripts/Download the COVID-19 public dataset and upload it to your S3 input bucket:
# Download the COVID-19 dataset from HealthData.gov
curl -o covid19-dataset.csv "https://healthdata.gov/api/views/g62h-syeh/rows.csv?accessType=DOWNLOAD"
# Upload the dataset to S3 input bucket under raw/ folder
aws s3 cp covid19-dataset.csv s3://$INPUT_BUCKET_NAME/raw/The Step Functions workflow is automatically triggered by EventBridge on a daily schedule. You can also manually execute the workflow:
# Get the Step Functions ARN and bucket names from CloudFormation outputs
STATE_MACHINE_ARN=$(aws cloudformation describe-stacks \
--stack-name emr-transient-cluster-stack \
--query 'Stacks[0].Outputs[?OutputKey==`StateMachineArn`].OutputValue' \
--output text)
LOG_BUCKET_NAME=$(aws cloudformation describe-stacks \
--stack-name emr-transient-cluster-stack \
--query 'Stacks[0].Outputs[?OutputKey==`LogBucketName`].OutputValue' \
--output text)
OUTPUT_BUCKET_NAME=$(aws cloudformation describe-stacks \
--stack-name emr-transient-cluster-stack \
--query 'Stacks[0].Outputs[?OutputKey==`OutputBucketName`].OutputValue' \
--output text)
# Start execution with required input parameters
aws stepfunctions start-execution \
--state-machine-arn $STATE_MACHINE_ARN \
--name "manual-execution-$(date +%Y%m%d-%H%M%S)" \
--input "{\"LogUri\": \"s3://$LOG_BUCKET_NAME/logs/\", \"OutputS3Location\": \"s3://$OUTPUT_BUCKET_NAME/processed/\"}"- CloudWatch Logs: EMR cluster and Spark job logs are automatically collected
- Step Functions Console: Monitor workflow execution status and history
- EventBridge: View scheduled execution events and triggers
aws cloudformation delete-stack \
--stack-name emr-transient-cluster-stack \
--region $AWS_REGION- Primary: Amazon EMR on EC2
- Secondary: AWS Step Functions, Amazon EventBridge, Amazon S3, Amazon CloudWatch
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.
- Senthil Kamala Rathinam (Solutions Architect)
- Shashi Makkapati (Sr. Solutions Architect)
