Skip to content

UCL-ARC/terraform-aws-batch-processing

Repository files navigation

terraform-aws-batch-processing

Purpose

A collection of terraform modules which creates a generic data processing pipeline. This is a recurring infrastructure pattern in both academic and commerical settings. The codebase is designed to be modulerised but care is needed when a subset of the modules are used.

Description

The containerised workflow consists of three highlevel steps: upload a file (image or text base), trigger an automatic workflow which processes the uploaded data and then return the resultant output. To capture this workflow in a modern serverless design the following architected solution was developed. Schematic of the infrastructure

  • A S3 bucket with intelligent storage options and lambda trigger is set up to receive the uploaded data. When data (file, image, etc.) arrives on the S3 bucket a lambda function is triggered which instantiates AWS Step Functions.
  • AWS Step Functions consists of three states:
    1. A DataSync task for copying the contents of the upload S3 bucket to EFS.
    2. AWS Batch which launches an AWS Fargate containerized task.
    3. A DataSync task to copy the resultant output from the Batch job on EFS to a reports S3 bucket.
  • A S3 bucket denoted as Reports S3 in the schematic figure provides storage for the processed output from the container.
  • AppStream 2.0 allows a potential user a virtual desktop within the VPC. This service is included to enable quality control to be carried out on the containerized workflow.

Installation Steps

Prerequisites

Ensure that your IAM credentials can be used to authenticate the Terraform AWS provider. Details can be found on the Terraform documentation.

Steps to set up the infrastructure

  1. To enable Appstream 2.0 one needs to first create the appropriate stacks and fleets. This can be done through the management console. Make note of the AppStream2.0 image name.
  2. Clone this GitHub repo.
  3. Navigate to the cloned repo in the terminal and run the following commands:
  • terraform init
  • terraform validate
  • terraform apply
  1. When prompted for the as2_image_name pass the value from Step 1.
  2. A complete list of the AWS services which will be deployed appears. The user should check this before agreeing to the deployment (Please be advised this will incur a cost).

If this is the first time to deploy please note that it will take some time.

Details on Step Functions

As stated, the step functions state machine consists of three main tasks:

  1. DataSync between the S3 uploads bucket and EFS
  2. AWS Batch job with launches a Fargate Task
  3. DataSync task to copy data from EFS to the S3 reports bucket.

However, to ensure that the DataSync tasks finish before moving to the next state a polling and wait pattern has been implemented.

Test the deployment

The best way to test the deployment is to navigate to the uploads S3 bucket in the management console. From here you can upload an example file to the bucket. This action should automatically trigger the step functions. To see this, open step functions in the management console and you should note a running job will have started. Once the workflow has completed you will see the processed file appear in the reports S3 bucket. Cleaning up

To avoid incurring future charges, delete the resources.

Warning

Please be advised that to use Appstream2.0 you will need to follow the console set up at [https://eu-west-2.console.aws.amazon.com/appstream2] before deploying the terraform.

Requirements

Name Version
terraform >= 1.3.9
aws >= 4.9.0

Providers

No providers.

Modules

Name Source Version
appstream ./modules/appstream n/a
batch ./modules/batch n/a
datasync ./modules/datasync n/a
efs ./modules/efs n/a
s3_reports ./modules/s3_reports n/a
s3_upload ./modules/s3_upload n/a
step_function ./modules/step_function n/a
vpc ./modules/vpc n/a

Resources

No resources.

Inputs

Name Description Type Default Required
as2_desired_instance_num Desired number of AS2 instances number 1 no
as2_fleet_description Fleet description string "ARC batch process fleet" no
as2_fleet_display_name Fleet diplay name string "ARC batch process fleet" no
as2_fleet_name Fleet name string "ARC-batch-fleet" no
as2_image_name AS2 image to deploy string n/a yes
as2_instance_type AS2 instance type string "stream.standard.medium" no
as2_stack_description Stack description string "ARC batch process stack" no
as2_stack_display_name Stack diplay name string "ARC batch process stack" no
as2_stack_name Stack name string "ARC-batch-stack" no
compute_environments Compute environments string "fargate" no
compute_resources_max_vcpus Max VCPUs resources number 1 no
container_image_url Container image URL string "public.ecr.aws/docker/library/busybox:latest" no
container_memory Containter Memory resources number 2048 no
container_vcpu Containter VCPUs resources number 1 no
efs_throughput_in_mibps EFS provisioned throughput in mibps number 1 no
efs_transition_to_ia_period Lifecycle policy transition period to IA string "AFTER_7_DAYS" no
region The region to deploy into. string "eu-west-2" no
solution_name Overall name for the solution string "arc-batch" no
vpc_cidr_block The CIDR block for the VPC string "10.0.0.0/25" no

Outputs

No outputs.