Skip to content

aws-samples/sample-stepfunctions-s3-prefix-processor

Processing Amazon S3 objects at scale with AWS Step Functions Distributed Map S3 prefix

This sample application demonstrates iterating over Amazon S3 objects under a specified prefix using S3ListObjectsV2 and processing them using AWS Step Functions Distributed Map.

Table of Contents

Workflow

The following diagram shows the Step Functions workflow.

AWS Step Function workflow

  • The Step Functions state machine reads all the log files from the given S3 prefix using distributed map. For each log file entry, the state machine puts a metrics into Amazon CloudWatch.
  • The state machine then stores hourly metrics count in a Amazon DynamoDB table.
  • The state machine then invokes an AWS Lambda function to perform metrics aggregation.

Prerequisites

Quick Start

1. Clone and navigate to stacks directory (all commands run from here)

Clone the GitHub repository in a new folder and navigate to project root folder.

git clone https://github.com/aws-samples/sample-stepfunctions-s3-prefix-processor.git
cd sample-stepfunctions-s3-prefix-processor

2. Deploy the application

Run the following commands to deploy the application.

sam deploy --guided

Enter the following details:

  • Stack name: The CloudFormation stack name(for example, stepfunctions-s3-prefix-processor)
  • AWS Region: A supported AWS Region (for example, us-east-1)
  • Keep rest of the components to default values.

The outputs from the sam deploy will be used in the subsequent steps.

3. Generate the test data and upload it to S3 bucket

Run the following command to generate sample test data and upload it to the input S3 bucket.

python3 scripts/generate_logs.py

Run the following to upload the log files to the S3 bucket /logs/daily prefix. Replace LogAnalyticsBucketName with the value from sam deploy output.

aws s3 sync logs/ s3://<LogAnalyticsBucketName>/logs/ --exclude '*' --include '*.log'

4. Test the Step Functions workflow

Run the following command to start execution of the Step Functions. Replace the StateMachineArn with the value from sam deploy output.

aws stepfunctions start-execution \
  --state-machine-arn <StateMachineArn> \
  --input '{}'

The Step Function state machine iterates over all the log files with S3 prefix /logs/daily prefix and processes them in parallel. It updates the metrics in CloudWatch, then stores hourly metrics count in the DynamoDB table and finally, invokes an AWS Lambda function to perform metrics aggregation.

5. Monitor the state machine execution

Run the following command to get the details of the execution. Replace the executionArn from the previous command.

aws stepfunctions describe-execution --execution-arn <executionArn>

Wait till the status shows SUCCEEDED.

6. Verify Results

Run the following commands to check the processed output from LogAnalyticsSummaryTableName DynamoDB table. Replace the value LogAnalyticsSummaryTableName with the value from sam deploy output.

aws dynamodb scan --table-name <LogAnalyticsSummaryTableName>

Run the following command to check the output of the Step Functions state machine execution output.

aws stepfunctions describe-execution --execution-arn <executionArn> --query 'output' --output text

The output of the Step Functions state machine shows the daily summary insights of the log files created by the Lambda function.

Cleanup

Run the following commands to delete the resources deployed in this sample application.

# Empty S3 buckets before deletion (replace with actual bucket names)
aws s3 rm s3://<LogAnalyticsBucketName> --recursive

# Delete the SAM stack
sam delete

# Clean up local log files
rm -rf logs/

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages