aws-samples · navbalaraman · Sep 9, 2022 · Sep 9, 2022
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,21 @@
+.DS_Store
+.idea
+.mvn
+.terraform
+.vscode
+terraform.tfstate.backup
+.terraform.lock.hcl
+target
+out
+source/clicklogger/src/main/.DS_Store
+source/clicklogger/src/main/java/.DS_Store
+source/clicklogger/src/main/java/com/.DS_Store
+source/clicklogger/src/main/java/com/clicklogs/.DS_Store
+terraform/workspaces/us-east-1/terraform.tfstate
+terraform/workspaces/us-east-1/.terraform/*
+terraform/workspaces/us-east-1/.terraform*
+source/clicklogger/src/test/.DS_Store
+source/clicklogger/src/test/java/.DS_Store
+source/clicklogger/src/test/java/com/.DS_Store
+assets/.$emr-serverless-click-logs-from-web-application.drawio.bkp
+assets/.$emr-serverless-click-logs-from-web-application.drawio.dtmp
diff --git a/HELP.md b/HELP.md
@@ -0,0 +1,17 @@
+# Getting Started
+
+### Reference Documentation
+For further reference, please consider the following sections:
+
+* [Official Apache Maven documentation](https://maven.apache.org/guides/index.html)
+* [Spring Boot Maven Plugin Reference Guide](https://docs.spring.io/spring-boot/docs/2.4.2/maven-plugin/reference/html/)
+* [Create an OCI image](https://docs.spring.io/spring-boot/docs/2.4.2/maven-plugin/reference/html/#build-image)
+* [Spring Web](https://docs.spring.io/spring-boot/docs/2.4.2/reference/htmlsingle/#boot-features-developing-web-applications)
+
+### Guides
+The following guides illustrate how to use some features concretely:
+
+* [Building a RESTful Web Service](https://spring.io/guides/gs/rest-service/)
+* [Serving Web Content with Spring MVC](https://spring.io/guides/gs/serving-web-content/)
+* [Building REST services with Spring](https://spring.io/guides/tutorials/bookmarks/)
+
diff --git a/LICENSES/MIT-0.txt b/LICENSES/MIT-0.txt
@@ -0,0 +1,15 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
+the Software, and to permit persons to whom the Software is furnished to do so.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
+FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
+COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
+IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
diff --git a/README.md b/README.md
@@ -1,17 +1,216 @@
-## My Project
+# Running a Data Processing Job on EMR Serverless with AWS Step Functions and AWS Lambda using Terraform (By HashiCorp)
 
-TODO: Fill this README out!
 
-Be sure to:
+In this blog we showcase how to build and orchestrate a [Scala](https://www.scala-lang.org/) Spark Application using [Amazon EMR Serverless](https://aws.amazon.com/emr/serverless/) , AWS Step Functions and [Terraform By HashiCorp](https://www.terraform.io/). In this end to end solution we execute a Spark job on EMR Serverless which processes sample click-stream data in Amazon S3 bucket and stores the aggregation results in Amazon S3. 
+
+With EMR Serverless, customers don’t have to configure, optimize, secure, or operate clusters to run applications. You will continue to get the benefits of [Amazon EMR](https://aws.amazon.com/emr/), such as open source compatibility, concurrency, and optimized runtime performance for popular data frameworks. EMR Serverless is suitable for customers who want ease in operating applications using open-source frameworks. It offers quick job startup, automatic capacity management, and straightforward cost controls.
+
+There are several ‘infrastructure as code’ frameworks available today, to help customers define their infrastructure, such as the AWS CDK or Terraform. Terraform, an AWS Partner Network (APN) Advanced Technology Partner and member of the AWS DevOps Competency, is an infrastructure as code tool similar to AWS CloudFormation that allows you to create, update, and version your AWS infrastructure. Terraform provides friendly syntax (similar to [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html)) along with other features like planning (visibility to see the changes before they actually happen), graphing, ability to create templates to break infra configurations into smaller chunks which allows better  maintenance and reusability. We will leverage the capabilities and features of Terraform to build an API based ingestion process into AWS. Let’s get started!
+
+We will provide the Terraform infrastructure definition and the source code for an AWS Lambda using which sample customer user clicks for online website inputs are ingested into an [Amazon Kinesis Data Firehose](https://aws.amazon.com/kinesis/data-firehose/). The solution leverages Firehose’s capability to convert the incoming data into a Parquet file (an open-source file format for Hadoop) before pushing it to [Amazon S3](https://aws.amazon.com/s3/) using [AWS Glue](https://aws.amazon.com/glue/) catalog. The generated output S3 Parquet file logs are then processed by an EMR Serverless process and outputs a report detailing aggregate click stream statistics in S3 bucket. The EMR serverless operation is triggered using [AWS Step Functions](https://aws.amazon.com/step-functions). The sample architecture and code will be spun up as below.
+
+Provided samples have the source code for building the infrastructure using Terraform for running the Amazon EMR Application. Setup scripts are provided to create the sample ingestion using AWS Lambda for incoming application logs.  Similar ingestion pattern sample was terraformed in an earlier [blog](https://aws.amazon.com/blogs/developer/provision-aws-infrastructure-using-terraform-by-hashicorp-an-example-of-web-application-logging-customer-data/). 
+
+Overview of the steps and the AWS Services used in this solution:
 
-* Change the title in this README
-* Edit your repository description on GitHub
+* Java source build – Provided application code is packaged & built using Apache Maven
+* Terraform commands are used to deploy the infrastructure in AWS.
+* [Amazon EMR Serverless](https://aws.amazon.com/emr/serverless/) Application - provides the option to submit a Spark job.
+* [AWS Lambda](https://aws.amazon.com/lambda/):
+    * Ingestion Lambda – This lambda processes the incoming request and pushes the data into Firehose stream.
+    * EMR Start Job Lambda - This lambda starts the EMR Serverless application, the EMR job process converts the ingested user click logs into output in another S3 bucket.
+* [AWS Step Functions](https://aws.amazon.com/step-functions)  triggers the EMR Start Job Lambda which submits the application to EMR Serverless for processing of the ingested log files.
+* [Amazon Simple Storage Service](https://aws.amazon.com/s3/) (Amazon S3)  
+    * Firehose Delivery Bucket - Stores the ingested application logs in parquet file format
+    * Loggregator Source Bucket - Stores the scala code/jar for EMR job execution
+    * Loggregator Output Bucket - EMR processed output is stored in this bucket
+    * EMR Serverless logs Bucket - Stores EMR process application logs
+* Sample AWS Invoke commands (run as part of initial set up process) inserts the data using the Ingestion Lambda and Firehose stream converts the incoming stream into a Parquet file and stored in an S3 bucket
 
-## Security
+
+![Alt text](assets/emr-serverless-click-logs-from-web-application.drawio.png?raw=true "Title")
+### Prerequisites
 
-See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
+* [AWS Cli](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) - At the time of writing this article version 2.7.18 was used. This will be required to query aws emr-serverless cli commands from your local machine. Optionally all the AWS Services used in this blog can be viewed/operated from AWS Console also.
+* Make sure to have [Java](https://www.java.com/en/download/) installed, JDK/JRE 8 is set in the environment path of your machine. For instructions, see [Java Development Kit](https://www.java.com/en/download/)
+* [Apache Maven](https://maven.apache.org/download.cgi) – Java Lambdas are built using mvn packages and are deployed using Terraform into AWS
+* [Scala Build Tool](https://www.scala-sbt.org/download.html) (sbt) - Version 1.4.7 is used at the time of this article. Make sure to download and install based on your operating system needs.
+* Set up [Terraform](https://www.terraform.io/downloads). For steps, see Terraform downloads. Version 1.2.5 is used at the time of this article.
+* An [AWS Account](https://aws.amazon.com/free/)
 
-## License
+### Design Decisions
 
-This library is licensed under the MIT-0 License. See the LICENSE file.
+* We use AWS Step Functions and AWS Lambda in this use case to trigger the EMR Serverless Application. In real world, the data processing application could be long running and may exceed AWS Lambda’s execution timeout.  Tools like [Amazon Managed Workflows for Apache Airflow (MWAA)](https://aws.amazon.com/managed-workflows-for-apache-airflow/) can be used. Amazon Managed Apache airflow is a managed orchestration service makes it easier to set up and operate end-to-end data pipelines in the cloud at scale
+* AWS Lambda Code & EMR Serverless Log Aggregation code are developed using Java & Scala respectively. These can any done using any supported languages in these use cases.
+* AWS CLI V2 is required for querying Amazon EMR Serverless applications from command line. These can be viewed from AWS Console also. A sample CLI command provided below in the “Testing” section below.
 
+### Steps
+
+
+ Clone [this repository](https://github.com/aws-samples/aws-emr-serverless-using-terraform) and execute the below command to spin up the infrastructure and the application
+Provided “exec.sh” shell script builds the Java application jar (For the Lambda Ingestion), the Scala application Jar (For the EMR Processing) and deploys the AWS Infrastructure that is needed for this use case.
+
+Execute the below commands
+
+
+```
+$ chmod +x exec.sh
+$ ./exec.sh
+```
+
+
+To run the commands individually
+
+Set the application deployment region and account number. An example below. Modify as needed. 
+
+```
+$ APP_DIR=$PWD
+ $ APP_PREFIX=clicklogger
+ $ STAGE_NAME=dev
+ $ REGION=us-east-1
+ $ ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account')
+```
+
+Maven build AWS Lambda Application Jar & Scala Application package
+
+```
+$ cd $APP_DIR/source/clicklogger
+ $ mvn clean package
+$ sbt reload
+ $ sbt compile
+ $ sbt package
+```
+
+
+Deploy the AWS Infrastructure using Terraform
+
+```
+$ terraform init
+ $ terraform plan
+ $ terraform apply --auto-approve
+```
+
+### Testing
+
+
+
+ Once the application is built and deployed, you can also insert sample data for the EMR processing. An example as below. Note exec.sh has multiple sample insertions for AWS Lambda. The ingested logs will be used by the EMR Serverless Application job
+
+Below sample AWS CLI Invoke command inserts sample data for the application logs
+
+```
+aws lambda invoke --function-name clicklogger-dev-ingestion-lambda —cli-binary-format raw-in-base64-out —payload '{"requestid":"OAP-guid-001","contextid":"OAP-ctxt-001","callerid":"OrderingApplication","component":"login","action":"load","type":"webpage"}' out
+```
+
+Validate the Deployments
+
+* Output – Once the Lambda is successfully executed, you should see the output in S3 buckets as shown below 
+* Validate the saved ingested data as below
+    * Navigate to the bucket created as part of the stack.
+    * Select the file and view the file from “Select From” sub tab.  
+    * You should see something ingested stream got converted into parquet file. * 
+    * Select the file and view the data. A sample is shown below
+
+![Alt text](assets/s3_source_parquet_files.png?raw=true "Title")
+
+* Run AWS Step Function to validate the Serverless application
+        * Open AWS Console > AWS Step Function > Open "clicklogger-dev-state-machine". 
+        * The step function will show the steps that ran to trigger the AWS Lambda and EMR Serverless Application
+        * Start a new execution to trigger the AWS Lambda and EMR Serverless Application/Job
+        * Once the AWS Step Function is successful, navigate to Amazon S3 > clicklogger-dev-outputs-bucket- to see the output files. 
+        * These will be partitioned by year/month/date/response.md. A sample is shown below
+
+![Alt text](assets/s3_output_response_file.png?raw=true "Title")
+
+
+AWS CLI can be used to check the deployed AWS Serverless Application
+
+```
+$ aws emr-serverless list-applications \
+      | jq -r '.applications[] | select(.name=="clicklogger-dev-loggregrator-emr-<Your-Account-Number>").id'
+
+
+```
+
+![Alt text](assets/step_function_success.png?raw=true "Title")
+
+EMR Studio
+
+* Open AWS Console, Navigate to “EMR” > “Serverless” tab on the left pane.
+* Select “clicklogger-dev-studio” and click “Manage Applications”
+
+
+
+![Alt text](assets/EMRStudioApplications.png?raw=true "Title")
+
+![Alt text](assets/EMRServerlessApplication.png?raw=true "Title")
+
+Reviewing the Serverless Application Output:
+
+
+* Open AWS Console, Navigate to Amazon S3
+* Open the outputs S3 bucket. This will be like - us-east-1-clicklogger-dev-loggregator-output-<YOUR-ACCOUNT-NUMBER>
+* The EMR Serverless application writes the output based on the date partition as below
+    * 2022/07/28/response.md
+    * Output of the file will be like below
+
+```
+
+ |*createdTime*|*callerid*|*component*|*count*
+ |------------|-----------|-----------|-------
+ *07-28-2022*|OrderingApplication|checkout|2
+ *07-28-2022*|OrderingApplication|login|2
+ *07-28-2022*|OrderingApplication|products|2
+```
+
+## Cleanup
+
+
+Provided "./cleanup.sh" has the required steps to delete all the files from Amazon S3 buckets that were created as part of this blog. terraform destroy command will clean up the AWS infrastructure those were spun up as mentioned above
+
+
+```
+$ chmod +x cleanup.sh
+$ ./cleanup.sh
+```
+
+* To do the steps manually,
+
+S3 and created services can be deleted using CLI also. Execute the below commands (an example below, modify as needed):
+
+```
+
+
+# CLI Commands to delete the S3  
+
+aws s3 rb s3://clicklogger-dev-emr-serverless-logs-bucket-<your-account-number> --force
+aws s3 rb s3://clicklogger-dev-firehose-delivery-bucket-<your-account-number> --force
+aws s3 rb s3://clicklogger-dev-loggregator-output-bucket-<your-account-number> --force
+aws s3 rb s3://clicklogger-dev-loggregator-source-bucket-<your-account-number> --force
+aws s3 rb s3://clicklogger-dev-loggregator-source-bucket-<your-account-number> --force
+
+# Destroy the AWS Infrastructure 
+terraform destroy --auto-approve
+
+
+```
+
+
+
+## Conclusion
+
+
+To recap, in this post we built, deployed & ran a data processing spark job in Amazon EMR Serverless that interacts with various AWS Services. The post walked through deploying a lambda packaged with Java using maven, a Scala application code for EMR Serverless Application triggered with AWS Step Functions with infrastructure as code. You may use any combination of applicable programming languages to build your lambda functions, EMR Job application. EMR Serverless can be triggered manually, automated or can be orchestrated using AWS Services like AWS Step Function, Amazon Managed Apache airflow, etc., 
+
+We encourage you to test this example and see for yourself how this overall application design works within AWS. Then, it will be just the matter of replacing your individual code base, package them and let the Amazon EMR Serverless handle the process efficiently.
+
+If you implement this example and run into any issues, or have any questions or feedback about this blog please provide your comments below!
+
+## References
+
+* [Terraform: Beyond the basics with AWS](https://aws.amazon.com/blogs/apn/terraform-beyond-the-basics-with-aws/) 
+* [Amazon EMR Serverless General Availability](https://aws.amazon.com/about-aws/whats-new/2022/06/amazon-emr-serverless-generally-available/)
+* [Amazon EMR Serverless Now Generally Available – Run Big Data Applications without Managing Servers](https://aws.amazon.com/blogs/aws/amazon-emr-serverless-now-generally-available-run-big-data-applications-without-managing-servers/)
+* [Provision AWS infrastructure using Terraform (By HashiCorp): an example of web application logging customer data](https://aws.amazon.com/blogs/developer/provision-aws-infrastructure-using-terraform-by-hashicorp-an-example-of-web-application-logging-customer-data/)
+
+
diff --git a/assets/AWSStepFunction.png b/assets/AWSStepFunction.png
diff --git a/assets/AWSStepFunction.png.license b/assets/AWSStepFunction.png.license
@@ -0,0 +1,3 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+SPDX-License-Identifier: MIT-0
diff --git a/assets/EMRServerlessApplication.png b/assets/EMRServerlessApplication.png
diff --git a/assets/EMRServerlessApplication.png.license b/assets/EMRServerlessApplication.png.license
@@ -0,0 +1,3 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+SPDX-License-Identifier: MIT-0
diff --git a/assets/EMRStudioApplications.png b/assets/EMRStudioApplications.png
diff --git a/assets/EMRStudioApplications.png.license b/assets/EMRStudioApplications.png.license
@@ -0,0 +1,3 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+SPDX-License-Identifier: MIT-0
diff --git a/assets/emr-serverless-click-logs-from-web-application.drawio.license b/assets/emr-serverless-click-logs-from-web-application.drawio.license
@@ -0,0 +1,3 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+SPDX-License-Identifier: MIT-0
diff --git a/assets/emr-serverless-click-logs-from-web-application.drawio.png b/assets/emr-serverless-click-logs-from-web-application.drawio.png
diff --git a/assets/emr-serverless-click-logs-from-web-application.drawio.png.license b/assets/emr-serverless-click-logs-from-web-application.drawio.png.license
@@ -0,0 +1,3 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+SPDX-License-Identifier: MIT-0
diff --git a/assets/s3_output_response_file.png b/assets/s3_output_response_file.png
diff --git a/assets/s3_output_response_file.png.license b/assets/s3_output_response_file.png.license
@@ -0,0 +1,3 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+SPDX-License-Identifier: MIT-0
diff --git a/assets/s3_source_parquet_files.png b/assets/s3_source_parquet_files.png
diff --git a/assets/s3_source_parquet_files.png.license b/assets/s3_source_parquet_files.png.license
@@ -0,0 +1,3 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+SPDX-License-Identifier: MIT-0
diff --git a/assets/step_function_caught.png b/assets/step_function_caught.png
diff --git a/assets/step_function_caught.png.license b/assets/step_function_caught.png.license
@@ -0,0 +1,3 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+SPDX-License-Identifier: MIT-0
diff --git a/assets/step_function_success.png b/assets/step_function_success.png
diff --git a/assets/step_function_success.png.license b/assets/step_function_success.png.license
@@ -0,0 +1,3 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+SPDX-License-Identifier: MIT-0
diff --git a/assets/step_function_uncaught.png b/assets/step_function_uncaught.png
diff --git a/assets/step_function_uncaught.png.license b/assets/step_function_uncaught.png.license
@@ -0,0 +1,3 @@
+SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
+
+SPDX-License-Identifier: MIT-0
diff --git a/cleanup.sh b/cleanup.sh
@@ -0,0 +1,26 @@
+#! /bin/bash
+
+echo 'Cleaning up Deployed Infrastructure..'
+echo $PWD
+APP_DIR=$PWD
+APP_PREFIX=clicklogger
+STAGE_NAME=dev
+REGION=us-east-1
+
+ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account')
+echo $ACCOUNT_ID
+
+aws s3 rb s3://$REGION-$APP_PREFIX-$STAGE_NAME-emr-logs-$ACCOUNT_ID --force
+aws s3 rb s3://$REGION-$APP_PREFIX-$STAGE_NAME-firehose-delivery-$ACCOUNT_ID --force
+aws s3 rb s3://$REGION-$APP_PREFIX-$STAGE_NAME-loggregator-output-$ACCOUNT_ID --force
+aws s3 rb s3://$REGION-$APP_PREFIX-$STAGE_NAME-loggregator-source-$ACCOUNT_ID --force
+aws s3 rb s3://$REGION-$APP_PREFIX-$STAGE_NAME-emr-studio-$ACCOUNT_ID --force
+echo 'Deleted S3 contents'
+
+echo 'Terraform Destroy Resources'
+cd $APP_DIR/terraform/workspaces/$REGION
+terraform destroy --auto-approve
+
+cd $APP_DIR
+
+echo 'Completed Successfully!'
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		SPDX-FileCopyrightText: Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved

		SPDX-License-Identifier: MIT-0