Skip to content

A guidance that provides declarative data processing capability, and workflow orchestration automation to help your business users (such as analysts and data scientists) access their data and create meaningful insights without the need for manual IT processes.

License

Notifications You must be signed in to change notification settings

aws-solutions-library-samples/guidance-for-sql-based-etl-with-apache-spark-on-amazon-eks

SQL-Based ETL with Apache Spark on Amazon EKS

The 'SQL-Based Extraction, Transformation and Loading (ETL) with Apache Spark on Amazon EKS' guidance provides declarative data processing support, codeless extract-transform-load (ETL) capabilities, and workflow orchestration automation to help business users (such as analysts and data scientists) access their data and create meaningful insights without a need for manual IT processes.

This guidance abstracts common ETL activities, including formatting, partitioning, and transforming datasets, into configurable and productive data processes. This abstraction results in actionable insights derived more quickly to help you accelerate your data-driven business decisions. Additionally, this guidance uses an open-source Arc data processing framework, run on Amazon Elastic Kubernetes Service (Amazon EKS) and powered by Apache Spark and container technologies, to simplify Spark application development and deployment.

This guidance uses GitHub as the source repository to track ETL asset changes, such as JupyterHub notebook scripts and SQL script updates, allowing for application version control and standardized continuous integration and continuous delivery (CI/CD) deployments. It unifies analytical workloads and IT operations using standardized and automated processes, providing a simplified ETL deployment management capability for your organization’s DevOps team. These automated processes help you avoid unintentional human mistakes caused by manual, repetitive tasks.

Features and benefits

The Guidance for SQL-Based ETL with Apache Spark on Amazon EKS offers the following features:

  • Build, test, and debug ETL jobs in JupyterHub: Use JupyterHub, a web-based interactive integrated development environment (IDE) to simplify your ETL development experience. It includes a custom Arc kernel that enables you to define each ETL task or stage in separate blocks. The execution of each block produces data results and a task log. The log captures real-time data processing status and exception messages that can be valuable for debugging.

  • Use a SQL-first approach: Implement business logic and data quality checks in ETL pipeline development using Spark SQL. You can process and manipulate data in Spark using your existing SQL expertise.

  • Orchestrate jobs without code: Use Argo workflows to schedule jobs and manage complex job run dependencies without the need to code. Argo workflows allows you declaratively define job implementation target state, orders, and relationships. It provides a user-friendly graphical dashboard to help you track workflow status and resource usage patterns. The job orchestration tool in this guidance is a switchable plug-in and can be replaced by another tool of your choice, for example Apache Airflow or Volcano.

  • Auto-deploy Docker images: Set up an AWS continuous improvement and continuous development (CI/CD) pipeline to securely store the guidance's Arc Docker image in Amazon Elastic Container Registry (Amazon ECR).

  • Automate ETL artifact deployment: Integrate the Jupyter IDE with the guidance GitHub repository to detect the state of an ETL application change. As a one-off setup with your choice of CI/CD tool, a GitHub change activates a file sync-up process between your repository and the artifact store of Amazon S3 bucket. As a result, Argo Workflows can refer to the latest or versioned application assets in S3, such as a Jupyter notebook (ETL job specification) file, then orchestrate an ETL job run either on-demand or based on a time or an event.

Use cases

Driven by the modern Data Lake architecture, data engineers usually have to implement formatting, partitioning, and transforming of datasets, before making data available to data analysts and data scientists, so that they can start to produce insights for business with well-formed data. For organisations that are operating on SQL based data management systems or tools, adapting to a modern data lake practice with Apache Spark technology slows down the progress of data to insight. Increasingly a business's success depends on its agility in transforming data into actionable insights, which requires efficient and automated data processes. The gap between data consumption requirements and low-level data engineering activities is addressed by this AWS Guidance.

To accelerate data innovation with faster time to insight, this AWS guidance provides you a codeless extract-transform-load (ETL) option driven by a SQL centric architecture. By leveraging an open-source data framework (Arc) powered by Apache Spark and Container technologies, it enables our customers to build a modern data guidance on an AWS managed container service with ease of use. Additionally, if you want to take advantage of an optimized Amazon EMR runtime for Apache Spark plus other features offered EMR, the Arc framework can be utilized in Amazon EMR on Amazon EKS deployment option with no code changes.

Architecture overview

This section provides an architecture diagram and describes the components deployed with this Guidance.

Architecture steps

  1. Users interact with Extraction, Transformation and Loading (ETL) development and orchestration tools via Amazon CloudFront endpoints with Application Load Balancer origins, which provide secure connections between clients and ETL tools’ endpoints.
  2. Users interactively develop, test and schedule ETL jobs that process batch and stream data. The data traffic between ETL processes and data stores flows through Amazon Virtual Private Cloud (VPC) Endpoints (AWS PrivateLink) without leaving the AWS network.
  3. JupyterHub, Argo Workflows and Kubernetes Operator for Apache Spark are running as containers on Amazon Elastic Kubernetes Service (EKS) cluster. JupyterHub IDE can integrate with a source code repository (GitHub) to track ETL assets changes made by users. The assets include JupyterHub notebook files, SQL scripts etc., to be run with the Arc ETL framework.
  4. ETL assets can be updated by users in the source code repository (GitHub), then uploaded to an Amazon S3 bucket. The synchronization process can be implemented by an automated CI/CD deployment process triggered by updates in the source code repository or performed manually.
  5. Users can change Docker build source code uploaded from a code repository to the S3 ETL Asset bucket. It activates an AWS CodeBuild /AWS CodePipeline CI/CD pipeline to automatically rebuild and push the Arc ETL Framework container image to Amazon ECR private registry.
  6. Users schedule ETL jobs via Argo Workflows to be run on Amazon EKS cluster. These jobs automatically pull Arc ETL Framework container image from ECR, download ETL assets from the artifact S3 bucket, and send application execution logs to Amazon CloudWatch. Access to all the AWS services is secured via VPC endpoints.
  7. JupyterHub IDE automatically retrieves login credentials from Amazon Secrets Manager to validate sign-in user requests. Authenticated users can interactively develop and test their notebooks as ETL jobs in JupyterHub.
  8. Users can access the ETL output data stored in the S3 bucket that supports the transactional Data lake format and query the Delta Lake tables through Amazon Athena integrated with AWS Glue Data Catalog.

AWS Services used in this Guidance

AWS service Description
Amazon Elastic Kubernetes Service - EKS Core service - The EKS service is used to host the guidance workloads
Amazon Virtual Private Cloud - VPC Core Service - network security layer
Amazon Elastic Compute Cloud - EC2 Core Service - EC2 instance power On Demand and Spot based EKS compute node groups for running container workloads
Amazon Elastic Container Registry - ECR Core service - ECR registry is used to host the container images for Spark jobs and Arc ETL Framework
Amazon Elastic Map Reduce EMR on EKS Auxiliary service - alternative way to configure and run ETL Jobs on EKS
Amazon Athena Core service - used for SQL syntax Querying of Sample ETL job results from S3
AWS Glue Data Catalog Auxiliary service - exposes ETL related data stores
Amazon S3 Core service - Object storage for users' ETL assets from GitHub
Amazon CloudFront Auxiliary service - provides SSL entrypoints for Jupyter and Argo Workflows tools
Amazon CloudWatch Auxiliary service - provides observability for core services
AWS Secretes Manager Auxiliary service - provides user credentials management for Jupyter IDE
AWS CodeBuild Core service - CI/CD automation for building Arc ETL framework images
AWS CodePipeline Core service - CI/CD automation for pushing Arc ETL framework images into ECR registry

Plan your deployment

Cost

You are responsible for the cost of the AWS services used while running this Guidance. As of February 2024, the estimated cost for running this Guidance with the default two-node Amazon EKS cluster in the US East (N. Virginia) Region is approximately $0.64 an hour or $467.76 per month. Refer to the AWS pricing webpage for each AWS service used in this Guidance.

Sample Cost table

The following table provides a sample cost breakdown for deploying this Guidance with the default parameters in the US East (N. Virginia) Region for one month.

AWS service Rate Cost [USD]
Amazon EKS cluster (no compute) $0.10 per hour per cluster X 1 $73.00
Amazon EC2 (On-Demand) $0.1632 per hour X 1 m7g.xlarge instance $119.14
Amazon EC2 (Spot) $0.1174 per hour X 1 r4.xlarge instance $85.70
Amazon EC2 (Graviton Spot) $0.0966 per hour X 1 r7g.xlarge instance $70.52
Elastic Load Balancing $0.0225 Application Load Balancer per hour X 2 ALBs $32.85
Elastic Load Balancing $0.008 Load Balancer Capacity Units (LCU) per hour X 2 ALBs $11.68
VPC Endpoint $0.01 per hour per VPC endpoint per Availability Zone (AZ) X 5 endpoints (Amazon S3, Amazon Athena, Amazon ECR, AWS KMS, and Amazon CloudWatch) X 2 AZs $73.00
VPC Endpoint $0.01 per GB data processed per month X 10 GB $0.1
Amazon S3 (storage) $0.023 per GB for First 50 TB/month X 1 GB $0.02
AWS CodeBuild $0.005 per build minute on general1.small X 150 minutes per month $0.75
AWS CodePipeline $1.00 per active pipeline per month X 1 $1.00
Total estimated cost per month: $467.76

Amazon CloudFront cost is not included in the estimation table, as its monthly Free Tier can fully covered the usage. To avoid the instance capacity issue, additional types of r5.xlarge and 5a.xlarge are included in the EC2 Spot Instance fleet, and r6g.xlarge,r6gd.xlarge are included in the Graviton Spot instance fleet. Their pricing varies based on the time period your instances are running. For more information on Spot Instances pricing, refer to the Amazon EC2 Spot Instances Pricing page

Deployment

Please see detailed Implementation Guide here with step-by-step guidance deployment, valudation and extension instructions.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the MIT-0 License. See the LICENSE file.

About

A guidance that provides declarative data processing capability, and workflow orchestration automation to help your business users (such as analysts and data scientists) access their data and create meaningful insights without the need for manual IT processes.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages