Operational Data Processing Framework using AWS Glue and Apache Hudi
The Operational Data Processing Framework (ODP Framework) contains three components: 1/ File Manager, 2/ File Processor, and 3/ Configuration Manager. Each component runs independently to solve a portion of the operational data processing use case. The source code is organized in three folders – one for each component and if you customize and adopt this framework for your use cases, we recommend you to promote these components to three separate code repositories in your version control system. You can consider the following repository names:
With this modular approach, you can independently deploy the components to your data lake environment by following your preferred CI/CD Processes. As illustrated in the Overall Architecture section, these components are deployed in conjunction with a Change Data Capture solution. For the sake of completeness, we assume that AWS DMS is used to migrate data from operational databases to Amazon S3 but skip its implementation specifics.
- Data Lake Reference Architecture
- ODP Framework Deep Dive and Deployment
- ODP Framework Demo
Data Lake Reference Architecture
A Data Lake solves a variety of Analytics and Machine Learning (ML) use cases dealing with internal and external data producers and consumers. We use a simplified and generic Data Lake reference architecture – illustrated in the diagram below. To ingest data from Operational Databases to Amazon S3 staging bucket of the data lake, either an AWS Database Migration Service (DMS) or any AWS partner solution from AWS Marketplace that has support for Change Data Capture (CDC) can fulfil the requirement. AWS Glue is used to create source-aligned and consumer-aligned datasets and separate Glue jobs to do Feature Engineering part of ML Engineering and Operations. Amazon Athena is used for interactive querying and AWS Lake Formation and Glue Data Catalog for Governance.
In our architecture, we used AWS Database Migration Service (DMS) to ingest data from Operational Data Sources (ODS) to S3 staging layer of data lake. We used AWS Glue to run data ingestion and transformation pipelines. To populate Raw zones of the data lake, we used Apache Hudi as an incremental data processing solution in conjunction with Apache Parquet. Apache Hudi Connector for AWS Glue make it easy to use Apache Hudi within Glue ecosystem (Glue ETL and Glue Data Catalog).
ODP Framework Deep Dive and Deployment
ODP Framework Demo
Refer to ODP Framework Demo.
The following people are involved in the design, architecture, development, and testing of this solution:
- Srinivas Kandi, Data Architect, Amazon Web Services Inc.
- Ravi Itha, Principal Consultant, Amazon Web Services Inc.
This project is licensed under the Apache-2.0 License.