Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Operational Data Processing Framework using AWS Glue and Apache Hudi

The Operational Data Processing Framework (ODP Framework) contains three components: 1/ File Manager, 2/ File Processor, and 3/ Configuration Manager. Each component runs independently to solve a portion of the operational data processing use case. The source code is organized in three folders – one for each component and if you customize and adopt this framework for your use cases, we recommend you to promote these components to three separate code repositories in your version control system. You can consider the following repository names:

  1. aws-glue-hudi-odp-framework-file-manager
  2. aws-glue-hudi-odp-framework-file-processor
  3. aws-glue-hudi-odp-framework-config-manager

With this modular approach, you can independently deploy the components to your data lake environment by following your preferred CI/CD Processes. As illustrated in the Overall Architecture section, these components are deployed in conjunction with a Change Data Capture solution. For the sake of completeness, we assume that AWS DMS is used to migrate data from operational databases to Amazon S3 but skip its implementation specifics.


Data Lake Reference Architecture

A Data Lake solves a variety of Analytics and Machine Learning (ML) use cases dealing with internal and external data producers and consumers. We use a simplified and generic Data Lake reference architecture – illustrated in the diagram below. To ingest data from Operational Databases to Amazon S3 staging bucket of the data lake, either an AWS Database Migration Service (DMS) or any AWS partner solution from AWS Marketplace that has support for Change Data Capture (CDC) can fulfil the requirement. AWS Glue is used to create source-aligned and consumer-aligned datasets and separate Glue jobs to do Feature Engineering part of ML Engineering and Operations. Amazon Athena is used for interactive querying and AWS Lake Formation and Glue Data Catalog for Governance.

In our architecture, we used AWS Database Migration Service (DMS) to ingest data from Operational Data Sources (ODS) to S3 staging layer of data lake. We used AWS Glue to run data ingestion and transformation pipelines. To populate Raw zones of the data lake, we used Apache Hudi as an incremental data processing solution in conjunction with Apache Parquet. Apache Hudi Connector for AWS Glue make it easy to use Apache Hudi within Glue ecosystem (Glue ETL and Glue Data Catalog).


ODP Framework Deep Dive and Deployment

Refer to:

  1. File Manager README.
  2. File Processor README.

ODP Framework Demo

Refer to ODP Framework Demo.


The following people are involved in the design, architecture, development, and testing of this solution:

  1. Srinivas Kandi, Data Architect, Amazon Web Services Inc.
  2. Ravi Itha, Principal Consultant, Amazon Web Services Inc.


This project is licensed under the Apache-2.0 License.


Operational Data Processing Framework developed using AWS Glue and Apache Hudi. This framework is suitable for Data Lake and Modern Data Platform implementations on the AWS Cloud.



Code of conduct

Security policy





No releases published


No packages published