Build a real-time incremental data load solution using open-source Debezium, Apache Iceberg and AWS Glue streaming

This repository provides an AWS Glue Studio notebook that builds a sample pipeline that loads real time data changes into a transactional data lake with CDC (Change Data Capture) approach.

Use Agreement

We recommend that you use this notebook as a starting point for creating your own, not for launching production-level environments. Before launching, always review the resources and policies that it will create and the permissions it requires. Using this code I Agree I'm solely responsible for any security issue caused due any misconfiguration and/or bugs.

Instructions

Creating the table for replication

Before starting, it is important to note that the installation and configuration of the source database and the deployment and configuration of the Debezium Server are outside the scope of this project, so it is assumed as a prerequisite that you are prepared to replicate a new table via Amazon Kinesis.

Create MYTABLE table with the columns DATA_ID (INTEGER, Primary Key), DATA (VARCHAR (50)) and insert some records. Below is the code to create the table in MySQL engine, however Debezium works with other database engines, you can read more about it here.

CREATE TABLE MYTABLE (
    DATA_ID INTEGER PRIMARY KEY,
    DATA VARCHAR(50)
);

INSERT INTO MYTABLE VALUES (1,'data1'),(2,'data2');

Note: After create the table, it is necessary to configure properly the Debezium configuraion file (application.properties) and start Debezium Server service to stream data.

Creating AWS Glue Studio notebook

Clone the repository or download the AWS Glue Studio Notebook file to your computer.

Access the AWS Glue Studio console from your account. Click on ETL jobs in the menu on the left, under Create job select Jupyter Notebook (1) and Upload and edit an existing notebook (2). Click the Choose file button (3) and select the demo-cdc-glue-streaming.ipynb file. To complete the creation, click “Create” (4).

Add a name for the Job, select the Spark option as Kernel, and select an IAM Role with AWS Glue services permissions and also to read from Kinesis stream. After that click on Start notebook.

The notebook contains 21 steps, starting configuring AWS Glue parameters, importing libraries, running the CDC and ending stopping the session. Attention to item 2/ which requires replacing <bucket_name> and <stream_name> values and item 9/ which is optional to filter streaming by a date range.

Note: For a better understanding of the code, perform one step at a time and read the results before perform next step.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
demo-cdc-glue-streaming.ipynb		demo-cdc-glue-streaming.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build a real-time incremental data load solution using open-source Debezium, Apache Iceberg and AWS Glue streaming

Use Agreement

Instructions

Creating the table for replication

Creating AWS Glue Studio notebook

Security

License

About

Releases

Packages

Contributors 2

Languages

License

aws-samples/cdc-transactional-datalake-using-debezium-gluestreaming-iceberg

Folders and files

Latest commit

History

Repository files navigation

Build a real-time incremental data load solution using open-source Debezium, Apache Iceberg and AWS Glue streaming

Use Agreement

Instructions

Creating the table for replication

Creating AWS Glue Studio notebook

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages