Skip to content

aws-samples/cdc-transactional-datalake-using-debezium-gluestreaming-iceberg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Build a real-time incremental data load solution using open-source Debezium, Apache Iceberg and AWS Glue streaming

alt text

This repository provides an AWS Glue Studio notebook that builds a sample pipeline that loads real time data changes into a transactional data lake with CDC (Change Data Capture) approach.

Use Agreement

We recommend that you use this notebook as a starting point for creating your own, not for launching production-level environments. Before launching, always review the resources and policies that it will create and the permissions it requires. Using this code I Agree I'm solely responsible for any security issue caused due any misconfiguration and/or bugs.

Instructions

Creating the table for replication

Before starting, it is important to note that the installation and configuration of the source database and the deployment and configuration of the Debezium Server are outside the scope of this project, so it is assumed as a prerequisite that you are prepared to replicate a new table via Amazon Kinesis.

  1. Create MYTABLE table with the columns DATA_ID (INTEGER, Primary Key), DATA (VARCHAR (50)) and insert some records. Below is the code to create the table in MySQL engine, however Debezium works with other database engines, you can read more about it here.
CREATE TABLE MYTABLE (
    DATA_ID INTEGER PRIMARY KEY,
    DATA VARCHAR(50)
);

INSERT INTO MYTABLE VALUES (1,'data1'),(2,'data2');

Note: After create the table, it is necessary to configure properly the Debezium configuraion file (application.properties) and start Debezium Server service to stream data.

Creating AWS Glue Studio notebook

  1. Clone the repository or download the AWS Glue Studio Notebook file to your computer.

  1. Access the AWS Glue Studio console from your account. Click on ETL jobs in the menu on the left, under Create job select Jupyter Notebook (1) and Upload and edit an existing notebook (2). Click the Choose file button (3) and select the demo-cdc-glue-streaming.ipynb file. To complete the creation, click “Create” (4).

alt text

  1. Add a name for the Job, select the Spark option as Kernel, and select an IAM Role with AWS Glue services permissions and also to read from Kinesis stream. After that click on Start notebook.

alt text

  1. The notebook contains 21 steps, starting configuring AWS Glue parameters, importing libraries, running the CDC and ending stopping the session. Attention to item 2/ which requires replacing <bucket_name> and <stream_name> values and item 9/ which is optional to filter streaming by a date range.

alt text

Note: For a better understanding of the code, perform one step at a time and read the results before perform next step.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published