Cross Region Replication of a Kinesis Data Stream using Kinesis Data Analytics Studio (Apache Flink SQL)
Using the Apache Flink SQL API to connect to a Kinesis Data Stream in Region A and writing the data into Region B for the purposes of cross region replication
It can be desirable to replicate data from a Kinesis Data Stream in Region A to Region B for many reasons including Disaster Recovery Resiliency, migration to another region, or simply making data available in both regions for separation of concerns.
In this code repository, we showcase how to replicate data between regions using Kinesis Data Analytics Studio, a managed Apache Flink SQL interactive environment using Apache Zeppelin.
- Active Amazon Web Service Account
- Python 3.x for Data Generation
- Clone this repository
- Create a Kinesis Data Stream in Region A
- Create a Kinesis Data Stream in Region B
- Create a Kinesis Data Analytics Studio application and use the default create wizard
- Add the following permissions to your Kinesis Data Analytics Studio role:
- Click on "Run" for the Kinesis Data Analytics Studio application and wait a few moments for the application to start.
- Open the Kinesis Data Analytics Studio application in Apache Zeppelin, and import the replicator.zpln file.
- Follow the instructions in the notebook (Github preview here) in order to begin replicating data to the secondary region.
Insert records into secondary region stream
Count records that have landed in secondary stream
PutRecords Metric showcasing data arriving in secondary stream
In order to start sending data to your SOURCE stream, download send_data.py and run the file using the following command:
NOTE: Ensure you have modified the file to mention your source stream name and the correct region. It will default to the user you are using to execute the python script from your machine.
Run each paragraph in the included Apache Zeppelin notebook (
replicator.zpln) by clicking the Play button on the top right of each paragraph, or by typing SHIFT + ENTER
You can use the provided zeppelin notebook's last paragraph to verify that data is being replicated across regions to your secondary kinesis data stream. This will perform a
COUNT(*) of all records written to the stream starting from the earliest offset.
You can also review the Kinesis Data Stream incoming records metric on your destination stream to ensure that data is being delivered in a timely manner.
Despite being browser-based, you can close this Apache Zeppelin notebook at any time and the Kinesis Data Analytics application will continue to replicate data unless the Kinesis Data Analytics application is stopped via the Amazon Web Services Management Console or API.
Additionally, this repository does not cover several important aspects of replication including scaling of your Kinesis Data Analytics application. Be sure to give thought to these aspects of replication before deploying this solution into a streaming environment.
Ensure you are stopping any superfluous paragraphs that don't need to be running to ensure your Kinesis Data Analytics application has enough available Kinesis Processing Units available to perform the replication.
- Initial Release