Dataduct is a wrapper built on top of AWS Datapipeline which makes it easy to create ETL jobs. All jobs can be specified as a series of steps in a YAML file and would automatically be translated into datapipeline with appropriate pipeline objects.
Features include:
- Visualizing pipeline activities
- Extracting data from different sources such as RDS, S3, local files
- Transforming data using EC2 and EMR
- Loading data into redshift
- Transforming data inside redshift
- QA data between the source system and warehouse
It is easy to create custom steps to augment the DSL as per the requirements. As well as running a backfill with the command line interface.
An example ETL from RDS would look like:
name: example_upsert
frequency: daily
load_time: 01:00 # Hour:Min in UTC
steps:
- step_type: extract-rds
host_name: test_host
database: test_database
sql: |
SELECT *
FROM test_table;
- step_type: create-load-redshift
table_definition: tables/dev.test_table.sql
- step_type: upsert
source: tables/dev.test_table.sql
destination: tables/dev.test_table_2.sql
This would first perform an extraction from the RDS database with the
extract-rds
step using the COPY ACTIVITY
. Then load the data
into the dev.test_table
in redshift with the
create-load-redshift
. Then perform an upsert
with the data into
the test_table_2
.