Permalink
Fetching contributors…
Cannot retrieve contributors at this time
184 lines (137 sloc) 5.8 KB

Input and Output Nodes

In dataduct, data is shared between two activities using S3. After a step is finished, it saves its output to a file in S3 for successive steps to read. Input and output nodes abstract this process, they represent the S3 directories in which the data is stored. A step's input node determines which S3 file it will read as input, and its output node determines where it will store its output. In most cases, this input-output node chain is taken care of by dataduct, but there are situations where you may want finer control over this process.

Input Nodes

The default behaviour of steps (except Extract- and Check-type steps) is to link its input node with the preceding step's output node. For example, in this pipeline snippet

-   step_type: extract-local
    path: data/test_table1.tsv

-   step_type: create-load-redshift
    table_definition: tables/dev.test_table.sql

the output of the extract-local step is fed into the create-load-redshift step, so the pipeline will load the data found inside data/test_table1.tsv into dev.test_table.sql. This behaviour can be made explicit through the name and input_node properties.

# This pipeline has the same behaviour as the previous pipeline.
-   step_type: extract-local
    name: extract_data
    path: data/test_table1.tsv

-   step_type: create-load-redshift
    input_node: extract_data
    table_definition: tables/dev.test_table.sql

When an input -> output node link is created, implicitly or explicitly, dependencies are created automatically between the two steps. This behaviour can be made explicit through the depends_on property.

# This pipeline has the same behaviour as the previous pipeline.
-   step_type: extract-local
    name: extract_data
    path: data/test_table1.tsv

-   step_type: create-load-redshift
    input_node: extract_data
    depends_on: extract_data
    table_definition: tables/dev.test_table.sql

You can use input nodes to communicate between steps that are not next to each other.

-   step_type: extract-local
    name: extract_data
    path: data/test_table1.tsv

-   step_type: extract-local
    path: data/test_table2.tsv

# This step will use the output of the first extract-local step (test_table1.tsv)
-   step_type: create-load-redshift
    input_node: extract_data
    table_definition: tables/dev.test_table.sql

Without the use of input_node, the create-load-redshift step would have used the data from test_table2.tsv instead.

You can also use input nodes to reuse the output of a step.

-   step_type: extract-local
    name: extract_data
    path: data/test_table1.tsv

-   step_type: create-load-redshift
    input_node: extract_data
    table_definition: tables/dev.test_table1.sql

-   step_type: create-load-redshift
    input_node: extract_data
    table_definition: tables/dev.test_table2.sql

Sometimes, you may not want a step to have any input nodes. You can specify this by writing input_node: [].

-   step_type: extract-local
    name: extract_data
    path: data/test_table1.tsv

# This step will not receive any input data
-   step_type: transform
    input_node: []
    script: scripts/example_script.py

If you are running your own script (e.g. through the Transform step), the input node's data can be found in the directory specified by the INPUT1_STAGING_DIR enviroment variable.

-   step_type: extract-local
    name: extract_data
    path: data/test_table1.tsv

# manipulate_data.py takes in the input directory as a script argument and
# converts the string into the enviroment variable.
-   step_type: transform
    script: scripts/manipulate_data.py
    script_arguments:
    -   --input=INPUT1_STAGING_DIR

Output Nodes

Dataduct usually handles a step's output nodes automatically, saving the file into a default path in S3. You can set the default path through your dataduct configuration file. However, some steps also have an optional output_path property, allowing you to choose an S3 directory to store the step's output.

Transform Step and Output Nodes

Transform steps allow you to run your own scripts. If you want to save the results of your script, you can store data into the output node by writing to the directory specified by the OUTPUT1_STAGING_DIR enviroment variable.

# generate_data.py takes in the output directory as a script argument and
# converts the string into the enviroment variable.
-   step_type: transform
    script: scripts/generate_data.py
    script_arguments:
    -   --output=OUTPUT1_STAGING_DIR

-   step_type: create-load-redshift
    table_definition: tables/dev.test_table.sql

You may wish to output more than one set of data for multiple proceeding steps to use. You can do this through the output_node property.

-   step_type: transform
    script: scripts/generate_data.py
    script_arguments:
    -   --output=OUTPUT1_STAGING_DIR
    output_node:
    -   foo_data
    -   bar_data

-   step_type: create-load-redshift
    input_node: foo_data
    table_definition: tables/dev.test_table1.sql

-   step_type: create-load-redshift
    input_node: bar_data
    table_definition: tables/dev.test_table2.sql

In this case, the script must save data to subdirectories with names matching the output nodes. In the above example, generate_data.py must save data in OUTPUT1_STAGING_DIR/foo_data and OUTPUT1_STAGING_DIR/bar_data directories. If the subdirectory and output node names are mismatched, the output nodes will not be generated correctly.