Running multiple jobs using one config file #167

Open
frsyuki opened this Issue Apr 30, 2015 · 3 comments

Projects

None yet

2 participants

@frsyuki
Contributor
frsyuki commented Apr 30, 2015

We often have this kind of use case:

  • Read src_accounts.csv file and write to accounts table on PostgreSQL
  • Read src_payments.csv file and write to payments table on PostgreSQL
  • Read src_requests.csv file and write to requests table on PostgreSQL

In this case, we need multiple configuration files and multiple embulk run commands.
Idea here is to add a simple multi-transaction feature to embulk.

For example, the configuration file can be like this:

jobs:
    accounts:
    payments:
    requests:
in:
    type: file
    path_prefix: src_${job}
    parser:
        type: csv
out:
    type: postgresql
    table: table_${job}

Discussion points are guess, transaction, next config diff and resume.

As the result of guess, we can output this config file:

jobs:
    accounts:
        in:
            parser:
                type: csv
                columns:
                    - {name: id, type: long}
                    - {name: surename, type: string}
                    - {name: givenname, type: string}
    payments:
        in:
            parser:
                type: csv
                columns:
                    - {name: id, type: long}
                    - {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
    requests:
        in:
            parser:
                type: csv
                columns:
                    - {name: request_id, type: string}
                    - {name: account_id, type: long}
                    - {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
in:
    type: file
    path_prefix: src_${job}
    parser:
        type: csv
out:
    type: postgresql
    table: table_${job}

About transaction, an idea is to commit jobs one by one when all jobs are completed.
For example, embulk runs transactions in this order:

  1. job accounts begins (1 task)
  2. job payments begins (2 tasks)
  3. job requests begins (2 tasks)
  4. job accounts task-1 completes
  5. job payments task-1 completes
  6. job requests task-1 completes
  7. job payments task-2 completes
  8. job requests task-2 completes
  9. job requests commits
  10. job payments commits
  11. job accounts commits

A concern is that requests and payments won't be rolled back when commit of job accounts fails.

DSL (#131) does similar thing. Any thoughts?

@frsyuki frsyuki added the help wanted label Apr 30, 2015
@hito4t
Contributor
hito4t commented May 1, 2015

A concern is that requests and payments won't be rolled back when commit of job accounts fails.

I think it is permissible because it will be rare case (or we should implement commit to rarely fail).
It's enough if an appropriate error message is output when commit fails.

As the example above, same type inputs, same type parsers, same type outputs will be used in multiple jobs.
(On the contrary to the example, multiple table inputs -> multiple file outputs jobs will be also often used.)
But in regards to filters, different types may be used.
For example, filters to validate input values based on the schema of the input may be used.

And I think it's useful to use different character strings for the file name and the table name (ex. acc.csv -> ACCOUNT table).

My idea is as follows:

jobs:
    accounts:
        file_name: src_accounts
        table_name: table_accounts
        csv_columns:
            - {name: id, type: long}
            - {name: surename, type: string}
            - {name: givenname, type: string}
        filter_type: filter_accounts
    payments:
        file_name: src_payments
        table_name: table_payments
        csv_columns:
            - {name: id, type: long}
            - {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
        filter_type: filter_payments
    requests:
        file_name: src_requests
        table_name: table_requests
        csv_columns:
            - {name: request_id, type: string}
            - {name: account_id, type: long}
            - {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
        filter_type: filter_requests
in:
    type: file
    path_prefix: ${file_name}
    parser:
        type: csv
        columns: ${csv_columns}
    filters:
    - type: ${filter_type}
out:
    type: postgresql
    table: ${table_name}
@frsyuki
Contributor
frsyuki commented May 1, 2015

It's good point that each source files want to use different type of filters.

In my idea above, each jobs will use global configuration + job-specific configuration. Job-specific configurations are merged to (overwrite) global configuration. For example, if accounts job has jobs.accounts.in.parser.columns setting, it overwrites in.parser.columns.

Another thing is variables section to defines many variables:

jobs:
    accounts:
        variables:
            file_name: src_accounts
            table_name: table_accounts
        in:
            parser:
                type: csv
                columns:
                - {name: id, type: long}
                - {name: surename, type: string}
                - {name: givenname, type: string}
        filters:
            - {type: filter_accounts}
    payments:
        variables:
            file_name: src_payments
            table_name: table_payments
        in:
            parser:
                type: csv
                columns:
                - {name: id, type: long}
                - {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
        filters:
            - {type: filter_accounts}
    requests:
        variables:
            file_name: src_requests
            table_name: table_requests
        in:
            parser:
                type: csv
                columns:
                - {name: request_id, type: string}
                - {name: account_id, type: long}
                - {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
        filters:
            - {type: filter_accounts}
in:
    type: file
    path_prefix: ${file_name}
    parser:
        type: csv
out:
    type: postgresql
    table: ${table_name}

A concern here is that merging the job-specific configuration to the global configuration might be difficult to understand.
Another concern is that we can't control order of filters. All filters in job-specific configuration will be appended to the end of global configuration.

@hito4t
Contributor
hito4t commented May 1, 2015

If all configuration elements can be overwritten, we may not need variables.

A concern here is that merging the job-specific configuration to the global configuration might be difficult to understand.

The global configuration defines default values, not actual job.
But it's confusing because the global configuration and one job configuration have the same form.

I think we should introduce new element to avoid confusion.
For example:

jobs:
    accounts:
        in:
            file_name: src_accounts
            parser:
                columns:
                - {name: id, type: long}
                - {name: surename, type: string}
                - {name: givenname, type: string}
        ...
default:
    in:
        type: file
    parser:
        type: csv
    out:
        type: postgresql

Another concern is that we can't control order of filters. All filters in job-specific configuration will be appended to the end of global configuration.

It's confusing because someone may assume filters will be appended and someone may assume whole filters will be overwritten (global filters will not be used).
It should be defined explicitly.
(I'm sorry, I have no idea now.)

@frsyuki frsyuki added the new feature label May 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment