[POC] Add ignore_duplicate_error option #89

mtsmfm · 2019-03-05T17:18:58Z

Currently, this plugin can prevent duplication but it raises an error and can't proceed.
I think we can use bigquery to remember which record is imported.

So I'd like to add ignore_duplicate_error option.

I confirmed this implementation works.

https://github.com/mtsmfm/rails-ci-result-importer/blob/e071b1dc4483b614997f6e682bba65c36b7261a9/Gemfile#L9

What do you think?

Known issue

This option leads num_input_rows and num_output_rows does not match error https://github.com/mtsmfm/embulk-output-bigquery/blob/957f02727704624835a836ac1928ba328d73932f/lib/embulk/output/bigquery.rb#L393

Currently I set abort_on_error to true.
I think we can calculate how many records already inserted and return dummy response here like:

if @task['ignore_duplicate_error'] && e.status_code == 409
  num_output_rows = client.call_api_to_calculate_num_output_rows
  return JSON.parse({statistics: {load: {output_rows: num_output_rows}}}.to_json, object_class: OpenStruct)
else
...

but I'm not sure it's acceptable or not.

sonots · 2019-03-05T23:54:18Z

Do you mean “prevent_duplicate_insert false”?

mtsmfm · 2019-03-06T01:26:19Z

Do you mean “prevent_duplicate_insert false”?

No.

Let's say we have a recurring import job and the following input on day 1:

id,data
1,hi

And on day 2, we have the following:

id,data
1,hi
2,bye

I'd like to prevent duplicated data 1,hi but insert 2,bye.

On BigQuery, the following data is created with prevent_duplicate_insert: false option:

id,data
1,hi
1,hi
2,bye

If with prevent_duplicate_insert: true option we don't have 2,bye:

id,data
1,hi

Because embulk-output-bigquery raises an error.

prevent_duplicate_insert: true with ignore_duplicate_error: true will create:

id,data
1,hi
2,bye

sonots · 2019-03-06T10:18:27Z

Are you sure this PR meets your demand? (incremental update)?

In my understanding, BigQuery does not have unique constraints and does not respond 409 for each row
https://cloud.google.com/bigquery/troubleshooting-errors.

sonots · 2019-03-06T10:22:32Z

I do incremental updation like

Get only new records from input data
Copy the current BigQuery table to the new BigQuery table
Append into the new BigQuery table.

The reason why I do not directly append into the old table is to make it possible to rerun the bulkload batch again.

mtsmfm · 2019-03-06T14:32:52Z

Ah, sorry, probably I misunderstood the behavior of prevent_duplicate_insert: true 🙇
I thought this plugin with this option creates a temporary file for each row to prevent duplicated insertion.
In my case, each row has more than 1 MB so now I guess this plugin (or embulk?) creates tempfile for each row and as a result it seems to work.

mtsmfm · 2019-03-06T14:43:42Z

Sorry to bother you again 🙇
Can I ask you to answer my questions?

Use case for prevent_duplicate_insert: true option
How/when does this plugin (or embluk?) split input data into multiple files?

sonots · 2019-03-06T15:42:44Z

creates tempfile for each row

It does not create a file for each row because, If it does, too many files are created.

Use case for prevent_duplicate_insert: true option

It is used to prevent inserting same data with mode: append_direct on rerunning, probably used together with skip_file_generation: true.

(Actually, I am not the author who added this option. It existed before 0.3.x. I have never used it)

How/when does this plugin (or embluk?) split input data into multiple files?

It depends on your input plugin. This plugin creates a local file per output thread in parallel. Some plugin supports to load data in parallel, in such case, many (# of threads) local files may be created.

sonots-zozo · 2019-03-06T15:53:19Z

Use case for prevent_duplicate_insert: true option

It was written at https://www.slideshare.net/oreradio/embulk-plugin-gcs-bigquery/13.

However, one of its purposes (making a unique job id) is currently achieved by the line

embulk-output-bigquery/lib/embulk/output/bigquery/bigquery_client.rb

Line 85 in 387c70e

job_id = "embulk_load_job_#{SecureRandom.uuid}"

About another purpose, letting it fail on loading same data with same configuration, it is probably difficult to achieve with this option because temporary files may be created differently on each re-running especially when multiple inputs, thus, multiple threads are used.

That said, prevent_duplicate_insert exists just for backward compatibility.

sakama · 2019-03-06T15:57:21Z

I am the original author who implemented prevent_duplicate_insert option.
This plugin became more powerful than what I implemented first.

As @sonots mentioned above, this option might not be needed anymore.

mtsmfm · 2019-03-09T06:58:33Z

@sonots @sakama Thank you for your kindness 🙏

Add ignore_duplicate_error option

957f027

mtsmfm closed this Mar 6, 2019

sonots mentioned this pull request Mar 6, 2019

[PLAN] Deprecate prevent_duplicate_insert option #92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] Add ignore_duplicate_error option #89

[POC] Add ignore_duplicate_error option #89

mtsmfm commented Mar 5, 2019 •

edited

Loading

sonots commented Mar 5, 2019

mtsmfm commented Mar 6, 2019

sonots commented Mar 6, 2019

sonots commented Mar 6, 2019 •

edited

Loading

mtsmfm commented Mar 6, 2019 •

edited

Loading

mtsmfm commented Mar 6, 2019 •

edited

Loading

sonots commented Mar 6, 2019 •

edited

Loading

sonots-zozo commented Mar 6, 2019 •

edited by sonots

Loading

sakama commented Mar 6, 2019

mtsmfm commented Mar 9, 2019

[POC] Add ignore_duplicate_error option #89

[POC] Add ignore_duplicate_error option #89

Conversation

mtsmfm commented Mar 5, 2019 • edited Loading

Known issue

sonots commented Mar 5, 2019

mtsmfm commented Mar 6, 2019

sonots commented Mar 6, 2019

sonots commented Mar 6, 2019 • edited Loading

mtsmfm commented Mar 6, 2019 • edited Loading

mtsmfm commented Mar 6, 2019 • edited Loading

sonots commented Mar 6, 2019 • edited Loading

sonots-zozo commented Mar 6, 2019 • edited by sonots Loading

sakama commented Mar 6, 2019

mtsmfm commented Mar 9, 2019

mtsmfm commented Mar 5, 2019 •

edited

Loading

sonots commented Mar 6, 2019 •

edited

Loading

mtsmfm commented Mar 6, 2019 •

edited

Loading

mtsmfm commented Mar 6, 2019 •

edited

Loading

sonots commented Mar 6, 2019 •

edited

Loading

sonots-zozo commented Mar 6, 2019 •

edited by sonots

Loading