Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC] Add ignore_duplicate_error option #89

Closed

Conversation

mtsmfm
Copy link
Contributor

@mtsmfm mtsmfm commented Mar 5, 2019

Currently, this plugin can prevent duplication but it raises an error and can't proceed.
I think we can use bigquery to remember which record is imported.

So I'd like to add ignore_duplicate_error option.

I confirmed this implementation works.

https://github.com/mtsmfm/rails-ci-result-importer/blob/e071b1dc4483b614997f6e682bba65c36b7261a9/Gemfile#L9

What do you think?

Known issue

This option leads num_input_rows and num_output_rows does not match error https://github.com/mtsmfm/embulk-output-bigquery/blob/957f02727704624835a836ac1928ba328d73932f/lib/embulk/output/bigquery.rb#L393

Currently I set abort_on_error to true.
I think we can calculate how many records already inserted and return dummy response here like:

if @task['ignore_duplicate_error'] && e.status_code == 409
  num_output_rows = client.call_api_to_calculate_num_output_rows
  return JSON.parse({statistics: {load: {output_rows: num_output_rows}}}.to_json, object_class: OpenStruct)
else
...

but I'm not sure it's acceptable or not.

@sonots
Copy link
Member

sonots commented Mar 5, 2019

Do you mean “prevent_duplicate_insert false”?

@mtsmfm
Copy link
Contributor Author

mtsmfm commented Mar 6, 2019

Do you mean “prevent_duplicate_insert false”?

No.

Let's say we have a recurring import job and the following input on day 1:

id,data
1,hi

And on day 2, we have the following:

id,data
1,hi
2,bye

I'd like to prevent duplicated data 1,hi but insert 2,bye.

On BigQuery, the following data is created with prevent_duplicate_insert: false option:

id,data
1,hi
1,hi
2,bye

If with prevent_duplicate_insert: true option we don't have 2,bye:

id,data
1,hi

Because embulk-output-bigquery raises an error.

prevent_duplicate_insert: true with ignore_duplicate_error: true will create:

id,data
1,hi
2,bye

@sonots
Copy link
Member

sonots commented Mar 6, 2019

Are you sure this PR meets your demand? (incremental update)?

In my understanding, BigQuery does not have unique constraints and does not respond 409 for each row
https://cloud.google.com/bigquery/troubleshooting-errors.

@sonots
Copy link
Member

sonots commented Mar 6, 2019

I do incremental updation like

  1. Get only new records from input data
  2. Copy the current BigQuery table to the new BigQuery table
  3. Append into the new BigQuery table.

The reason why I do not directly append into the old table is to make it possible to rerun the bulkload batch again.

@mtsmfm
Copy link
Contributor Author

mtsmfm commented Mar 6, 2019

Ah, sorry, probably I misunderstood the behavior of prevent_duplicate_insert: true 🙇
I thought this plugin with this option creates a temporary file for each row to prevent duplicated insertion.
In my case, each row has more than 1 MB so now I guess this plugin (or embulk?) creates tempfile for each row and as a result it seems to work.

@mtsmfm mtsmfm closed this Mar 6, 2019
@mtsmfm
Copy link
Contributor Author

mtsmfm commented Mar 6, 2019

Sorry to bother you again 🙇
Can I ask you to answer my questions?

  • Use case for prevent_duplicate_insert: true option
  • How/when does this plugin (or embluk?) split input data into multiple files?

@sonots
Copy link
Member

sonots commented Mar 6, 2019

creates tempfile for each row

It does not create a file for each row because, If it does, too many files are created.

Use case for prevent_duplicate_insert: true option

It is used to prevent inserting same data with mode: append_direct on rerunning, probably used together with skip_file_generation: true.

(Actually, I am not the author who added this option. It existed before 0.3.x. I have never used it)

How/when does this plugin (or embluk?) split input data into multiple files?

It depends on your input plugin. This plugin creates a local file per output thread in parallel. Some plugin supports to load data in parallel, in such case, many (# of threads) local files may be created.

@sonots-zozo
Copy link

sonots-zozo commented Mar 6, 2019

Use case for prevent_duplicate_insert: true option

It was written at https://www.slideshare.net/oreradio/embulk-plugin-gcs-bigquery/13.

However, one of its purposes (making a unique job id) is currently achieved by the line

job_id = "embulk_load_job_#{SecureRandom.uuid}"

About another purpose, letting it fail on loading same data with same configuration, it is probably difficult to achieve with this option because temporary files may be created differently on each re-running especially when multiple inputs, thus, multiple threads are used.

That said, prevent_duplicate_insert exists just for backward compatibility.

@sakama
Copy link
Member

sakama commented Mar 6, 2019

I am the original author who implemented prevent_duplicate_insert option.
This plugin became more powerful than what I implemented first.

As @sonots mentioned above, this option might not be needed anymore.

@mtsmfm
Copy link
Contributor Author

mtsmfm commented Mar 9, 2019

@sonots @sakama Thank you for your kindness 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants