-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] Add ignore_duplicate_error option #89
Conversation
Do you mean “prevent_duplicate_insert false”? |
No. Let's say we have a recurring import job and the following input on day 1:
And on day 2, we have the following:
I'd like to prevent duplicated data On BigQuery, the following data is created with
If with
Because embulk-output-bigquery raises an error.
|
Are you sure this PR meets your demand? (incremental update)? In my understanding, BigQuery does not have unique constraints and does not respond 409 for each row |
I do incremental updation like
The reason why I do not directly append into the old table is to make it possible to rerun the bulkload batch again. |
Ah, sorry, probably I misunderstood the behavior of |
Sorry to bother you again 🙇
|
It does not create a file for each row because, If it does, too many files are created.
It is used to prevent inserting same data with mode: append_direct on rerunning, probably used together with skip_file_generation: true. (Actually, I am not the author who added this option. It existed before 0.3.x. I have never used it)
It depends on your input plugin. This plugin creates a local file per output thread in parallel. Some plugin supports to load data in parallel, in such case, many (# of threads) local files may be created. |
It was written at https://www.slideshare.net/oreradio/embulk-plugin-gcs-bigquery/13. However, one of its purposes (making a unique job id) is currently achieved by the line
About another purpose, letting it fail on loading same data with same configuration, it is probably difficult to achieve with this option because temporary files may be created differently on each re-running especially when multiple inputs, thus, multiple threads are used. That said, |
I am the original author who implemented As @sonots mentioned above, this option might not be needed anymore. |
Currently, this plugin can prevent duplication but it raises an error and can't proceed.
I think we can use bigquery to remember which record is imported.
So I'd like to add
ignore_duplicate_error
option.I confirmed this implementation works.
https://github.com/mtsmfm/rails-ci-result-importer/blob/e071b1dc4483b614997f6e682bba65c36b7261a9/Gemfile#L9
What do you think?
Known issue
This option leads
num_input_rows and num_output_rows does not match
error https://github.com/mtsmfm/embulk-output-bigquery/blob/957f02727704624835a836ac1928ba328d73932f/lib/embulk/output/bigquery.rb#L393Currently I set
abort_on_error
totrue
.I think we can calculate how many records already inserted and return dummy response here like:
but I'm not sure it's acceptable or not.