Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Bigquery Load jobs with WRITE_TRUNCATE disposition may truncate valid records. #24535

Closed
robertwb opened this issue Dec 5, 2022 · 5 comments · Fixed by #24536 or #25101
Closed

Comments

@robertwb
Copy link
Contributor

robertwb commented Dec 5, 2022

What happened?

TriggerCopyJobs was written to assume all jobs belonging to the same table are processed by the same bundle, but this precondition appears to have been invalidated with future changes to the code. Though this still seems to hold most of the time, it may be violated for large writes, and seems to trigger more on runner v2 than runner v1.

Issue Priority

Priority: 1

Issue Component

Component: io-py-gcp

@kennknowles
Copy link
Member

Is this a regression? It sounds severe and worth a cherry pick most likely.

@robertwb
Copy link
Contributor Author

robertwb commented Dec 6, 2022 via email

@kennknowles
Copy link
Member

I see it is first reported by a user in #23306. I think keeping the earlier bug is worthwhile so I will dupe this one.

@Abacn
Copy link
Contributor

Abacn commented Jan 20, 2023

The bug can still occur when the number of files are greater than 10,000 then bigquery load jobs will be conducted in multiple partitions:

_MAXIMUM_SOURCE_URIS = 10 * 1000

Did a test that decrease this number and force the write happens in multiple bigquery load jobs, still see dropping elements

@Abacn
Copy link
Contributor

Abacn commented Jan 20, 2023

reopen for cherry pick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment