-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Moving UUID definition in Python BQ IO #26002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26002 +/- ##
==========================================
- Coverage 73.96% 71.38% -2.58%
==========================================
Files 706 739 +33
Lines 95473 97928 +2455
==========================================
- Hits 70616 69908 -708
- Misses 23541 26704 +3163
Partials 1316 1316
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 153 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
retest this please |
|
Run Python Examples_Direct |
|
R: @ahmedabu98 |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
| job_name=job_name, | ||
| step_name=step_name, | ||
| unique_id=unique_id, | ||
| unique_id=str(uuid.uuid4())[0:10], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's critical that the unique_id here is equal to unique_id on line 2492. BigQuery Export read generally works in three steps:
- output to files in GCS
- read from those files
- delete GCS files
The unique id is used when creating those file names, and the same unique id is used to generate the name of the directory to delete after the read operation. Unique ID should be the same for both so that we can clean up properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you could do here is create the unique_id beforehand in a DoFn then wrap it in a pvalue.AsSingleton and pass it to both the Read and Cleanup operations as a side-input. See an example here that does this.
In the relevant issue, it mentions this is the way Java does it. It creates the value in a DoFn then uses it as a sideinput for read and cleanup.
| job_name=job_name, | ||
| step_name=step_name, | ||
| unique_id=unique_id, | ||
| unique_id=str(uuid.uuid4())[0:10], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does doing it this way make a difference? I'm not so familiar, would refer to a Python SDK expert for this
fixes #22733The UUID generation is now done upon pipeline creation time in Python BigQuery IO. This should prevent collisions from occuring.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.