Moving UUID definition in Python BQ IO #26002

svetakvsundhar · 2023-03-27T22:17:15Z

fixes #22733

The UUID generation is now done upon pipeline creation time in Python BigQuery IO. This should prevent collisions from occuring.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

codecov · 2023-03-27T22:33:19Z

Codecov Report

Merging #26002 (f79cda8) into master (fe791f6) will decrease coverage by 2.58%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #26002      +/-   ##
==========================================
- Coverage   73.96%   71.38%   -2.58%     
==========================================
  Files         706      739      +33     
  Lines       95473    97928    +2455     
==========================================
- Hits        70616    69908     -708     
- Misses      23541    26704    +3163     
  Partials     1316     1316

Flag	Coverage Δ
python	`79.97% <100.00%> (-4.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdks/python/apache_beam/io/gcp/bigquery.py	`71.46% <100.00%> (+0.11%)`	⬆️

... and 153 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

github-actions · 2023-03-27T23:34:04Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

svetakvsundhar · 2023-03-27T23:59:34Z

retest this please

svetakvsundhar · 2023-03-28T00:06:01Z

Run Python Examples_Direct

svetakvsundhar · 2023-03-28T14:26:28Z

R: @ahmedabu98

github-actions · 2023-03-28T14:27:49Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

ahmedabu98 · 2023-03-28T14:47:13Z

sdks/python/apache_beam/io/gcp/bigquery.py

                job_name=job_name,
                step_name=step_name,
-                unique_id=unique_id,
+                unique_id=str(uuid.uuid4())[0:10],


It's critical that the unique_id here is equal to unique_id on line 2492. BigQuery Export read generally works in three steps:

output to files in GCS

read from those files

delete GCS files

The unique id is used when creating those file names, and the same unique id is used to generate the name of the directory to delete after the read operation. Unique ID should be the same for both so that we can clean up properly.

What you could do here is create the unique_id beforehand in a DoFn then wrap it in a pvalue.AsSingleton and pass it to both the Read and Cleanup operations as a side-input. See an example here that does this.

In the relevant issue, it mentions this is the way Java does it. It creates the value in a DoFn then uses it as a sideinput for read and cleanup.

ahmedabu98 · 2023-03-28T15:09:52Z

sdks/python/apache_beam/io/gcp/bigquery.py

            job_name=job_name,
            step_name=step_name,
-            unique_id=unique_id,
+            unique_id=str(uuid.uuid4())[0:10],


Does doing it this way make a difference? I'm not so familiar, would refer to a Python SDK expert for this

UUID now set in PTransform.

f79cda8

github-actions bot added gcp io python labels Mar 27, 2023

svetakvsundhar changed the title ~~UUID now set in PTransform.~~ Moving UUID definition Mar 27, 2023

svetakvsundhar changed the title ~~Moving UUID definition~~ Moving UUID definition in Python BQ IO Mar 27, 2023

ahmedabu98 reviewed Mar 28, 2023

View reviewed changes

svetakvsundhar closed this Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Moving UUID definition in Python BQ IO #26002

Moving UUID definition in Python BQ IO #26002

Uh oh!

svetakvsundhar commented Mar 27, 2023

Uh oh!

codecov bot commented Mar 27, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Mar 27, 2023

Uh oh!

svetakvsundhar commented Mar 27, 2023

Uh oh!

svetakvsundhar commented Mar 28, 2023

Uh oh!

svetakvsundhar commented Mar 28, 2023

Uh oh!

github-actions bot commented Mar 28, 2023

Uh oh!

ahmedabu98 Mar 28, 2023

Uh oh!

ahmedabu98 Mar 28, 2023

Uh oh!

ahmedabu98 Mar 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Moving UUID definition in Python BQ IO #26002

Moving UUID definition in Python BQ IO #26002

Uh oh!

Conversation

svetakvsundhar commented Mar 27, 2023

GitHub Actions Tests Status (on master branch)

Uh oh!

codecov bot commented Mar 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Mar 27, 2023

Uh oh!

svetakvsundhar commented Mar 27, 2023

Uh oh!

svetakvsundhar commented Mar 28, 2023

Uh oh!

svetakvsundhar commented Mar 28, 2023

Uh oh!

github-actions bot commented Mar 28, 2023

Uh oh!

ahmedabu98 Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 Mar 28, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 27, 2023 •

edited

Loading