Skip to content

Conversation

@svetakvsundhar
Copy link
Contributor

fixes #22733

The UUID generation is now done upon pipeline creation time in Python BigQuery IO. This should prevent collisions from occuring.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@svetakvsundhar svetakvsundhar changed the title UUID now set in PTransform. Moving UUID definition Mar 27, 2023
@svetakvsundhar svetakvsundhar changed the title Moving UUID definition Moving UUID definition in Python BQ IO Mar 27, 2023
@codecov
Copy link

codecov bot commented Mar 27, 2023

Codecov Report

Merging #26002 (f79cda8) into master (fe791f6) will decrease coverage by 2.58%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #26002      +/-   ##
==========================================
- Coverage   73.96%   71.38%   -2.58%     
==========================================
  Files         706      739      +33     
  Lines       95473    97928    +2455     
==========================================
- Hits        70616    69908     -708     
- Misses      23541    26704    +3163     
  Partials     1316     1316              
Flag Coverage Δ
python 79.97% <100.00%> (-4.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdks/python/apache_beam/io/gcp/bigquery.py 71.46% <100.00%> (+0.11%) ⬆️

... and 153 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@svetakvsundhar
Copy link
Contributor Author

retest this please

@svetakvsundhar
Copy link
Contributor Author

Run Python Examples_Direct

@svetakvsundhar
Copy link
Contributor Author

R: @ahmedabu98

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

job_name=job_name,
step_name=step_name,
unique_id=unique_id,
unique_id=str(uuid.uuid4())[0:10],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's critical that the unique_id here is equal to unique_id on line 2492. BigQuery Export read generally works in three steps:

  1. output to files in GCS
  2. read from those files
  3. delete GCS files

The unique id is used when creating those file names, and the same unique id is used to generate the name of the directory to delete after the read operation. Unique ID should be the same for both so that we can clean up properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you could do here is create the unique_id beforehand in a DoFn then wrap it in a pvalue.AsSingleton and pass it to both the Read and Cleanup operations as a side-input. See an example here that does this.

In the relevant issue, it mentions this is the way Java does it. It creates the value in a DoFn then uses it as a sideinput for read and cleanup.

job_name=job_name,
step_name=step_name,
unique_id=unique_id,
unique_id=str(uuid.uuid4())[0:10],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does doing it this way make a difference? I'm not so familiar, would refer to a Python SDK expert for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants