Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace StorageV1 client with GCS client (take 4) #28079

Merged
merged 1 commit into from
Sep 26, 2023

Conversation

BjornPrime
Copy link
Contributor

@BjornPrime BjornPrime commented Aug 21, 2023

Addresses #25676.

Now with improved error handling on batch requests (and test coverage of same).


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@codecov
Copy link

codecov bot commented Aug 21, 2023

Codecov Report

Merging #28079 (904550e) into master (042ac52) will decrease coverage by 0.36%.
Report is 37 commits behind head on master.
The diff coverage is 48.59%.

@@            Coverage Diff             @@
##           master   #28079      +/-   ##
==========================================
- Coverage   72.31%   71.96%   -0.36%     
==========================================
  Files         684      680       -4     
  Lines      101727   100163    -1564     
==========================================
- Hits        73567    72080    -1487     
+ Misses      26581    26505      -76     
+ Partials     1579     1578       -1     
Flag Coverage Δ
python 82.53% <48.59%> (-0.28%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...pache_beam/dataframe/pandas_top_level_functions.py 90.62% <100.00%> (+0.33%) ⬆️
sdks/python/apache_beam/internal/gcp/auth.py 79.72% <100.00%> (+0.56%) ⬆️
sdks/python/apache_beam/io/gcp/bigquery_tools.py 75.00% <ø> (ø)
sdks/python/apache_beam/io/gcp/bigtableio.py 74.04% <ø> (-0.96%) ⬇️
sdks/python/apache_beam/io/gcp/gcsfilesystem.py 89.78% <100.00%> (+1.12%) ⬆️
sdks/python/apache_beam/typehints/schemas.py 88.91% <100.00%> (-2.71%) ⬇️
sdks/python/apache_beam/version.py 100.00% <100.00%> (ø)
sdks/python/setup.py 0.00% <ø> (ø)
sdks/python/apache_beam/io/gcp/bigquery.py 69.92% <50.00%> (-0.25%) ⬇️
...ks/python/apache_beam/runners/interactive/utils.py 94.69% <90.00%> (-0.40%) ⬇️
... and 4 more

... and 14 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@BjornPrime BjornPrime changed the title Replace StorageV1 client with GCS client (take 3) Replace StorageV1 client with GCS client (take 4) Aug 21, 2023

return [(pair[0], pair[1], pair_to_status[pair]) for pair in src_dest_pairs]
for pair, resp in list(zip(current_pairs, current_batch._responses)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May mentioned in the past - current_paths and current_batch._responses has different length. Actually, for copy the response has 4x of the length as the current_paths. This is because copy involves multiple operations.

python zip does not care length match and this would miss potential error code beyond first N response (N=batch size)

May consider zip(current_batch._target_objects, current_batch._responses) they should have the same size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure I'm testing for this already, but I've added an assertion just to be sure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a test like this

https://github.com/Abacn/beam/blob/4a84d1618ed23614ec663f2cda1758d280ed5da6/sdks/python/apache_beam/io/gcp/gcsio_test.py#L481-500

also added some print to gcsio.py

https://github.com/Abacn/beam/blob/4a84d1618ed23614ec663f2cda1758d280ed5da6/sdks/python/apache_beam/io/gcp/gcsio.py#L292

Output:

5 20
200 200 200 200
200 200 200 200
200 200 200 200
200 404 200 404
200 200 200 200
gs://clouddfe-yihu-test/temp1/gcs_0.txt gs://clouddfe-yihu-test/temp2/gcs_0.txt 200
gs://clouddfe-yihu-test/temp1/gcs_1.txt gs://clouddfe-yihu-test/temp2/gcs_1.txt 200
gs://clouddfe-yihu-test/temp1/gcs_2.txt gs://clouddfe-yihu-test/temp2/gcs_2.txt 200
gs://clouddfe-yihu-test/temp1/gcs_3.txt gs://clouddfe-yihu-test/temp2/gcs_3.txt 200
gs://clouddfe-yihu-test/temp1/gcs_4.txt gs://clouddfe-yihu-test/temp2/gcs_4.txt 200

As can be seen, the length current_pairs and current_batch._responses is different (4 fold), and the return value missed the 404 code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying, and running the tests with the assertion included has confirmed it. I'll dig into the client more to see what we can do about it.

@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

6 similar comments
@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

@BjornPrime
Copy link
Contributor Author

Run Python_Runners PreCommit 3.10

@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

1 similar comment
@BjornPrime
Copy link
Contributor Author

Run Python 3.11 PostCommit

@BjornPrime BjornPrime marked this pull request as ready for review September 21, 2023 19:05
@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

Copy link
Contributor

@Abacn Abacn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for persistent on this. I went through carefully and commented on potential issues I found.

@@ -224,23 +223,6 @@ def test_delete_dataset_retries_for_timeouts(self, patched_time_sleep):
wrapper._delete_dataset('', '')
self.assertTrue(client.datasets.Delete.called)

@unittest.skipIf(
Copy link
Contributor

@Abacn Abacn Sep 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test gets removed while the reason why it fails internally remains unknown. From a high level there might be still some regression regarding credential / anonymous credentials - to aware

I am fine with it for now but please do not remove the test until we understand why it fails in certain environment, could skip if for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with Pablo and we agreed that the test was no longer necessary. While its failure is curious, IIRC, we determined that the test was not ultimately testing something relevant to production, so we didn't actually need to worry about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests if user agent info is added to the request sent to bigquery. In various gcp component we use this info to track usage. This effort is active e.g. gcs counterpart #26770 #27211

That's possibly why it is saying not relevent to production, but does not mean it can be removed

We can make it "skipped" but please do not remove test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll skip it and open an issue for it, since you say it's important.

sdks/python/setup.py Outdated Show resolved Hide resolved
sdks/python/apache_beam/io/gcp/gcsio.py Show resolved Hide resolved
@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @AnandInguva for label python.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@Abacn
Copy link
Contributor

Abacn commented Sep 25, 2023

Run Python 3.11 PostCommit

@Abacn
Copy link
Contributor

Abacn commented Sep 26, 2023

PreCommit Portable Python workflow fails during clean up unrelated. Opened actions/setup-java#543

@Abacn Abacn merged commit 94812c0 into apache:master Sep 26, 2023
81 of 85 checks passed
damccorm added a commit that referenced this pull request Sep 28, 2023
damccorm added a commit that referenced this pull request Sep 28, 2023
BjornPrime added a commit to BjornPrime/my-beam that referenced this pull request Oct 17, 2023
@ghost ghost mentioned this pull request Nov 8, 2023
3 tasks
shunping pushed a commit to shunping/beam that referenced this pull request Nov 16, 2023
The two commits are merged into one:
* Reapply "Replace StorageV1 client with GCS client (apache#28079)" (apache#28721)
* added project parameter to apiclient
johnjcasey pushed a commit that referenced this pull request Nov 16, 2023
* Cherry pick two previous commits on migrating gcs client

The two commits are merged into one:
* Reapply "Replace StorageV1 client with GCS client (#28079)" (#28721)
* added project parameter to apiclient

* Initialze storage client with project from pipeline option.

---------

Co-authored-by: Bjorn Pedersen <bjornpedersen@google.com>
jto pushed a commit to jto/beam that referenced this pull request Nov 17, 2023
* Cherry pick two previous commits on migrating gcs client

The two commits are merged into one:
* Reapply "Replace StorageV1 client with GCS client (apache#28079)" (apache#28721)
* added project parameter to apiclient

* Initialze storage client with project from pipeline option.

---------

Co-authored-by: Bjorn Pedersen <bjornpedersen@google.com>
@jto jto mentioned this pull request Nov 17, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants