Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-2572] Python SDK S3 Filesystem #9955

Merged
merged 13 commits into from
Dec 21, 2019

Conversation

MattMorgis
Copy link
Contributor

@MattMorgis MattMorgis commented Oct 31, 2019

Co-authored-by: Matthew Morgis matthew.morgis@gmail.com
Co-authored-by: Tamera Lanham t.lanham@elsevier.com

This adds an AWS S3 file system implementation to the python SDK.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

Co-authored-by: Matthew Morgis <matthew.morgis@gmail.com>
Co-authored-by: Tamera Lanham <t.lanham@elsevier.com>
@MattMorgis MattMorgis force-pushed the BEAM-2572/python-sdk-s3-filesystem branch from 20ea58a to 0db9aef Compare November 4, 2019 18:55
@MattMorgis
Copy link
Contributor Author

MattMorgis commented Nov 4, 2019

Hi,

We are running into trouble getting the unit tests to pass in the CI environment, and I think we can use help from a core team member.

We added a new set of extra dependencies when using this new S3 filesystem - we followed the same pattern that GCP did: https://github.com/apache/beam/pull/9955/files#diff-e9d0ab71f74dc10309a29b697ee99330R239

This allows the user to install with pip install beam[gcp] or pip install beam[aws] in our case.

Our unit tests are completely mocked out and do not require any of the AWS extra packages, however, we set it up behind a flag so you can bypass the mock and talk to a real S3 bucket over the wire. Because of this, the extra dependencies do need to installed when running these new unit tests.

Again, following the lead of how GCP implemented this, they also skip the unit tests if their extra dependencies are not installed: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/gcsio_test.py#L240

Our question: How do we configure CI to install the AWS deps to run the tests?

I have poked around a bit and found one setting in tox.ini that appears to install both the test and gcp deps (https://github.com/apache/beam/blob/master/sdks/python/tox.ini#L200). Addtionally, at the root level of the project, (https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1799) I found a installGcpTest Gradle task that seems to also install both. This task only seems to be referenced inside of the test-suites/dataflow but not direct or portable.

Any guidance here would be greatly appreciated!

@MattMorgis
Copy link
Contributor Author

@aaltay
Copy link
Member

aaltay commented Nov 5, 2019

@MattMorgis thank you for the contribution. The general path of adding [aws] as an extra package sounds reasonable.

@chamikaramj could help with reviews or find a person to review.
@yifanzou could help with CI related questions.

@chamikaramj
Copy link
Contributor

R: @pabloem will you be able to review ?

Also cc: @lukecwik

@yifanzou
Copy link
Contributor

yifanzou commented Nov 8, 2019

@markflyhigh would you help on the python test environment setup?

@pabloem
Copy link
Member

pabloem commented Nov 11, 2019

I can review. Looking today.

Copy link
Member

@pabloem pabloem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Matt, Tamera!
I believe that adding the dependency in the tox.ini file should make your tests work fine. Could you try adding the aws tag to tox.ini please?

Later on it might make sense to rename the suites to pyXX-all, or pyXX-cloud.

raise ValueError('Path %r must be S3 path.' % path)

prefix_len = len(S3FileSystem.S3_PREFIX)
last_sep = path[prefix_len:].rfind('/')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of this code is duplicated, so it would be nice to deduplicate between the filesystems.... but you don't need to worry about it for now : P

from __future__ import absolute_import


class GetRequest():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like most of these messages could be implemented as namedtuple? You'd gain hashing and other utils, but also not a blocker.

sdks/python/apache_beam/io/aws/clients/s3/boto3_client.py Outdated Show resolved Hide resolved
filename (str): S3 file path in the form ``s3://<bucket>/<object>``.
mode (str): ``'r'`` for reading or ``'w'`` for writing.
read_buffer_size (int): Buffer size to use during read operations.
mime_type (str): Mime type to set for write operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mime_type is ignored in this function. Maybe AWS always works with byte data? Should we verify that users sre requesting bytes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressed with the most recent changes, so the mime_type that the user sets will be reflected in the ContentType of the object in S3

@@ -46,8 +46,15 @@
# TODO(sourabhbajaj): Remove the GCP specific error code to a submodule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you file a JIRA issue and replace it in this line to remove the S3/GCS-specific errors from this file? We don't need to move them now, but it'd be nice to track them for later.

@pabloem
Copy link
Member

pabloem commented Nov 22, 2019

There were some errors when creating the fake client:

14:34:21 ======================================================================
14:34:21 ERROR: test_delete_tree (apache_beam.io.aws.s3io_test.TestS3IO)
14:34:21 ----------------------------------------------------------------------
14:34:21 Traceback (most recent call last):
14:34:21   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/aws/s3io_test.py", line 488, in test_delete_tree
14:34:21     self.assertTrue(self.aws.exists(path))
14:34:21   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/aws/s3io.py", line 443, in exists
14:34:21     self.client.get_object_metadata(request)
14:34:21   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/aws/clients/s3/fake_client.py", line 88, in get_object_metadata
14:34:21     return file_.get_metadata()
14:34:21   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/aws/clients/s3/fake_client.py", line 49, in get_metadata
14:34:21     len(self.contents))
14:34:21 TypeError: __init__() takes exactly 6 arguments (5 given)

@pabloem
Copy link
Member

pabloem commented Nov 26, 2019

@MattMorgis @tamera-lanham There seems to be an issue with the fake client : ) would love to move this forward.

@MattMorgis
Copy link
Contributor Author

@pabloem I'll look into that and the tox.ini today or tomorrow

@pabloem
Copy link
Member

pabloem commented Dec 11, 2019

Hi! : ) it would be nice to get this in soon. Would you like us to jump on a call / any help pushing it through? : )

return messages.Item(self.etag,
self.key,
last_modified_datetime,
len(self.contents))
Copy link
Member

@pabloem pabloem Dec 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the problem is a missing argument here (mime_type). : ) @MattMorgis @tamera-lanham

Suggested change
len(self.contents))
len(self.contents), 'application/octet-stream')

Copy link
Contributor

@tamera-lanham tamera-lanham Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Sorry it's been a bit since I've been in this codebase. A couple notes:

  • In the constructor for messages.Item, mime_type is optional and defaults to None (skds/python/apache_beam/io/aws/clients/s3/messages.py:121), so it's surprising that this error is appearing. When I run the tests locally the behavior is what I'd expect - the mime_type gets set to None and the test passes, so I don't know why it isn't doing that in CI. I can't replicate the failure in my environment (which is Dockerized, if you want to try building it yourself!) Sorry! I may have only had that locally
  • The reason for using the default of None is because it was kind of onerous to build mime type support into the fake client. The only time mime type behavior is tested is in s3io_test.py, in test_file_mime_type, and that passes against the real client and is skipped for the fake. I like that metadata from the fake client comes out with a mime type of None, to indicate to whoever consumes the fake client that mime types aren't supported. Would you be ok with a default of None here instead of application/octet-stream? Whatever the default is, it shouldn't show up in any test anyway (except the one that we skip for the mock anyway, in which case we're skipping it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's totally fine by me : ) - you're right, it seems that the default value for mime type in Item was not set to None. I am okay with letting it be None. Thanks for taking a look.

@pabloem
Copy link
Member

pabloem commented Dec 19, 2019

Run Python PreCommit

@tamera-lanham
Copy link
Contributor

I'm checking out the test failures and spot-fixing now, starting with the python 2.7 issues. Seems like most of those can be fixed fairly easily. There are a couple of oddball failures as well in some python 3 environments which might take longer to figure out. I'll commit again when I think I've got some fixes in.

Also, do you know if there's a way to download a whole test report instead of navigating Jenkins to see the results?

@pabloem
Copy link
Member

pabloem commented Dec 19, 2019

image

I'll also try and take a look. Sometimes our precommit tests are flaky, though it does seem like the failure is coming from thr s3io code. Thanks for pushing this forward @tamera-lanham : )

@pabloem
Copy link
Member

pabloem commented Dec 20, 2019

Build scan: https://scans.gradle.com/s/tc45x4bplbo6m - only docs are complaining now : )

Co-Authored-By: Pablo <pabloem@users.noreply.github.com>
Copy link
Member

@pabloem pabloem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New build scan: https://scans.gradle.com/s/zeusfpjug4inm
Docs still complaining. I misunderstood what the error was. There's a new proposed fix. I believe that should help : )

import boto3

except ImportError:
raise ImportError('Missing boto3 requirement')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I think another approach for this line could be:

Suggested change
raise ImportError('Missing boto3 requirement')
boto3 = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is the real, non-mocked boto3 client, so it only makes sense to use it with boto3 installed. If we catch the ImportError and silence it this way the user will just get a more cryptic error later (something like 'NoneType' object has no attribute 'client') which could be harder to debug. If it's our usage of a built-in error that's causing the problem, could we just change the type of error we're raising?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also just ignore the original import error, that is just delete lines 27 and 28 entirely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add this, and in the client __init__ method (where boto3 is actually called), add a line with something like:

assert boto3 is not None, 'Missing boto3 requirement'

This would prevent this import-time error, and let the tests pass, and give a reasonable error message during client intialization rather than at import. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me!

@pabloem
Copy link
Member

pabloem commented Dec 20, 2019

Looks like errors unrelated to the change. Let me clean up the GCP project that we use for testing

@pabloem
Copy link
Member

pabloem commented Dec 20, 2019

Run Python2_PVR_Flink PreCommit

@pabloem
Copy link
Member

pabloem commented Dec 20, 2019

Run Python PreCommit

@pabloem
Copy link
Member

pabloem commented Dec 21, 2019

lovely!

@pabloem pabloem merged commit 9c46f50 into apache:master Dec 21, 2019
@pabloem
Copy link
Member

pabloem commented Dec 21, 2019

Thanks so much @tamera-lanham @MattMorgis - y'all went the extra mile to write a good feature with testable code. Lots of people have wanted this feature added, so I'm very grateful to you two : )

@aaltay
Copy link
Member

aaltay commented Dec 21, 2019

Thank you all very much!

@BACtaki BACtaki mentioned this pull request Mar 30, 2020
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants