-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-2572] Python SDK S3 Filesystem #9955
[BEAM-2572] Python SDK S3 Filesystem #9955
Conversation
Co-authored-by: Matthew Morgis <matthew.morgis@gmail.com> Co-authored-by: Tamera Lanham <t.lanham@elsevier.com>
20ea58a
to
0db9aef
Compare
Hi, We are running into trouble getting the unit tests to pass in the CI environment, and I think we can use help from a core team member. We added a new set of extra dependencies when using this new S3 filesystem - we followed the same pattern that GCP did: https://github.com/apache/beam/pull/9955/files#diff-e9d0ab71f74dc10309a29b697ee99330R239 This allows the user to install with Our unit tests are completely mocked out and do not require any of the AWS extra packages, however, we set it up behind a flag so you can bypass the mock and talk to a real S3 bucket over the wire. Because of this, the extra dependencies do need to installed when running these new unit tests. Again, following the lead of how GCP implemented this, they also skip the unit tests if their extra dependencies are not installed: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/gcsio_test.py#L240 Our question: How do we configure CI to install the AWS deps to run the tests? I have poked around a bit and found one setting in Any guidance here would be greatly appreciated! |
@MattMorgis thank you for the contribution. The general path of adding [aws] as an extra package sounds reasonable. @chamikaramj could help with reviews or find a person to review. |
@markflyhigh would you help on the python test environment setup? |
I can review. Looking today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Matt, Tamera!
I believe that adding the dependency in the tox.ini file should make your tests work fine. Could you try adding the aws tag to tox.ini please?
Later on it might make sense to rename the suites to pyXX-all, or pyXX-cloud.
raise ValueError('Path %r must be S3 path.' % path) | ||
|
||
prefix_len = len(S3FileSystem.S3_PREFIX) | ||
last_sep = path[prefix_len:].rfind('/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of this code is duplicated, so it would be nice to deduplicate between the filesystems.... but you don't need to worry about it for now : P
from __future__ import absolute_import | ||
|
||
|
||
class GetRequest(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like most of these messages could be implemented as namedtuple? You'd gain hashing and other utils, but also not a blocker.
filename (str): S3 file path in the form ``s3://<bucket>/<object>``. | ||
mode (str): ``'r'`` for reading or ``'w'`` for writing. | ||
read_buffer_size (int): Buffer size to use during read operations. | ||
mime_type (str): Mime type to set for write operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mime_type
is ignored in this function. Maybe AWS always works with byte data? Should we verify that users sre requesting bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is addressed with the most recent changes, so the mime_type
that the user sets will be reflected in the ContentType
of the object in S3
@@ -46,8 +46,15 @@ | |||
# TODO(sourabhbajaj): Remove the GCP specific error code to a submodule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you file a JIRA issue and replace it in this line to remove the S3/GCS-specific errors from this file? We don't need to move them now, but it'd be nice to track them for later.
[WIP] Mime type
style: linter
Co-Authored-By: Pablo <pabloem@users.noreply.github.com>
There were some errors when creating the fake client:
|
@MattMorgis @tamera-lanham There seems to be an issue with the fake client : ) would love to move this forward. |
@pabloem I'll look into that and the |
Hi! : ) it would be nice to get this in soon. Would you like us to jump on a call / any help pushing it through? : ) |
return messages.Item(self.etag, | ||
self.key, | ||
last_modified_datetime, | ||
len(self.contents)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the problem is a missing argument here (mime_type
). : ) @MattMorgis @tamera-lanham
len(self.contents)) | |
len(self.contents), 'application/octet-stream') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Sorry it's been a bit since I've been in this codebase. A couple notes:
In the constructor forSorry! I may have only had that locallymessages.Item
,mime_type
is optional and defaults toNone
(skds/python/apache_beam/io/aws/clients/s3/messages.py:121), so it's surprising that this error is appearing. When I run the tests locally the behavior is what I'd expect - themime_type
gets set toNone
and the test passes, so I don't know why it isn't doing that in CI. I can't replicate the failure in my environment (which is Dockerized, if you want to try building it yourself!)- The reason for using the default of
None
is because it was kind of onerous to build mime type support into the fake client. The only time mime type behavior is tested is in s3io_test.py, intest_file_mime_type
, and that passes against the real client and is skipped for the fake. I like that metadata from the fake client comes out with a mime type ofNone
, to indicate to whoever consumes the fake client that mime types aren't supported. Would you be ok with a default ofNone
here instead ofapplication/octet-stream
? Whatever the default is, it shouldn't show up in any test anyway (except the one that we skip for the mock anyway, in which case we're skipping it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's totally fine by me : ) - you're right, it seems that the default value for mime type in Item
was not set to None. I am okay with letting it be None. Thanks for taking a look.
Run Python PreCommit |
I'm checking out the test failures and spot-fixing now, starting with the python 2.7 issues. Seems like most of those can be fixed fairly easily. There are a couple of oddball failures as well in some python 3 environments which might take longer to figure out. I'll commit again when I think I've got some fixes in. Also, do you know if there's a way to download a whole test report instead of navigating Jenkins to see the results? |
I'll also try and take a look. Sometimes our precommit tests are flaky, though it does seem like the failure is coming from thr s3io code. Thanks for pushing this forward @tamera-lanham : ) |
Build scan: https://scans.gradle.com/s/tc45x4bplbo6m - only docs are complaining now : ) |
Co-Authored-By: Pablo <pabloem@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New build scan: https://scans.gradle.com/s/zeusfpjug4inm
Docs still complaining. I misunderstood what the error was. There's a new proposed fix. I believe that should help : )
import boto3 | ||
|
||
except ImportError: | ||
raise ImportError('Missing boto3 requirement') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I think another approach for this line could be:
raise ImportError('Missing boto3 requirement') | |
boto3 = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is the real, non-mocked boto3 client, so it only makes sense to use it with boto3 installed. If we catch the ImportError
and silence it this way the user will just get a more cryptic error later (something like 'NoneType' object has no attribute 'client') which could be harder to debug. If it's our usage of a built-in error that's causing the problem, could we just change the type of error we're raising?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also just ignore the original import error, that is just delete lines 27 and 28 entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add this, and in the client __init__
method (where boto3 is actually called), add a line with something like:
assert boto3 is not None, 'Missing boto3 requirement'
This would prevent this import-time error, and let the tests pass, and give a reasonable error message during client intialization rather than at import. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me!
Looks like errors unrelated to the change. Let me clean up the GCP project that we use for testing |
Run Python2_PVR_Flink PreCommit |
Run Python PreCommit |
lovely! |
Thanks so much @tamera-lanham @MattMorgis - y'all went the extra mile to write a good feature with testable code. Lots of people have wanted this feature added, so I'm very grateful to you two : ) |
Thank you all very much! |
Co-authored-by: Matthew Morgis matthew.morgis@gmail.com
Co-authored-by: Tamera Lanham t.lanham@elsevier.com
This adds an AWS S3 file system implementation to the python SDK.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.