[BEAM-2572] Python SDK S3 Filesystem #9955

MattMorgis · 2019-10-31T19:47:25Z

Co-authored-by: Matthew Morgis matthew.morgis@gmail.com
Co-authored-by: Tamera Lanham t.lanham@elsevier.com

This adds an AWS S3 file system implementation to the python SDK.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza	Spark
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

Co-authored-by: Matthew Morgis <matthew.morgis@gmail.com> Co-authored-by: Tamera Lanham <t.lanham@elsevier.com>

MattMorgis · 2019-11-04T20:01:28Z

Hi,

We are running into trouble getting the unit tests to pass in the CI environment, and I think we can use help from a core team member.

We added a new set of extra dependencies when using this new S3 filesystem - we followed the same pattern that GCP did: https://github.com/apache/beam/pull/9955/files#diff-e9d0ab71f74dc10309a29b697ee99330R239

This allows the user to install with pip install beam[gcp] or pip install beam[aws] in our case.

Our unit tests are completely mocked out and do not require any of the AWS extra packages, however, we set it up behind a flag so you can bypass the mock and talk to a real S3 bucket over the wire. Because of this, the extra dependencies do need to installed when running these new unit tests.

Again, following the lead of how GCP implemented this, they also skip the unit tests if their extra dependencies are not installed: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/gcsio_test.py#L240

Our question: How do we configure CI to install the AWS deps to run the tests?

I have poked around a bit and found one setting in tox.ini that appears to install both the test and gcp deps (https://github.com/apache/beam/blob/master/sdks/python/tox.ini#L200). Addtionally, at the root level of the project, (https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L1799) I found a installGcpTest Gradle task that seems to also install both. This task only seems to be referenced inside of the test-suites/dataflow but not direct or portable.

Any guidance here would be greatly appreciated!

MattMorgis · 2019-11-04T20:02:24Z

R: @pabloem @robertwb @aaltay @charlesccychen

aaltay · 2019-11-05T01:36:53Z

@MattMorgis thank you for the contribution. The general path of adding [aws] as an extra package sounds reasonable.

@chamikaramj could help with reviews or find a person to review.
@yifanzou could help with CI related questions.

chamikaramj · 2019-11-05T01:54:03Z

R: @pabloem will you be able to review ?

Also cc: @lukecwik

yifanzou · 2019-11-08T18:41:22Z

@markflyhigh would you help on the python test environment setup?

pabloem · 2019-11-11T18:28:05Z

I can review. Looking today.

pabloem

Thanks Matt, Tamera!
I believe that adding the dependency in the tox.ini file should make your tests work fine. Could you try adding the aws tag to tox.ini please?

Later on it might make sense to rename the suites to pyXX-all, or pyXX-cloud.

pabloem · 2019-11-13T18:50:30Z

sdks/python/apache_beam/io/aws/s3filesystem.py

+      raise ValueError('Path %r must be S3 path.' % path)
+
+    prefix_len = len(S3FileSystem.S3_PREFIX)
+    last_sep = path[prefix_len:].rfind('/')


a lot of this code is duplicated, so it would be nice to deduplicate between the filesystems.... but you don't need to worry about it for now : P

pabloem · 2019-11-13T19:06:05Z

sdks/python/apache_beam/io/aws/clients/s3/messages.py

+from __future__ import absolute_import
+
+
+class GetRequest():


It seems like most of these messages could be implemented as namedtuple? You'd gain hashing and other utils, but also not a blocker.

sdks/python/apache_beam/io/aws/clients/s3/boto3_client.py

pabloem · 2019-11-13T19:14:44Z

sdks/python/apache_beam/io/aws/s3io.py

+      filename (str): S3 file path in the form ``s3://<bucket>/<object>``.
+      mode (str): ``'r'`` for reading or ``'w'`` for writing.
+      read_buffer_size (int): Buffer size to use during read operations.
+      mime_type (str): Mime type to set for write operations.


mime_type is ignored in this function. Maybe AWS always works with byte data? Should we verify that users sre requesting bytes?

This is addressed with the most recent changes, so the mime_type that the user sets will be reflected in the ContentType of the object in S3

pabloem · 2019-11-13T19:19:34Z

sdks/python/apache_beam/utils/retry.py

@@ -46,8 +46,15 @@
 # TODO(sourabhbajaj): Remove the GCP specific error code to a submodule


Would you file a JIRA issue and replace it in this line to remove the S3/GCS-specific errors from this file? We don't need to move them now, but it'd be nice to track them for later.

[WIP] Mime type

style: linter

Co-Authored-By: Pablo <pabloem@users.noreply.github.com>

pabloem · 2019-11-22T23:48:32Z

There were some errors when creating the fake client:

14:34:21 ======================================================================
14:34:21 ERROR: test_delete_tree (apache_beam.io.aws.s3io_test.TestS3IO)
14:34:21 ----------------------------------------------------------------------
14:34:21 Traceback (most recent call last):
14:34:21   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/aws/s3io_test.py", line 488, in test_delete_tree
14:34:21     self.assertTrue(self.aws.exists(path))
14:34:21   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/aws/s3io.py", line 443, in exists
14:34:21     self.client.get_object_metadata(request)
14:34:21   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/aws/clients/s3/fake_client.py", line 88, in get_object_metadata
14:34:21     return file_.get_metadata()
14:34:21   File "/home/jenkins/jenkins-slave/workspace/beam_PreCommit_Python_Commit/src/sdks/python/test-suites/tox/py2/build/srcs/sdks/python/apache_beam/io/aws/clients/s3/fake_client.py", line 49, in get_metadata
14:34:21     len(self.contents))
14:34:21 TypeError: __init__() takes exactly 6 arguments (5 given)

pabloem · 2019-11-26T22:23:49Z

@MattMorgis @tamera-lanham There seems to be an issue with the fake client : ) would love to move this forward.

MattMorgis · 2019-12-04T14:07:00Z

@pabloem I'll look into that and the tox.ini today or tomorrow

pabloem · 2019-12-11T21:18:41Z

Hi! : ) it would be nice to get this in soon. Would you like us to jump on a call / any help pushing it through? : )

pabloem · 2019-12-17T02:15:37Z

sdks/python/apache_beam/io/aws/clients/s3/fake_client.py

+    return messages.Item(self.etag,
+                         self.key,
+                         last_modified_datetime,
+                         len(self.contents))


It seems that the problem is a missing argument here (mime_type). : ) @MattMorgis @tamera-lanham

Suggested change

len(self.contents))

len(self.contents), 'application/octet-stream')

Hey! Sorry it's been a bit since I've been in this codebase. A couple notes:

In the constructor for messages.Item, mime_type is optional and defaults to None (skds/python/apache_beam/io/aws/clients/s3/messages.py:121), so it's surprising that this error is appearing. When I run the tests locally the behavior is what I'd expect - the mime_type gets set to None and the test passes, so I don't know why it isn't doing that in CI. I can't replicate the failure in my environment (which is Dockerized, if you want to try building it yourself!) Sorry! I may have only had that locally

The reason for using the default of None is because it was kind of onerous to build mime type support into the fake client. The only time mime type behavior is tested is in s3io_test.py, in test_file_mime_type, and that passes against the real client and is skipped for the fake. I like that metadata from the fake client comes out with a mime type of None, to indicate to whoever consumes the fake client that mime types aren't supported. Would you be ok with a default of None here instead of application/octet-stream? Whatever the default is, it shouldn't show up in any test anyway (except the one that we skip for the mock anyway, in which case we're skipping it)

That's totally fine by me : ) - you're right, it seems that the default value for mime type in Item was not set to None. I am okay with letting it be None. Thanks for taking a look.

pabloem · 2019-12-19T01:14:43Z

Run Python PreCommit

tamera-lanham · 2019-12-19T17:03:27Z

I'm checking out the test failures and spot-fixing now, starting with the python 2.7 issues. Seems like most of those can be fixed fairly easily. There are a couple of oddball failures as well in some python 3 environments which might take longer to figure out. I'll commit again when I think I've got some fixes in.

Also, do you know if there's a way to download a whole test report instead of navigating Jenkins to see the results?

pabloem · 2019-12-19T17:42:40Z

I'll also try and take a look. Sometimes our precommit tests are flaky, though it does seem like the failure is coming from thr s3io code. Thanks for pushing this forward @tamera-lanham : )

pabloem · 2019-12-20T00:14:56Z

Build scan: https://scans.gradle.com/s/tc45x4bplbo6m - only docs are complaining now : )

sdks/python/apache_beam/io/aws/clients/s3/boto3_client.py

Co-Authored-By: Pablo <pabloem@users.noreply.github.com>

pabloem

New build scan: https://scans.gradle.com/s/zeusfpjug4inm
Docs still complaining. I misunderstood what the error was. There's a new proposed fix. I believe that should help : )

pabloem · 2019-12-20T19:43:35Z

sdks/python/apache_beam/io/aws/clients/s3/boto3_client.py

+  import boto3
+
+except ImportError:
+  raise ImportError('Missing boto3 requirement')


Sorry. I think another approach for this line could be:

Suggested change

raise ImportError('Missing boto3 requirement')

boto3 = None

This file is the real, non-mocked boto3 client, so it only makes sense to use it with boto3 installed. If we catch the ImportError and silence it this way the user will just get a more cryptic error later (something like 'NoneType' object has no attribute 'client') which could be harder to debug. If it's our usage of a built-in error that's causing the problem, could we just change the type of error we're raising?

We can also just ignore the original import error, that is just delete lines 27 and 28 entirely.

You could add this, and in the client __init__ method (where boto3 is actually called), add a line with something like:

assert boto3 is not None, 'Missing boto3 requirement'

This would prevent this import-time error, and let the tests pass, and give a reasonable error message during client intialization rather than at import. Thoughts?

Works for me!

pabloem · 2019-12-20T22:11:57Z

Looks like errors unrelated to the change. Let me clean up the GCP project that we use for testing

pabloem · 2019-12-20T22:19:49Z

Run Python2_PVR_Flink PreCommit

pabloem · 2019-12-20T22:19:55Z

Run Python PreCommit

pabloem · 2019-12-21T00:55:08Z

lovely!

pabloem · 2019-12-21T00:59:05Z

Thanks so much @tamera-lanham @MattMorgis - y'all went the extra mile to write a good feature with testable code. Lots of people have wanted this feature added, so I'm very grateful to you two : )

aaltay · 2019-12-21T01:01:39Z

Thank you all very much!

feature: python sdk S3 filesystem

0db9aef

Co-authored-by: Matthew Morgis <matthew.morgis@gmail.com> Co-authored-by: Tamera Lanham <t.lanham@elsevier.com>

MattMorgis force-pushed the BEAM-2572/python-sdk-s3-filesystem branch from 20ea58a to 0db9aef Compare November 4, 2019 18:55

chamikaramj requested a review from pabloem November 5, 2019 01:54

pabloem reviewed Nov 13, 2019

View reviewed changes

karlschriek mentioned this pull request Nov 14, 2019

Tfx samples without gcs. tensorflow/tfx#19

Closed

tamera-lanham and others added 5 commits November 22, 2019 14:28

feat: mime_type support in s3io.open

654ae5c

Merge pull request #27 from MattMorgis/mime-type

296c235

[WIP] Mime type

style: linter

865686a

Merge pull request #28 from MattMorgis/mime-type

74c4ee3

style: linter

Update sdks/python/apache_beam/io/aws/clients/s3/boto3_client.py

466f587

Co-Authored-By: Pablo <pabloem@users.noreply.github.com>

MattMorgis and others added 2 commits December 16, 2019 08:38

WIP: trying py37 aws tests

bf40190

Merge branch 'master' into BEAM-2572/python-sdk-s3-filesystem

22eaed8

pabloem reviewed Dec 17, 2019

View reviewed changes

fix: fake_client mime_type test failure

0a76036

tamera-lanham added 2 commits December 19, 2019 15:19

fix: python 2.7 test failures

e949a48

style: linter

4305899

pabloem reviewed Dec 20, 2019

View reviewed changes

sdks/python/apache_beam/io/aws/clients/s3/boto3_client.py Outdated Show resolved Hide resolved

Update sdks/python/apache_beam/io/aws/clients/s3/boto3_client.py

c586fd2

Co-Authored-By: Pablo <pabloem@users.noreply.github.com>

pabloem reviewed Dec 20, 2019

View reviewed changes

fix: error when boto3 is absent

1cb4dc1

pabloem merged commit 9c46f50 into apache:master Dec 21, 2019

BACtaki mentioned this pull request Mar 30, 2020

[BEAM-2572] Update documentation #11260

Merged

4 tasks

		@@ -46,8 +46,15 @@
		# TODO(sourabhbajaj): Remove the GCP specific error code to a submodule

	len(self.contents))
	len(self.contents), 'application/octet-stream')

[BEAM-2572] Python SDK S3 Filesystem #9955

[BEAM-2572] Python SDK S3 Filesystem #9955

Conversation

MattMorgis commented Oct 31, 2019 • edited Loading

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

MattMorgis commented Nov 4, 2019 • edited Loading

MattMorgis commented Nov 4, 2019

aaltay commented Nov 5, 2019

chamikaramj commented Nov 5, 2019

yifanzou commented Nov 8, 2019

pabloem commented Nov 11, 2019

pabloem left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Nov 22, 2019

pabloem commented Nov 26, 2019

MattMorgis commented Dec 4, 2019

pabloem commented Dec 11, 2019

pabloem Dec 17, 2019 • edited Loading

Choose a reason for hiding this comment

tamera-lanham Dec 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Dec 19, 2019

tamera-lanham commented Dec 19, 2019

pabloem commented Dec 19, 2019

pabloem commented Dec 20, 2019

pabloem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pabloem commented Dec 20, 2019

pabloem commented Dec 20, 2019

pabloem commented Dec 20, 2019

pabloem commented Dec 21, 2019

pabloem commented Dec 21, 2019

aaltay commented Dec 21, 2019

MattMorgis commented Oct 31, 2019 •

edited

Loading

MattMorgis commented Nov 4, 2019 •

edited

Loading

pabloem left a comment •

edited

Loading

pabloem Dec 17, 2019 •

edited

Loading

tamera-lanham Dec 18, 2019 •

edited

Loading