Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Using s3_bucket with non-AWS S3-compatible storage is confusing #33904

Closed
amoeba opened this issue Jan 28, 2023 · 3 comments · Fixed by #34009
Closed

[R] Using s3_bucket with non-AWS S3-compatible storage is confusing #33904

amoeba opened this issue Jan 28, 2023 · 3 comments · Fixed by #34009

Comments

@amoeba
Copy link
Member

amoeba commented Jan 28, 2023

Describe the bug, including details regarding any error messages, version, and platform.

If I want to use arrow with non-AWS, S3-compatible storage, I might find the s3_bucket convenience function from our documentation [1, 2]. The documentation doesn't always mention non-AWS storage but I might see in the help page for s3_bucket that I can pass options to S3FileSystem$Create using dots and those options include endpoint_override. I might then try the following:

s3_bucket(
    "my-bucket",
    endpoint_override = "example.org", 
    anonymous = TRUE
)

However, this results in the following error:

# Error: IOError: Bucket 'my-bucket' not found
# /Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/s3fs.cc:829  ResolveRegionUncached(bucket)
# /Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/s3fs.cc:357  ResolveS3BucketRegion(bucket)
# /Users/bryce/src/apache/arrow/cpp/src/arrow/filesystem/filesystem.cc:719  S3Options::FromUri(uri, out_path)

You can see that we're trying to resolve the bucket's AWS region which fails as expected. Interestingly, the user that reported this issue to me found they could prevent this error by creating a bucket on AWS with the same name at which point the region resolution would succeed (pointlessly), the S3FileSystem createsd would respect their endpoint_override, and future calls like $ls() would target the correct endpoint. This could actually cause some significant frustration for non-AWS users.

If I go back to the docs and read further down on https://arrow.apache.org/docs/dev/r/articles/fs.html#file-systems-that-emulate-s3, I find that the recommended way to do what I want is directly through S3FileSystem$create:

S3FileSystem$create(
    endpoint_override = "example.org",
    anonymous = TRUE
)

I was made aware through other channels while researching this issue that this syntax would also have worked (though it's not documented sufficiently):

s3_bucket("s3://anonymous@my-bucket?endpoint_override=example.org")

With the increasing number of non-AWS providers of S3-compatible storage (e.g., Ceph, DigitalOcean], I only think we'll have more users of non-AWS, S3-compatible storage and those users will run into confusion over (1) how things work and (2) how things are documented.

I'd like to:

  1. Do a pass over the documentation (online, and R help pages)
    a. To make the messaging clear that we support AWS, GCS, and S3-compatible storage providers
    b. Add more examples to s3_bucket to make it clearer how a user can go about doing that
  2. Change the behavior of s3_bucket to special-case passing endpoint_override to skip region resolution and just jump directly to creating an S3FileSystem.

Component(s)

R

@kou kou changed the title Using s3_bucket with non-AWS S3-compatible storage is confusing [R] Using s3_bucket with non-AWS S3-compatible storage is confusing Jan 28, 2023
@PMassicotte
Copy link

Good call @amoeba !

@paleolimbot
Copy link
Member

I agree the existing documentation of that is confusing (I found the enpoint override thing by searching my own gists, which is hardly sufficient). I'd be grateful for any improvements you can come up with for the S3 interface...I don't have any experience using S3 (other than keeping our tests passing) so I'm a poor candidate to come up with improvements (but would be more than happy to review any PRs!).

cboettig added a commit to cboettig/arrow that referenced this issue Jan 29, 2023
`s3_bucket` has needless conditional logic that defaulted to filesystem construction using `FileSystem$from_uri`, and only falling back on `S3FileSystem$create()` when additional optional arguments were supplied.  

This approach is more verbose opaque, and not properly documented, leading to errors in numerous edge cases described below.  All of these problems are easily avoided by using the `S3FileSystem$create()` constructor directly, (which also corresponds to the documentation for the S3 Filesytem in pyarrow).

As you know, the S3 SDK is capable of reading user keys, secret keys, region, endpoint, and scheme from environmental variables and from `.aws` config files, and users who have encountered the S3 system in other open source software are very familiar with these uses.  

The current implementation of `s3_bucket` needlessly seeks to construct the `s3://` URI notation instead,  `s3://{access_key}:{secret_key}@?scheme={scheme}&endpoint_override={uri_escaped_enpoint}` and create the filesystem object using `FileSystem$from_uri`:

https://github.com/apache/arrow/blob/4dd5cedb21d7b58d837bdb3c0d35a5cd80fd9f4b/r/R/filesystem.R#L463

This is not the correct URI construction, as it ignores additional options such as key, secret key, scheme, and endpoint override (and possibly others) that the user has passed to the s3_bucket() function.  This creates errors and unpredictable behavior to the user -- for instance observe that:

```r
s3 <- s3_bucket("neon4cast-scores", endpoint_override="data.ecoforecast.org", anonymous=TRUE)
```

Works as anticipated -- but only because the exact same bucket-name requested from the MINIO-based https://data.ecoforecast.org/ , happens to also exist on Amazon's default endpoint, s3.amazonaws.com (allowing the `from_uri()` method to erroneously succeed, connecting to the wrong endpoint, and then be replaced by a connection to the correct endpoint triggered by the presence of `...` .  (It took me some time to track down this behavior, as I and many colleagues have used `s3_bucket()` with independent endpoints for over a year, not realizing why some bucket names would mysteriously fail while others worked).  

I do not think this problem is isolated to the use of `endpoint_override`, I don't think that `from_uri` construction is preferable to `S3FileSystem$create()` even when a user is working against the default s3.awsamazon.com endpoints, and at any event, requires more care that the URI is constructed with respect to user-facing options (including respecting `AWS_REGION` as documented in the SDK. I believe this would address apache#33904, but emphasize that _no special conditional handling_ is needed here to support endpoint_override or other options -- the AWS SDK is already well designed to handle alternative endpoints and all of these other options just fine, we just let the standard S3FileSystem$create method do it's job.

I think the change I propose here would make the code base robust to these edge cases, improve performance by decreasing redundant API calls, and requires no change to any of the documentation or function API.
@amoeba
Copy link
Member Author

amoeba commented Jan 29, 2023

FYI all: @cboettig filed a PR and makes the case for dramatically simplifying the function body in his top level comment. See #33918 (comment).

paleolimbot added a commit that referenced this issue Feb 10, 2023
I think this is another option to address #33918.

This retains the existing behavior when no additional arguments (region, endpoint, etc) are supplied.  

If any optional arguments are supplied, this approach will by-pass the `$from_uri` call.  I think this is reasonable, given that the `from_uri()` call is implicitly ignoring optional arguments already, which is giving rise to the current bugs.  

* Closes: #33904

Lead-authored-by: Carl Boettiger <cboettig@gmail.com>
Co-authored-by: Carl Boettiger <cobettig@gmail.com>
Co-authored-by: Carl <cboettig@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
@paleolimbot paleolimbot added this to the 12.0.0 milestone Feb 10, 2023
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this issue Feb 17, 2023
…he#34009)

I think this is another option to address apache#33918.

This retains the existing behavior when no additional arguments (region, endpoint, etc) are supplied.  

If any optional arguments are supplied, this approach will by-pass the `$from_uri` call.  I think this is reasonable, given that the `from_uri()` call is implicitly ignoring optional arguments already, which is giving rise to the current bugs.  

* Closes: apache#33904

Lead-authored-by: Carl Boettiger <cboettig@gmail.com>
Co-authored-by: Carl Boettiger <cobettig@gmail.com>
Co-authored-by: Carl <cboettig@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
…he#34009)

I think this is another option to address apache#33918.

This retains the existing behavior when no additional arguments (region, endpoint, etc) are supplied.  

If any optional arguments are supplied, this approach will by-pass the `$from_uri` call.  I think this is reasonable, given that the `from_uri()` call is implicitly ignoring optional arguments already, which is giving rise to the current bugs.  

* Closes: apache#33904

Lead-authored-by: Carl Boettiger <cboettig@gmail.com>
Co-authored-by: Carl Boettiger <cobettig@gmail.com>
Co-authored-by: Carl <cboettig@gmail.com>
Co-authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@fishandwhistle.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants