Skip to content

[C++/Python] Add support for S3 Bucket Versioning #32797

@asfimport

Description

@asfimport

Arrow offers a reasonably capable S3 interface, but it lacks support for S3 Buckets that have versioning enabled.  For information about what S3 bucket versioning is, see:

https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html

If Arrow is interacting with a bucket where versioning is enabled, there can be S3 keys that have multiple versions of content stored utilizing the same key name.  At the present moment, Arrow does not have the ability to:

  1. Access versions of an S3 key rather than just the latest version of an S3 key.  There is no ability to specify the VersionId parameter of S3's GetObject API.

  2. Report the VersionId created when a new S3 key is uploaded to a bucket.

    Along with S3, GCS also supports versioned buckets.

    https://cloud.google.com/storage/docs/object-versioning

    There are a few shortcomings of the Filesystem interface to support remote file systems that support versioning:

    1. The parameters for open_input_stream() and open_input_file() do not easily lend themselves to adding an additional parameter of "version" because they would be passed to all other implemented filesystems.  Most other file systems that exist don't actually support versioning.

    2. Upon completion of an S3 multipart upload (i.e., close() on an S3FileSystem output stream), there is not currently a way for the user to determine the VersionId or ETag of the S3 key that was created.  This is important to know because if there are multiple concurrent writers to S3, it should be possible to identify the written S3 key.

    Proposed solutions to enable S3 Bucket versioning:

    1. To allow library callers to read specific versions of an S3 key, extend only the S3FileSystem interface with two new API calls:

    open_input_stream_with_version()

    open_input_file_with_version()

    Both are like their namesakes from the normal FileSystem interface but take an additional parameter of a "version," which is a string representation of the VersionId returned by S3 when the S3 Key is created.  If these functions are called with an empty string for the specified version, the latest version of the S3 key will be returned.

    I'm a bit reluctant to create these specialized functions just on the S3FileSystem interface, but I also don't think it is appropriate to change open_input_stream() and open_input_file()'s parameter list for all filesystems just for functionality that is only implemented by a small number of filesystems.

    1. Allow callers to call ReadMetadata() on an S3FileSystem output stream to retrieve the metadata about the S3 key that has been written after the stream has been closed.  The metadata will likely include both a VersionId and a value for ETag.

Reporter: Rusty Conover

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-17544. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions