Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object_store: allow setting content-type per request #5329

Closed
flokli opened this issue Jan 24, 2024 · 7 comments · Fixed by #5650
Closed

object_store: allow setting content-type per request #5329

flokli opened this issue Jan 24, 2024 · 7 comments · Fixed by #5650
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface

Comments

@flokli
Copy link

flokli commented Jan 24, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'd like to use object_store to store blobs (and stay somewhat cloud-provider agnostic).

Depending on the size/nature of the blob, I either break it further down, or store it as-is. I need to distinguish these two cases when reading the data back in, and I'd like to use the widely-supported content-type field for this (it's either application/octet-stream, or some index content-type).

The decision of which content-type to use can't be made statically, but needs to happen per request. Which means the existing content-type mechanism (default, or map from extension to content-type) is not sufficient.

Describe the solution you'd like
Add an optional content_type field to PutOptions. In case it's not None, use that content type in favor of any of the existing logic.

Additionally, add an optional content_type field to ObjectMeta. Most backends already send the content type anyways, so populating it should be quite limited in cost, and no explicit different client config should be necessary.

Describe alternatives you've considered

Using custom metadata, as introduced in #4999. This however is distinct from content-type and requires an additional request for some backends (AWS), while content-type is sent inline alongside the data.

@flokli flokli added the enhancement Any new improvement worthy of a entry in the changelog label Jan 24, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 24, 2024

It looks like #4754 was closed by accident, #4999 only added a workaround for object tagging, whereas the ticket tracks supporting arbitrary object metadata. It also enumerates some of the challenges; including lack of support in listing APIs and unclear semantics for LocalFilesystem.

I would strongly encourage you to encode such metadata in the object path, as opposed to relying on metadata attributes, the different stores do not handle metadata in anything approaching a consistent manner.

@flokli
Copy link
Author

flokli commented Jan 24, 2024

Encoding this metadata in the object path itself means I need to effectively check two different keys :-/

The listing API already doesn't guarantee order, I personally would feel Ok if on backends where the API listing endpoint doesn't return content-types we would not populate the field for list, and only on a get request. These are things that can be mentioned in the documentation.

Regarding LocalFilesystem, we could be using xattrs for this where available, and just return an error if PutOptions has a Some(_) content_type (so setting a content-type is requested) and it's either not supported or fails.

This shouldn't affect behaviour where the content-type isn't set explicitly, and also shouldn't swallow desired content types without any notice.

@flokli flokli changed the title object_store: allow setting content-type per request type object_store: allow setting content-type per request Jan 24, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 24, 2024

Encoding this metadata in the object path itself means I need to effectively check two different keys

Fair enough, if you aren't listing and are instead going to the path directly, this would be harder.

we would not populate the field for list, and only on a get request

One option might be to include the UserMetadata on GetResult instead of ObjectMeta 🤔

Regarding LocalFilesystem, we could be using xattrs for this where available

Yes, that would be the plan

Edit: I'll try to find some time later this week to prototype some things

@flokli
Copy link
Author

flokli commented Jan 24, 2024

Awesome, feel free to poke me on a draft PR or something, happy to give feedback and test things out!

@Xuanwo
Copy link
Member

Xuanwo commented Jan 26, 2024

One option might be to include the UserMetadata on GetResult instead of ObjectMeta 🤔

Hi, I think we should handle content_type separately from user metadata. Object storage services typically categorize metadata such as content-type and content-disposition as system metadata, offering distinct headers for setting custom user metadata.

Storing content_type within UserMetadata could lead to confusion with the standard HTTP header Content-Type and Amazon's specific header x-amz-meta-content-type. (Alought it unlikely to happen).

@tustvold
Copy link
Contributor

tustvold commented Jan 26, 2024

Yes the design needs to support:

  • User metadata
  • Standard headers, e.g. Cache-Control, ContentType
  • Store specific headers, e.g. AWS KMS headers

I am iterating on such a design currently

tustvold added a commit to tustvold/arrow-rs that referenced this issue Apr 15, 2024
tustvold added a commit that referenced this issue Apr 16, 2024
* Add Attributes API (#5329)

* Clippy

* Emulator test tweaks
@tustvold tustvold added the object-store Object Store Interface label Apr 17, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'object-store'} from #5650

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants