Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support User-Defined Object Metadata #4754

Closed
tustvold opened this issue Aug 30, 2023 · 22 comments · Fixed by #4999 or #5915
Closed

Support User-Defined Object Metadata #4754

tustvold opened this issue Aug 30, 2023 · 22 comments · Fixed by #4999 or #5915
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface

Comments

@tustvold
Copy link
Contributor

tustvold commented Aug 30, 2023

This is a draft proposal, and likely needs more polish

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Many stores provide the ability to associate arbitrary user-defined attributes with objects, it would be useful to expose this.

Describe the solution you'd like

I would like to propose a new put_opts call, in a similar vein to the existing get_opts. This would take a PutOptions

pub struct PutOptions {
    pub metadata: HashMap<String, String>
}

Stores that can't store metadata should return an error if passed metadata, and ObjectMeta should be updated to include such metadata.

Unix systems can likely make use of xattr to store user metadata

We will likely need to restrict the key names in some manner

Describe alternatives you've considered

Additional context

#4498 also calls for some sort of put_opts style API

#4753 would benefit from this functionality

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Aug 30, 2023
@tustvold
Copy link
Contributor Author

A further wrinkle is that many of the listing APIs do not return this metadata

@thinkharderdev
Copy link
Contributor

We need this somewhat urgently (can hack around it for now but would like to unhack it asap) so I can work on this.

@tustvold
Copy link
Contributor Author

tustvold commented Oct 20, 2023

Can you perhaps expand on your use-case, I'm not sure about the API as originally proposed by this ticket, and was considering instead providing a mechanism similar to what we provide for content type

@thinkharderdev
Copy link
Contributor

We need to read/write objects tags from S3 (and soon other cloud providers). I was planning on spending some time looking at the relevant Cloud provider APIs and seeing what a reasonable way to do this would be. I know with S3 at least it's a little bit annoying as you can set tags in the PutObject calls but neither GetObject nor ListObjects return the tags.

@tustvold
Copy link
Contributor Author

read/write objects tag

As in https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html or metadata - https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingMetadata.html

They're separate things, and part of why I'm not sure about exposing this

We need to read/write objects tags from S3

  • Can you provide any context on why you need to read tags?
  • Are the tags you wish to write static or do they vary based on request
  • If they vary do they do so based on path or extension in a predictable manner?

@thinkharderdev
Copy link
Contributor

Object tagging as in https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-tagging.html

Can you provide any context on why you need to read tags?

We use tags to drive retention policies

Are the tags you wish to write static or do they vary based on request

There is a static set of tags but which tags get applied to any given object is dynamic

If they vary do they do so based on path or extension in a predictable manner?

No, it would not be possible to do this based on some static rules. It would have to be a mechanism that allows tagging of individual put requests.

I'm also a little hesitant to try and abstract this as there are a lot of subtle differences between APIs so it would be a little bit hard to make sure the default ObjectStore implementations work across providers. That said, adding a maximally flexible API interface at least allows custom implementations that can do whatever they want. So something as simple as what you proposed in the ticket might be ok even if the exact semantics are not consistent across different object storage APIs.

Alternatively, maybe we could punt on the whole issue by providing a canonical way to extend the ObjectStore interface. Something like (and just spitballing here :))

pub trait ObjectStoreExt {
  fn as_any(&self) -> &dyn Any // Just need to allow for downcasting to concrete type
}


pub trait ObjectStore {
  type Ext: ObjectStoreExt

  fn extension(&self) -> Ext; 
}

Then there could be standard extansions in the default impl:

pub struct AwsObjectStoreExt {
  async fn get_tags(&self, path: &Path) -> Result<HashMap<String,String>>

  async fn put_tags(&self, path: &Path, tags: &HashMap<String,String>) 
}

@tustvold
Copy link
Contributor Author

I'm also a little hesitant to try and abstract this as there are a lot of subtle differences between APIs

Yeah, GCS doesn't even have a notion of tags, only metadata 😄

Alternatively, maybe we could punt on the whole issue by providing a canonical way to extend the ObjectStore interface.

I mean it isn't ideal but we do provide https://docs.rs/object_store/latest/object_store/aws/struct.AmazonS3.html#method.credentials and https://docs.rs/object_store/latest/object_store/aws/struct.AwsAuthorizer.html which would let you fairly easily construct your own requests, including https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObjectTagging.html

@thinkharderdev
Copy link
Contributor

Yeah, GCS doesn't even have a notion of tags, only metadata

Right, but it may not really be an issue as long as the semantics are internally consistent within a provider. When it's unclear where to put the metadata (like in the case of AWS) that should be manageable through configuration.

It's annoying that semantics are different between providers but that is what it is. I think something like:

pub struct PutOptions {
    pub metadata: HashMap<String, String>
}

pub struct ObjectMeta {
    /// The full path to the object
    pub location: Path,
    /// The last modified time
    pub last_modified: DateTime<Utc>,
    /// The size in bytes of the object
    pub size: usize,
    /// The unique identifier for the object
    ///
    /// <https://datatracker.ietf.org/doc/html/rfc9110#name-etag>
    pub e_tag: Option<String>,
    /// A version indicator for this object
    pub version: Option<String>,
    /// Key/Value metadata for this object
    pub metadata: HashMap<String,String>
}

trait ObjectStore {

  async fn put_opt(&self, location: &Path, bytes: Bytes, options: PutOptions) -> Result<PutResult>;

  fn async get_metadata(&self, location: &Path) -> Result<HashMap<String,String>>;
}

where ObjectStore::get_metadata can be used to fetch metadata which isn't included in regular Get or List requests (like with S3).

@tustvold
Copy link
Contributor Author

retention policy

Are you referring to https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html or some custom system? I'm mainly interested in the importance of being able read them, as writing has a lot more potential options for achieving it that don't leak into the ObjectStore trait

Right, but it may not really be an issue as long as the semantics are internally consistent within a provider

Apart from this crate goes to great lengths to try to provide an API that is consistent across providers... 😅

@thinkharderdev
Copy link
Contributor

Are you referring to https://docs.aws.amazon.com/AmazonS3/latest/userguide/intro-lifecycle-rules.html or some custom system?

Both. The data is in customer buckets and we add tags so they can manage their own retention. How they do that is up to them, we just provide the tags.

Currently we only need to write them. We can obviously work around that (and will in the immediate term) without involving the ObjectStore trait but it would be nice if we didn't have to as associating metadata with objects is used in a lot of applications.

Apart from this crate goes to great lengths to try to provide an API that is consistent across providers... 😅

Yeah, agreed but the APIs are what they are :). So we can either provide a consistent API which always works the same across providers by always doing additional API calls to grab metadata/tags (which seems like a bad idea). Or we can make the semantics around metadata depend on the provider.

Or of course we can do neither and just say that if we can't provide consistent semantics because of provider API differences then it's not going to be exposed in the ObjectStore interface. But IMO that ship has already sailed. We have ObjectStore::append even though S3 and GCS don't support append operations at all and on Azure you can only append to objects that were created as append blobs to begin with.

@tustvold
Copy link
Contributor Author

We have ObjectStore::append even though S3 and GCS don't support append operations at all

This is not something we should be following, I fought very hard to not include that, and I am increasingly of the opinion we should remove it.

Or we can make the semantics around metadata depend on the provider.

Or a third option is to make these details specified at the point of creation of the ObjectStore, e.g. via some middleware system or otherwise. That way if people have requirements outside the ObjectStore trait, they can plugin at that point.

@thinkharderdev
Copy link
Contributor

That way if people have requirements outside the ObjectStore trait, they can plugin at that point

This would all be much easier if we didn't have also deal with local filesystems :)

I'm leaning more and more towards some sort of extension mechanism. Either exposing the inner client so you can just make arbitrary API calls outside the ObjectStore interface or an extension type that can expose "extra" API operations.

@tustvold
Copy link
Contributor Author

tustvold commented Oct 20, 2023

I think adding a tags block to PutOptions that is simply ignored by backends that don't support it, seems harmless to me.

I'm in the process of adding conditional put support and so will sequence this after that

@tustvold
Copy link
Contributor Author

tustvold commented Oct 26, 2023

Turns out Azure doesn't even support this consistently... But then again Azure does seem to specialize in inconsistent APIs...

Specified feature is not yet supported for hierarchical namespace accounts

Edit: https://learn.microsoft.com/en-us/azure/storage/blobs/storage-feature-support-in-storage-accounts

@tustvold
Copy link
Contributor Author

tustvold commented Oct 26, 2023

Having played around with this I'm unsure how to support this consistently, stores have different restrictions on what value are valid, and support for this across the stores is wildly inconsistent, even stores from the same provider...

Taking a step back, could your use-case encode the lifecycle details in the path of the object instead?

@thinkharderdev
Copy link
Contributor

Taking a step back, could your use-case encode the lifecycle details in the path of the object instead?

No, ultimately it's not up to us (this was a solution in place before us and would be monumentally complex to change).

Having played around with this I'm unsure how to support this consistently, stores have different restrictions on what value are valid, and support for this across the stores is wildly inconsistent, even stores from the same provider...

Why is this a problem? If a user adds incorrect metadata (values which are not allowed for whatever reason by the particular provider) then they get an error. It's no different than (for example) writing multi-part file to S3 in which case chunks need to be > 5.5MB (except for the last one). But the same limitation obviously wouldn't apply to local file systems. So at some level you have to know which provider you are using and what the individual semantics are.

@tustvold
Copy link
Contributor Author

It's no different than (for example) writing multi-part file to S3 in which case chunks need to be > 5.5MB (except for the last one). But the same limitation obviously wouldn't apply to local file systems. So at some level you have to know which provider you are using and what the individual semantics are.

Because in general we try to hide these incompatibilities from you, you can't write to funky paths, the chunking for multipart upload is done for you, etc... We could add TagSets to the crate, and I have a mostly complete PR that does this, but it just seems strange to add something to the ObjectStore trait that is supported by only 1 and a half stores...

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 26, 2023
@thinkharderdev
Copy link
Contributor

It's no different than (for example) writing multi-part file to S3 in which case chunks need to be > 5.5MB (except for the last one). But the same limitation obviously wouldn't apply to local file systems. So at some level you have to know which provider you are using and what the individual semantics are.

Because in general we try to hide these incompatibilities from you, you can't write to funky paths, the chunking for multipart upload is done for you, etc... We could add TagSets to the crate, and I have a mostly complete PR that does this, but it just seems strange to add something to the ObjectStore trait that is supported by only 1 and a half stores...

Right, and I think it's a good idea to try and hide the incompatibilities, but if the only way to do that is not add the functionality at all then it may be better to just expose the incompatibilities and let user's deal with it. I guess the "proper" way to do this would be through traits. You could have the base ObjectStore trait expose the minimal API surface area that every provider can implement. And then have other traits for stuff not supported by all providers (ObjectAppend, ObjectMetadata, etc). This is a little awkward for upstream projects like DataFusion which tend to pass around Arc<dyn ObjectStore> but maybe this can be handled dynamically as well. So maybe something like

trait ObjectAppend: ObjectStore {
  async fn append(&self, location: &Path, bytes: Bytes) -> Result<()>;
}

trait ObjectStore {
   .. regular methods

   fn as_append(&self) -> Option<Arc<dyn ObjectAppend>>;

   // or once RPIT lands on stable
   fn as_append(&self) -> Option<&impl ObjectAppend>;
}

@tustvold
Copy link
Contributor Author

Yeah, that's the approach we've taken for functionality that is disjoint, e.g. the MultiPartStore and Signer traits. This is a bit of a funny one because it is additive to existing functionality, which makes adding a separate trait a bit cumbersome, as you'll have to duplicate your write logic.

My current plan is to proceed with the approach in #4999. Provided we add a config option to ignore tags, I think we'll be fine, and will allow people to always write the tags and just have them ignored if not supported

tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 27, 2023
tustvold added a commit to tustvold/arrow-rs that referenced this issue Oct 27, 2023
tustvold added a commit that referenced this issue Oct 30, 2023
* Object tagging (#4754)

* Allow disabling tagging

* Rename to disable_tagging
@tustvold tustvold added the object-store Object Store Interface label Nov 2, 2023
@tustvold
Copy link
Contributor Author

tustvold commented Nov 2, 2023

label_issue.py automatically added labels {'object-store'} from #4999

@criccomini
Copy link
Contributor

criccomini commented Jun 18, 2024

Checking in here. I would like to refocus this ticket on User-Defined Metadata (not tags) as the title suggests. Much of the discussion is around object tags, which are a separate thing.

For User-Defined Metadata, I would like to implement a new Attribute called Metadata(String) that allows users to specify attributes in their put requests.

For get requests, I propose we expose the user-defined metadata the same way as other attributes, as part of the Attribute object. This could be somewhat confusing to users since there's an meta: ObjectMetadata in GetResults. I am open to alternative suggestions, but my proposed approach mirrors the way other attributes behave.

If no one objects, I would be happy to try and submit a patch for this. I talked to @Xuanwo about this briefly on Twitter and it sounds like no one is actively working on it.

@criccomini
Copy link
Contributor

I've posted a PR for user-defined metadata here:

#5915

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog object-store Object Store Interface
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants