Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issues of RFC: Object Versioning #2611

Open
2 of 8 tasks
Tracked by #4841
suyanhanx opened this issue Jul 9, 2023 · 23 comments
Open
2 of 8 tasks
Tracked by #4841

Tracking issues of RFC: Object Versioning #2611

suyanhanx opened this issue Jul 9, 2023 · 23 comments

Comments

@suyanhanx
Copy link
Member

suyanhanx commented Jul 9, 2023

Some storage services, such as Amazon S3, have built-in support for versioning.

This is achieved through a feature called ObjectVersion, which allows the same object to exist in multiple versions and be accessed separately even after deletion. With this feature, users can ensure the safety of their data by rolling back to previous versions in case of unintended deletions or changes.


To implement object versioning in OpenDAL, the following tasks need to be done:

@Xuanwo
Copy link
Member

Xuanwo commented Jul 11, 2023

cc @drmingdrmer, would you like to try this feature after we release it?

@Xuanwo Xuanwo changed the title Tracking issues of object versioning Tracking issues of RFC: Object Versioning Jul 11, 2023
@drmingdrmer
Copy link

cc @drmingdrmer, would you like to try this feature after we release it?

Hm... I do not see any feature in my schedule that will be using versioning.

@prabirshrestha
Copy link

I'm working on nextcloud alternative (still several months from open sourcing) using opendal and I'm very interested to see this feature land. I'm primarily interested in the local file system though. For example, I have servers with ZFS and BTRFS (Synology NAS) running so being able to do native versioning on these would be great.

https://gist.github.com/CMCDragonkai/1a4860671145b295fe7a4d8bc3968e87
https://kb.synology.com/en-us/DSM/help/SynologyDrive/drive_file_management?version=7#historical

@Xuanwo
Copy link
Member

Xuanwo commented Jul 25, 2023

I'm working on nextcloud alternative (still several months from open sourcing) using opendal and I'm very interested to see this feature land.

Thanks for using OpenDAL! And looking forward to your project!

I'm primarily interested in the local file system though.

I am considering adding version support for local file systems. However, I have encountered a problem: POSIX file systems do not include concepts related to versions. This means that I cannot use the POSIX file system API to read or delete a specific version of a file in the file system.

@prabirshrestha
Copy link

I don't think POSIX or any other OS will ever support a generic one as some systems use concept of versioning while some use concept of snapshots. ZFS and BTRFS uses snapshots. I personally have snapshots on my machine running every 15mins and it automatically roles out to clean up old snapshots. You can see sample config here for my dev archlinux here and my ubuntu server here. Then with tools like httm we can look at different versions.

To start you could assume the admins take care of Snapshotting but OpenDAL provides viewing versioning and allow to get a file in particular version. Then next feature could be to actually implement a snapshot capability natively via opendal.

var zfs = SnapshotFs::new(ZfsSnapshotManager::new(), OsFs::new())
var btrfs = SnapshotFs::new(BtrFSSnapshotManager::new(), OsFs::new())

Snapshots and Versioning if s3 and other filesystems seems related to me so would be good to think in terms of how it could be possible to work on this.

I want to use my app to stores important data and photos so having some sort of versioning is critical. For now I have been thinking of me as and admin I can just revert files around, but being able to expose this natively in the app if opendal makes it easy would be great so I don't need to be the middle man :).

@PsiACE
Copy link
Member

PsiACE commented Sep 24, 2023

https://x.com/criccomini/status/1705263488489394470

image

I think once we support object versioning and expose the corresponding methods in Python, we can provide support.

@PsiACE
Copy link
Member

PsiACE commented Sep 24, 2023

cc @criccomini Due to the lack of clear user requirements before, we have made slow progress on this feature. If you are willing to provide some suggestions, we will be able to release an initial implementation quickly.

@criccomini
Copy link

criccomini commented Sep 24, 2023

Sure! I'll give you a concrete example. I want to build a storage layer for https://github.com/recap-build/recap. I'd like to provide four operations: ls, get, put, delete.

    def ls(self, path: str | None = None, clock: int | None = None) -> list[str]
    def write(
        self,
        path: str,
        val: str,
        clock: int | None = None,
    )
    def read(
        self,
        path: str,
        clock: int | None = None,
    ) -> str | None
    def delete(self, path: str, clock: int | None = None)

The path param is a path like /foo/bar/baz/blah or /foo/bar/baz/blah/some_file.json. ls should list all child objects of a path (not nested, just the immediate children).

All four operations should support a clock attribute. For read operations (ls, get) the response should return values "as of" the clock point in time. For write operations (put, delete), the clock should be used to persist or delete the value "as of" the clock point in time.

The clock attribute should be an int that supports both UTC timestamp millis or monotonically increasing version numbers (1, 2, 3, 4...).

When all files in a "directory" are deleted, the "directory" automatically should disappear. This mimics object stores like S3.

I'd like this to work for S3, GCS, Azure Blob Store, and local FS.

Local FS is particularly complex since it doesn't have versioning. I experimented with this a bit. What I had was a completely flat single directory with URL-quoted and a clock suffix for each file:

file%3A%2F%2F%2Fa%2Fb%2Fc%2Fd.1695404868067
file%3A%2F%2F%2Fa%2Fb%2Fc%2Fd.1695424005390
file%3A%2F%2F%2Fa%2Fb%2Fc%2Fd.1695424006572
file%3A%2F%2F%2Ffoo%2Fbar%2Fbaz.1695404235945
file%3A%2F%2F%2Ffoo%2Fbar.1695403655860
file%3A%2F%2F%2Ffoo%2Fbar.1695403659922
file%3A%2F%2F%2Ffoo%2Fbaraaa.1695404637710
file%3A%2F%2F%2Fsomewhere.1695404860466

NOTE: In the example I'm using URLs as the paths, but that is not a requirement for OpenDAL.

This implementation works, and the write-operations are O(1). Read operations slow as the number of files increases, but that is fine for my usecase. Caching and binary search (bisect) could be used to increase the read operations, but I didn't bother with that.

An implementation where the clock is ignored on local FS would be acceptable to me if you decide implementing local versioning would be too complex.

@Xuanwo
Copy link
Member

Xuanwo commented Sep 25, 2023

Hi, @criccomini, thank you for sharing!

It is possible to build the API upon OpenDAL, but it may not be related to the version API we proposed here.

Let's take S3 as an example: the object version is generated by the S3 side, like 3HL4kqtJlcpXroDTDmjVBH40Nrjfkd. Therefore, it is not possible to write with a clock and use the clock as a version.

However, As you suggested for fs, it is possible to achieve the same thing by encoding the clock in the file path and working in a similar manner. The benefits of using OpenDAL are that you only need to write that logic once. 😆

@criccomini
Copy link

Yep, yep. Makes sense. Thanks for the reply. :)

@Xuanwo
Copy link
Member

Xuanwo commented Dec 19, 2023

Hi @suyanhanx, are you still interested in implementing this issue?

@suyanhanx
Copy link
Member Author

Hi @suyanhanx, are you still interested in implementing this issue?

Yes. Any additional info?

@Xuanwo
Copy link
Member

Xuanwo commented Dec 19, 2023

Yes. Any additional info?

We can implement the support for s3 first and design some behavior test.

@suyanhanx
Copy link
Member Author

Yes. Any additional info?

We can implement the support for s3 first and design some behavior test.

Let's do this.

@Xuanwo
Copy link
Member

Xuanwo commented Jul 1, 2024

The current design does not account for list, delete, and undelete. I feel we need to refine the RFC before continuing with the work.

@meteorgan
Copy link
Contributor

Add version(bool) in List to include version during list or not, why not use MetaKey::Version ?

@Xuanwo
Copy link
Member

Xuanwo commented Aug 28, 2024

Add version(bool) in List to include version during list or not, why not use MetaKey::Version ?

Wow, nice question.

The Metakey is designed for querying metadata in a list. It operates by checking against the metadata returned during the list operation (different services may return different metadata). If the metadata does not match, use stat to fetch all the metadata.

The problem is that most storage services don't support querying metadata during the list operation. So, the mechanism is just a best effort and doesn't work well, leading most users to ignore it or use it incorrectly. I'm considering removing it in favor of other methods.

As for version, it's a strong API argument for storage services that may involve different APIs like ListObjects and ListObjectVersions. It makes more sense to me to make it a separate argument.

@meteorgan
Copy link
Contributor

Add version(bool) in List to include version during list or not, why not use MetaKey::Version ?

Wow, nice question.

The Metakey is designed for querying metadata in a list. It operates by checking against the metadata returned during the list operation (different services may return different metadata). If the metadata does not match, use stat to fetch all the metadata.

The problem is that most storage services don't support querying metadata during the list operation. So, the mechanism is just a best effort and doesn't work well, leading most users to ignore it or use it incorrectly. I'm considering removing it in favor of other methods.

As for version, it's a strong API argument for storage services that may involve different APIs like ListObjects and ListObjectVersions. It makes more sense to me to make it a separate argument.

Ok. I'd like to implement it

@meteorgan
Copy link
Contributor

If versioning is enabled, should we return version id after writing an object ? If not, we would need to call stat or list to retrieve it.

@meteorgan
Copy link
Contributor

If versioning is enabled, should we return version id after writing an object ? If not, we would need to call stat or list to retrieve it.

Hi, @Xuanwo what do you think about this issue

@Xuanwo
Copy link
Member

Xuanwo commented Sep 19, 2024

If versioning is enabled, should we return version id after writing an object ? If not, we would need to call stat or list to retrieve it.

Yep, it's another API changes that returns the object meta while Write::close() has been called. This is useful even while version is not enabled. Users can get the file content length, etag, last modified without extra call.

Do you have interest to submit an RFC for this? I'm willing to help review and help you implement it.

@meteorgan
Copy link
Contributor

If versioning is enabled, should we return version id after writing an object ? If not, we would need to call stat or list to retrieve it.

Yep, it's another API changes that returns the object meta while Write::close() has been called. This is useful even while version is not enabled. Users can get the file content length, etag, last modified without extra call.

Do you have interest to submit an RFC for this? I'm willing to help review and help you implement it.

I'd love to do it! :) But I want to finish this RFC first. In the versioning tests, I'll temporarily use stat to retrieve the version id.

@Xuanwo
Copy link
Member

Xuanwo commented Sep 20, 2024

I'd love to do it! :) But I want to finish this RFC first. In the versioning tests, I'll temporarily use stat to retrieve the version id.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants