Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issues of OpenDAL API changes #356

Closed
3 tasks
Xuanwo opened this issue Apr 27, 2024 · 10 comments
Closed
3 tasks

Tracking issues of OpenDAL API changes #356

Xuanwo opened this issue Apr 27, 2024 · 10 comments

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Apr 27, 2024

Hi, iceberger! OpenDAL's coming v0.46 release will have API changes that could affect our project.

New Features

OpenDAL Reader now has concurrent support

OpenDAL Reader now has concurrent support that can read multiple chunks concurrently.

let r = op.reader_with("test.txt").concurrent(4).chunk(4 * 1024 * 1024).await?;
let buf = r.read(0..16 * 1024 * 1024).await?;

The buf here will be fetched in 4 concurrent requests.

To read non-contiguous buffers, please use our fetch API:

let r = op.reader_with("test.txt").concurrent(4).chunk(4 * 1024 * 1024).await?;
let bufs = r.fetch(vec![0..1024 * 1024, 1024..3 * 1024 * 1024]).await?;

OpenDAL will merge close ranges and read them concurrently.

The detailed upgrade guide could be found here.

OpenDAL v0.46 is not related yet so those changes are still possible to be altered. I will try my best to keep this issue update.

API Changes

I list the major changes that we need to take care:

OpenDAL Reader doens't impl AsyncRead + AsyncSeek anymore

OpenDAL's Reader now transformed into range based read.

let r = op.reader("test.txt").await?;
let buf = r.read(1024..2048).await?;

Users can transform into AsyncRead + AsyncSeek by using into_futures_async_read:

let r = op.reader("test.txt").await?;
let reader: FuturesAsyncReader = r.into_futures_async_read(0..4096);

But please note:

  • opendal::Reader adopts zero-cost abstraction, no extra bytes copy and allocation happened.
  • opendal::FuturesAsyncReader is the same as our old reader, it might have extra bytes copy.

OpenDAL Writer doens't impl AsyncWrite anymore

Just like Reader, opendal::Writer doesn't impl AsyncWrite anymore. Users could use opendal's native Buffer for both contiguous and non-contiguous buffers support.

let w = op.writer("test.txt").await?;

// Buffer can be created from continues bytes.
w.write("hello, world").await?;
// Buffer can also be created from non-continues bytes.
w.write(vec![Bytes::from("hello,"), Bytes::from("world!")]).await?;

// Make sure file has been written completely.
w.close().await?;

Users can transform into AsyncWrite by using into_futures_async_write:

let w = op.writer("test.txt").await?;
let writer: FuturesAsyncWriter = r.into_futures_async_write();

Tasks

Although it's possible to simply convert opendal's Reader and Writer into AsyncXxx-based structures, I aim to prepare Iceberg for the most efficient IO methods. In the near future, we will support compilation-based IO and vectorization. The AsyncXxx-based traits do not integrate well with these methods.

I believe only read side needs to do some changes. write side should be simple to update.

Related

@Xuanwo
Copy link
Member Author

Xuanwo commented Apr 27, 2024

cc @Fokko @liurenjie1024 @ZENOTME to take a look.

@liurenjie1024
Copy link
Collaborator

I agree that we should refactor the Input/Output API for best performance, but we should be careful not to expose opendal's api to end user.

@liurenjie1024
Copy link
Collaborator

Also cc @tustvold Since we had a related discussion here: #172

@Xuanwo
Copy link
Member Author

Xuanwo commented Apr 28, 2024

I agree that we should refactor the Input/Output API for best performance, but we should be careful not to expose opendal's api to end user.

Yep, I have a plan to refactor the FileIO as a trait so that users can provide their own implemations via opendal or object_store.

I have met a use case that users will parse storage related input by themselves and don't uses iceberg's own args.

@liurenjie1024
Copy link
Collaborator

I agree that we should refactor the Input/Output API for best performance, but we should be careful not to expose opendal's api to end user.

Yep, I have a plan to refactor the FileIO as a trait so that users can provide their own implemations via opendal or object_store.

I have met a use case that users will parse storage related input by themselves and don't uses iceberg's own args.

I'm ok with this change, but it's breaking. Let's see what others think. cc @Fokko @sdd @marvinlanhenke @tustvold

@tustvold
Copy link

tustvold commented Apr 28, 2024

I'm not seeing a concrete proposal on which to comment, but the broad theme of moving closer to the actual object store APIs is IMO a good idea.

That being said, I feel I should point out that such an abstraction already exists, is natively supported by both the Rust arrow and Datafusion projects, and already has OpenDAL bindings, namely object_store... Perhaps this is therefore already a solved problem?

I can understand why Iceberg Java felt the need to provide FileIO, as the ecosystem there is still very wedded to filesystem APIs, but the Rust ecosystem has largely avoided this mistake?

I dunno I am far from impartial, but the motivation for object_store is to provide this exact abstraction

@marvinlanhenke
Copy link
Contributor

I'm not seeing a concrete proposal on which to comment, but the broad theme of moving closer to the actual object store APIs is IMO a good idea.

That being said, I feel I should point out that such an abstraction already exists, is natively supported by both the Rust arrow and Datafusion projects, and already has OpenDAL bindings, namely object_store... Perhaps this is therefore already a solved problem?

...with the disclaimer, that I have little to no experience regarding the intricacies of object store APIs;

I (also) think we should avoid "reinventing the wheel" here and use already existing solutions; with that in mind if FileIO becomes a trait and we can use an existing implementation e.g. opendal or object_store - I have little to no worries about this proposed change...

...hope this makes any sense at all.

@tustvold
Copy link

if FileIO becomes a trait and we can use an existing implementation

I guess I was suggesting to just not have a FileIO trait, and just use ObjectStore which is already an object-safe trait

@liurenjie1024
Copy link
Collaborator

I think a FileIO trait would still be necessary for two reasons:

  1. It's more familiar to iceberg community, which makes this easier to maintain and align with other implementations such as java/python.
  2. Unlike object_store which provides direct interface to storage, FileIO could be chained, for example when we need to deal with encryption. This way we can hide complexity from end user.

@Xuanwo
Copy link
Member Author

Xuanwo commented May 19, 2024

Closing now. I thought we would have huge breaking API changes, but it turned out to be a simple, small PR. I'll track the optimization in other issues instead.

@Xuanwo Xuanwo closed this as completed May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants