Revisit Design of ObjectStore::put_multipart #5458

tustvold · 2024-03-03T00:14:36Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently streaming uploads are supported by ObjectStore::put_multipart. This returns a AsyncWrite, which provides a push-based interface for writing data.

However, this approach is not without issue:

No obvious way to return PutResult for parts or the final object - In Object Store, return version & etag on multipart put. #5443
No obvious way to retry uploading of a single part - AsyncRead/AsyncWrite Poisoning Behaviour #5437
Unclear poisoning behaviour - AsyncRead/AsyncWrite Poisoning Behaviour #5437
Cannot support resuming uploads - continue existing multi-part upload #4961 APIs for directly managing multi-part uploads and saving potential parquet footers #4608
No obvious way to support Attributes - Add ObjectStore::put_multipart_opts #5435 Add Attributes API Exposing Broader Set of Object Metadata #5334
AsyncWrite design can easily lead to timeouts - Multipart upload can leave futures unpolled, leading to timeout #5366 Async write_all sometimes silently fails to write to file tokio-rs/tokio#4296
The way we implement poll_flush and poll_shutdown is not entirely in keeping with the AsyncWrite contract, e.g. poll_flush may not flush all buffered data
The ecosystem hasn't settled on a single IO trait for AsyncWrite (because they all have their own issues) - https://github.com/nrc/portable-interoperable/blob/master/io-traits/README.md
Data is copied potentially multiple times to/from buffers
Parallelism is controlled by the ObjectStore implementation internally with no way to control this
AsyncWrite is tricky to integrate with synchronous code, despite the fact the internal buffering should make it straightforward
Cannot easily track upload progress - Any way to track the progress when uploading a big file with ObjectStore::put_multipart? #5117

Describe the solution you'd like

#4971 added a MultipartStore abstraction that more closely mirrors the APIs exposed by object stores, avoiding all of the above issues. If we could devise a way to implement this interface for LocalFileSystem we could then "promote" it into the ObjectStore trait and deprecate put_multipart. This would provide the maximum flexibility to users, whilst being in keeping with the objectives of this crate to closely hew to the APIs of the stores themselves.

The key observation that makes this possible, is that we already recommend MultiPartStore be used with fixed size chunks for compatibility with r2, we therefore could require this for LocalFilesystem, in turn allowing it to support out-of-order / parallel writes as the file offsets can be determined from the part index.

#5431 and #4857 added BufWriter and BufReader and these would be retained to preserve compatibility with the tokio ecosystem and provide a more idiomatic API on top of this

Describe alternatives you've considered

I briefly considered a put_stream API, however, this doesn't resolve many of the above issues

We could also just implement MultipartStore for LocalFilesystem, whilst retaining the current put_multipart. This would allow downstreams to opt-in to the lower level API if they so wished.

We could also modify put_multipart to return something other than AsyncWrite, possibly something closer to PutPart

Additional context

Many of the stores also support composing objects from others, this might be something to consider in this design - #4966

FYI @wjones127 @Xuanwo @alamb @roeap

The text was updated successfully, but these errors were encountered:

tustvold · 2024-03-03T07:51:48Z

One downside of moving to a multipart upload API is it would force unnecessary buffering in cases where the underlying store lacks a minimum part size, e.g. LocalFilesystem. More thought is needed 🤔

alamb · 2024-03-03T15:15:56Z

One downside of moving to a multipart upload API is it would force unnecessary buffering in cases where the underlying store lacks a minimum part size, e.g. LocalFilesystem. More thought is needed 🤔

Maybe we could do something like GetResult and basically allow a special case write for local files if a writer wanted to implement a specialized local file path 🤔

The idea of implementing MultiPartStore for LocalFileSystem makes a lot of sense to me (it could be implemented very efficiently, as you point out). Tuning write buffer sizes (aka part sizes in a multi-part upload) is likely to be object store and system specific, so the buffering doesn't seem like a fundamental problem to me.

tustvold · 2024-03-05T03:44:07Z

So the major challenge with providing a multipart API for LocalFilesystem is that there is no obvious way to transport the part size in use. This presents a problem for determining the offset of the final part, which will likely be smaller than the file's chunk size.

There are a few options here, but none of them particularly great:

Encode the part size in a separate file, adds a file read/write to every part write
Use a mechanism like xattr, but this would limit platform support
Encode the part size in the MultipartId, but this would require specifying the part size when creating the upload
Encode the part size in the file, but this would be fragile and hard to coordinate with parallel uploads
Keep multipart uploads as separate files, this would complicate listing and retrieval logic and break compatibility with non-ObjectStore based systems
Concatenate the the parts once upload finished, this would be simple but slow

Taking a step back, I think there are two users of multipart uploads:

Users who just want to stream data to durable storage
Users doing a chunked upload of an existing data set

Users in the second category are extremely unlikely to care about LocalFilesystem support, as they could just use the filesystem directly. As such I suspect they are adequately served by MultipartStore. I therefore think we can just focus our efforts on the first category of user, providing an efficient way to stream data, in-order to durable storage.

I'm therefore leaning towards replacing put_multipart with

trait ObjectStore {
    fn upload(&self, path: &Path) -> Result<Box<dyn Upload>>;
    ...
}

pub struct UploadOptions {
    /// A hint as to the size of the object to be uploaded, implementations may use this to select an appropriate IO size
    pub size_hint: Option<usize>,
    /// Implementations may perform chunked uploads in parallel, use this to restrict the concurrency
    pub max_concurrency: Option<usize>,
}

#[async_trait]
pub trait Upload {
    /// Enqueue data to be uploaded
    pub fn write(&mut self, data: &[u8]) -> Result<()> { ... }
    
    /// Enqueue `data` to be uploaded
    pub fn put(&mut self, data: Bytes) -> Result<()> { 
        self.write(&data)
    }    

    /// Flush as much data as possible to durable storage
    ///
    /// Returns the offset up to which data has been made durable 
    ///
    /// Some implementations may have IO size restrictions making this best effort
    pub async fn flush(&mut self) -> Result<usize>  { ... }

    /// Flush all written data, complete this upload, and return the [`PutResult`]
    pub async fn shutdown(&mut self) -> Result<PutResult>  { ... }

    /// Abort this upload
    pub async fn abort(&mut self) -> Result<()> { ... }
}

There are a few things worth highlighting here:

The synchronous write will integrate well with synchronous writers, e.g. Provide access to inner Write for parquet writers #5471
Implementations can choose to use the cheaper Put instead of PutMultipart if data sizes are small
Part sizing is abstracted away from the user
Implementations are not constrained on IO granularity, e.g. LocalFilesystem can stream writes directly
The Upload interface should be significantly easier to implement than AsyncWrite

Xuanwo · 2024-03-05T07:31:02Z

I'm therefore leaning towards replacing put_multipart with

This design looks cool, but I have two concerns:

Current design it requires users to call flush when the buffer gets too large because we can't perform IO during write or put. This complicates things as users can no longer write continuously like before.

Also, I'm unclear about how max_concurrency functions. Does this mean that flush could operate asynchronously in the background?

tustvold · 2024-03-05T07:37:38Z

Current design it requires users to call flush when the buffer gets too large because we can't perform IO during write or put. This complicates things as users can no longer write continuously like before.

In practice they have to do this anyway because of tokio-rs/tokio#4296 and #5366. In general the previous API was very difficult to actually use correctly, especially with the type of long-running synchronous operations that characterize arrow/parquet workloads.

Also, I'm unclear about how max_concurrency functions. Does this mean that flush could operate asynchronously in the background?

The idea is if Upload has accumulated enough data to do so, it could upload multiple chunks in parallel, much like WriteMultipart does currently.

Effectively aside from moving away from the problematic AsyncWrite abstraction this doesn't materially alter the way IO is performed, other than removing the async backpressure mechanism that makes AsyncWrite such a pain to integrate with predominantly sync workloads

wjones127 · 2024-03-05T16:22:11Z

The design seems reasonable overall.

In Lance, our write pattern at the moment looks like:

write col 1
...
write col N
flush
(maybe return control to caller)
write col 1
..
write col N
flush

Because I am calling flush often, I don't think I'd miss the backpressure from write. However, what I think I might miss is being able to initiate the requests during write calls. I wonder if it would make sense to have some sort of poll_flush() method? Obviously it has some of the stability concerns from #5366, but I think if given a warning it could be safe enough.

Also, is there a maximum buffer size enforced? Does reaching that make write() fail? or is it up to the user to limit how much they are buffering? (Which I think they could do easily by tracking bytes.)

I'm thinking, if implemented, how I would use this API is put the writer on a background tokio task. Then I could run the IO calls in the background and implement backpressure over some channel. This brings up the question, it is safe to call write while the future returned by flush() has begun but is incomplete? Ideally I would like to be able to enqueue more data before I've completed drained the currently queued buffer.

tustvold · 2024-03-05T20:32:22Z

Hmm... Good point. For that sort of workload something like put_stream would probably be the best option, it does have a certain elegant simplicity/symmetry to it 🤔

Something like

struct PutStreamOptions {
    /// A hint as to the size of the object to be uploaded, implementations may use this to select an appropriate IO size
    pub size_hint: Option<usize>,
    /// Implementations may perform chunked uploads in parallel, use this to restrict the concurrency
    pub max_concurrency: Option<usize>,
}

impl ObjectStore {
    async fn put_stream(&self, stream: BoxStream<'static, Result<Bytes>>) -> Result<PutResult>;
}

tustvold · 2024-03-06T23:19:56Z

Had a good sync with @alamb and I think we've devised a way to support the original vision of exposing a MultiPart API for stores, including LocalFilesystem. Apologies for the noise

alamb · 2024-03-07T22:41:25Z

Had a good sync with @alamb and I think we've devised a way to support the original vision of exposing a MultiPart API for stores, including LocalFilesystem. Apologies for the noise

To avoid leaving anyone in suspense, as I recall the basic idea is at first to require, for file backed object stores, that each multi-part upload is the same size (except for the last one). In this way, when writing multiple "parts" to a file, we can calculate apriori what offsets in the file each part will go.

If a user tries to upload a part that is not the required size, the API will error with a clear message. We can eventually maybe extend the implementation to handle different sized chunks (with copying as part of the finalize)

https://docs.rs/object_store/latest/object_store/multipart/trait.MultiPartStore.html#tymethod.put_part

I may be misremembering this

tustvold · 2024-03-13T09:06:27Z

The design evolved a bit to accomodate reality, but it is largely inline with the spirit of the original proposal - #5500

PTAL and let me know what you think, I'm quite pleased with how it came out

…tipartStore (#5458) (#5500) * Replace AsyncWrite with Upload trait (#5458) * Make BufWriter abortable * Flesh out cloud implementations * Review feedback * Misc tweaks and fixes * Format * Replace multi-part with multipart * More docs * Clippy * Rename to MultipartUpload * Rename ChunkedUpload to WriteMultipart * Doc tweaks * Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Docs * Format --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

tustvold · 2024-04-17T13:48:39Z

label_issue.py automatically added labels {'object-store'} from #5500

tustvold · 2024-04-17T13:48:41Z

label_issue.py automatically added labels {'parquet'} from #5485

tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Mar 3, 2024

This was referenced Mar 3, 2024

AsyncArrowWriter doesn't limit underlying ArrowWriter to respect buffer-size #5450

Closed

Provide access to inner Write for parquet writers #5471

Merged

This was referenced Mar 5, 2024

Document parquet writer memory limiting (#5450) #5457

Merged

Add ObjectStore::put_multipart_opts #5435

Closed

In Object Store, return version & etag on multipart put. #5443

Closed

This was referenced Mar 8, 2024

Remove internal buffering from AsyncArrowWriter (#5484) #5485

Merged

Implement MultiPartStore for InMemory #5495

Merged

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 13, 2024

Replace AsyncWrite with Upload trait (apache#5458)

7a0d010

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 13, 2024

Replace AsyncWrite with Upload trait (apache#5458)

521e918

tustvold mentioned this issue Mar 13, 2024

Replace AsyncWrite with Upload trait and rename MultiPartStore to MultipartStore (#5458) #5500

Merged

alamb mentioned this issue Mar 14, 2024

Supporting using parallel parquet writer outside of Datafusion query execution apache/datafusion#9493

Open

tustvold closed this as completed in #5500 Mar 19, 2024

tustvold added object-store Object Store Interface parquet Changes to the parquet crate labels Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit Design of ObjectStore::put_multipart #5458

Revisit Design of ObjectStore::put_multipart #5458

tustvold commented Mar 3, 2024 •

edited

Loading

tustvold commented Mar 3, 2024

alamb commented Mar 3, 2024

tustvold commented Mar 5, 2024 •

edited

Loading

Xuanwo commented Mar 5, 2024

tustvold commented Mar 5, 2024

wjones127 commented Mar 5, 2024

tustvold commented Mar 5, 2024 •

edited

Loading

tustvold commented Mar 6, 2024 •

edited

Loading

alamb commented Mar 7, 2024 •

edited

Loading

tustvold commented Mar 13, 2024

tustvold commented Apr 17, 2024

tustvold commented Apr 17, 2024

Revisit Design of ObjectStore::put_multipart #5458

Revisit Design of ObjectStore::put_multipart #5458

Comments

tustvold commented Mar 3, 2024 • edited Loading

tustvold commented Mar 3, 2024

alamb commented Mar 3, 2024

tustvold commented Mar 5, 2024 • edited Loading

Xuanwo commented Mar 5, 2024

tustvold commented Mar 5, 2024

wjones127 commented Mar 5, 2024

tustvold commented Mar 5, 2024 • edited Loading

tustvold commented Mar 6, 2024 • edited Loading

alamb commented Mar 7, 2024 • edited Loading

tustvold commented Mar 13, 2024

tustvold commented Apr 17, 2024

tustvold commented Apr 17, 2024

tustvold commented Mar 3, 2024 •

edited

Loading

tustvold commented Mar 5, 2024 •

edited

Loading

tustvold commented Mar 5, 2024 •

edited

Loading

tustvold commented Mar 6, 2024 •

edited

Loading

alamb commented Mar 7, 2024 •

edited

Loading