Replace AsyncWrite with Upload trait and rename MultiPartStore to MultipartStore (#5458) #5500

tustvold · 2024-03-13T08:46:10Z

Which issue does this PR close?

Closes #5458
Closes #5443
Closes #5526

Rationale for this change

What changes are included in this PR?

This changes put_multipart to instead return a Box<dyn MultipartUpload>` that provides an API much closer to that of the stores themselves. This gives users much more control over buffering and concurrency, whilst also opening up new possibilities for integrating object_store into systems such as parquet.

Are there any user-facing changes?

tustvold · 2024-03-13T08:47:33Z

object_store/src/lib.rs

-    /// <div class="warning">
-    /// It is recommended applications wait for any in-flight requests to complete by calling `flush`, if
-    /// there may be a significant gap in time (> ~30s) before the next write.
-    /// These gaps can include times where the function returns control to the
-    /// caller while keeping the writer open. If `flush` is not called, futures
-    /// for in-flight requests may be left unpolled long enough for the requests
-    /// to time out, causing the write to fail.
-    /// </div>


This warning is no longer necessary 🎉

tustvold · 2024-03-13T08:49:04Z

object_store/src/limit.rs

-    async fn abort_multipart(&self, location: &Path, multipart_id: &MultipartId) -> Result<()> {
-        let _permit = self.semaphore.acquire().await.unwrap();
-        self.inner.abort_multipart(location, multipart_id).await
+    async fn upload(&self, location: &Path) -> Result<Box<dyn Upload>> {


We can actually properly limit/throttle uploads now, as the public APIs now mirror the underlying requests 🎉

This will also enable things like reliably running IO on a separate tokio runtime

I know you prefer shorter names, but here I recommend calling this multipart_upload and MultipartUpload to align with the term used by the cloud providers. Upload is pretty generic

https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpu-upload-object.html
https://cloud.google.com/storage/docs/xml-api/post-object-complete

Or maybe keep the existing put_multipart name (though I realize reusing the same name may be more confusing on upgrade)

I guess I was trying to avoid being too specific, as such a concept doesn't exist in LocalFileSystem, and Azure calls it something different

tustvold · 2024-03-13T08:50:17Z

object_store/src/upload.rs

+///
+/// [`Sink`]: futures::sink::Sink
+#[derive(Debug)]
+pub struct ChunkedUpload {


This is not only significantly simpler than the WriteUpload it replaces, but avoids a number of the issues.

I think it is also a nice example of the flexibility of the Upload API, if a downstream wants to handle chunking/concurrency differently they're entirely able to do so.

tustvold · 2024-03-13T08:52:59Z

object_store/src/upload.rs

+    ///
+    /// Back pressure can optionally be applied to producers by calling
+    /// [`Self::wait_for_capacity`] prior to calling this method
+    pub fn write(&mut self, mut buf: &[u8]) {


It is worth highlighting that this is a synchronous method, and so users could wrap this in a type and feed it directly into a synchronous writer such as parquet's ArrowWriter. If back pressure is required they could potentially call wait_for_capacity between row groups.

It seems like this API basically requires copying the data. I wonder if there is some way to allow users to pass in an owned buffer already, like

pub fn write(&mut self, buf: impl Into<Buffer>) { ... }

And then internally slicing up the Buffer to ensure the correct sizes

I realize that put currently requires a single contiguous buffer (per part), so maybe this copy isn't that big a problem. However it seems a pity that we require the copy 🤔

async fn put(&self, location: &Path, bytes: Bytes) -> Result<PutResult> { self.put_opts(location, bytes, PutOptions::default()).await }

I will file a ticket for supporting non-contiguous write payloads

Edit: #5514

object_store/src/upload.rs

tustvold · 2024-03-13T09:01:31Z

object_store/src/upload.rs

+    /// ```
+    ///
+    /// [R2]: https://developers.cloudflare.com/r2/objects/multipart-objects/#limitations
+    fn put_part(&mut self, data: Bytes) -> UploadPart;


The design of this is somewhat subtle.

A part index would run into issues for LocalFilesystem as we would need to know the offset to write the chunk to. If we permit out of order writes, e.g. writing the final partial chunk first, this chunk size would need to be provided at creation time of Upload. Aside from being an unfortunate API, this which would create non-trivial behaviour differences between stores that respect this config and those that ignore it.

Instead by taking a mutable borrow and providing the data to be written, we prohibit out of order writes. The final piece is we return a BoxFuture<'static, Result<()>> instead of this being an async fn, i.e. returning BoxFuture<'_, Result<()>>. This allows multiple UploadPart to be created and polled in parallel, without the borrow checker complaining about concurrent mutable borrows.

This is strictly a more flexible interface than the current AsyncWrite API, and whilst it still doesn't permit out of order writes, use-cases that require this can use MultiPartStore.

This API requires having the entire upload in a single contiguous Bytes buffer right? it doesn't seem to allow for incremental streaming writes, the way the current AsyncWrite does.

Maybe we could support both a put_part like this (all data at once) as well as a put_part_stream (that takes data as a steam of Bytes) 🤔

Even if the intention is that most people will use the ChunkedUpload API I think it is still worth considering non-contiguous writes

The underlying implementations in fact all require a contiguous buffer, either for the HTTP request or to spawn the blocking IO. I did spend a decent amount of time seeing if we could avoid this, but it requires some pretty intrusive refactoring to basically extract our own Bytes abstraction. Ultimately this is still strictly better than the current API

I'll file a ticket for supporting non-contiguous payloads in ObjectStore

Edit: filed #5514

Instead by taking a mutable borrow and providing the data to be written, we prohibit out of order writes. The final piece is we return a BoxFuture<'static, Result<()>> instead of this being an async fn, i.e. returning BoxFuture<'_, Result<()>>. This allows multiple UploadPart to be created and polled in parallel, without the borrow checker complaining about concurrent mutable borrows.

I think this should be part of the docstring for future reference.

tustvold · 2024-03-13T09:01:55Z

object_store/src/upload.rs

+    ///
+    /// Returns a stream


Suggested change

///

/// Returns a stream

tustvold · 2024-03-13T09:03:11Z

object_store/src/upload.rs

+    fn put_part(&mut self, data: Bytes) -> UploadPart;
+
+    /// Complete the multipart upload
+    async fn complete(&mut self) -> Result<PutResult>;


Unlike MultiPartStore we expect Upload to handle the PartId internally, I think this makes for a more intuitive interface and avoids issues relating to the behaviour if the PartId are provided out of order.

Is there any usecase for the PartId needed by the client (e.g. is it something that S3 exposes that someone might want access to?)

There isn't an obvious way this information could be used by a user of the Upload trait, but if a user needed this information they could use the lower level MultiPartStore trait

tustvold · 2024-03-13T09:04:39Z

object_store/src/lib.rs

-    async fn abort_multipart(&self, location: &Path, multipart_id: &MultipartId) -> Result<()>;
+    /// Client should prefer [`ObjectStore::put`] for small payloads, as streaming uploads
+    /// typically require multiple separate requests. See [`Upload`] for more information
+    async fn upload(&self, location: &Path) -> Result<Box<dyn Upload>>;


Unlike before we don't expose a notion of MultipartId. In hindsight this makes for a confusing API. We instead encourage users to configure automatic cleanup of incomplete uploads, which is the more reliable mechanism anyway

crepererum · 2024-03-14T15:52:15Z

object_store/src/upload.rs

+    /// Most stores require that all parts excluding the last are at least 5 MiB, and some
+    /// further require that all parts excluding the last be the same size, e.g. [R2].
+    /// Clients wanting to maximise compatibility should therefore perform writes in
+    /// fixed size blocks larger than 5 MiB.


Could we provide an API to read these attributes from a DynObjectStore? Otherwise I don't see how anyone would ever be able to use DynObjectStore and put_part in combination.

We could do, or writers could just 5 MiB chunks

alamb

I think this API looks really nice @tustvold -- thank you 🙏

cc @wjones127 @devinjdangelo @wiedld

alamb · 2024-03-14T15:52:14Z

object_store/src/upload.rs

+    /// # async fn test() {
+    /// #
+    /// let mut upload: Box<&dyn Upload> = todo!();
+    /// let mut p1 = upload.put_part(vec![0; 10 * 1024 * 1024].into());


I didn't quite follow the discussion about mut borrows below, but this example seems to demonstrate it is possible to uploads in parallel, which is a key usecase

alamb · 2024-03-14T15:52:47Z

object_store/src/upload.rs

+    /// ```
+    ///
+    /// [R2]: https://developers.cloudflare.com/r2/objects/multipart-objects/#limitations
+    fn put_part(&mut self, data: Bytes) -> UploadPart;


This API requires having the entire upload in a single contiguous Bytes buffer right? it doesn't seem to allow for incremental streaming writes, the way the current AsyncWrite does.

Maybe we could support both a put_part like this (all data at once) as well as a put_part_stream (that takes data as a steam of Bytes) 🤔

Even if the intention is that most people will use the ChunkedUpload API I think it is still worth considering non-contiguous writes

alamb · 2024-03-14T15:54:53Z

object_store/src/upload.rs

+    fn put_part(&mut self, data: Bytes) -> UploadPart;
+
+    /// Complete the multipart upload
+    async fn complete(&mut self) -> Result<PutResult>;


Is there any usecase for the PartId needed by the client (e.g. is it something that S3 exposes that someone might want access to?)

alamb · 2024-03-14T16:02:26Z

object_store/src/upload.rs

+    ///
+    /// Back pressure can optionally be applied to producers by calling
+    /// [`Self::wait_for_capacity`] prior to calling this method
+    pub fn write(&mut self, mut buf: &[u8]) {


It seems like this API basically requires copying the data. I wonder if there is some way to allow users to pass in an owned buffer already, like

pub fn write(&mut self, buf: impl Into<Buffer>) { ... }

And then internally slicing up the Buffer to ensure the correct sizes

I realize that put currently requires a single contiguous buffer (per part), so maybe this copy isn't that big a problem. However it seems a pity that we require the copy 🤔

async fn put(&self, location: &Path, bytes: Bytes) -> Result<PutResult> { self.put_opts(location, bytes, PutOptions::default()).await }

alamb · 2024-03-14T16:03:36Z

object_store/src/upload.rs

+pub type UploadPart = BoxFuture<'static, Result<()>>;
+
+#[async_trait]
+pub trait Upload: Send + std::fmt::Debug {


For other reviewers, This is the key new API

What is the correct description of this trait? Something like

Suggested change

pub trait Upload: Send + std::fmt::Debug {

/// Represents an inprogress multi-part upload

///

/// (Is this right? Can we actually abort a multi-part upload? )

/// Cancel behavior: On drop, the multi-part upload is aborted

pub trait Upload: Send + std::fmt::Debug {

alamb · 2024-03-14T16:07:57Z

object_store/src/limit.rs

-    async fn abort_multipart(&self, location: &Path, multipart_id: &MultipartId) -> Result<()> {
-        let _permit = self.semaphore.acquire().await.unwrap();
-        self.inner.abort_multipart(location, multipart_id).await
+    async fn upload(&self, location: &Path) -> Result<Box<dyn Upload>> {


I know you prefer shorter names, but here I recommend calling this multipart_upload and MultipartUpload to align with the term used by the cloud providers. Upload is pretty generic

https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpu-upload-object.html
https://cloud.google.com/storage/docs/xml-api/post-object-complete

Or maybe keep the existing put_multipart name (though I realize reusing the same name may be more confusing on upgrade)

alamb · 2024-03-14T16:09:36Z

object_store/src/upload.rs

+    }
+
+    /// Flush final chunk, and await completion of all in-flight requests
+    pub async fn finish(mut self) -> Result<PutResult> {


I believe this PutResult is what @ashtuchkin is asking for in #5443

object_store/src/upload.rs

crepererum · 2024-03-15T12:00:30Z

BTW: I really like the overall implementation!

…-upload

tustvold · 2024-03-18T03:27:33Z

object_store/src/multipart.rs

@@ -277,7 +43,7 @@ impl<T: PutPart> std::fmt::Debug for WriteMultiPart<T> {
 /// [`ObjectStore::put_multipart`]: crate::ObjectStore::put_multipart
 /// [`LocalFileSystem`]: crate::local::LocalFileSystem
 #[async_trait]
-pub trait MultiPartStore: Send + Sync + 'static {
+pub trait MultipartStore: Send + Sync + 'static {


This is a drive-by fix that has annoyed me for a while, this crate routinely uses multipart as a single word, so this capitalization was inconsistent

tustvold · 2024-03-18T04:10:00Z

I think this is now ready to go, I may refine it in subsequent PRs prior to release, but I am happy with where the core API is

tustvold · 2024-03-18T06:49:47Z

object_store/src/upload.rs

+    ///
+    /// It is implementation defined behaviour to call [`MultipartUpload::abort`]
+    /// on an already completed or aborted [`MultipartUpload`]
+    async fn abort(&mut self) -> Result<()>;


I am very tempted to just not include this, in favour of just encouraging people to configure appropriate lifecycle rules, but am curious what other people think

This came up downstream as well in DataFusion apache/datafusion#9648 (comment)

My opinion is that we should include it in the API and appropriately caveat / explain why it is not always possible. I offered a suggestion above

alamb

Thank you @tustvold TLDR is I think this PR looks really nice -- thank you for pushing it along. Overall this change looks good to me. I left several suggestions that I think are important to improve the comments, but if necessary we can do them as follow on PRs

alamb · 2024-03-19T19:30:51Z

object_store/src/upload.rs

+    ///
+    /// It is implementation defined behaviour to call [`MultipartUpload::abort`]
+    /// on an already completed or aborted [`MultipartUpload`]
+    async fn abort(&mut self) -> Result<()>;


This came up downstream as well in DataFusion apache/datafusion#9648 (comment)

My opinion is that we should include it in the API and appropriately caveat / explain why it is not always possible. I offered a suggestion above

alamb · 2024-03-19T19:34:25Z

object_store/src/upload.rs

+    /// If a [`MultipartUpload`] is dropped without calling [`MultipartUpload::complete`],
+    /// some implementations will automatically reap any uploaded parts. However,
+    /// this is not always possible, e.g. for S3 and GCS. [`MultipartUpload::abort`] can
+    /// therefore be invoked to perform this cleanup.


I think the term reap, while more literary, has a greater potential to be misunderstood. I also think it would help to explain why it is not always possible.

Here is a suggestion

Suggested change

/// If a [`MultipartUpload`] is dropped without calling [`MultipartUpload::complete`],

/// some implementations will automatically reap any uploaded parts. However,

/// this is not always possible, e.g. for S3 and GCS. [`MultipartUpload::abort`] can

/// therefore be invoked to perform this cleanup.

/// If a [`MultipartUpload`] is dropped without calling [`MultipartUpload::complete`],

/// some object stores will automatically clean up any previously uploaded parts.

/// However, some stores such as for S3 and GCS do not perform automatic cleanup and

/// in such cases, [`MultipartUpload::abort`] manually invokes this cleanup.

object_store/src/upload.rs

alamb · 2024-03-19T19:52:34Z

object_store/src/lib.rs

@@ -269,12 +269,11 @@
 //!
 //! #  Multipart Upload
 //!
-//! Use the [`ObjectStore::put_multipart`] method to atomically write a large amount of data,
-//! with implementations automatically handling parallel, chunked upload where appropriate.
+//! Use the [`ObjectStore::put_multipart`] method to atomically write a large amount of data


Can we also add an example using BufWriter here (to drive people to use that unless they really need to use the multi part API themselves)? It has the nice property that it does put/put-mulitpart dynamically. '

alamb · 2024-03-19T19:53:12Z

object_store/src/local.rs

-            invalid_state("when writer is already complete.")
-        }
+        let s = Arc::clone(&self.state);
+        maybe_spawn_blocking(move || {


alamb · 2024-03-19T19:54:58Z

object_store/src/buffered.rs

-                            Ok((id, writer))
+                            let upload = store.put_multipart(&path).await?;
+                            let mut chunked = WriteMultipart::new(upload);
+                            chunked.write(&buffer);


this write results in a second buffer copy, right? That double buffering is unfortunate

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

tustvold · 2024-03-19T23:15:34Z

I intend to follow this up with a PR with some further tweaks to the BufWriter / WriteMultipart structures, but I'd like to get the changes to ObjectStore in, and then iterate on top of this

Replace AsyncWrite with Upload trait (apache#5458)

521e918

github-actions bot added the object-store Object Store Interface label Mar 13, 2024

tustvold added the api-change Changes to the arrow API label Mar 13, 2024

tustvold mentioned this pull request Mar 13, 2024

Revisit Design of ObjectStore::put_multipart #5458

Closed

tustvold commented Mar 13, 2024

View reviewed changes

alamb mentioned this pull request Mar 13, 2024

DataFusion weekly project plan (Andrew Lamb) - March 11, 2024 apache/datafusion#9555

Closed

5 tasks

alamb changed the title ~~Replace AsyncWrite with Upload trait (#5458)~~ [ObjectStore] Replace AsyncWrite with Upload trait (#5458) Mar 14, 2024

crepererum reviewed Mar 14, 2024

View reviewed changes

alamb reviewed Mar 14, 2024

View reviewed changes

alamb mentioned this pull request Mar 14, 2024

In Object Store, return version & etag on multipart put. #5443

Closed

tustvold mentioned this pull request Mar 17, 2024

Parquet ObjectStore Writer #5524

Open

Make BufWriter abortable

b172a6e

devinjdangelo mentioned this pull request Mar 17, 2024

Use object_store:BufWriter to replace put_multipart apache/datafusion#9648

Merged

tustvold added 2 commits March 18, 2024 15:13

Merge remote-tracking branch 'upstream/master' into revisit-multipart…

f641773

…-upload

Flesh out cloud implementations

1c8a965

tustvold force-pushed the revisit-multipart-upload branch from 97bc03b to 1c8a965 Compare March 18, 2024 02:45

tustvold added 2 commits March 18, 2024 16:21

Review feedback

764289a

Misc tweaks and fixes

fbc1ec8

tustvold commented Mar 18, 2024

View reviewed changes

tustvold changed the title ~~[ObjectStore] Replace AsyncWrite with Upload trait (#5458)~~ Replace AsyncWrite with Upload trait and rename MultiPartStore to MultipartStore (#5458) Mar 18, 2024

tustvold added 5 commits March 18, 2024 16:28

Format

4c6a2a8

Replace multi-part with multipart

96f77d4

More docs

24627f5

Clippy

554decf

Rename to MultipartUpload

60b2ae7

Rename ChunkedUpload to WriteMultipart

2bcf83a

tustvold marked this pull request as ready for review March 18, 2024 04:09

Doc tweaks

e83c993

tustvold commented Mar 18, 2024

View reviewed changes

alamb mentioned this pull request Mar 18, 2024

DataFusion weekly project plan (Andrew Lamb) - March 18, 2024 apache/datafusion#9675

Closed

7 tasks

alamb approved these changes Mar 19, 2024

View reviewed changes

tustvold and others added 3 commits March 20, 2024 10:54

Apply suggestions from code review

40f7ee3

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Docs

5ea92d4

Format

fd1d198

tustvold merged commit 96c4c0b into apache:master Mar 19, 2024
13 checks passed

tustvold mentioned this pull request Mar 20, 2024

Implement MultipartStore for ThrottledStore #5533

Merged

This was referenced Apr 15, 2024

Release Object Store 0.10.0 #5647

Closed

Inconsistent Multipart Nomenclature #5526

Closed

-pub trait Upload: Send + std::fmt::Debug {
+/// Represents an inprogress multi-part upload
+///
+/// (Is this right? Can we actually abort a multi-part upload? )
+/// Cancel behavior: On drop, the multi-part upload is aborted
+pub trait Upload: Send + std::fmt::Debug {

Replace AsyncWrite with Upload trait and rename MultiPartStore to MultipartStore (#5458) #5500

Replace AsyncWrite with Upload trait and rename MultiPartStore to MultipartStore (#5458) #5500

Conversation

tustvold commented Mar 13, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

tustvold Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crepererum commented Mar 15, 2024

Choose a reason for hiding this comment

tustvold commented Mar 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Mar 19, 2024

tustvold commented Mar 13, 2024 •

edited

Loading

tustvold Mar 13, 2024 •

edited

Loading

tustvold Mar 14, 2024 •

edited

Loading

tustvold Mar 14, 2024 •

edited

Loading

tustvold Mar 18, 2024 •

edited

Loading