Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use object_store:BufWriter instead of put_multipart #9614

Closed
tustvold opened this issue Mar 15, 2024 · 5 comments · Fixed by #9648
Closed

Use object_store:BufWriter instead of put_multipart #9614

tustvold opened this issue Mar 15, 2024 · 5 comments · Fixed by #9648
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge?

Currently in many places we use put_multipart for streaming writes. When writing files smaller than 10MiB this is wasteful, as it performs 3 requests when 1 would suffice.

Describe the solution you'd like

object_store 0.9.1 added https://docs.rs/object_store/latest/object_store/buffered/struct.BufWriter.html which can automatically switch between using Put and PutMultipart based on the amount of data that has been written

Describe alternatives you've considered

We could implement our own adaptive logic in the write path within DF

Additional context

A future version of object_store is likely to significantly change put_multipart, and using BufWriter will limit the impact of this - apache/arrow-rs#5500

@tustvold tustvold added the enhancement New feature or request label Mar 15, 2024
@tustvold
Copy link
Contributor Author

FYI @devinjdangelo @alamb

@tustvold tustvold added good first issue Good for newcomers help wanted Extra attention is needed labels Mar 15, 2024
@alamb
Copy link
Contributor

alamb commented Mar 15, 2024

Related ticket for cleaning up parallel parquet writer is #9493

@yyy1000
Copy link
Contributor

yyy1000 commented Mar 16, 2024

Can I take this to get familiar with datasource related code?
Currently my plan is:

  1. Create BufWriter using given object_storeand path
  2. Remove put_multipart method and call AsyncArrowWriter::try_new using the new BufWriter

That's my initial plan after investigation, hope to hear your feedback. :)

@alamb
Copy link
Contributor

alamb commented Mar 16, 2024

@yyy1000 good luck -- this ticket will require some API exploration / potential changes so it will likely be a bit trickey.

I think your suggested plan sounds good.

It will be interesting if you can also capture any experience / improvements that would make using BufWriter easier from the context of DataFusion

@tustvold
Copy link
Contributor Author

Your plan sounds good and should be relatively straightforward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants