Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Custom Buffer Implementation with Bytes in Parquet #1474

Closed
tustvold opened this issue Mar 22, 2022 · 4 comments · Fixed by #5055
Closed

Replace Custom Buffer Implementation with Bytes in Parquet #1474

tustvold opened this issue Mar 22, 2022 · 4 comments · Fixed by #5055
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The parquet crate currently rolls its own ByteBufferPtr, BufferPtr, Buffer, etc... and this feels like functionality that could be provided by an upstream crate.

Describe the solution you'd like

The bytes crate has wide ecosystem adoption and supports most of the necessary functionality. It does not support memory tracking, but I'm not sure this is being used given the util::memory module was made experimental and therefore not public in #1134 and there hasn't been any wailing nor gnashing of teeth resulting from this.

Describe alternatives you've considered

Continue to use custom buffer abstractions.

Tagging @sunchao as I believe he added this code all the way back in 2018 😅

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label Mar 22, 2022
@sunchao
Copy link
Member

sunchao commented Mar 23, 2022

Thanks. Yes these are pretty out-dated and I think it will be good to replace them. Have you thought about just use MutableBuffer from Arrow side too?

@tustvold
Copy link
Contributor Author

Have you thought about just use MutableBuffer from Arrow side too?

I hadn't, but I'm not sure it is a good fit

  • I would like to use the ecosystem standard abstraction if possible
  • arrow is currently an optional dependency of parquet
  • MutableBuffer is really designed for arrays of primitives, not raw byte arrays
  • I have a long-term hope to eventually phase out MutableBuffer and replace it with a typed construction that is easier to use without unsafe. Something with a similar interface to the ScalarBuffer I added to parquet might be a candidate

@nevi-me
Copy link
Contributor

nevi-me commented Mar 23, 2022

If possible, we could use Arrow's buffer based on the arrow feature, then use some abstraction (I'd be fine with bytes) for the other cases. The perf cliff is whenever we create multiple small ByteBuffer instances (e.g. representing vec!["hello", "there"]as 2 instances instead of a singleByteBuffer` with offsets into the 2 values. I think having a single buffer per page/row group would be helpful.

The upside of using Arrow's buffer is minimising/eliminating data copies. I was able to improve the Arrow side here (#820), and see @alamb's comment (#820 (comment)).

I have a long-term hope to eventually phase out MutableBuffer and replace it with a typed construction that is easier to use without unsafe. Something with a similar interface to the ScalarBuffer I added to parquet might be a candidate

This would be great, as it seems that a lot of the safety (and some perf) issues lie there.

@alamb
Copy link
Contributor

alamb commented Mar 23, 2022

I think using Bytes would have many advantages (such as potentially avoiding copies from data that are read from remote object stores. etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants