Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better memory limiting in parquet ArrowWriter #5484

Closed
alamb opened this issue Mar 7, 2024 · 2 comments · Fixed by #5485
Closed

Better memory limiting in parquet ArrowWriter #5484

alamb opened this issue Mar 7, 2024 · 2 comments · Fixed by #5485
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Mar 7, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
@DDtKey suggested in #5457 #5457 (review)

Describe the solution you'd like

I still think would be nice to have an additional config(or method) to "enforce flush on buffer size". To be able to encapsulate this logic for user's code 🤔

The idea is to add an additional option to force the writer to flush when its buffered data hits a certain limit.

Describe alternatives you've considered

The challenge is how to enforce buffer limiting without slowing down encoding. One idea would be to checking memory usage after completing encoding each RecordBatch. This would be imprecise (the writer could go over), as noted by @tustvold , but the overage would be bounded to the size of one RecordBatch (which the user could control)

Since all writers
This might look like adding something like this to the ArrowWriter:

let mut writer = ArrowWriter::try_new(&mut buffer, to_write.schema(), None)
  .unwrap()
  // flush when buffered parquet data exceeds 10MB
  .with_target_buffer_size(10*1024*1024)

Since not all the parquet writers buffer their data like this, I think it doesn't make sense to put the buffer size on the WriterProperties struct.

Additional context
@tustvold documented the current behavior better in #5457

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Mar 7, 2024
@tustvold
Copy link
Contributor

tustvold commented Mar 8, 2024

I think we should probably just remove the confusing buffer_size argument, and just flush on row group persist, I'll file a PR to do this

tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 8, 2024
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 8, 2024
tustvold added a commit that referenced this issue Mar 13, 2024
* Remove confusing buffer_size from AsyncArrowWriter (#5484)

* Review feedback
@tustvold tustvold added the parquet Changes to the parquet crate label Mar 15, 2024
@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'parquet'} from #5457

mwylde added a commit to ArroyoSystems/arrow-rs that referenced this issue May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants