-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files #37854
GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files #37854
Conversation
|
Questions:
|
Based on the #28218 results, I think so. |
When scanner has multiple scanner thread and prefetch depth( i.e. Prefetch Fragment Count), would this causing huge memory consumption? |
Status FileReaderImpl::GetRecordBatchReader(const std::vector<int>& row_groups,
const std::vector<int>& column_indices,
std::unique_ptr<RecordBatchReader>* out) {
RETURN_NOT_OK(BoundsCheck(row_groups, column_indices));
if (reader_properties_.pre_buffer()) {
// PARQUET-1698/PARQUET-1820: pre-buffer row groups/column chunks if enabled
BEGIN_PARQUET_CATCH_EXCEPTIONS
reader_->PreBuffer(row_groups, column_indices, reader_properties_.io_context(),
reader_properties_.cache_options());
END_PARQUET_CATCH_EXCEPTIONS
} Here, Pre_Buffer will try to buffer the require RowGroups if neccessary, and memory will not be released until read is finished. It's different from Even when policy is |
@mapleFU could you post those questions and remarks on the issue? (that might have a wider audience that can answer those, as many people already commented there) |
Done it, I think enable this is ok, maybe we should doc how to handling high memory usage in doc |
python/pyarrow/_dataset_parquet.pyx
Outdated
@@ -666,7 +666,7 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions): | |||
Disabled by default. | |||
buffer_size : int, default 8192 | |||
Size of buffered stream, if enabled. Default is 8KB. | |||
pre_buffer : bool, default False | |||
pre_buffer : bool, default True | |||
If enabled, pre-buffer the raw Parquet data instead of issuing one | |||
read per column chunk. This can improve performance on high-latency | |||
filesystems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly we should improve the docstring to also mention that you should disable this if you are concerned with memory usage over throughput? (Also, possibly make it clear that "high-latency filesystems" is likely to mean object stores like S3, GCS, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will do
@lidavidm do you have any thought on the question in #37854 (comment) ? |
Oops, I missed that.
|
ArrowReaderProperties arrow_properties = default_arrow_reader_properties(); | ||
arrow_properties.set_pre_buffer(true); | ||
arrow_properties.set_pre_buffer(false); | ||
arrow_properties.set_cache_options(::arrow::io::CacheOptions::Defaults()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might cause compiling failed, I submit a fixing: #38069
Oh, sorry for completely missing to check all the failed builds .. 🤦 |
….cc` compile (#38069) ### Rationale for this change This is introduced in a previous patch. This patch fixed the compile. ( #37854 ) ### What changes are included in this PR? a one-line fixing. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: #38068 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
So this PR introduced a failure in the "AMD64 Ubuntu 22.04 C++ ASAN UBSAN" build (https://github.com/apache/arrow/actions/runs/6430392691/job/17462667620?pr=38069#logs), related to the LazyCache coalesced reads. See details below. I assume this is an existing bug, given this PR only changed a default for an option a user could already set before as well. But changing the default of course makes it more visible. Potentially short term option is to only change
|
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit d7017dd. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
It seems that this also broke R Windows CI: https://github.com/apache/arrow/actions/runs/6428740951/job/17457269209
|
…e for reading Parquet files (apache#37854) ### Rationale for this change Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default. The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly. * Closes: apache#36765 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…r_test.cc` compile (apache#38069) ### Rationale for this change This is introduced in a previous patch. This patch fixed the compile. ( apache#37854 ) ### What changes are included in this PR? a one-line fixing. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#38068 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…e for reading Parquet files (apache#37854) ### Rationale for this change Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default. The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly. * Closes: apache#36765 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…r_test.cc` compile (apache#38069) ### Rationale for this change This is introduced in a previous patch. This patch fixed the compile. ( apache#37854 ) ### What changes are included in this PR? a one-line fixing. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#38068 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…e for reading Parquet files (apache#37854) ### Rationale for this change Enabling `pre_buffer` can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default. The option was already enabled by default in the `pyarrow.parquet.read_table` interface, this PR aligns the defaults when using `pyarrow.dataset` directly. * Closes: apache#36765 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…r_test.cc` compile (apache#38069) ### Rationale for this change This is introduced in a previous patch. This patch fixed the compile. ( apache#37854 ) ### What changes are included in this PR? a one-line fixing. ### Are these changes tested? no ### Are there any user-facing changes? no * Closes: apache#38068 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Rationale for this change
Enabling
pre_buffer
can give a significant speed-up on filesystems like S3, while it doesn't give noticeable slowdown on local filesystems, based on benchmarks in the issue. Therefore simply enabling it by default seems the best default.The option was already enabled by default in the
pyarrow.parquet.read_table
interface, this PR aligns the defaults when usingpyarrow.dataset
directly.