[GLUTEN-12010][CH] Pass the correct values to ParquetInputFormat#12011
Conversation
Pass the correct values to ParquetInputFormat
|
Run Gluten Clickhouse CI on x86 |
| import org.apache.gluten.backendsapi.clickhouse.CHConfig._ | ||
| conf.setCHConfig( | ||
| "timezone" -> conf.get("spark.sql.session.timeZone", TimeZone.getDefault.getID), | ||
| "local_engine.settings.log_processors_profiles" -> "true") |
There was a problem hiding this comment.
The default value of the log_processors_profiles is true, don't need to set again.
There was a problem hiding this comment.
Better leave this here.
There was a problem hiding this comment.
Pull request overview
Updates the ClickHouse backend’s Parquet read path to use the configured Parquet input settings (rather than hardcoded defaults) when constructing/feeding the Parquet input formats.
Changes:
- Thread
format_settings.parquet.max_block_sizeinto the localParquetInputFormatwrapper and use it for row-index-only batch generation. - Select
min_bytes_for_seekbased on whether the underlying read is remote vs local, and pass it into the native Parquet input formats. - Remove the default ClickHouse setting
local_engine.settings.log_processors_profiles = truefrom backend initialization.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| cpp-ch/local-engine/Storages/SubstraitSource/ParquetFormatFile.cpp | Uses configured Parquet max_block_size for row-index-only reads and uses remote vs local min_bytes_for_seek when creating Parquet input formats. |
| backends-clickhouse/src/main/scala/org/apache/gluten/backendsapi/clickhouse/CHListenerApi.scala | Changes default ClickHouse backend initialization config by removing log_processors_profiles. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert(outputHeader.columns()); | ||
| assert(row_index_reader); | ||
| // TODO: rebase-25.12, format_settings_.parquet.max_block_size | ||
| Columns cols{row_index_reader->readBatch(8192)}; | ||
| Columns cols{row_index_reader->readBatch(max_block_size)}; | ||
| size_t rows = cols[0]->size(); | ||
| return Chunk(std::move(cols), rows); |
There was a problem hiding this comment.
This change alters the batch size used when only meta columns are requested (readHeader.columns()==0) to depend on format_settings.parquet.max_block_size. There doesn’t appear to be a unit/integration test covering that row-index-only path with a non-default max_block_size; adding one would help prevent regressions (e.g., ensuring chunk sizes follow the setting rather than a hardcoded constant).
| // Add configs | ||
| import org.apache.gluten.backendsapi.clickhouse.CHConfig._ | ||
| conf.setCHConfig( | ||
| "timezone" -> conf.get("spark.sql.session.timeZone", TimeZone.getDefault.getID), | ||
| "local_engine.settings.log_processors_profiles" -> "true") | ||
| "timezone" -> conf.get("spark.sql.session.timeZone", TimeZone.getDefault.getID)) | ||
| conf.setCHSettings("spark_version", SPARK_VERSION) |
There was a problem hiding this comment.
The PR description/title is about passing correct values to ParquetInputFormat, but this hunk also removes the default CH setting local_engine.settings.log_processors_profiles = true. If this behavior change is intended, it should be called out in the PR description (or moved to a separate PR) since it changes runtime logging/profiling defaults for the ClickHouse backend.
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
Updates the ClickHouse backend’s Parquet read path to use the configured Parquet input settings (rather than hardcoded defaults) when constructing/feeding the Parquet input formats.
Changes:
What changes are proposed in this pull request?
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
Related issue: #12010