[GLUTEN-12010][CH] Pass the correct values to ParquetInputFormat by zzcclp · Pull Request #12011 · apache/gluten

zzcclp · 2026-04-29T10:23:27Z

Updates the ClickHouse backend’s Parquet read path to use the configured Parquet input settings (rather than hardcoded defaults) when constructing/feeding the Parquet input formats.

Changes:

Thread format_settings.parquet.max_block_size into the local ParquetInputFormat wrapper and use it for row-index-only batch generation.
Select min_bytes_for_seek based on whether the underlying read is remote vs local, and pass it into the native Parquet input formats.
Remove the default ClickHouse setting local_engine.settings.log_processors_profiles = true from backend initialization.

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Related issue: #12010

Pass the correct values to ParquetInputFormat

github-actions · 2026-04-29T10:23:57Z

Run Gluten Clickhouse CI on x86

zzcclp · 2026-04-29T10:26:32Z

    import org.apache.gluten.backendsapi.clickhouse.CHConfig._
    conf.setCHConfig(
-      "timezone" -> conf.get("spark.sql.session.timeZone", TimeZone.getDefault.getID),
-      "local_engine.settings.log_processors_profiles" -> "true")


The default value of the log_processors_profiles is true, don't need to set again.

Better leave this here.

Copilot

Pull request overview

Updates the ClickHouse backend’s Parquet read path to use the configured Parquet input settings (rather than hardcoded defaults) when constructing/feeding the Parquet input formats.

Changes:

Thread format_settings.parquet.max_block_size into the local ParquetInputFormat wrapper and use it for row-index-only batch generation.
Select min_bytes_for_seek based on whether the underlying read is remote vs local, and pass it into the native Parquet input formats.
Remove the default ClickHouse setting local_engine.settings.log_processors_profiles = true from backend initialization.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
cpp-ch/local-engine/Storages/SubstraitSource/ParquetFormatFile.cpp	Uses configured Parquet `max_block_size` for row-index-only reads and uses remote vs local `min_bytes_for_seek` when creating Parquet input formats.
backends-clickhouse/src/main/scala/org/apache/gluten/backendsapi/clickhouse/CHListenerApi.scala	Changes default ClickHouse backend initialization config by removing `log_processors_profiles`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T10:27:32Z

            assert(outputHeader.columns());
            assert(row_index_reader);
-            // TODO: rebase-25.12, format_settings_.parquet.max_block_size
-            Columns cols{row_index_reader->readBatch(8192)};
+            Columns cols{row_index_reader->readBatch(max_block_size)};
            size_t rows = cols[0]->size();
            return Chunk(std::move(cols), rows);


This change alters the batch size used when only meta columns are requested (readHeader.columns()==0) to depend on format_settings.parquet.max_block_size. There doesn’t appear to be a unit/integration test covering that row-index-only path with a non-default max_block_size; adding one would help prevent regressions (e.g., ensuring chunk sizes follow the setting rather than a hardcoded constant).

Copilot · 2026-04-29T10:27:33Z

    // Add configs
    import org.apache.gluten.backendsapi.clickhouse.CHConfig._
    conf.setCHConfig(
-      "timezone" -> conf.get("spark.sql.session.timeZone", TimeZone.getDefault.getID),
-      "local_engine.settings.log_processors_profiles" -> "true")
+      "timezone" -> conf.get("spark.sql.session.timeZone", TimeZone.getDefault.getID))
    conf.setCHSettings("spark_version", SPARK_VERSION)


The PR description/title is about passing correct values to ParquetInputFormat, but this hunk also removes the default CH setting local_engine.settings.log_processors_profiles = true. If this behavior change is intended, it should be called out in the PR description (or moved to a separate PR) since it changes runtime logging/profiling defaults for the ClickHouse backend.

github-actions · 2026-04-29T14:08:23Z

Run Gluten Clickhouse CI on x86

github-actions · 2026-04-29T14:16:22Z

Run Gluten Clickhouse CI on x86

zhanglistar

LGTM

[GLUTEN-12010][CH] Pass the correct values to ParquetInputFormat

ae687f0

Pass the correct values to ParquetInputFormat

github-actions Bot added the CLICKHOUSE label Apr 29, 2026

zzcclp requested a review from Copilot April 29, 2026 10:23

zzcclp requested a review from zhanglistar April 29, 2026 10:23

Copilot started reviewing on behalf of zzcclp April 29, 2026 10:24 View session

zzcclp commented Apr 29, 2026

View reviewed changes

Copilot AI reviewed Apr 29, 2026

View reviewed changes

revert

819c55b

add comment

6143c2c

zzcclp requested a review from lgbo-ustc April 30, 2026 02:10

zhanglistar approved these changes Apr 30, 2026

View reviewed changes

zzcclp merged commit 93de3c8 into apache:main Apr 30, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GLUTEN-12010][CH] Pass the correct values to ParquetInputFormat#12011

[GLUTEN-12010][CH] Pass the correct values to ParquetInputFormat#12011
zzcclp merged 3 commits into
apache:mainfrom
zzcclp:gluten-12010

zzcclp commented Apr 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

zzcclp Apr 29, 2026

Uh oh!

zhanglistar Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

zhanglistar left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zzcclp commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

zzcclp Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

zhanglistar Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

zhanglistar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zzcclp commented Apr 29, 2026 •

edited

Loading