Skip to content

[refactor](be) Simplify new parquet complex reader assembly#64078

Merged
suxiaogang223 merged 20 commits into
apache:refact_reader_branchfrom
suxiaogang223:codex/simplify-file-reader-schema
Jun 4, 2026
Merged

[refactor](be) Simplify new parquet complex reader assembly#64078
suxiaogang223 merged 20 commits into
apache:refact_reader_branchfrom
suxiaogang223:codex/simplify-file-reader-schema

Conversation

@suxiaogang223
Copy link
Copy Markdown
Member

What changed

  • Simplified the file reader schema layout and documented the intent.
  • Removed the parquet shape-only reader wrapper and let unprojected nested children advance through their original reader skip path.
  • Refactored new parquet MAP/LIST nested assembly toward local reader-owned Dremel traversal.
  • Localized MAP-only repeated assembly helpers in MapColumnReader.
  • Simplified nested scalar batch state by removing values_written and omitting value_indices for dense nested leaf batches.
  • Updated complex column refactor documentation with the current Phase 3/4 status.

Why

This keeps Doris new parquet complex column handling closer to the intended reader layering: LIST owns ColumnArray assembly, MAP owns ColumnMap assembly, and shared nested helpers only keep the state that multiple readers actually need.

Validation

  • Local git diff --check.
  • Fedora /home/socrates/code/doris: BUILD_TYPE=DEBUG ./build.sh --be passed after each code step.
  • UT not run in this round.

Notes

  • PR target: apache/doris refact_reader_branch.
  • Head branch: suxiaogang223:codex/simplify-file-reader-schema.

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: New parquet reader kept dead projected schema helpers and TableReader stored file block layout as full SchemaField trees. This made the file-reader schema boundary harder to reason about and duplicated complex projected type rebuild logic.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Local: git diff --check
    - Local: verified no remaining _fill_projected_schema_field/_get_projected_schema_field/block_schema/FILE_NAME = 2 references under be/src/format and be/test/format
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Clarify table reader comments after replacing block_schema with file_block_layout.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Remove the ShapeOnlyColumnReader wrapper from the new Parquet reader tree. Unprojected STRUCT children now keep their original reader and advance through child_output_indices=-1 skip semantics, while row position reader is split into its own component. Update the DuckDB comparison/refactor docs and internal layering status accordingly.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh on changed BE files.
    - Ran git diff --check and git diff --cached --check.
    - Verified no ShapeOnlyColumnReader/shape_only_column_reader references remain in BE source/tests.
    - Fedora DEBUG BE build from the previous branch revision completed successfully as cache warm-up; this commit itself has not yet been compiled on Fedora.
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Keep Doris new Parquet MAP materialization centered on MapColumnReader and ColumnMap instead of blindly mapping it to LIST<STRUCT>. Update the refactor docs accordingly and start Phase 2 by extracting MapColumnReader child resolution and output finalization helpers without changing MAP read semantics.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh on map_column_reader.cpp.
    - Ran git diff --check.
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Continue the Doris-aware MAP reader refactor by sharing the scalar and struct MAP value aligned read/skip logic through local helpers, while keeping MapColumnReader and ColumnMap materialization unchanged. Mark the first Phase 2 helper step complete in the refactor document.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran build-support/clang-format.sh on map_column_reader.cpp.
    - Ran git diff --check.
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Refactor the new parquet MAP reader so Map<K, List<V>> uses a dedicated list-value entry helper while keeping Doris ColumnMap materialization and value-list driven nested shape handling.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Update the complex column refactor feasibility document to reflect the completed MAP entry/value stream refactor phase.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check
- Behavior changed: No
- Does this need documentation: Yes
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Inline LIST Dremel level assembly in the new parquet ListColumnReader read and skip paths, removing its dependency on the generic repeated sink templates while preserving overflow behavior.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Remove the generic LIST repeated sink templates that became unused after inlining LIST level assembly in the new parquet ListColumnReader.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Remove values_written from NestedScalarBatch and keep the value count local to Arrow leaf batch construction and overflow compaction.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Localize MAP-only repeated level assembly and value stream helpers in the new parquet MapColumnReader after LIST Dremel assembly was inlined, reducing the public nested reader helper surface. Update the complex column refactor document with the current Phase 3 and Phase 4 status.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Avoid materializing NestedScalarBatch value_indices for dense nested leaf batches without repetition filtering, and keep overflow tail compaction in the same dense representation. Update the complex column refactor document with the Phase 4 status.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check
- Behavior changed: No
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

suxiaogang223 and others added 8 commits June 3, 2026 23:03
…ng tasks

Remove completed analysis documents (SchemaField merge, DuckDB comparison,
projection/filtering analysis). Consolidate remaining work into a single
document: NestedScalarBatch value_indices removal plan, test gaps, and
known limitations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Keep LIST Dremel traversal local to ListColumnReader while extracting the repeated stream mechanics into a LIST-local helper, add comments explaining LIST/MAP local traversal boundaries, and update the follow-up value cursor refactor plan.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#64078

Problem Summary: Update the new parquet reader layering document to match the current LIST/MAP local helper structure and remaining work.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- docs/doris-new-parquet-reader-internal-layering.md
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#64078

Problem Summary: Record the planned Arrow page-level skip path for the new parquet reader layering document.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check -- docs/doris-new-parquet-reader-internal-layering.md
- Behavior changed: No
- Does this need documentation: No
Add four independent task documents:
- page-level-skip-plan.md: Arrow PageReader::set_data_page_filter() approach
- observability-profile-plan.md: Scheduler/reader/nested/adapter metrics
- ut-coverage-plan.md: Nullable boundaries, filter interaction, edge cases
- complex-column-predicate-and-stats-filtering.md: Nested field filter
  localization and statistics pruning for complex types

Remove doris-new-parquet-reader-internal-layering.md — P3 (schema
change) already handled by TableReader/ColumnMapper. Remaining tasks
consolidated into the new documents.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#64078

Problem Summary: Replace NestedScalarBatch value_indices with per-stream NestedScalarValueCursor for LIST, MAP, and STRUCT nested scalar value consumption.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - build-support/clang-format.sh be/src/format/new_parquet/reader/nested_column_reader.h be/src/format/new_parquet/reader/nested_column_reader.cpp be/src/format/new_parquet/reader/arrow_leaf_reader_adapter.cpp be/src/format/new_parquet/reader/list_column_reader.cpp be/src/format/new_parquet/reader/map_column_reader.cpp be/src/format/new_parquet/reader/struct_column_reader.cpp
    - rg -n 'value_indices|nested_scalar_value_index' be/src/format/new_parquet/reader
    - git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#64078

Problem Summary: Fix the MutableColumnPtr null check used by NestedScalarValueCursor so the new parquet reader compiles.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check
    - rg -n 'values_column != nullptr|value_indices|nested_scalar_value_index' be/src/format/new_parquet/reader
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: apache#64078

Problem Summary: Remove the completed complex column cursor refactor feasibility document after the planned value_indices cleanup is implemented.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - git diff --check
    - rg -n 'value_indices|nested_scalar_value_index' be/src/format/new_parquet/reader docs
- Behavior changed: No
- Does this need documentation: No
@suxiaogang223 suxiaogang223 marked this pull request as ready for review June 3, 2026 16:37
@suxiaogang223 suxiaogang223 merged commit d6487e8 into apache:refact_reader_branch Jun 4, 2026
18 of 20 checks passed
@suxiaogang223 suxiaogang223 deleted the codex/simplify-file-reader-schema branch June 4, 2026 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants