Skip to content

Remove getOrAddPerField calls in processBatch#16118

Merged
ChrisHegarty merged 5 commits into
apache:mainfrom
Tim-Brooks:columnar_less_hash_lookups
May 25, 2026
Merged

Remove getOrAddPerField calls in processBatch#16118
ChrisHegarty merged 5 commits into
apache:mainfrom
Tim-Brooks:columnar_less_hash_lookups

Conversation

@Tim-Brooks
Copy link
Copy Markdown
Contributor

The schema-validation pass in processBatch already resolves a PerField
for every column via getOrAddPerField. The row-oriented and
column-oriented passes were each re-resolving the same PerField by name,
costing two extra hash lookups per column per batch.

Cache the validated PerField into docFields[] keyed by the column's
original position, and have both downstream passes read from that
cache. processRowColumns now keeps a local int[] rowPfIndices mapping
row-mode slots back to their original column position, so the inner
loop can index docFields[] directly. docFields is grown in the
validation pass instead of inside processRowColumns.

The schema-validation pass in processBatch already resolves a PerField
for every column via getOrAddPerField. The row-oriented and
column-oriented passes were each re-resolving the same PerField by name,
costing two extra hash lookups per column per batch.

Cache the validated PerField into docFields[] keyed by the column's
original position, and have both downstream passes read from that
cache. processRowColumns now keeps a local int[] rowPfIndices mapping
row-mode slots back to their original column position, so the inner
loop can index docFields[] directly. docFields is grown in the
validation pass instead of inside processRowColumns.
@github-actions github-actions Bot added this to the 10.5.0 milestone May 25, 2026
Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ChrisHegarty ChrisHegarty merged commit 807bb4e into apache:main May 25, 2026
13 checks passed
ChrisHegarty pushed a commit that referenced this pull request May 25, 2026
The schema-validation pass in processBatch already resolves a PerField
for every column via getOrAddPerField. The row-oriented and
column-oriented passes were each re-resolving the same PerField by name,
costing two extra hash lookups per column per batch.

Cache the validated PerField into docFields[] keyed by the column's
original position, and have both downstream passes read from that
cache. processRowColumns now keeps a local int[] rowPfIndices mapping
row-mode slots back to their original column position, so the inner
loop can index docFields[] directly. docFields is grown in the
validation pass instead of inside processRowColumns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants