Skip to content

fix: Avoid panicing when stats are not available for a file group split#23277

Open
mkleen wants to merge 6 commits into
apache:mainfrom
mkleen:missing_statistics_partitioned
Open

fix: Avoid panicing when stats are not available for a file group split#23277
mkleen wants to merge 6 commits into
apache:mainfrom
mkleen:missing_statistics_partitioned

Conversation

@mkleen

@mkleen mkleen commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The query from the issue:

SELECT   (((Cast(id AS BIGINT) % 1024) + 1024) % 1024) AS computed_bucket
FROM     profile
ORDER BY computed_bucket,
         Cast(id AS BIGINT) limit 10;

panics:

thread 'main' panicked at .../datafusion-datasource-54.0.0/src/statistics.rs:100:48:
index out of bounds: the len is 0 but the index is 0

The underlying issue is that the current code panics when files are split by statistics and there are no statistics available for the column where the sort order is defined in this case computed_bucket.

What changes are included in this PR?

  • Fix in MinMaxStatistics to check if there are stats available for a given column
  • Test

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Jul 1, 2026
@mkleen mkleen force-pushed the missing_statistics_partitioned branch from f8c0e85 to cba5503 Compare July 1, 2026 10:26
@mkleen mkleen changed the title fix: Avoid panicing when min/max stats are not available for a group split fix: Avoid panicing when stats are not available for a group split Jul 1, 2026
@mkleen mkleen changed the title fix: Avoid panicing when stats are not available for a group split fix: avoid panicing when stats are not available for a group split Jul 1, 2026
@mkleen mkleen changed the title fix: avoid panicing when stats are not available for a group split fix: Avoid panicing when stats are not available for a group split Jul 1, 2026
@mkleen mkleen marked this pull request as ready for review July 1, 2026 10:54
@mkleen mkleen force-pushed the missing_statistics_partitioned branch from 56c9cef to 3490ba9 Compare July 1, 2026 12:41
@mkleen mkleen changed the title fix: Avoid panicing when stats are not available for a group split fix: Avoid panicing when stats are not available for a file group split Jul 1, 2026

@comphead comphead left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mkleen WDYT if its possible to add the test to SLT test as well?

Comment thread datafusion/datasource/src/statistics.rs Outdated

@notfilippo notfilippo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We are seeing this issue as well in our DF54 upgrade. The fix matches our patch downstream.

Comment thread datafusion/datasource/src/statistics.rs Outdated
@mkleen

mkleen commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

@comphead @geoffreyclaude @notfilippo Thanks for the review! I added all the feedback from you but i am having troubles to reproduce it in a plain slt. I will give it another try. The problem is the data creation.

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error rather than panic seems better to me.

It is not clear to me that this condition can happen during normal operation (e.g. it looks like maybe it happens due to a bug in partition reporting).

Thanks @mkleen

Comment thread datafusion/datasource/src/statistics.rs Outdated
{
Ok((partition_value.clone(), partition_value.clone()))
} else {
Err(plan_datafusion_err!("statistics not found"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be an internal error (rather than a planning error)?

Also it might be nice to print out the requested statistics value (partition_value( and the toal number of columns in the error message to help debug issues

                            Err(plan_datafusion_err!("statistics not found for partition {partition_value}, expected at most {}", s.column_statistics.len() - 1))

@alamb

alamb commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Thanks! We are seeing this issue as well in our DF54 upgrade. The fix matches our patch downstream.

Added as candidate for backport to 54.1.0

@notfilippo do you have other issues you have seen during upgrade that you think we should backport?

@notfilippo

Copy link
Copy Markdown
Member

have other issues

@alamb thanks for the question! As of now nothing else surfaced. If something else comes up I'll update the backport issue thread or I'll create a new issue.

@mkleen mkleen force-pushed the missing_statistics_partitioned branch 2 times, most recently from e576273 to 6083050 Compare July 3, 2026 14:55
@mkleen mkleen force-pushed the missing_statistics_partitioned branch from 6083050 to 279342b Compare July 3, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Panic in DataFusion 54.0.0 when ordering Parquet scan by computed projection alias

5 participants