Skip to content

Add blog post for ieee754 total order and nan count#183

Open
wgtmac wants to merge 1 commit into
apache:productionfrom
wgtmac:ieee754_blog
Open

Add blog post for ieee754 total order and nan count#183
wgtmac wants to merge 1 commit into
apache:productionfrom
wgtmac:ieee754_blog

Conversation

@wgtmac
Copy link
Copy Markdown
Member

@wgtmac wgtmac commented May 29, 2026

Copy link
Copy Markdown
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it -- I had a suggestion for a specific example that is a little less esoteric than the all nan values example

---
title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total Order and NaN Counts"
date: 2026-05-29
description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance."
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance."
description: "How the Apache Parquet Community resolved potentially ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts"


## Why Floating-Point Statistics Need Special Handling

For integers, strings, and many other straightforward types, Parquet statistics are simple: the writer records the absolute smallest and largest values, and the reader uses those bounds to decide if a query might find a match.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section would be stronger with a specific example - like maybe two columns of floating points, one with Nans and one without and then a predicate like where x > 1.0

My understanding is that the column of floats without the Nan could be proven to match all rows and thus the predicate can be avoiding during execution

However, the column of floats with a Nan doesn't match all rows

Something like this

100.0
200.0
Nan.  <-- needs to be filtered out
300.0

Previously most parquet writers woudl write stats like

min: 100.0
max: 300.0

And a clever engine might conclude that all rows match (for example the optimization descrbed by @xudong963 in https://datafusion.apache.org/blog/2026/03/20/limit-pruning/) which in this case is incorrect

The engine needs to know if any Nans appear in the data


Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data.

Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I tthink this would be clearer with a motivating example

@alamb
Copy link
Copy Markdown
Collaborator

alamb commented May 29, 2026

FYI @JFinis and @etseidl -- perhaps you want to take over this post ( or rewrite it into somehing more in your voice)

title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total Order and NaN Counts"
date: 2026-05-29
description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance."
author: "[Jan Finis](https://github.com/JFinis), [Ed Seidl](https://github.com/etseidl), [Gang Wu](https://github.com/wgtmac)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wgtmac, but as @alamb knows well, it would be better for all concerned to not include me as an author. Doing so will delay releasing this by months 😮 😭

Happy to help edit though.

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just flushing a few suggestions.

I also am a bit uncomfortable with some of the hyperbole 😅 (e.g. "blazing speed", "small but mighty").

For integers, strings, and many other straightforward types, Parquet statistics are simple: the writer records the absolute smallest and largest values, and the reader uses those bounds to decide if a query might find a match.

Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Parquet's approach to dealing with this ambiguity has been to mandate that a `min` of `0.0` always be written as `-0.0`, and a `max` of `0.0` always be written as `+0.0`, regardless of any sign bits that may be present in the actual data. Readers are advised that `-0.0` may be present even if the `min` is `+0.0`, and `+0.0` may be present even if the max is `-0.0`.

Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data.

Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To date Parquet has followed the latter approach, forbidding the inclusion of `NaN` in the statistics. [PR #196](https://github.com/apache/parquet-format/pull/196) provides a detailed overview of the problems inherent in this approach. For instance, consider a page with a max statistic of `0.0` that also contains a `NaN`. A query engine that considers `NaN` to be greater than all values attempts a query with a predicate like `x > 1.0`. If the engine examines the statistics, it will see that the `max` is `0.0`, so it might improperly skip that page, even though it contains at least one row that satisfies the predicate. Without knowledge of the presence or absence of `NaN`, the engine cannot safely perform this type of page pruning for floating point columns.


Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block.

These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. A perfect example of this was highlighted in [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), which exposed a critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. A perfect example of this was highlighted in [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), which exposed a critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata.
[PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), exposed another critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata.

Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block.

These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. A perfect example of this was highlighted in [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), which exposed a critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change


The resulting specification elegantly marries two concepts: an optional `nan_count` field for both `Statistics` and `ColumnIndex`, and the `IEEE_754_TOTAL_ORDER` column order.

`nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional in older files, readers must treat a *missing* `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values *might* be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional in older files, readers must treat a *missing* `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values *might* be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`.
`nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional (or completely missing from older files), readers must treat a *missing* `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values *might* be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`.


Readers, conversely, should treat a missing `nan_count` with caution. Absence doesn't mean zero; it means "unknown," so `NaN` values may lurk inside. When `nan_count` is present, readers can pair it with `min` and `max` bounds to make hyper-efficient, safe pruning decisions.

Implementations that haven't adopted the new rules yet should continue handling older files conservatively, particularly when queries involve `NaN` or signed zeros.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Implementations that haven't adopted the new rules yet should continue handling older files conservatively, particularly when queries involve `NaN` or signed zeros.
Implementations that haven't adopted the new rules yet should continue handling older files conservatively, particularly when queries involve `NaN` or signed zeros, but also in the case of inequalities not involving `NaN`.

@wgtmac
Copy link
Copy Markdown
Member Author

wgtmac commented Jun 2, 2026

Thank you all for the review! I'll try to address all comments as soon as I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog on Floating Point Statistics Improvement (order + Nan counts)

3 participants