Add blog post for ieee754 total order and nan count#183
Conversation
alamb
left a comment
There was a problem hiding this comment.
Love it -- I had a suggestion for a specific example that is a little less esoteric than the all nan values example
| --- | ||
| title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total Order and NaN Counts" | ||
| date: 2026-05-29 | ||
| description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance." |
There was a problem hiding this comment.
| description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance." | |
| description: "How the Apache Parquet Community resolved potentially ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts" |
|
|
||
| ## Why Floating-Point Statistics Need Special Handling | ||
|
|
||
| For integers, strings, and many other straightforward types, Parquet statistics are simple: the writer records the absolute smallest and largest values, and the reader uses those bounds to decide if a query might find a match. |
There was a problem hiding this comment.
I think this section would be stronger with a specific example - like maybe two columns of floating points, one with Nans and one without and then a predicate like where x > 1.0
My understanding is that the column of floats without the Nan could be proven to match all rows and thus the predicate can be avoiding during execution
However, the column of floats with a Nan doesn't match all rows
Something like this
100.0
200.0
Nan. <-- needs to be filtered out
300.0
Previously most parquet writers woudl write stats like
min: 100.0
max: 300.0
And a clever engine might conclude that all rows match (for example the optimization descrbed by @xudong963 in https://datafusion.apache.org/blog/2026/03/20/limit-pruning/) which in this case is incorrect
The engine needs to know if any Nans appear in the data
|
|
||
| Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data. | ||
|
|
||
| Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block. |
There was a problem hiding this comment.
As above, I tthink this would be clearer with a motivating example
| title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total Order and NaN Counts" | ||
| date: 2026-05-29 | ||
| description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance." | ||
| author: "[Jan Finis](https://github.com/JFinis), [Ed Seidl](https://github.com/etseidl), [Gang Wu](https://github.com/wgtmac)" |
etseidl
left a comment
There was a problem hiding this comment.
Just flushing a few suggestions.
I also am a bit uncomfortable with some of the hyperbole 😅 (e.g. "blazing speed", "small but mighty").
| For integers, strings, and many other straightforward types, Parquet statistics are simple: the writer records the absolute smallest and largest values, and the reader uses those bounds to decide if a query might find a match. | ||
|
|
||
| Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data. | ||
|
|
There was a problem hiding this comment.
| Parquet's approach to dealing with this ambiguity has been to mandate that a `min` of `0.0` always be written as `-0.0`, and a `max` of `0.0` always be written as `+0.0`, regardless of any sign bits that may be present in the actual data. Readers are advised that `-0.0` may be present even if the `min` is `+0.0`, and `+0.0` may be present even if the max is `-0.0`. | |
| Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data. | ||
|
|
||
| Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block. | ||
|
|
There was a problem hiding this comment.
| To date Parquet has followed the latter approach, forbidding the inclusion of `NaN` in the statistics. [PR #196](https://github.com/apache/parquet-format/pull/196) provides a detailed overview of the problems inherent in this approach. For instance, consider a page with a max statistic of `0.0` that also contains a `NaN`. A query engine that considers `NaN` to be greater than all values attempts a query with a predicate like `x > 1.0`. If the engine examines the statistics, it will see that the `max` is `0.0`, so it might improperly skip that page, even though it contains at least one row that satisfies the predicate. Without knowledge of the presence or absence of `NaN`, the engine cannot safely perform this type of page pruning for floating point columns. | |
|
|
||
| Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block. | ||
|
|
||
| These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. A perfect example of this was highlighted in [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), which exposed a critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata. |
There was a problem hiding this comment.
| These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. A perfect example of this was highlighted in [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), which exposed a critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata. | |
| [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), exposed another critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata. |
| Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block. | ||
|
|
||
| These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. A perfect example of this was highlighted in [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), which exposed a critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata. | ||
|
|
|
|
||
| The resulting specification elegantly marries two concepts: an optional `nan_count` field for both `Statistics` and `ColumnIndex`, and the `IEEE_754_TOTAL_ORDER` column order. | ||
|
|
||
| `nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional in older files, readers must treat a *missing* `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values *might* be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`. |
There was a problem hiding this comment.
| `nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional in older files, readers must treat a *missing* `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values *might* be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`. | |
| `nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional (or completely missing from older files), readers must treat a *missing* `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values *might* be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`. |
|
|
||
| Readers, conversely, should treat a missing `nan_count` with caution. Absence doesn't mean zero; it means "unknown," so `NaN` values may lurk inside. When `nan_count` is present, readers can pair it with `min` and `max` bounds to make hyper-efficient, safe pruning decisions. | ||
|
|
||
| Implementations that haven't adopted the new rules yet should continue handling older files conservatively, particularly when queries involve `NaN` or signed zeros. |
There was a problem hiding this comment.
| Implementations that haven't adopted the new rules yet should continue handling older files conservatively, particularly when queries involve `NaN` or signed zeros. | |
| Implementations that haven't adopted the new rules yet should continue handling older files conservatively, particularly when queries involve `NaN` or signed zeros, but also in the case of inequalities not involving `NaN`. |
|
Thank you all for the review! I'll try to address all comments as soon as I can. |
Uh oh!
There was an error while loading. Please reload this page.