Add blog post for ieee754 total order and nan count by wgtmac · Pull Request #183 · apache/parquet-site

wgtmac · 2026-05-29T05:51:30Z

Closes Blog on Floating Point Statistics Improvement (order + Nan counts) #182

alamb

Love it -- I had a suggestion for a specific example that is a little less esoteric than the all nan values example

alamb · 2026-05-29T19:52:11Z

+---
+title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total Order and NaN Counts"
+date: 2026-05-29
+description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance."


Suggested change

description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance."

description: "How the Apache Parquet Community resolved potentially ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts"

alamb · 2026-05-29T19:57:10Z

+
+## Why Floating-Point Statistics Need Special Handling
+
+For integers, strings, and many other straightforward types, Parquet statistics are simple: the writer records the absolute smallest and largest values, and the reader uses those bounds to decide if a query might find a match.


I think this section would be stronger with a specific example - like maybe two columns of floating points, one with Nans and one without and then a predicate like where x > 1.0

My understanding is that the column of floats without the Nan could be proven to match all rows and thus the predicate can be avoiding during execution

However, the column of floats with a Nan doesn't match all rows

Something like this

100.0 200.0 Nan. <-- needs to be filtered out 300.0

Previously most parquet writers woudl write stats like

min: 100.0 max: 300.0

And a clever engine might conclude that all rows match (for example the optimization descrbed by @xudong963 in https://datafusion.apache.org/blog/2026/03/20/limit-pruning/) which in this case is incorrect

The engine needs to know if any Nans appear in the data

alamb · 2026-05-29T19:57:53Z

+
+Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data.
+
+Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block.


As above, I tthink this would be clearer with a motivating example

alamb · 2026-05-29T19:59:36Z

FYI @JFinis and @etseidl -- perhaps you want to take over this post ( or rewrite it into somehing more in your voice)

etseidl · 2026-05-29T20:39:09Z

+title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total Order and NaN Counts"
+date: 2026-05-29
+description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance."
+author: "[Jan Finis](https://github.com/JFinis), [Ed Seidl](https://github.com/etseidl), [Gang Wu](https://github.com/wgtmac)"


Thanks @wgtmac, but as @alamb knows well, it would be better for all concerned to not include me as an author. Doing so will delay releasing this by months 😮 😭

Happy to help edit though.

etseidl

Just flushing a few suggestions.

I also am a bit uncomfortable with some of the hyperbole 😅 (e.g. "blazing speed", "small but mighty").

etseidl · 2026-05-29T23:18:44Z

+For integers, strings, and many other straightforward types, Parquet statistics are simple: the writer records the absolute smallest and largest values, and the reader uses those bounds to decide if a query might find a match.
+
+Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data.
+


Suggested change

Parquet's approach to dealing with this ambiguity has been to mandate that a `min` of `0.0` always be written as `-0.0`, and a `max` of `0.0` always be written as `+0.0`, regardless of any sign bits that may be present in the actual data. Readers are advised that `-0.0` may be present even if the `min` is `+0.0`, and `+0.0` may be present even if the max is `-0.0`.

etseidl · 2026-05-29T23:37:58Z

+Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data.
+
+Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block.
+


Suggested change

To date Parquet has followed the latter approach, forbidding the inclusion of `NaN` in the statistics. [PR #196](https://github.com/apache/parquet-format/pull/196) provides a detailed overview of the problems inherent in this approach. For instance, consider a page with a max statistic of `0.0` that also contains a `NaN`. A query engine that considers `NaN` to be greater than all values attempts a query with a predicate like `x > 1.0`. If the engine examines the statistics, it will see that the `max` is `0.0`, so it might improperly skip that page, even though it contains at least one row that satisfies the predicate. Without knowledge of the presence or absence of `NaN`, the engine cannot safely perform this type of page pruning for floating point columns.

etseidl · 2026-05-29T23:40:26Z

+
+Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block.
+
+These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. A perfect example of this was highlighted in [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), which exposed a critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata.


Suggested change

These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. A perfect example of this was highlighted in [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), which exposed a critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata.

[PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), exposed another critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata.

etseidl · 2026-05-29T23:40:52Z

+Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block.
+
+These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. A perfect example of this was highlighted in [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), which exposed a critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata.
+


Suggested change

etseidl · 2026-05-29T23:46:04Z

+
+The resulting specification elegantly marries two concepts: an optional `nan_count` field for both `Statistics` and `ColumnIndex`, and the `IEEE_754_TOTAL_ORDER` column order.
+
+`nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional in older files, readers must treat a *missing* `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values *might* be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`.


Suggested change

`nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional in older files, readers must treat a *missing* `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values *might* be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`.

`nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional (or completely missing from older files), readers must treat a *missing* `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values *might* be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`.

etseidl · 2026-05-29T23:50:51Z

+
+Readers, conversely, should treat a missing `nan_count` with caution. Absence doesn't mean zero; it means "unknown," so `NaN` values may lurk inside. When `nan_count` is present, readers can pair it with `min` and `max` bounds to make hyper-efficient, safe pruning decisions.
+
+Implementations that haven't adopted the new rules yet should continue handling older files conservatively, particularly when queries involve `NaN` or signed zeros.


Suggested change

Implementations that haven't adopted the new rules yet should continue handling older files conservatively, particularly when queries involve `NaN` or signed zeros.

Implementations that haven't adopted the new rules yet should continue handling older files conservatively, particularly when queries involve `NaN` or signed zeros, but also in the case of inequalities not involving `NaN`.

wgtmac · 2026-06-02T08:30:53Z

Thank you all for the review! I'll try to address all comments as soon as I can.

add blog post for ieee754 total order and nan count

06c5e56

wgtmac mentioned this pull request May 29, 2026

Blog on Floating Point Statistics Improvement (order + Nan counts) #182

Open

alamb reviewed May 29, 2026

View reviewed changes

etseidl reviewed May 29, 2026

View reviewed changes

	description: "How Apache Parquet resolves ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts for better query performance."
	description: "How the Apache Parquet Community resolved potentially ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts"


		## Why Floating-Point Statistics Need Special Handling

		For integers, strings, and many other straightforward types, Parquet statistics are simple: the writer records the absolute smallest and largest values, and the reader uses those bounds to decide if a query might find a match.


		Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data.

		Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block.


	Parquet's approach to dealing with this ambiguity has been to mandate that a `min` of `0.0` always be written as `-0.0`, and a `max` of `0.0` always be written as `+0.0`, regardless of any sign bits that may be present in the actual data. Readers are advised that `-0.0` may be present even if the `min` is `+0.0`, and `+0.0` may be present even if the max is `-0.0`.


	To date Parquet has followed the latter approach, forbidding the inclusion of `NaN` in the statistics. [PR #196](https://github.com/apache/parquet-format/pull/196) provides a detailed overview of the problems inherent in this approach. For instance, consider a page with a max statistic of `0.0` that also contains a `NaN`. A query engine that considers `NaN` to be greater than all values attempts a query with a predicate like `x > 1.0`. If the engine examines the statistics, it will see that the `max` is `0.0`, so it might improperly skip that page, even though it contains at least one row that satisfies the predicate. Without knowledge of the presence or absence of `NaN`, the engine cannot safely perform this type of page pruning for floating point columns.


		Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. Conversely, if a writer simply ignores `NaN` values, readers are left in the dark about whether any `NaN`s actually exist in the data block.

		These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. A perfect example of this was highlighted in [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249), which exposed a critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata.


		The resulting specification elegantly marries two concepts: an optional `nan_count` field for both `Statistics` and `ColumnIndex`, and the `IEEE_754_TOTAL_ORDER` column order.

		`nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional in older files, readers must treat a missing `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values might be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`.


		Readers, conversely, should treat a missing `nan_count` with caution. Absence doesn't mean zero; it means "unknown," so `NaN` values may lurk inside. When `nan_count` is present, readers can pair it with `min` and `max` bounds to make hyper-efficient, safe pruning decisions.

		Implementations that haven't adopted the new rules yet should continue handling older files conservatively, particularly when queries involve `NaN` or signed zeros.

Conversation

wgtmac commented May 29, 2026 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented May 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wgtmac commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wgtmac commented May 29, 2026 •

edited by alamb

Loading