Skip to content

Incorrect filtering results when collect_statistics = true and column statistics are partially disabled #18241

@xhwhis

Description

@xhwhis

Describe the bug

When datafusion.execution.collect_statistics = true is enabled, and Parquet statistics are only partially collected (i.e., some columns have statistics disabled), filters using IS NULL or IS NOT NULL on the non-statistics columns produce incorrect results.
In this case, queries that should return rows instead return no results, even though the data clearly exists in the Parquet file.

To Reproduce

use DataFusion CLI v50.2.0

> set datafusion.execution.collect_statistics = true;
0 row(s) fetched.
Elapsed 0.001 seconds.

> COPY (
  SELECT
    column1 as id,
    column2 as value,
  FROM (VALUES (1, 10), (2, 20), (3, NULL))
)
TO 'test/data.parquet'
OPTIONS (
  'statistics_enabled::value' 'none',
);
+-------+
| count |
+-------+
| 3     |
+-------+
1 row(s) fetched.
Elapsed 0.008 seconds.

> CREATE EXTERNAL TABLE t (
  id INT,
  value INT,
)
STORED AS PARQUET
LOCATION 'test';
0 row(s) fetched.
Elapsed 0.001 seconds.

> select * from t where value is null;
+----+-------+
| id | value |
+----+-------+
+----+-------+
0 row(s) fetched.
Elapsed 0.007 seconds.

> select * from 'test/data.parquet' where value is not null;
+----+-------+
| id | value |
+----+-------+
+----+-------+
0 row(s) fetched.
Elapsed 0.002 seconds.

> select * from 'test/data.parquet';
+----+-------+
| id | value |
+----+-------+
| 1  | 10    |
| 2  | 20    |
| 3  | NULL  |
+----+-------+
3 row(s) fetched.
Elapsed 0.002 seconds.

> select * from t;
+----+-------+
| id | value |
+----+-------+
| 1  | 10    |
| 2  | 20    |
| 3  | NULL  |
+----+-------+
3 row(s) fetched.
Elapsed 0.002 seconds.

select path_in_schema,stats_min,stats_max,stats_null_count,stats_min_value,stats_max_value from parquet_metadata('test/data.parquet');
+----------------+-----------+-----------+------------------+-----------------+-----------------+
| path_in_schema | stats_min | stats_max | stats_null_count | stats_min_value | stats_max_value |
+----------------+-----------+-----------+------------------+-----------------+-----------------+
| "id"           | 1         | 3         | 0                | 1               | 3               |
| "value"        | NULL      | NULL      | NULL             | NULL            | NULL            |
+----------------+-----------+-----------+------------------+-----------------+-----------------+
2 row(s) fetched.
Elapsed 0.002 seconds.

Expected behavior

Filtering with IS NULL / IS NOT NULL should still correctly evaluate against the file data even if statistics for a column are missing.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions