-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When datafusion.execution.collect_statistics = true is enabled, and Parquet statistics are only partially collected (i.e., some columns have statistics disabled), filters using IS NULL or IS NOT NULL on the non-statistics columns produce incorrect results.
In this case, queries that should return rows instead return no results, even though the data clearly exists in the Parquet file.
To Reproduce
use DataFusion CLI v50.2.0
> set datafusion.execution.collect_statistics = true;
0 row(s) fetched.
Elapsed 0.001 seconds.
> COPY (
SELECT
column1 as id,
column2 as value,
FROM (VALUES (1, 10), (2, 20), (3, NULL))
)
TO 'test/data.parquet'
OPTIONS (
'statistics_enabled::value' 'none',
);
+-------+
| count |
+-------+
| 3 |
+-------+
1 row(s) fetched.
Elapsed 0.008 seconds.
> CREATE EXTERNAL TABLE t (
id INT,
value INT,
)
STORED AS PARQUET
LOCATION 'test';
0 row(s) fetched.
Elapsed 0.001 seconds.
> select * from t where value is null;
+----+-------+
| id | value |
+----+-------+
+----+-------+
0 row(s) fetched.
Elapsed 0.007 seconds.
> select * from 'test/data.parquet' where value is not null;
+----+-------+
| id | value |
+----+-------+
+----+-------+
0 row(s) fetched.
Elapsed 0.002 seconds.
> select * from 'test/data.parquet';
+----+-------+
| id | value |
+----+-------+
| 1 | 10 |
| 2 | 20 |
| 3 | NULL |
+----+-------+
3 row(s) fetched.
Elapsed 0.002 seconds.
> select * from t;
+----+-------+
| id | value |
+----+-------+
| 1 | 10 |
| 2 | 20 |
| 3 | NULL |
+----+-------+
3 row(s) fetched.
Elapsed 0.002 seconds.
select path_in_schema,stats_min,stats_max,stats_null_count,stats_min_value,stats_max_value from parquet_metadata('test/data.parquet');
+----------------+-----------+-----------+------------------+-----------------+-----------------+
| path_in_schema | stats_min | stats_max | stats_null_count | stats_min_value | stats_max_value |
+----------------+-----------+-----------+------------------+-----------------+-----------------+
| "id" | 1 | 3 | 0 | 1 | 3 |
| "value" | NULL | NULL | NULL | NULL | NULL |
+----------------+-----------+-----------+------------------+-----------------+-----------------+
2 row(s) fetched.
Elapsed 0.002 seconds.
Expected behavior
Filtering with IS NULL / IS NOT NULL should still correctly evaluate against the file data even if statistics for a column are missing.
Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working