Skip to content

INNER JOIN with dictionary keys fails when run on parquet with pushdown_filters = true. #20696

@alamb

Description

@alamb

Describe the bug

A multi-column INNER JOIN with dictionary-encoded string keys fails at runtime when scanning Parquet tables with datafusion.execution.parquet.pushdown_filters = true.

The failure is:

Parquet error: External: Compute error: Error evaluating filter predicate: ArrowError(InvalidArgumentError("Can't compare arrays of different types"), Some(""))

This appears related to hash join dynamic filter (InList) pushdown over dictionary columns.

To Reproduce

Environment:

  • datafusion-cli 52.2.0

Steps:

  1. Save and run the attached SQL script with datafusion-cli from a clean directory:
datafusion-cli -q -f repro_datafusion_cli_multi_column_dictionary_join.sql
  1. Script contents:
-- Required to reproduce the failure path.
SET datafusion.execution.parquet.pushdown_filters = true;

CREATE TABLE h2o AS
SELECT
  to_timestamp_nanos(time_ns) AS time,
  arrow_cast(state, 'Dictionary(Int32, Utf8)') AS state,
  arrow_cast(city, 'Dictionary(Int32, Utf8)') AS city,
  temp
FROM (
  VALUES
    (200, 'CA', 'LA', 90.0),
    (250, 'MA', 'Boston', 72.4),
    (100, 'MA', 'Boston', 70.4),
    (350, 'CA', 'LA', 90.0)
) AS t(time_ns, state, city, temp);

CREATE TABLE o2 AS
SELECT
  to_timestamp_nanos(time_ns) AS time,
  arrow_cast(state, 'Dictionary(Int32, Utf8)') AS state,
  arrow_cast(city, 'Dictionary(Int32, Utf8)') AS city,
  temp,
  reading
FROM (
  VALUES
    (250, 'MA', 'Boston', 53.4, 51.0),
    (100, 'MA', 'Boston', 50.4, 50.0)
) AS t(time_ns, state, city, temp, reading);

CREATE EXTERNAL TABLE h2o_parquet_tbl STORED AS PARQUET LOCATION 'h2o_parquet';
CREATE EXTERNAL TABLE o2_parquet_tbl STORED AS PARQUET LOCATION 'o2_parquet';

SELECT h2o_parquet_tbl.temp AS h2o_temp, o2_parquet_tbl.temp AS o2_temp, o2_parquet_tbl.reading
FROM h2o_parquet_tbl
INNER JOIN o2_parquet_tbl ON h2o_parquet_tbl.time = o2_parquet_tbl.time
    AND h2o_parquet_tbl.state = o2_parquet_tbl.state
    AND h2o_parquet_tbl.city = o2_parquet_tbl.city
WHERE h2o_parquet_tbl.time >= '1970-01-01T00:00:00.000000050Z'
    AND h2o_parquet_tbl.time <= '1970-01-01T00:00:00.000000300Z';
  1. Observed output:
+-------+
| count |
+-------+
| 4     |
+-------+
+-------+
| count |
+-------+
| 2     |
+-------+
Parquet error: External: Compute error: Error evaluating filter predicate: ArrowError(InvalidArgumentError("Can't compare arrays of different types"), Some(""))

Expected behavior

The query should succeed and return (which it does if run directly from the input 0or if SET datafusion.execution.parquet.pushdown_filters = false is removed

+----------+---------+---------+
| h2o_temp | o2_temp | reading |
+----------+---------+---------+
| 70.4     | 50.4    | 50.0    |
| 72.4     | 53.4    | 51.0    |
+----------+---------+---------+

Additional context

  • If SET datafusion.execution.parquet.pushdown_filters = false, the query succeeds and returns the expected 2 rows.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions