Skip to content

Error with CASE and DictionaryArrays: ArrowError(InvalidArgumentError("arguments need to have the same data type")) #2873

@alamb

Description

@alamb

Describe the bug
For a DictionaryArray col evaluating an expression like

CASE 
  WHEN col IS NULL THEN '' 
  ELSE col
END

Generates an error:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError(InvalidArgumentError("arguments need to have the same data type"))', src/main.rs:45:82

To Reproduce

use std::sync::Arc;

use datafusion::arrow::datatypes::Int32Type;
use datafusion::prelude::*;
use datafusion::arrow::array::DictionaryArray;
use datafusion::datasource::MemTable;
use datafusion::logical_plan::{LogicalPlanBuilder, provider_as_source, when};
use datafusion::physical_plan::collect;
use datafusion::error::Result;
use datafusion::arrow::{self, record_batch::RecordBatch};

#[tokio::main]
async fn main() -> Result<()> {
    let ctx = SessionContext::new();

    let host: DictionaryArray<Int32Type> = vec![Some("host1"), None, Some("host2")].into_iter().collect();

    let batch = RecordBatch::try_from_iter(vec![
        ("host", Arc::new(host) as _),
    ]).unwrap();

    let t = MemTable::try_new(batch.schema(), vec![vec![batch]]).unwrap();


    let expr = when(col("host").is_null(), lit(""))
        .otherwise(col("host"))
        .unwrap();

    let projection = None;
    let builder = LogicalPlanBuilder::scan(
        "cpu_load_short",
        provider_as_source(Arc::new(t)),
        projection
    ).unwrap();


    let logical_plan = builder
        .project(vec![expr])
        .unwrap()
        .build()
        .unwrap();

    // manually optimize the plan
    let physical_plan = ctx.create_physical_plan(&logical_plan).await.unwrap();
    let results: Vec<RecordBatch> = collect(physical_plan, ctx.task_ctx()).await.unwrap();

    // format the results
    println!("Results:\n\n{}", arrow::util::pretty::pretty_format_batches(&results).unwrap());
    Ok(())
}
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError(InvalidArgumentError("arguments need to have the same data type"))', src/main.rs:45:82
stack backtrace:
   0: rust_begin_unwind
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/panicking.rs:142:14
   2: core::result::unwrap_failed
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/result.rs:1785:5
   3: core::result::Result<T,E>::unwrap
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/result.rs:1078:23
   4: rust_arrow_playground::main::{{closure}}
             at ./src/main.rs:45:37
   5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/future/mod.rs:91:19
   6: tokio::park::thread::CachedParkThread::block_on::{{closure}}
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/park/thread.rs:263:54
   7: tokio::coop::with_budget::{{closure}}
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/coop.rs:102:9
   8: std::thread::local::LocalKey<T>::try_with
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/thread/local.rs:445:16
   9: std::thread::local::LocalKey<T>::with
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/std/src/thread/local.rs:421:9
  10: tokio::coop::with_budget
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/coop.rs:95:5
  11: tokio::coop::budget
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/coop.rs:72:5
  12: tokio::park::thread::CachedParkThread::block_on
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/park/thread.rs:263:31
  13: tokio::runtime::enter::Enter::block_on
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/enter.rs:151:13
  14: tokio::runtime::thread_pool::ThreadPool::block_on
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/thread_pool/mod.rs:90:9
  15: tokio::runtime::Runtime::block_on
             at /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.19.2/src/runtime/mod.rs:482:43
  16: rust_arrow_playground::main
             at ./src/main.rs:49:5
  17: core::ops::function::FnOnce::call_once
             at /rustc/a8314ef7d0ec7b75c336af2c9857bfaf43002bfc/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Expected behavior

The test passes with this output:

+------------------------------------------------------------------------------------+
| CASE WHEN #cpu_load_short.host IS NULL THEN Utf8("") ELSE #cpu_load_short.host END |
+------------------------------------------------------------------------------------+
| host1                                                                              |
|                                                                                    |
| host2                                                                              |
+------------------------------------------------------------------------------------+

Additional context

This test used to pass. The last commit it passed was 57f47ab

It appears to fail starting of da392f4 (aka came in via #2819) which makes sense given the change.

Found while debugging upgrade into IOx: https://github.com/influxdata/influxdb_iox/pull/5079

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions