Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datafusion: unreachable code reached when parsing statistics with missing columns #1374

Closed
cmackenzie1 opened this issue May 16, 2023 · 10 comments · Fixed by #1413
Closed

Datafusion: unreachable code reached when parsing statistics with missing columns #1374

cmackenzie1 opened this issue May 16, 2023 · 10 comments · Fixed by #1413
Labels
bug Something isn't working

Comments

@cmackenzie1
Copy link
Contributor

cmackenzie1 commented May 16, 2023

Environment

Delta-rs version: 0.11.0, 0.10.0

Binding: rust

Environment:

  • Cloud provider:
  • OS:
  • Other:

Bug

What happened:

Reached https://github.com/apache/arrow-datafusion/blob/37b2c53f281b9550034e7e69f5acf1ae666a0da7/datafusion/common/src/scalar.rs#L2472 when querying table with datafusion. It looks like the issue may have been reached from

ScalarValue::iter_to_array(values).ok()
when pruning files based on timestamp values in the min/max statistics.

What you expected to happen:

Query executes successfully and returns the matching results.

How to reproduce it:

More details:

Stack trace

{"timestamp":"2023-05-16T23:11:47.052290Z","level":"INFO","fields":{"message":"plan: Limit: skip=0, fetch=1000\n  Filter: t.EdgeStartTimestamp BETWEEN TimestampMicrosecond(1681603200000000, None) AND TimestampMicrosecond(1681668000000000, None)\n    Filter: t.date BETWEEN Utf8(\"2023-04-16\") AND Utf8(\"2023-04-16\")\n      TableScan: t"},"target":"loghouse::handlers::query"}
thread 'tokio-runtime-worker' panicked at 'internal error: entered unreachable code', /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-common-24.0.0/src/scalar.rs:2472:26
stack backtrace:
   0: rust_begin_unwind
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/std/src/panicking.rs:579:5
   1: core::panicking::panic_fmt
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:64:14
   2: core::panicking::panic
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/panicking.rs:114:5
   3: datafusion_common::scalar::ScalarValue::iter_to_null_array::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-common-24.0.0/src/scalar.rs:2472:26
   4: core::iter::adapters::map::map_fold::{{closure}}
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/iter/adapters/map.rs:84:21
   5: core::iter::traits::iterator::Iterator::fold
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/iter/traits/iterator.rs:2477:21
   6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/iter/adapters/map.rs:124:9
   7: <core::iter::adapters::peekable::Peekable<I> as core::iter::traits::iterator::Iterator>::fold
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/iter/adapters/peekable.rs:114:9
   8: datafusion_common::scalar::ScalarValue::iter_to_null_array
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-common-24.0.0/src/scalar.rs:2468:13
   9: datafusion_common::scalar::ScalarValue::iter_to_array
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-common-24.0.0/src/scalar.rs:2233:31
  10: deltalake::operations::transaction::state::AddContainer::get_prune_stats
             at /Users/cole/.cargo/git/checkouts/delta-rs-6ac7a11b145995c6/d7acb5c/rust/src/operations/transaction/state.rs:204:9
  11: <deltalake::operations::transaction::state::AddContainer as datafusion::physical_optimizer::pruning::PruningStatistics>::max_values
             at /Users/cole/.cargo/git/checkouts/delta-rs-6ac7a11b145995c6/d7acb5c/rust/src/operations/transaction/state.rs:241:9
  12: deltalake::operations::transaction::state::<impl datafusion::physical_optimizer::pruning::PruningStatistics for deltalake::table_state::DeltaTableState>::max_values
             at /Users/cole/.cargo/git/checkouts/delta-rs-6ac7a11b145995c6/d7acb5c/rust/src/operations/transaction/state.rs:300:9
  13: datafusion::physical_optimizer::pruning::build_statistics_record_batch
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/physical_optimizer/pruning.rs:394:36
  14: datafusion::physical_optimizer::pruning::PruningPredicate::prune
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/physical_optimizer/pruning.rs:162:13
  15: deltalake::delta_datafusion::<impl datafusion::datasource::datasource::TableProvider for deltalake::delta::DeltaTable>::scan::{{closure}}
             at /Users/cole/.cargo/git/checkouts/delta-rs-6ac7a11b145995c6/d7acb5c/rust/src/delta_datafusion.rs:450:34
  16: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/future/future.rs:125:9
  17: datafusion::physical_plan::planner::DefaultPhysicalPlanner::create_initial_plan::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/physical_plan/planner.rs:535:88
  18: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/future/future.rs:125:9
  19: datafusion::physical_plan::planner::DefaultPhysicalPlanner::create_initial_plan::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/physical_plan/planner.rs:806:96
  20: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/future/future.rs:125:9
  21: datafusion::physical_plan::planner::DefaultPhysicalPlanner::create_initial_plan::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/physical_plan/planner.rs:1126:79
  22: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/future/future.rs:125:9
  23: <datafusion::physical_plan::planner::DefaultPhysicalPlanner as datafusion::physical_plan::planner::PhysicalPlanner>::create_physical_plan::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/physical_plan/planner.rs:428:21
  24: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/future/future.rs:125:9
  25: <datafusion::execution::context::DefaultQueryPlanner as datafusion::execution::context::QueryPlanner>::create_physical_plan::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/execution/context.rs:1282:13
  26: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/future/future.rs:125:9
  27: datafusion::execution::context::SessionState::create_physical_plan::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/execution/context.rs:1840:13
  28: datafusion::dataframe::DataFrame::create_physical_plan::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/dataframe.rs:99:60
  29: datafusion::dataframe::DataFrame::collect::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-24.0.0/src/dataframe.rs:663:47
  30: loghouse::handlers::query::query::{{closure}}
             at /Users/cole/bitbucket/DATA/loghouse/src/handlers/query.rs:153:37
  31: <F as axum::handler::Handler<(M,T1,T2,T3),S,B>>::call::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/axum-0.6.18/src/handler/mod.rs:213:52
  32: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/future/future.rs:125:9
  33: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/future/map.rs:55:37
  34: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  35: <axum::handler::future::IntoServiceFuture<F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/axum-0.6.18/src/macros.rs:42:17
  36: <F as futures_core::future::TryFuture>::try_poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.28/src/future.rs:82:9
  37: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/try_future/into_future.rs:34:9
  38: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/future/map.rs:55:37
  39: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  40: <futures_util::future::try_future::MapOk<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  41: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/macros.rs:38:17
  42: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/future/future.rs:125:9
  43: <tower::util::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/util/oneshot.rs:97:38
  44: <axum::routing::route::RouteFuture<B,E> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/axum-0.6.18/src/routing/route.rs:161:61
  45: <F as futures_core::future::TryFuture>::try_poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.28/src/future.rs:82:9
  46: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/try_future/into_future.rs:34:9
  47: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/future/map.rs:55:37
  48: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  49: <futures_util::future::try_future::MapOk<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  50: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/macros.rs:38:17
  51: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/future/future.rs:125:9
  52: axum::middleware::from_fn::Next<B>::run::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/axum-0.6.18/src/middleware/from_fn.rs:335:35
  53: loghouse::metrics::middleware::{{closure}}
             at /Users/cole/bitbucket/DATA/loghouse/src/metrics/mod.rs:71:33
  54: <axum::middleware::from_fn::FromFn<F,S,I,(T1,)> as tower_service::Service<http::request::Request<B>>>::call::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/axum-0.6.18/src/middleware/from_fn.rs:300:44
  55: <axum::middleware::from_fn::ResponseFuture as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/axum-0.6.18/src/middleware/from_fn.rs:381:9
  56: <F as futures_core::future::TryFuture>::try_poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.28/src/future.rs:82:9
  57: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/try_future/into_future.rs:34:9
  58: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/future/map.rs:55:37
  59: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  60: <futures_util::future::try_future::MapOk<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  61: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/macros.rs:38:17
  62: <F as futures_core::future::TryFuture>::try_poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.28/src/future.rs:82:9
  63: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/try_future/into_future.rs:34:9
  64: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/future/map.rs:55:37
  65: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  66: <futures_util::future::try_future::MapErr<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  67: <tower::util::map_err::MapErrFuture<F,N> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/macros.rs:38:17
  68: <F as futures_core::future::TryFuture>::try_poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.28/src/future.rs:82:9
  69: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/try_future/into_future.rs:34:9
  70: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/future/map.rs:55:37
  71: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  72: <futures_util::future::try_future::MapOk<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  73: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/macros.rs:38:17
  74: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/core/src/future/future.rs:125:9
  75: <tower::util::oneshot::Oneshot<S,Req> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/util/oneshot.rs:97:38
  76: <axum::routing::route::RouteFuture<B,E> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/axum-0.6.18/src/routing/route.rs:161:61
  77: <tower_http::limit::future::ResponseFuture<F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-http-0.4.0/src/limit/future.rs:56:53
  78: <F as futures_core::future::TryFuture>::try_poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.28/src/future.rs:82:9
  79: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/try_future/into_future.rs:34:9
  80: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/future/map.rs:55:37
  81: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  82: <futures_util::future::try_future::MapOk<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  83: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/macros.rs:38:17
  84: <F as futures_core::future::TryFuture>::try_poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.28/src/future.rs:82:9
  85: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/try_future/into_future.rs:34:9
  86: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/future/map.rs:55:37
  87: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  88: <futures_util::future::try_future::MapErr<Fut,F> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/lib.rs:91:13
  89: <tower::util::map_err::MapErrFuture<F,N> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.13/src/macros.rs:38:17
  90: <F as futures_core::future::TryFuture>::try_poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-core-0.3.28/src/future.rs:82:9
  91: <futures_util::future::try_future::into_future::IntoFuture<Fut> as core::future::future::Future>::poll
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-util-0.3.28/src/future/try_future/into_future.rs:34:9
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
@cmackenzie1 cmackenzie1 added the bug Something isn't working label May 16, 2023
@cmackenzie1
Copy link
Contributor Author

Apologies for the issue title - not really sure what to call it here

@roeap
Copy link
Collaborator

roeap commented May 17, 2023

Would it be possible for you to provide a repro example, or print the values we are passing into ScalarValue::iter_to_arrway, or to identify the datatype of the column where this is happening? Looking into the datafusion code the most likely thing that is happen is that we are passing a ScalarValue::Null as a first value, while it should be a more specific type like e.g. ScalarValue::Float32(None). Along most code paths we should actually create that correct value, but for some we see to not :)

@cmackenzie1
Copy link
Contributor Author

@roeap I'll try and get a reproducible delta table created today and add it as a test case.

@cmackenzie1
Copy link
Contributor Author

cmackenzie1 commented May 18, 2023

I was able to recreate this in a test and I believe it is related / duplicate of #1372. When running test on 2077e6a it errors out while on 1b01f9f it succeeds.

Here is the test case for ref: main...cmackenzie1:delta-rs:cole/issue-1374

Edit: err, hold that thought. I am still seeing this error on some tables of mine. Will report back

cmackenzie1 added a commit to cmackenzie1/delta-rs that referenced this issue May 18, 2023
@cmackenzie1
Copy link
Contributor Author

cmackenzie1 commented May 19, 2023

Ok, I think I've narrowed it down and there is two bugs at play- with the first one being solved in #1372.

This one happens in

values
.get(&column.name)
.and_then(|f| to_correct_scalar_value(f.as_value()?, &data_type))
.unwrap_or(ScalarValue::Null)

What I have been observing is that sometimes the statistics are sometimes missing entire keys, which causes the unwrap_or() condition after trying to get the value from the HashMap with get (since it returns None).

In this scenario, the table is loaded by a combination of parquet checkpoint + JSON and the field missing in the stats is EdgeStartTimestamp which is defined as

SchemaField::new(
    "EdgeStartTimestamp".to_string(),
    SchemaDataType::primitive("timestamp".to_string()),
    true,
    HashMap::new(),
),

I've injected some println!() statements into the code to show the statistics map and then the value retrieved by get.

statistics: {"ClientRequestURI": Value(String("/")), "EdgeResponseStatus": Value(Number(200)), "EdgeResponseBytes": Value(Number(219)), "ClientRequestMethod": Value(String("GET")), "ClientIP": Value(String("0.0.0.0")), "RayID": Value(String("7c86f1ee89b32a2e")), "ClientRequestHost": Value(String("exmaple.com"))}
column.name: EdgeStartTimestamp, stat: None

statistics: {"ClientRequestMethod": Value(String("GET")), "EdgeEndTimestamp": Value(String("2023-05-16T22:10:15.000000Z")), "EdgeResponseStatus": Value(Number(200)), "ClientRequestHost": Value(String("exmaple.com")), "EdgeResponseBytes": Value(Number(484)), "EdgeStartTimestamp": Value(String("2023-05-16T22:10:15.000000Z")), "ClientIP": Value(String("0.0.0.0")), "ClientRequestURI": Value(String("/")), "RayID": Value(String("7c86f31dd8892a14"))}
column.name: EdgeStartTimestamp, stat: Some(Value(String("2023-05-16T22:10:15.000000Z")))

I am not sure what the behavior should be in this scenario, but I imagine that being unable to prune the file via statistics would just mean the file is kept for querying.

@cmackenzie1 cmackenzie1 changed the title Datafusion: unreachable code reached Datafusion: unreachable code reached when parsing statistics with missing columns May 19, 2023
@cmackenzie1
Copy link
Contributor Author

cmackenzie1 commented May 19, 2023

Looking into the PruningStatistics trait from datafusion, I believe the expected result is an ArrayRef of Option<ScalarValue> of all the same type instead and then None is used to represent null instead of ScalarValue::Null? Something like

// [true, Null, false]
let scalars = vec![
  ScalarValue::Boolean(Some(true)),
  ScalarValue::Boolean(None),
  ScalarValue::Boolean(Some(false)),
];

https://github.com/apache/arrow-datafusion/blob/dd5e1dbbfd20539b40ae65acb8883f7e392cba92/datafusion/core/src/physical_optimizer/pruning.rs#L54-L72

@roeap
Copy link
Collaborator

roeap commented May 20, 2023

I believe the expected result is an ArrayRef of Option of all the same type instead and then None is used to represent null instead of ScalarValue::Null

Arrived at the same conclusion. I guess the PruningStatistics should just return None all together if we do not have any column stats for that column, or even if stats for a file are missing? Nut sure on the specifics of the pushdown, but otherwise we might erroneously skip a file on IS NOT NULL if both min and mux stats are null?

@cmackenzie1
Copy link
Contributor Author

but otherwise we might erroneously skip a file on IS NOT NULL if both min and mux stats are null?

Wouldn't those be covered by the nullCounts stats?

Also, I think this raises another issue: why are the add.stats_parsed fields sometimes null for a given column?

SELECT
    add.path,
    add.stats_parsed.minvalues.edgestarttimestamp,
    add.stats_parsed.maxvalues.edgestarttimestamp
FROM read_parquet('./testdata/http_requests/_delta_log/00000000000000000400.checkpoint.parquet')
WHERE add.path = 'date=2023-05-16/part-00000-de989f29-71ff-4906-be26-3c36fb604c6a-c000.snappy.parquet'

|                                        path                                         | edgestarttimestamp | edgestarttimestamp |
|-------------------------------------------------------------------------------------|--------------------|--------------------|
| date=2023-05-16/part-00000-de989f29-71ff-4906-be26-3c36fb604c6a-c000.snappy.parquet |                    |                    |
SELECT
    min(EdgeStartTimestamp),
    max(EdgeStartTimestamp)
FROM read_parquet('./testdata/http_requests/date=2023-05-16/part-00000-de989f29-71ff-4906-be26-3c36fb604c6a-c000.snappy.parquet')

| min(EdgeStartTimestamp) | max(EdgeStartTimestamp) |
|-------------------------|-------------------------|
| 2023-05-16 22:09:26     | 2023-05-16 22:11:04     |
SELECT count(*)
FROM read_parquet('./testdata/http_requests/date=2023-05-16/part-00000-de989f29-71ff-4906-be26-3c36fb604c6a-c000.snappy.parquet')
WHERE EdgeStartTimestamp IS NULL

| count_star() |
|--------------|
| 0            |

@roeap
Copy link
Collaborator

roeap commented May 24, 2023

I believe the stats_parsed field is always just optional, an delta does not necessarily collect stats for all columns, in which case it would always be be null for columns where there are no stats collected.

@roeap
Copy link
Collaborator

roeap commented May 24, 2023

Wouldn't those be covered by the nullCounts stats?

Well, not entirely sure how that would exactly work out. But if there are conditions defined these would tell datafusion to skip the file (i.e. min and max are NULL and thus none would match IS NOT NULL), I dont think df re-validates this by comparing null count and num rows to realize that there are in fact non null rows in the file.

cmackenzie1 added a commit to cmackenzie1/delta-rs that referenced this issue May 30, 2023
Test table `issue_1374` was created by hand to have 2 data files where
only one file has the `min_values` for the statistics in the
`checkpoint.parquet` file set to null in order to trigger the bug.
There is no other significance to the table other than to demonstrate
issue delta-io#1374.

```
internal error: entered unreachable code
thread 'test_issue_1374' panicked at 'internal error: entered unreachable code', /Users/cole/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-common-24.0.0/src/scalar.rs:2472:26
```
cmackenzie1 added a commit to cmackenzie1/delta-rs that referenced this issue May 30, 2023
Test table `issue_1374` was created by hand to have 2 data files where
only one file has the `min_values` for the statistics in the
`checkpoint.parquet` file set to null in order to trigger the bug.
There is no other significance to the table other than to demonstrate
issue delta-io#1374.

```
internal error: entered unreachable code
thread 'test_issue_1374' panicked at 'internal error: entered unreachable code', /Users/cole/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-common-24.0.0/src/scalar.rs:2472:26
```
cmackenzie1 added a commit to cmackenzie1/delta-rs that referenced this issue May 30, 2023
Test table `issue_1374` was created by hand to have 2 data files where
only one file has the `min_values` for the statistics in the
`checkpoint.parquet` file set to null in order to trigger the bug.
There is no other significance to the table other than to demonstrate
issue delta-io#1374.

```
internal error: entered unreachable code
thread 'test_issue_1374' panicked at 'internal error: entered unreachable code', /Users/cole/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-common-24.0.0/src/scalar.rs:2472:26
```
cmackenzie1 added a commit to cmackenzie1/delta-rs that referenced this issue May 30, 2023
Test table `issue_1374` was created by hand to have 2 data files where
only one file has the `min_values` for the statistics in the
`checkpoint.parquet` file set to null in order to trigger the bug.
There is no other significance to the table other than to demonstrate
issue delta-io#1374.

```
internal error: entered unreachable code
thread 'test_issue_1374' panicked at 'internal error: entered unreachable code', /Users/cole/.cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-common-24.0.0/src/scalar.rs:2472:26
```
roeap pushed a commit that referenced this issue Jun 2, 2023
# Description
Switch the `get_prune_stats` functions to use `None` to represent `null`
instead of `ScalarValue::Null` as `ArrayRef` must be of all the same
type.

# Related Issue(s)

- closes #1374 

# Documentation


https://github.com/apache/arrow-datafusion/blob/dd5e1dbbfd20539b40ae65acb8883f7e392cba92/datafusion/core/src/physical_optimizer/pruning.rs#L54-L72

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
roeap pushed a commit to roeap/delta-rs that referenced this issue Jun 2, 2023
# Description
Switch the `get_prune_stats` functions to use `None` to represent `null`
instead of `ScalarValue::Null` as `ArrayRef` must be of all the same
type.

# Related Issue(s)

- closes delta-io#1374 

# Documentation


https://github.com/apache/arrow-datafusion/blob/dd5e1dbbfd20539b40ae65acb8883f7e392cba92/datafusion/core/src/physical_optimizer/pruning.rs#L54-L72

---------

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants