-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As someone new to datafusion it may not be clear that to run the tests successfully you need to set PARQUET_TEST_DATA and ARROW_TEST_DATA environment variables
So today, here is what happens:
git clone https://github.com/apache/arrow-datafusion
cd arrow-datafusion
cargo test -p datafusion
Which results in many errors like:
---- physical_plan::windows::tests::window_function_input_partition stdout ----
thread 'physical_plan::windows::tests::window_function_input_partition' panicked at 'failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `/Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-4.2.0/../testing/data` not found
HINT: try running `git submodule update --init`', /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-4.2.0/src/util/test_util.rs:81:21
And even when you do as suggested git submodule update --init it does not work. Instead, you need to set :
export ARROW_TEST_DATA=testing/data
export PARQUET_TEST_DATA=parquet-testing/data
cargo test -p datafusion
Describe the solution you'd like
I would like the tests to automatically try the default locations, as above, if ARROW_TEST_DATA and PARQUET_TEST_DATA are set.
The tests should pass successfully with only these commands:
git clone https://github.com/apache/arrow-datafusion
cd arrow-datafusion
git submodule update --init
cargo test -p datafusion
The arrow-rs crate already does this (here and here: but now that we no longer have arrow-rs and datafusion in the same workspace it stopped working
Perhaps we can simply take the code from arrow-rs and port it to run in datafusion rather than calling arrow::util::test_util
Describe alternatives you've considered
None