Skip to content

Automatically find PARQUET_TEST_DATA and ARROW_TEST_DATA #467

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

As someone new to datafusion it may not be clear that to run the tests successfully you need to set PARQUET_TEST_DATA and ARROW_TEST_DATA environment variables

So today, here is what happens:

git clone https://github.com/apache/arrow-datafusion
cd arrow-datafusion
cargo test -p datafusion

Which results in many errors like:

---- physical_plan::windows::tests::window_function_input_partition stdout ----
thread 'physical_plan::windows::tests::window_function_input_partition' panicked at 'failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `/Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-4.2.0/../testing/data` not found
HINT: try running `git submodule update --init`', /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-4.2.0/src/util/test_util.rs:81:21

And even when you do as suggested git submodule update --init it does not work. Instead, you need to set :

export ARROW_TEST_DATA=testing/data
export PARQUET_TEST_DATA=parquet-testing/data
cargo test -p datafusion

Describe the solution you'd like
I would like the tests to automatically try the default locations, as above, if ARROW_TEST_DATA and PARQUET_TEST_DATA are set.

The tests should pass successfully with only these commands:

git clone https://github.com/apache/arrow-datafusion
cd arrow-datafusion
git submodule update --init
cargo test -p datafusion

The arrow-rs crate already does this (here and here: but now that we no longer have arrow-rs and datafusion in the same workspace it stopped working

Perhaps we can simply take the code from arrow-rs and port it to run in datafusion rather than calling arrow::util::test_util

Describe alternatives you've considered
None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions