Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10967: [Rust] Add functions for test data to mod arrow::util::test_util #8967

Closed
wants to merge 4 commits into from

Conversation

mqy
Copy link
Contributor

@mqy mqy commented Dec 19, 2020

If we could get test data dirs at runtime, both env vars ARROW_TEST_DATA and PARQUET_TEST_DATA become optional: no need to set them unless the testing data is not in pre-defined location.

This PR adds two similar public functions arrow_test_data and parquet_test_data to mod arrow::util::test_util, each behaves like this:

  • return data dir from user defined env if defined and corresponding dir exists.
  • return default data dir by joining env CARGO_MANIFEST_DIR and relative pre-defined data data dirs.
  • panic on error.

Possible panic errors from arrow_test_data():

- failed to get arrow data dir: the data dir `non/existing` defined by env `ARROW_TEST_DATA` not found
- failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `../../testing/data` not found

Possible panic errors from parquet_test_data():

- failed to get parquet data dir: the data dir `non/existing` defined by env `PARQUET_TEST_DATA` not found
- failed to get parquet data dir: env `PARQUET_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `../../cpp/submodules/parquet-testing/data` not found

Existing codes can be updated in this way :

let testdata = env::var("ARROW_TEST_DATA").expect("ARROW_TEST_DATA not defined");
// change to 
let testdata = arrow::util::test_util::arrow_test_data();

@mqy mqy changed the title Arrow 10967 optional env Arrow-10967: [Rust] Make env vars ARROW_TEST_DATA and PARQUET_TEST_DATA optional Dec 19, 2020
@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@mqy mqy changed the title Arrow-10967: [Rust] Make env vars ARROW_TEST_DATA and PARQUET_TEST_DATA optional ARROW-10967: [Rust] Make env vars optional Dec 19, 2020
@github-actions
Copy link

@codecov-io
Copy link

codecov-io commented Dec 19, 2020

Codecov Report

Merging #8967 (1242059) into master (0519c4c) will increase coverage by 0.01%.
The diff coverage is 96.49%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8967      +/-   ##
==========================================
+ Coverage   82.64%   82.65%   +0.01%     
==========================================
  Files         200      200              
  Lines       49730    49787      +57     
==========================================
+ Hits        41098    41153      +55     
- Misses       8632     8634       +2     
Impacted Files Coverage Δ
rust/arrow/src/util/test_util.rs 90.90% <96.49%> (+15.90%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9bab12f...1242059. Read the comment docs.

#[test]
fn read_generated_files() {
let testdata = env::var("ARROW_TEST_DATA").expect("ARROW_TEST_DATA not defined");
Copy link
Member

@andygrove andygrove Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you post example output for when an example of test is run and the test data cannot be found? I expected the unwrap does show the detailed error message but just want to be sure since we're removing these explicit messages here.

Copy link
Contributor Author

@mqy mqy Dec 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove Make sense, let me refined the errors.

@andygrove
Copy link
Member

Thanks @mqy I like this approach.

@mqy
Copy link
Contributor Author

mqy commented Dec 20, 2020

@andygrove finally I force pushed just the newly added mod. PR desc was updated too.

@mqy mqy force-pushed the ARROW-10967_optional_env branch 4 times, most recently from 3ebffce to e8b74ac Compare December 21, 2020 07:23
@mqy mqy closed this Dec 21, 2020
@mqy mqy reopened this Dec 21, 2020
@mqy mqy force-pushed the ARROW-10967_optional_env branch 3 times, most recently from a769d4e to f05badb Compare December 21, 2020 07:47
@mqy mqy changed the title ARROW-10967: [Rust] Make env vars optional ARROW-10967: [Rust] Add mod arrow::util::test_data_dir Dec 21, 2020
@mqy mqy closed this Dec 21, 2020
@mqy mqy reopened this Dec 21, 2020
@mqy mqy force-pushed the ARROW-10967_optional_env branch 4 times, most recently from e608972 to 9574571 Compare December 21, 2020 14:37
@mqy mqy requested a review from andygrove December 22, 2020 03:16
@mqy mqy changed the title ARROW-10967: [Rust] Add mod arrow::util::test_data_dir ARROW-10967: [Rust] Add functions arrow::util::test_util::{arrow_test_data, parquet_test_data} Dec 22, 2020
@mqy mqy changed the title ARROW-10967: [Rust] Add functions arrow::util::test_util::{arrow_test_data, parquet_test_data} ARROW-10967: [Rust] Add functions for test data to mod arrow::util::test_util Dec 22, 2020
@mqy
Copy link
Contributor Author

mqy commented Dec 22, 2020

Could you describe (or better illustrate in the PR) how these functions are intended to be used? It is difficult to evaluate whether the API fits a use-case if they are not called anywhere.

Thanks! New changes:

  • rename function names to lower case
  • moved the new codes to test_util.rs
  • fix and test
  • update PR title and content

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @mqy -- I really like the idea of being able to run cargo test without having to specify an environment variable (while still allowing the directories to be overridden by environment variable)

I am not sure I should admit it, but this is how I run the arrow tests now :

cd /Users/alamb/Software/arrow/rust/datafusion  && PARQUET_TEST_DATA=`pwd`/../../cpp/submodules/parquet-testing/data ARROW_TEST_DATA=`pwd`/../../testing/data cargo test

Which is a mess, so for me, the premise of this PR would be a big improvement

I wonder if we can use a Cargo environment variable rather than calling out to git to find the appropriate paths. For example, I think env!("CARGO_MANIFEST_DIR") is the directory containing Cargo.toml.

I am not opposed to calling out to git either, as long as there is a reasonable error message that explains what environment variables to set when that fails (e.g. git isn't installed or something)-- I didn't test this.

I would personally be happy to accept this PR (based on calling git or something else) if simply running cargo test would pass all tests without having to set the *_TEST_DATA variables ( I think you would just need to make the changes you suggested of "Existing codes can be updated in this way :")

I tried locally and I still get errors:

(arrow_dev) alamb@MacBook-Pro:~/Software/arrow/rust$ cargo test --all
...
---- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'ARROW_TEST_DATA not defined: NotPresent', arrow/src/ipc/reader.rs:988:52
note: panic did not contain expected string
      panic message: `"ARROW_TEST_DATA not defined: NotPresent"`,
 expected substring: `"Big Endian is not supported for Decimal!"`
...

@mqy
Copy link
Contributor Author

mqy commented Dec 22, 2020

panic message: "ARROW_TEST_DATA not defined: NotPresent"

@alamb thanks for review.

I guess if you had unset ARROW_TEST_DATA, env::var("ARROW_TEST_DATA").expect("ARROW_TEST_DATA not defined"); should panic with that error.

I'll double check this.

@mqy
Copy link
Contributor Author

mqy commented Dec 22, 2020

I wonder if we can use a Cargo environment variable rather than calling out to git to find the appropriate paths. For example, I think env!("CARGO_MANIFEST_DIR") is the directory containing Cargo.toml.

@alamb The Cargo Book :: Environment Variables states that:

CARGO_MANIFEST_DIR — The directory containing the manifest for **the package** being built (the package containing the build script). Also note that this is the value of the current working directory of the build script when it starts.

I've verified with unit test in the "arrow" package, that env!("CARGO_MANIFEST_DIR") equals to env::current_dir().unwrap().display().to_string() -- the dir "rust/arrow".

Actually, the most simplest implementation to get "git top level dir" is to append "../../" to current dir at present.

I have tried various ways to get the git top level dir:

  • find ".git" file or dir from parents of current dir
  • find dir "rust" from parents of current dir
  • use git command -- this is current solution

I prefer that arrow_test_data() and parquet_test_dir() SHOULD NOT depend on the actual dir structure: workspace dir "rust/" or any existing package dir ("parquet/" for example), or possible deeper nesting dirs. The major reason is: these data dirs are defined (controlled) by git submodule for all sub projects, it's none of the business of underlying dir structures in "apache arrow".

Thank you for your review and comments!

@mqy
Copy link
Contributor Author

mqy commented Dec 22, 2020

I would personally be happy to accept this PR (based on calling git or something else) if simply running cargo test would pass all tests without having to set the *_TEST_DATA variables ( I think you would just need to make the changes you suggested of "Existing codes can be updated in this way :")

Actually, in previous commits (that was reverted), I had updated other codes, but sometimes this PR fails in CI tests, caused by code merge, typical error is about unknown env: std::env was delete by this PR.

@alamb If I understood correctly, are you suggesting me update existing codes that call env::("*_TEST_DIR") ?
If so, I'll commit files (including to README.md) for review.

@mqy
Copy link
Contributor Author

mqy commented Dec 22, 2020

as long as there is a reasonable error message that explains what environment variables to set when that fails (e.g. git isn't installed or something)

Good point, let me update the error message.

@andygrove
Copy link
Member

One observation about the use of git is that it won't help when we are validating a release candidate based on the source tarball.

I think it would be better if we just relied on cargo env vars if that is possible.

@mqy
Copy link
Contributor Author

mqy commented Dec 23, 2020

@andygrove @alamb @jorgecarleitao

I updated code and PR: drop git, add env CARGO_MANIFEST_DIR. Good news: it should be better than the pure env solution and the codes get cleaner than ever.

Coverage (amd64, stable) fails at test encodings::encoding::tests::test_bool ... FAILED, but I can't re-produce this failure on local macOS with cargo +stable test. I think this failure is not introduced in this PR.

@alamb
Copy link
Contributor

alamb commented Dec 23, 2020

Coverage (amd64, stable) fails at test encodings::encoding::tests::test_bool ... FAILED, but I can't re-produce this failure on local macOS with cargo +stable test. I think this failure is not introduced in this PR.

I think this is https://issues.apache.org/jira/browse/ARROW-10943 -- it is intermittently failing. I will retrigger the tests

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think this is looking good. Thanks again @mqy

I thought it was so good, in fact, that I whipped up a small PR based on this work to migrate the tests over to use arrow_test_data and parquet_test_data -- and thereby scratching an itch I have had for a long time about the environment variables being cumbersome.

#8996 is based on the code in this PR


env::set_var(udf_env, non_existing_str);
let res = get_data_dir(udf_env, existing_str);
debug_assert!(res.is_err());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally recommend using assert! here and below rather than debug assert -- but it probably makes no practical difference since we would run these tests only on debug builds.

Suggested change
debug_assert!(res.is_err());
debug!(res.is_err());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated again: replaced debug_assert with assert.

@alamb that's great, I appreciate all the kindness, guidance!

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good. Thanks a lot for taking the time for this. I look forward to not have to set these export every time I want to run these. :)

@nevi-me nevi-me closed this in be72a2b Dec 23, 2020
nevi-me pushed a commit that referenced this pull request Dec 27, 2020
This PR is based on @mqy 's great work in: #8967.

(If we want to take this PR, we can either merge it in to https://github.com/apache/arrow/pull/8967/files# directly or I can make a new independent PR when that is merged).

## Rationale
The outcome is that developers can now simply run  `cargo test` in a typical checkout without having to mess with environment variables. I think this will lower the barrier to entry for people to contribute.

## Changes
1. Code from #8967 to encode heuristics of where to check for test data
1. Remove all references to ARROW_TEST_DATA and PARQUET_TEST_DATA and uses the test_util methods instead
2. Update the comments / error messages in test_util

## Example Error Handling

Error handling: here is what happens with a fresh checkout and no git modules checked out  and no environment variables set:

```
cargo test -p arrow
---- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `/private/tmp/arrow/rust/arrow/../../testing/data` not found
HINT: try running `git submodule update --init`', arrow/src/util/test_util.rs:81:21
```

Here is an example of what happens when `ARROW_TEST_DATA` is pointing somewhere non existent

```
ARROW_TEST_DATA=blargh cargo test -p arrow
...
--- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: the data dir `blargh` defined by env ARROW_TEST_DATA not found', arrow/src/util/test_util.rs:81:21
```

Closes #8996 from alamb/alamb/tests_without_env

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
This PR is based on @mqy 's great work in: apache/arrow#8967.

(If we want to take this PR, we can either merge it in to https://github.com/apache/arrow/pull/8967/files# directly or I can make a new independent PR when that is merged).

## Rationale
The outcome is that developers can now simply run  `cargo test` in a typical checkout without having to mess with environment variables. I think this will lower the barrier to entry for people to contribute.

## Changes
1. Code from apache/arrow#8967 to encode heuristics of where to check for test data
1. Remove all references to ARROW_TEST_DATA and PARQUET_TEST_DATA and uses the test_util methods instead
2. Update the comments / error messages in test_util

## Example Error Handling

Error handling: here is what happens with a fresh checkout and no git modules checked out  and no environment variables set:

```
cargo test -p arrow
---- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `/private/tmp/arrow/rust/arrow/../../testing/data` not found
HINT: try running `git submodule update --init`', arrow/src/util/test_util.rs:81:21
```

Here is an example of what happens when `ARROW_TEST_DATA` is pointing somewhere non existent

```
ARROW_TEST_DATA=blargh cargo test -p arrow
...
--- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: the data dir `blargh` defined by env ARROW_TEST_DATA not found', arrow/src/util/test_util.rs:81:21
```

Closes #8996 from alamb/alamb/tests_without_env

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…est_util

If we could get test data dirs at runtime,  both env vars `ARROW_TEST_DATA` and `PARQUET_TEST_DATA` become **optional**: no need to set them unless the testing data is not in pre-defined location.

This PR adds two similar public functions `arrow_test_data` and `parquet_test_data` to mod `arrow::util::test_util`, each behaves like this:

- return data dir from user defined env if defined and corresponding dir exists.
- return default data dir by joining env `CARGO_MANIFEST_DIR` and relative pre-defined data data dirs.
- panic on error.

Possible panic errors from `arrow_test_data()`:

```
- failed to get arrow data dir: the data dir `non/existing` defined by env `ARROW_TEST_DATA` not found
- failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `../../testing/data` not found
```

Possible panic errors from `parquet_test_data()`:

```
- failed to get parquet data dir: the data dir `non/existing` defined by env `PARQUET_TEST_DATA` not found
- failed to get parquet data dir: env `PARQUET_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `../../cpp/submodules/parquet-testing/data` not found
```

Existing codes can be updated in this way :
```
let testdata = env::var("ARROW_TEST_DATA").expect("ARROW_TEST_DATA not defined");
// change to
let testdata = arrow::util::test_util::arrow_test_data();
```

Closes apache#8967 from mqy/ARROW-10967_optional_env

Authored-by: mqy <meng.qingyou@gmail.com>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR is based on @mqy 's great work in: apache#8967.

(If we want to take this PR, we can either merge it in to https://github.com/apache/arrow/pull/8967/files# directly or I can make a new independent PR when that is merged).

## Rationale
The outcome is that developers can now simply run  `cargo test` in a typical checkout without having to mess with environment variables. I think this will lower the barrier to entry for people to contribute.

## Changes
1. Code from apache#8967 to encode heuristics of where to check for test data
1. Remove all references to ARROW_TEST_DATA and PARQUET_TEST_DATA and uses the test_util methods instead
2. Update the comments / error messages in test_util

## Example Error Handling

Error handling: here is what happens with a fresh checkout and no git modules checked out  and no environment variables set:

```
cargo test -p arrow
---- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `/private/tmp/arrow/rust/arrow/../../testing/data` not found
HINT: try running `git submodule update --init`', arrow/src/util/test_util.rs:81:21
```

Here is an example of what happens when `ARROW_TEST_DATA` is pointing somewhere non existent

```
ARROW_TEST_DATA=blargh cargo test -p arrow
...
--- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: the data dir `blargh` defined by env ARROW_TEST_DATA not found', arrow/src/util/test_util.rs:81:21
```

Closes apache#8996 from alamb/alamb/tests_without_env

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants