New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10967: [Rust] Add functions for test data to mod arrow::util::test_util #8967
Conversation
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format?
See also: |
Codecov Report
@@ Coverage Diff @@
## master #8967 +/- ##
==========================================
+ Coverage 82.64% 82.65% +0.01%
==========================================
Files 200 200
Lines 49730 49787 +57
==========================================
+ Hits 41098 41153 +55
- Misses 8632 8634 +2
Continue to review full report at Codecov.
|
rust/arrow/src/ipc/reader.rs
Outdated
#[test] | ||
fn read_generated_files() { | ||
let testdata = env::var("ARROW_TEST_DATA").expect("ARROW_TEST_DATA not defined"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you post example output for when an example of test is run and the test data cannot be found? I expected the unwrap does show the detailed error message but just want to be sure since we're removing these explicit messages here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove Make sense, let me refined the errors.
Thanks @mqy I like this approach. |
@andygrove finally I force pushed just the newly added mod. PR desc was updated too. |
3ebffce
to
e8b74ac
Compare
e8b74ac
to
a054c78
Compare
a769d4e
to
f05badb
Compare
f05badb
to
bf26986
Compare
e608972
to
9574571
Compare
1365c35
to
65eede3
Compare
Thanks! New changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @mqy -- I really like the idea of being able to run cargo test
without having to specify an environment variable (while still allowing the directories to be overridden by environment variable)
I am not sure I should admit it, but this is how I run the arrow tests now :
cd /Users/alamb/Software/arrow/rust/datafusion && PARQUET_TEST_DATA=`pwd`/../../cpp/submodules/parquet-testing/data ARROW_TEST_DATA=`pwd`/../../testing/data cargo test
Which is a mess, so for me, the premise of this PR would be a big improvement
I wonder if we can use a Cargo environment variable rather than calling out to git to find the appropriate paths. For example, I think env!("CARGO_MANIFEST_DIR")
is the directory containing Cargo.toml.
I am not opposed to calling out to git
either, as long as there is a reasonable error message that explains what environment variables to set when that fails (e.g. git
isn't installed or something)-- I didn't test this.
I would personally be happy to accept this PR (based on calling git
or something else) if simply running cargo test
would pass all tests without having to set the *_TEST_DATA
variables ( I think you would just need to make the changes you suggested of "Existing codes can be updated in this way :")
I tried locally and I still get errors:
(arrow_dev) alamb@MacBook-Pro:~/Software/arrow/rust$ cargo test --all
...
---- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'ARROW_TEST_DATA not defined: NotPresent', arrow/src/ipc/reader.rs:988:52
note: panic did not contain expected string
panic message: `"ARROW_TEST_DATA not defined: NotPresent"`,
expected substring: `"Big Endian is not supported for Decimal!"`
...
@alamb thanks for review. I guess if you had I'll double check this. |
@alamb The Cargo Book :: Environment Variables states that:
I've verified with unit test in the "arrow" package, that Actually, the most simplest implementation to get "git top level dir" is to append "../../" to current dir at present. I have tried various ways to get the
I prefer that Thank you for your review and comments! |
Actually, in previous commits (that was reverted), I had updated other codes, but sometimes this PR fails in CI tests, caused by code merge, typical error is about unknown @alamb If I understood correctly, are you suggesting me update existing codes that call |
Good point, let me update the error message. |
One observation about the use of git is that it won't help when we are validating a release candidate based on the source tarball. I think it would be better if we just relied on cargo env vars if that is possible. |
7b3504a
to
e57f088
Compare
e57f088
to
bcc45fa
Compare
@andygrove @alamb @jorgecarleitao I updated code and PR: drop Coverage (amd64, stable) fails at |
I think this is https://issues.apache.org/jira/browse/ARROW-10943 -- it is intermittently failing. I will retrigger the tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I think this is looking good. Thanks again @mqy
I thought it was so good, in fact, that I whipped up a small PR based on this work to migrate the tests over to use arrow_test_data
and parquet_test_data
-- and thereby scratching an itch I have had for a long time about the environment variables being cumbersome.
#8996 is based on the code in this PR
rust/arrow/src/util/test_util.rs
Outdated
|
||
env::set_var(udf_env, non_existing_str); | ||
let res = get_data_dir(udf_env, existing_str); | ||
debug_assert!(res.is_err()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally recommend using assert!
here and below rather than debug assert -- but it probably makes no practical difference since we would run these tests only on debug builds.
debug_assert!(res.is_err()); | |
debug!(res.is_err()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated again: replaced debug_assert
with assert
.
@alamb that's great, I appreciate all the kindness, guidance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really good. Thanks a lot for taking the time for this. I look forward to not have to set these export
every time I want to run these. :)
This PR is based on @mqy 's great work in: #8967. (If we want to take this PR, we can either merge it in to https://github.com/apache/arrow/pull/8967/files# directly or I can make a new independent PR when that is merged). ## Rationale The outcome is that developers can now simply run `cargo test` in a typical checkout without having to mess with environment variables. I think this will lower the barrier to entry for people to contribute. ## Changes 1. Code from #8967 to encode heuristics of where to check for test data 1. Remove all references to ARROW_TEST_DATA and PARQUET_TEST_DATA and uses the test_util methods instead 2. Update the comments / error messages in test_util ## Example Error Handling Error handling: here is what happens with a fresh checkout and no git modules checked out and no environment variables set: ``` cargo test -p arrow ---- ipc::reader::tests::read_decimal_be_file_should_panic stdout ---- thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `/private/tmp/arrow/rust/arrow/../../testing/data` not found HINT: try running `git submodule update --init`', arrow/src/util/test_util.rs:81:21 ``` Here is an example of what happens when `ARROW_TEST_DATA` is pointing somewhere non existent ``` ARROW_TEST_DATA=blargh cargo test -p arrow ... --- ipc::reader::tests::read_decimal_be_file_should_panic stdout ---- thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: the data dir `blargh` defined by env ARROW_TEST_DATA not found', arrow/src/util/test_util.rs:81:21 ``` Closes #8996 from alamb/alamb/tests_without_env Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
This PR is based on @mqy 's great work in: apache/arrow#8967. (If we want to take this PR, we can either merge it in to https://github.com/apache/arrow/pull/8967/files# directly or I can make a new independent PR when that is merged). ## Rationale The outcome is that developers can now simply run `cargo test` in a typical checkout without having to mess with environment variables. I think this will lower the barrier to entry for people to contribute. ## Changes 1. Code from apache/arrow#8967 to encode heuristics of where to check for test data 1. Remove all references to ARROW_TEST_DATA and PARQUET_TEST_DATA and uses the test_util methods instead 2. Update the comments / error messages in test_util ## Example Error Handling Error handling: here is what happens with a fresh checkout and no git modules checked out and no environment variables set: ``` cargo test -p arrow ---- ipc::reader::tests::read_decimal_be_file_should_panic stdout ---- thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `/private/tmp/arrow/rust/arrow/../../testing/data` not found HINT: try running `git submodule update --init`', arrow/src/util/test_util.rs:81:21 ``` Here is an example of what happens when `ARROW_TEST_DATA` is pointing somewhere non existent ``` ARROW_TEST_DATA=blargh cargo test -p arrow ... --- ipc::reader::tests::read_decimal_be_file_should_panic stdout ---- thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: the data dir `blargh` defined by env ARROW_TEST_DATA not found', arrow/src/util/test_util.rs:81:21 ``` Closes #8996 from alamb/alamb/tests_without_env Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
…est_util If we could get test data dirs at runtime, both env vars `ARROW_TEST_DATA` and `PARQUET_TEST_DATA` become **optional**: no need to set them unless the testing data is not in pre-defined location. This PR adds two similar public functions `arrow_test_data` and `parquet_test_data` to mod `arrow::util::test_util`, each behaves like this: - return data dir from user defined env if defined and corresponding dir exists. - return default data dir by joining env `CARGO_MANIFEST_DIR` and relative pre-defined data data dirs. - panic on error. Possible panic errors from `arrow_test_data()`: ``` - failed to get arrow data dir: the data dir `non/existing` defined by env `ARROW_TEST_DATA` not found - failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `../../testing/data` not found ``` Possible panic errors from `parquet_test_data()`: ``` - failed to get parquet data dir: the data dir `non/existing` defined by env `PARQUET_TEST_DATA` not found - failed to get parquet data dir: env `PARQUET_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `../../cpp/submodules/parquet-testing/data` not found ``` Existing codes can be updated in this way : ``` let testdata = env::var("ARROW_TEST_DATA").expect("ARROW_TEST_DATA not defined"); // change to let testdata = arrow::util::test_util::arrow_test_data(); ``` Closes apache#8967 from mqy/ARROW-10967_optional_env Authored-by: mqy <meng.qingyou@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
This PR is based on @mqy 's great work in: apache#8967. (If we want to take this PR, we can either merge it in to https://github.com/apache/arrow/pull/8967/files# directly or I can make a new independent PR when that is merged). ## Rationale The outcome is that developers can now simply run `cargo test` in a typical checkout without having to mess with environment variables. I think this will lower the barrier to entry for people to contribute. ## Changes 1. Code from apache#8967 to encode heuristics of where to check for test data 1. Remove all references to ARROW_TEST_DATA and PARQUET_TEST_DATA and uses the test_util methods instead 2. Update the comments / error messages in test_util ## Example Error Handling Error handling: here is what happens with a fresh checkout and no git modules checked out and no environment variables set: ``` cargo test -p arrow ---- ipc::reader::tests::read_decimal_be_file_should_panic stdout ---- thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `/private/tmp/arrow/rust/arrow/../../testing/data` not found HINT: try running `git submodule update --init`', arrow/src/util/test_util.rs:81:21 ``` Here is an example of what happens when `ARROW_TEST_DATA` is pointing somewhere non existent ``` ARROW_TEST_DATA=blargh cargo test -p arrow ... --- ipc::reader::tests::read_decimal_be_file_should_panic stdout ---- thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: the data dir `blargh` defined by env ARROW_TEST_DATA not found', arrow/src/util/test_util.rs:81:21 ``` Closes apache#8996 from alamb/alamb/tests_without_env Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
If we could get test data dirs at runtime, both env vars
ARROW_TEST_DATA
andPARQUET_TEST_DATA
become optional: no need to set them unless the testing data is not in pre-defined location.This PR adds two similar public functions
arrow_test_data
andparquet_test_data
to modarrow::util::test_util
, each behaves like this:CARGO_MANIFEST_DIR
and relative pre-defined data data dirs.Possible panic errors from
arrow_test_data()
:Possible panic errors from
parquet_test_data()
:Existing codes can be updated in this way :