Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11026 [Rust]: Run tests without requiring environment variables #8996

Closed
wants to merge 7 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Dec 23, 2020

This PR is based on @mqy 's great work in: #8967.

(If we want to take this PR, we can either merge it in to https://github.com/apache/arrow/pull/8967/files# directly or I can make a new independent PR when that is merged).

Rationale

The outcome is that developers can now simply run cargo test in a typical checkout without having to mess with environment variables. I think this will lower the barrier to entry for people to contribute.

Changes

  1. Code from ARROW-10967: [Rust] Add functions for test data to mod arrow::util::test_util #8967 to encode heuristics of where to check for test data
  2. Remove all references to ARROW_TEST_DATA and PARQUET_TEST_DATA and uses the test_util methods instead
  3. Update the comments / error messages in test_util

Example Error Handling

Error handling: here is what happens with a fresh checkout and no git modules checked out and no environment variables set:

cargo test -p arrow
---- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `/private/tmp/arrow/rust/arrow/../../testing/data` not found
HINT: try running `git submodule update --init`', arrow/src/util/test_util.rs:81:21

Here is an example of what happens when ARROW_TEST_DATA is pointing somewhere non existent

ARROW_TEST_DATA=blargh cargo test -p arrow
...
--- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: the data dir `blargh` defined by env ARROW_TEST_DATA not found', arrow/src/util/test_util.rs:81:21

@github-actions
Copy link

Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions, this looks great @alamb @mqy

rust/README.md Outdated
The following Env vars are required to run `cargo test`, examples, etc.
By default, `cargo test` will look for these directories at their
standard location. The following Env vars can be used to override the
location shoud you choose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
location shoud you choose
location should you choose.

rust/arrow-flight/src/arrow.flight.protocol.rs Outdated Show resolved Hide resolved
rust/arrow/src/util/test_util.rs Outdated Show resolved Hide resolved
}
};
let mut pathbuf =
PathBuf::from_str(&arrow::util::test_util::parquet_test_data()).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, better to move the parquet test data fn into this crate, because arrow is an optional dependency

Copy link
Contributor

@mqy mqy Dec 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @nevi-me

Actually I had asked similar question "where to put them?", finally I decided leaving this question to reviewers, because the most important thing at that time is coding.

At present, with this PR, thanks to @alamb, we finally got a chance to discuss this issue. Actually I don't mind how to improve or split the code in #8967 at all, as long as the fundamental purpose could be achieved, and they do not mess up the existing testing code too much :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nevi-me and @mqy -- I tried to move parquet_test_data into https://github.com/apache/arrow/blob/master/rust/parquet/src/util/test_common/file_util.rs -- however, the code quickly got messy because parquet::util is not publically exported and thus I can't use functions defined there in places (like datafusion) outside the parquet crate. Furthermore, the test_utils are only compiled in test config, but several datafusion examples use the parquet test data but they are not compiled in test config.

I can think of several possibilities:

  1. Leave the parquet_test_data function in the arrow crate as it is in this PR
  2. Make a copy of parquet_test_data in the parquet crate
  3. Make the parquet util module public and export test_util in all configurations

Given that this function is used in tests and the other options seem messy to me, I suggest number 1 (though perhaps I am being lazy)

Copy link
Contributor

@mqy mqy Dec 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the first choice right now. Perhaps these functions should stay in a top crate parallel to arrow crate, as common utilities for testing or something else. We may create that crate later.

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Dec 23, 2020
@alamb alamb changed the title ARROW-10967 [Rust]: Run tests without requiring environment variables ARROW-11026 [Rust]: Run tests without requiring environment variables Dec 24, 2020
@github-actions
Copy link

@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 24, 2020
@codecov-io
Copy link

codecov-io commented Dec 24, 2020

Codecov Report

Merging #8996 (84ceb2b) into master (8fa68b0) will increase coverage by 0.00%.
The diff coverage is 96.15%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #8996   +/-   ##
=======================================
  Coverage   82.87%   82.88%           
=======================================
  Files         201      201           
  Lines       49729    49734    +5     
=======================================
+ Hits        41213    41220    +7     
+ Misses       8516     8514    -2     
Impacted Files Coverage Δ
rust/arrow/src/util/test_util.rs 90.90% <ø> (ø)
rust/datafusion/examples/flight_server.rs 0.00% <0.00%> (ø)
rust/datafusion/src/datasource/csv.rs 81.25% <ø> (ø)
rust/arrow/src/ipc/reader.rs 82.86% <100.00%> (ø)
rust/arrow/src/ipc/writer.rs 83.25% <100.00%> (ø)
rust/datafusion/src/datasource/parquet.rs 96.92% <100.00%> (ø)
rust/datafusion/src/execution/dataframe_impl.rs 93.47% <100.00%> (ø)
rust/datafusion/src/physical_plan/csv.rs 82.78% <100.00%> (ø)
rust/datafusion/src/physical_plan/parquet.rs 80.46% <100.00%> (+0.15%) ⬆️
rust/datafusion/src/physical_plan/planner.rs 77.46% <100.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fa68b0...84ceb2b. Read the comment docs.

@mqy
Copy link
Contributor

mqy commented Dec 24, 2020

@alamb those two envs also stay in rust/arrow/README.md
How about remove them from rust/arrow/README.md, or link user to to rust/README.md ?

@alamb
Copy link
Contributor Author

alamb commented Dec 26, 2020

How about remove them from rust/arrow/README.md, or link user to to rust/README.md ?

Good idea @mqy -- done.

@alamb
Copy link
Contributor Author

alamb commented Dec 26, 2020

@nevi-me -- I have not been able to find a satisfactory way to avoid using the common parquet_test_data function from the arrow crate in the parquet crate. Since they are only used in tests, I think this is ok -- however, if you would like a different organization please let me know and I can do it as a follow on PR

@nevi-me
Copy link
Contributor

nevi-me commented Dec 27, 2020

Hey @alamb, it's fine, we can merge the changes as they are. Thanks for trying to move the function.

@nevi-me nevi-me closed this in ecb760f Dec 27, 2020
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR is based on @mqy 's great work in: apache#8967.

(If we want to take this PR, we can either merge it in to https://github.com/apache/arrow/pull/8967/files# directly or I can make a new independent PR when that is merged).

## Rationale
The outcome is that developers can now simply run  `cargo test` in a typical checkout without having to mess with environment variables. I think this will lower the barrier to entry for people to contribute.

## Changes
1. Code from apache#8967 to encode heuristics of where to check for test data
1. Remove all references to ARROW_TEST_DATA and PARQUET_TEST_DATA and uses the test_util methods instead
2. Update the comments / error messages in test_util

## Example Error Handling

Error handling: here is what happens with a fresh checkout and no git modules checked out  and no environment variables set:

```
cargo test -p arrow
---- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: env `ARROW_TEST_DATA` is undefined or has empty value, and the pre-defined data dir `/private/tmp/arrow/rust/arrow/../../testing/data` not found
HINT: try running `git submodule update --init`', arrow/src/util/test_util.rs:81:21
```

Here is an example of what happens when `ARROW_TEST_DATA` is pointing somewhere non existent

```
ARROW_TEST_DATA=blargh cargo test -p arrow
...
--- ipc::reader::tests::read_decimal_be_file_should_panic stdout ----
thread 'ipc::reader::tests::read_decimal_be_file_should_panic' panicked at 'failed to get arrow data dir: the data dir `blargh` defined by env ARROW_TEST_DATA not found', arrow/src/util/test_util.rs:81:21
```

Closes apache#8996 from alamb/alamb/tests_without_env

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants