Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe doctests in the main branch are taking very long to run (over 60 seconds) #5347

Closed
iajoiner opened this issue Feb 20, 2023 · 2 comments · Fixed by #9402
Closed
Labels
bug Something isn't working

Comments

@iajoiner
Copy link
Contributor

iajoiner commented Feb 20, 2023

Describe the bug
A clear and concise description of what the bug is.
Doctests in dataframe.rs are taking very long to run in the main branch. Moreover the tests likely use up so much resource that it is not even easy to submit this issue or switch to another tab while the tests are running.

test src/dataframe.rs - dataframe::DataFrame (line 62) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::aggregate (line 189) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::cache (line 860) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::collect (line 471) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::collect_partitioned (line 550) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::count (line 438) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::distinct (line 287) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::except (line 719) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::execute_stream (line 530) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::execute_stream_partitioned (line 569) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::explain (line 658) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::filter (line 169) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::intersect (line 696) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::join (line 330) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::join_on (line 371) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::limit (line 221) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::registry (line 678) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::repartition (line 417) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::schema (line 590) has been running for over 60 seconds
test src/dataframe.rs - dataframe::DataFrame::select (line 124) has been running for over 60 seconds

To Reproduce
Steps to reproduce the behavior:
cargo test --docs DataFrame
Expected behavior
A clear and concise description of what you expected to happen.
The tests should be faster and shouldn't cause my machine to hang.
Additional context
Add any other context about the problem here.
I'm actually on a pretty new and good Ubuntu 22.04/AMD64 machine.

@iajoiner iajoiner added the bug Something isn't working label Feb 20, 2023
@Jefffrey
Copy link
Contributor

Jefffrey commented Mar 4, 2023

I think this may be an issue with Rust doctest in general: rust-lang/rust#75341

Not sure what can be done here, maybe try to reduce the amount of doctests used (not really ideal), or be able to omit the doctests from default cargo test?

@devinjdangelo
Copy link
Contributor

The reason these tests lock up is very high memory utilization to run them in parallel, which is cargo's default behavior. My system peaked at over 100GB of memory utilization 🤯 ! I took a look through the dataframe doc tests, and I don't see any inherent reason for such extreme memory usage. I believe @Jefffrey is correct that the cause is rust loading many multiples of a large debug binary into memory.

I think it would be a reasonable workaround to improve the developer experience to find a way to default cargo to run these specific tests with a maximum parallelism of somewhere in the 1-4 range which should work on most systems.

You can do this manually by running cargo test --doc dataframe -- --test-threads 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants