Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflows when planning tpcds 22 in debug mode #4786

Closed
alamb opened this issue Jan 1, 2023 · 4 comments
Closed

Stack overflows when planning tpcds 22 in debug mode #4786

alamb opened this issue Jan 1, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Jan 1, 2023

Describe the bug
While we fixed some stack overflows in #4065

When planning some complex queries in debug mode, DataFusion will overflow its stack

This happens on the CI builders

To Reproduce
Unignore tests in tpcds_planning suite:

diff --git a/datafusion/core/tests/tpcds_planning.rs b/datafusion/core/tests/tpcds_planning.rs
index 7359f3906..1e3cea8be 100644
--- a/datafusion/core/tests/tpcds_planning.rs
+++ b/datafusion/core/tests/tpcds_planning.rs
@@ -343,7 +343,6 @@ async fn tpcds_logical_q63() -> Result<()> {
     create_logical_plan(63).await
 }
 
-#[ignore] // thread 'q64' has overflowed its stack]
 #[tokio::test]
 async fn tpcds_logical_q64() -> Result<()> {
     create_logical_plan(64).await
@@ -851,7 +850,6 @@ async fn tpcds_physical_q63() -> Result<()> {
     create_physical_plan(63).await
 }
 
-#[ignore] // thread 'q64' has overflowed its stack
 #[tokio::test]
 async fn tpcds_physical_q64() -> Result<()> {
     create_physical_plan(64).await

Run on my machine (MacOS) like:

RUST_MIN_STACK=1000000 cargo  test --test tpcds_planning 
...
running 198 tests
test tpcds_logical_q22 ... ok

thread 'tpcds_logical_q1' has overflowed its stack
fatal runtime error: stack overflow
error: test failed, to rerun pass `-p datafusion --test tpcds_planning`

Expected behavior
No stack overflow

Additional context

@alamb alamb added the bug Something isn't working label Jan 1, 2023
@alamb
Copy link
Contributor Author

alamb commented Jan 1, 2023

Prior to the fixes for #4065 these queries would overflow the stack during SQL planning

After the fixes for #4065 the overflow in the push_down_filter pass

You can see what the problem is by running under lldb:

$ RUST_MIN_STACK=1000000 rust-lldb /Users/alamb/Software/target-df2/debug/deps/tpcds_planning-c781740f1011efdc
...
(lldb) type category enable Rust
(lldb) target create "/Users/alamb/Software/target-df2/debug/deps/tpcds_planning-c781740f1011efdc"
Current executable set to '/Users/alamb/Software/target-df2/debug/deps/tpcds_planning-c781740f1011efdc' (x86_64).
(lldb) r
Process 93091 launched: '/Users/alamb/Software/target-df2/debug/deps/tpcds_planning-c781740f1011efdc' (x86_64)

running 198 tests
test tpcds_logical_q22 ... ok
Process 93091 stopped
* thread #2, name = 'tpcds_logical_q1', stop reason = EXC_BAD_ACCESS (code=2, address=0x70000488f3a8)
    frame #0: 0x000000010399f23e tpcds_planning-c781740f1011efdc`__rust_probestack + 23
tpcds_planning-c781740f1011efdc`:
->  0x10399f23e <+23>: testq  %rsp, 0x8(%rsp)
    0x10399f243 <+28>: subq   $0x1000, %r11             ; imm = 0x1000 
    0x10399f24a <+35>: cmpq   $0x1000, %r11             ; imm = 0x1000 
    0x10399f251 <+42>: ja     0x10399f237               ; <+16>
Target 0: (tpcds_planning-c781740f1011efdc) stopped.
(lldb) bt
* thread #2, name = 'tpcds_logical_q1', stop reason = EXC_BAD_ACCESS (code=2, address=0x70000488f3a8)
  * frame #0: 0x000000010399f23e tpcds_planning-c781740f1011efdc`__rust_probestack + 23
    frame #1: 0x000000010106b50e tpcds_planning-c781740f1011efdc`_$LT$datafusion_expr..expr..Expr$u20$as$u20$datafusion_expr..expr_rewriter..ExprRewritable$GT$::rewrite::h980731ca505c0427 at expr_rewriter.rs:101
    frame #2: 0x000000010103b165 tpcds_planning-c781740f1011efdc`datafusion_expr::expr_rewriter::rewrite_boxed::haff2d36daa5dd0d6 at expr_rewriter.rs:312:26
    frame #3: 0x000000010106bd0c tpcds_planning-c781740f1011efdc`_$LT$datafusion_expr..expr..Expr$u20$as$u20$datafusion_expr..expr_rewriter..ExprRewritable$GT$::rewrite::h980731ca505c0427 at expr_rewriter.rs:131:21
    frame #4: 0x00000001010b3021 tpcds_planning-c781740f1011efdc`datafusion_expr::utils::from_plan::he7efef4d1b95ebfa at utils.rs:523:29
    frame #5: 0x00000001007055ae tpcds_planning-c781740f1011efdc`datafusion_optimizer::utils::optimize_children::h54aa06b4ed6a900f at utils.rs:51:5
    frame #6: 0x00000001006bf761 tpcds_planning-c781740f1011efdc`_$LT$datafusion_optimizer..push_down_filter..PushDownFilter$u20$as$u20$datafusion_optimizer..optimizer..OptimizerRule$GT$::try_optimize::hcc5af41853d9aa51 at push_down_filter.rs:745:17
    frame #7: 0x0000000100705367 tpcds_planning-c781740f1011efdc`datafusion_optimizer::utils::optimize_children::h54aa06b4ed6a900f at utils.rs:48:25
    frame #8: 0x00000001006bf761 tpcds_planning-c781740f1011efdc`_$LT$datafusion_optimizer..push_down_filter..PushDownFilter$u20$as$u20$datafusion_optimizer..optimizer..OptimizerRule$GT$::try_optimize::hcc5af41853d9aa51 at push_down_filter.rs:745:17
    frame #9: 0x0000000100705367 tpcds_planning-c781740f1011efdc`datafusion_optimizer::utils::optimize_children::h54aa06b4ed6a900f at utils.rs:48:25
    frame #10: 0x00000001006bf000 tpcds_planning-c781740f1011efdc`_$LT$datafusion_optimizer..push_down_filter..PushDownFilter$u20$as$u20$datafusion_optimizer..optimizer..OptimizerRule$GT$::try_optimize::hcc5af41853d9aa51 at push_down_filter.rs:535:33
    frame #11: 0x0000000100705367 tpcds_planning-c781740f1011efdc`datafusion_optimizer::utils::optimize_children::h54aa06b4ed6a900f at utils.rs:48:25
    frame #12: 0x00000001006bf000 tpcds_planning-c781740f1011efdc`_$LT$datafusion_optimizer..push_down_filter..PushDownFilter$u20$as$u20$datafusion_optimizer..optimizer..OptimizerRule$GT$::try_optimize::hcc5af41853d9aa51 at push_down_filter.rs:535:33
    frame #13: 0x0000000100705367 tpcds_planning-c781740f1011efdc`datafusion_optimizer::utils::optimize_children::h54aa06b4ed6a900f at utils.rs:48:25
    frame #14: 0x00000001006bf000 tpcds_planning-c781740f1011efdc`_$LT$datafusion_optimizer..push_down_filter..PushDownFilter$u20$as$u20$datafusion_optimizer..optimizer..OptimizerRule$GT$::try_optimize::hcc5af41853d9aa51 at push_down_filter.rs:535:33
    frame #15: 0x0000000100705367 tpcds_planning-c781740f1011efdc`datafusion_optimizer::utils::optimize_children::h54aa06b4ed6a900f at utils.rs:48:25
    frame #16: 0x00000001006bf000 tpcds_planning-c781740f1011efdc`_$LT$datafusion_optimizer..push_down_filter..PushDownFilter$u20$as$u20$datafusion_optimizer..optimizer..OptimizerRule$GT$::try_optimize::hcc5af41853d9aa51 at push_down_filter.rs:535:33
...```

@alamb
Copy link
Contributor Author

alamb commented Jan 1, 2023

I believe the good work @jackwener has done to remove recursion from the optimizer rules may help this one.

Specifically, I think #4465 might fix this particular issue

@mingmwang
Copy link
Contributor

@jackwener @alamb
Is it still an issue ? Looks the rule push_down_filter is implemented in the preorder traversal approach(Top-Down) which is usually easy to introduce stack overflow.

@jackwener
Copy link
Member

@jackwener @alamb Is it still an issue ? Looks the rule push_down_filter is implemented in the preorder traversal approach(Top-Down) which is usually easy to introduce stack overflow.

I test it, look like it already was fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants