Skip to content

Conversation

@xudong963
Copy link
Member

@xudong963 xudong963 commented Dec 11, 2021

Which issue does this PR close?

Closes #1434

Rationale for this change

Use SafeRecursion to support limited recursion to avoid stack overflow. maybe_growsupports unconditionally grow the stack space, by spilling over to the heap if the stack has hit its limit.

What changes are included in this PR?

Converts a panic! into an Error in some (very important)
Allows the user to set a higher stack limit (but overflows are still possible)

Are there any user-facing changes?

No

@xudong963
Copy link
Member Author

xudong963 commented Dec 11, 2021

I think there are still some places that need to be wrapped by maybe_grow, such as match expr in the physical plan, but I want to support them in the next ticket. In this ticket, the skeleton is laid.

PTAL @alamb @houqp @Dandandan


/// Bytes available in the current stack
pub const STACKER_RED_ZONE: usize = {
64 << 10 // 64KB
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is what I feel, I want to know your thoughts.


/// Allocate a new stack of at least stack_size bytes.
pub const STACKER_SIZE: usize = {
4 << 20 // 4MB
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

SqlToRel { schema_provider }
SqlToRel {
schema_provider,
project_recursion: ProtectRecursion::new_with_limit(2048),
Copy link
Member Author

@xudong963 xudong963 Dec 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is what I feel, I want to know your thoughts.

@jimexist
Copy link
Member

I wonder if it makes sense to just limit number of or statements instead?

@xudong963
Copy link
Member Author

I wonder if it makes sense to just limit number of or statements instead?

First, Statement represents the SQL AST node, I don't think it's reasonable to limit the depth of recursion stack by limiting the number of Statement.

Second, in some cases, it's preferable to unconditionally grow the stack than returning err or panic.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran out of time today to look at this issue carefully, but will do so tomorrow

Copy link
Member

@houqp houqp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an acceptable short term workaround, but I think it would be more efficient and more elegant if we rewrite these recursive procedures into iterative procedures.

@xudong963
Copy link
Member Author

This is an acceptable short term workaround, but I think it would be more efficient and more elegant if we rewrite these recursive procedures into iterative procedures.

Do you mean to use push-based model?

@xudong963 xudong963 force-pushed the safe_recursion branch 2 times, most recently from 6459001 to f0593d1 Compare December 13, 2021 14:32
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I previously fixed a similar issue like this in:
#1047

For the proposed approach (grow the stack if needed), I think this PR is quite elegant -- nicely done @xudong963

However, I think this PR doesn't actually fix the stack overflow, rather it:

  1. Converts a panic! into an Error in some (very important)
  2. Allows the user to set a higher stack limit (but overflows are still possible)

The idea of datafusion protecting itself from pathological input cases is 👍 . The thing I worry about with the approach in this PR is that it relies on us annotating all places in the code (both that currently exist as well as may be added in the future) that do recursive walks of the tree with maybe_grow). It seems almost inevitable that we will end up missing some and I do think this code will likely add a non trivial overhead to datafusion)

I wonder if it makes sense to just limit number of or statements instead?

I think the idea from @jimexist was to limit the depth of any Expr tree by running a checking function like

pub fn check_depth(plan: LogicalPlan) {
  // recursively check all Exprs in all LogicalPlans 
  // for depth greater than 50 (or something)
}

I think @houqp's plan of "rewriting to not be recursive" is the best -- I will add a comment describing it shortly

@alamb
Copy link
Contributor

alamb commented Dec 13, 2021

This is an acceptable short term workaround, but I think it would be more efficient and more elegant if we rewrite these recursive procedures into iterative procedures.

Do you mean to use push-based model?

@xudong963 , I think what @houqp is suggesting is to rewrite code that is recursive to not be recursive.

The pattern for Datafusion probably looks like taking code like

fn visit(expr: &expr)  {
  for child in expr.children() {
    visit(child)
  }
  // do actual expr logic
}

And changing it so the state is tracked with a structure on the heap rather than a stack. I think VecDeque is a good one for rust:

fn visit(expr: &expr)  {
  let mut worklist = VecDequeue::new();
  worklist.push_back(expr);
  while !worklist.is_empty() {
    let parent = worklist.pop_front();
    for child in parent.children() {
      worklist.push_back(child)
    }
    // do actual expr logic on parent
  }
}

(aka avoid the call back to visit)

@xudong963
Copy link
Member Author

  • Converts a panic! into an Error in some (very important)
  • Allows the user to set a higher stack limit (but overflows are still possible)

Very nice conclusion, You said what I was thinking inside👍

The thing I worry about with the approach in this PR is that it relies on us annotating all places in the code (both that currently exist as well as may be added in the future) that do recursive walks of the tree with maybe_grow). It seems almost inevitable that we will end up missing some and I do think this code will likely add a non trivial overhead to datafusion)

Yes, I agree. So I think we may check the depth of exprs using safe_recursion in datafusion/src/sql/planner.rs, then we can avoid using may_growth in the follow-up process.

@xudong963
Copy link
Member Author

This is an acceptable short term workaround, but I think it would be more efficient and more elegant if we rewrite these recursive procedures into iterative procedures.

Do you mean to use push-based model?

@xudong963 , I think what @houqp is suggesting is to rewrite code that is recursive to not be recursive.

The pattern for Datafusion probably looks like taking code like

fn visit(expr: &expr)  {
  for child in expr.children() {
    visit(child)
  }
  // do actual expr logic
}

And changing it so the state is tracked with a structure on the heap rather than a stack. I think VecDeque is a good one for rust:

fn visit(expr: &expr)  {
  let mut worklist = VecDequeue::new();
  worklist.push_back(expr);
  while !worklist.is_empty() {
    let parent = worklist.pop_front();
    for child in parent.children() {
      worklist.push_back(child)
    }
    // do actual expr logic on parent
  }
}

(aka avoid the call back to visit)

Open an issue #1444 to track of this.

@xudong963 xudong963 changed the title Fix: stack overflow Process stack overflow panic elegantly Dec 14, 2021
@xudong963 xudong963 marked this pull request as draft December 14, 2021 16:45
@alamb alamb added the stale-pr label Feb 15, 2022
@alamb
Copy link
Contributor

alamb commented Feb 15, 2022

Marking as stale pr -- will close it in a week or two unless we plan to keep working on it

@xudong963
Copy link
Member Author

Marking as stale pr -- will close it in a week or two unless we plan to keep working on it

I'll directly close the ticket because I plan to fix it by #1444

@xudong963 xudong963 closed this Feb 16, 2022
@xudong963 xudong963 deleted the safe_recursion branch February 16, 2022 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query with 100 OR conditions overflows stack

4 participants