Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing stack overflow issue for large tree #1441

Merged
merged 7 commits into from
Feb 9, 2023

Conversation

wanmeihuali
Copy link
Contributor

This is a patch for Issue-1433.

  • The root cause for Issue-1433 is that smart pointers, especially shared_ptr, is not a good choice to manage data structures such as tree or graph, because they can easily cause stack overview during destruction.
  • In this patch, we manually release the root node of the tree data structure during a BFS when a destructor is called. In this way, the previous recursive release for tree nodes is now sequential. The algorithm is carefully designed so that it won't break the edges between nodes, and in this way keeps the same behavior even when some tree nodes are held by extra codes.
  • We tested the modified code with the given example in [Issue-1433], and the issue is fixed with the patch.
  • Does it solve all similar issues? We only fixed the three tree data structures we used in our use case(which is a bit more complex version of the bug reproduction code in the issue) in this patch. Considering there are tons of similar tree data structures in Gtsam, it is likely that a similar issue will happen again when other people use other tree data structures, and these issues should be able to be fixed in a similar way. However, fixing all data structures is a large work and we won't do it in this patch.
  • To fully get rid of the stack overflow issue, one suggestion is to adopt some garbage collection library and have a global replacement for shared_ptr/make_shared. Of course, it is also necessary to make sure the garbage collection library does not use any recursive implementation. In this way, all similar issues will be solved without a large modification of the existing code, Considering that the garbage collection solution may cause some performance loss, a compiling flag can be used to control whether to use a smart pointer or garbage collection and let the user make the decision based on their use case.

@dellaert
Copy link
Member

dellaert commented Feb 5, 2023

Thanks @wanmeihuali !!! @ProfFan let's wait to run CI here as I'm trying to debug some other issues now.

@ProfFan
Copy link
Collaborator

ProfFan commented Feb 6, 2023

CI is failing, @wanmeihuali could you rebase to develop and see if it still fails?

@wanmeihuali
Copy link
Contributor Author

Yep, seems like a network or dep version issue. Just merged the latest develop. XD

@dellaert
Copy link
Member

dellaert commented Feb 6, 2023

Very interesting! Note, there are not tons of more tree data structures in gtsam, these are really the most important ones :-). I must say, though, I don’t fully understand the mechanism yet. Pushing the nodes into the BFS queue is one thing, but I’m assuming there is also an assumption that something will happen when the queue is deleted? That’s not really explained.

I also don’t understand why you do a clear on nodes_ first. Will that not be a problem if reference counts go to zero? You probably have very good answer is for all these questions, it’s just a matter of getting them into the code comments for future reference :-)

@wanmeihuali
Copy link
Contributor Author

wanmeihuali commented Feb 7, 2023

Very interesting! Note, there are not tons of more tree data structures in gtsam, these are really the most important ones :-). I must say, though, I don’t fully understand the mechanism yet. Pushing the nodes into the BFS queue is one thing, but I’m assuming there is also an assumption that something will happen when the queue is deleted? That’s not really explained.

I also don’t understand why you do a clear on nodes_ first. Will that not be a problem if reference counts go to zero? You probably have very good answer is for all these questions, it’s just a matter of getting them into the code comments for future reference :-)

Yep it's quite tricky, I write some mermaid to show how a tree is released by this algorithm. The releasing is actually happening when a node is popped from the queue and then leaves the scope. For BFS, the queue is always empty after all. @dellaert
Initial state:

graph TD
    T[Tree Data Structure]
    T-->A((A))
    A-->B((B))
    B-->D((D))
    B-->E((E))
    D-->G((G))
    Z[Context]
    Z--hold by other data structure-->D
Loading

Expect behavior:

  • A, B, and E are released without recursive.
  • D and G are not released, and G is still the child of D.
graph TD
    Q[BFS Queue]
    T[Tree Data Structure]
    T-->A((A))
    A-->B((B))
    B-->D((D))
    B-->E((E))
    D-->G((G))
    Z[Context]
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    T[Tree Data Structure]
    Q==push into queue==>A((A))
    T-.set to nullptr.->A
    A-->B((B))
    B-->D((D))
    B-->E((E))
    D-->G((G))
    Z[Context]
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    T[Tree Data Structure]
    Q-->A((A))
    A-->B((B))
    B-->D((D))
    B-->E((E))
    D-->G((G))
    Z[Context]
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    Q-.popping front.->A((A))
    Z==getting front==>A
    A-->B((B))
    B-->D((D))
    B-->E((E))
    D-->G((G))
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    Z-->A((A))
    A-->B((B))
    Q==BFS: adding children to queue==>B
    B-->D((D))
    B-->E((E))
    D-->G((G))
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    Z-.leaving scope-.->A((A, ref=0))
    A-->B((B))
    Q-->B
    B-->D((D))
    B-->E((E))
    D-->G((G))
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    A((A))-.A get released.->B((B))
    style A stroke:#333,stroke-width:4px
    Q--holds B so B will not be released recursively-->B
    B-->D((D))
    B-->E((E))
    D-->G((G))
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    Q-.popping front.->B((B))
    Z==getting front==>B
    B-->D((D))
    B-->E((E))
    D-->G((G))
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    Z-->B
    B((B))-->D((D))
    B-->E((E))
    D-->G((G))
    Q==BFS: adding children to queue==>D
    Q==BFS: adding children to queue==>E
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    Z-.leaving scope.->B
    B((B, ref=0))-->D((D))
    B-->E((E))
    D-->G((G))
    Q==BFS: adding children to queue==>D
    Q==BFS: adding children to queue==>E
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    B((B))-.B released.->D((D))
    style B stroke:#333,stroke-width:4px
    B-.B released.->E((E))
    D-->G((G))
    Q-->D
    Q-->E
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    D((D))-->G((G))
    Q-.pop front.->D
    Q-->E((E))
    Z--hold by other data structure-->D
    Z==getting front==>D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    D((D))-->G((G))
    Q-->E((E))
    Q==BFS: adding children to queue==>G
    Z--hold by other data structure-->D
    Z-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    D((D, ref=1))-->G((G))
    Q-->E((E))
    Q-->G
    Z--hold by other data structure-->D
    Z-.leaving scope.->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    D((D))-->G((G))
    Q-.pop front.->E((E))
    Z==getting front==>E
    Q-->G
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    D((D))-->G((G))
    Z-.leaving scope.->E((E, ref=0))
    style E stroke:#333,stroke-width:4px
    Q-->G
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    D((D))-->G((G))
    Q-.pop front.->G
    Z==getting front==>G
    Z--hold by other data structure-->D
Loading
graph TD
    Q[BFS Queue]
    Z[Context]
    D((D))-->G((G, ref=1))
    Z-.leaving scope.->G
    Z--hold by other data structure-->D
Loading

@dellaert
Copy link
Member

dellaert commented Feb 7, 2023

Oh, that helps a lot! I will approve the PR, but I will also leave some comments as to where you could add some of that explanation in the code.

Copy link
Member

@dellaert dellaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!! I do feel strongly that things are not cluttering up the header if possible, so will “request changes” for now and re-review on push…

gtsam/inference/BayesTree.h Outdated Show resolved Hide resolved
gtsam/inference/BayesTree.h Outdated Show resolved Hide resolved
gtsam/inference/BayesTree.h Outdated Show resolved Hide resolved
gtsam/inference/BayesTree.h Outdated Show resolved Hide resolved
gtsam/inference/ClusterTree.h Outdated Show resolved Hide resolved
gtsam/inference/EliminationTree.h Outdated Show resolved Hide resolved
Copy link
Member

@dellaert dellaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, a great addition to enable GTSAM to do large scale inference!

Copy link
Collaborator

@ProfFan ProfFan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wanmeihuali :)

@dellaert dellaert merged commit b71ed8f into borglab:develop Feb 9, 2023
dellaert added a commit that referenced this pull request Feb 12, 2023
Cherry-pick pull request #1441 from wanmeihuali/hotfix/stack_overflow
@ShuangLiu1992
Copy link
Contributor

ShuangLiu1992 commented May 5, 2023

current develop branch seems to have missed this commit for gtsam/inference/BayesTreeCliqueBase.h

wanmeihuali@1d06f1e

it was changed back in
Commit e6f2cd4
by Kuangyuan.Sun

does the the commit message "ent" mean it's not needed?

@wanmeihuali
Copy link
Contributor Author

current develop branch seems to have missed this commit for gtsam/inference/BayesTreeCliqueBase.h

wanmeihuali@1d06f1e

@ShuangLiu1992 Yep, As I mentioned I only fixed the issue in my use case... Because Clique Tree is much smaller than the original tree, it did not trigger stack overflow for my case.
You may ask @ProfFan and @dellaert for another patch.

@wanmeihuali
Copy link
Contributor Author

@ShuangLiu1992 I just got some time to check it, gtsam/inference/BayesTreeCliqueBase.h is the base class for the node in BayesTree, so the patch should already fix the issue if you are using BayesTree.h... You can check the image I attached for how it works. If you still got a stackoverflow, maybe you are using another tree data structure with BayesTreeCliqueBase?

@ShuangLiu1992
Copy link
Contributor

@wanmeihuali Thanks for taking the time to check. I will double check. On some platforms we are constrained with a stack size of 1mb so the stackover flow might have came from somewhere else in the library. Debug builds are too big to fit on the target machine and Release builds crashes without giving any useful information.....

@ProfFan
Copy link
Collaborator

ProfFan commented May 24, 2023

1MB of stack is a bit too small for current GTSAM but you can try changing some stack variables used to unique_ptrs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants