-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing stack overflow issue for large tree #1441
Conversation
…olution a better solution for ISUEE-1433
Thanks @wanmeihuali !!! @ProfFan let's wait to run CI here as I'm trying to debug some other issues now. |
CI is failing, @wanmeihuali could you rebase to develop and see if it still fails? |
Yep, seems like a network or dep version issue. Just merged the latest develop. XD |
Very interesting! Note, there are not tons of more tree data structures in gtsam, these are really the most important ones :-). I must say, though, I don’t fully understand the mechanism yet. Pushing the nodes into the BFS queue is one thing, but I’m assuming there is also an assumption that something will happen when the queue is deleted? That’s not really explained. I also don’t understand why you do a clear on nodes_ first. Will that not be a problem if reference counts go to zero? You probably have very good answer is for all these questions, it’s just a matter of getting them into the code comments for future reference :-) |
Yep it's quite tricky, I write some mermaid to show how a tree is released by this algorithm. The releasing is actually happening when a node is popped from the queue and then leaves the scope. For BFS, the queue is always empty after all. @dellaert graph TD
T[Tree Data Structure]
T-->A((A))
A-->B((B))
B-->D((D))
B-->E((E))
D-->G((G))
Z[Context]
Z--hold by other data structure-->D
Expect behavior:
graph TD
Q[BFS Queue]
T[Tree Data Structure]
T-->A((A))
A-->B((B))
B-->D((D))
B-->E((E))
D-->G((G))
Z[Context]
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
T[Tree Data Structure]
Q==push into queue==>A((A))
T-.set to nullptr.->A
A-->B((B))
B-->D((D))
B-->E((E))
D-->G((G))
Z[Context]
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
T[Tree Data Structure]
Q-->A((A))
A-->B((B))
B-->D((D))
B-->E((E))
D-->G((G))
Z[Context]
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
Q-.popping front.->A((A))
Z==getting front==>A
A-->B((B))
B-->D((D))
B-->E((E))
D-->G((G))
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
Z-->A((A))
A-->B((B))
Q==BFS: adding children to queue==>B
B-->D((D))
B-->E((E))
D-->G((G))
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
Z-.leaving scope-.->A((A, ref=0))
A-->B((B))
Q-->B
B-->D((D))
B-->E((E))
D-->G((G))
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
A((A))-.A get released.->B((B))
style A stroke:#333,stroke-width:4px
Q--holds B so B will not be released recursively-->B
B-->D((D))
B-->E((E))
D-->G((G))
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
Q-.popping front.->B((B))
Z==getting front==>B
B-->D((D))
B-->E((E))
D-->G((G))
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
Z-->B
B((B))-->D((D))
B-->E((E))
D-->G((G))
Q==BFS: adding children to queue==>D
Q==BFS: adding children to queue==>E
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
Z-.leaving scope.->B
B((B, ref=0))-->D((D))
B-->E((E))
D-->G((G))
Q==BFS: adding children to queue==>D
Q==BFS: adding children to queue==>E
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
B((B))-.B released.->D((D))
style B stroke:#333,stroke-width:4px
B-.B released.->E((E))
D-->G((G))
Q-->D
Q-->E
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
D((D))-->G((G))
Q-.pop front.->D
Q-->E((E))
Z--hold by other data structure-->D
Z==getting front==>D
graph TD
Q[BFS Queue]
Z[Context]
D((D))-->G((G))
Q-->E((E))
Q==BFS: adding children to queue==>G
Z--hold by other data structure-->D
Z-->D
graph TD
Q[BFS Queue]
Z[Context]
D((D, ref=1))-->G((G))
Q-->E((E))
Q-->G
Z--hold by other data structure-->D
Z-.leaving scope.->D
graph TD
Q[BFS Queue]
Z[Context]
D((D))-->G((G))
Q-.pop front.->E((E))
Z==getting front==>E
Q-->G
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
D((D))-->G((G))
Z-.leaving scope.->E((E, ref=0))
style E stroke:#333,stroke-width:4px
Q-->G
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
D((D))-->G((G))
Q-.pop front.->G
Z==getting front==>G
Z--hold by other data structure-->D
graph TD
Q[BFS Queue]
Z[Context]
D((D))-->G((G, ref=1))
Z-.leaving scope.->G
Z--hold by other data structure-->D
|
Oh, that helps a lot! I will approve the PR, but I will also leave some comments as to where you could add some of that explanation in the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing!! I do feel strongly that things are not cluttering up the header if possible, so will “request changes” for now and re-review on push…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool, a great addition to enable GTSAM to do large scale inference!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wanmeihuali :)
Cherry-pick pull request #1441 from wanmeihuali/hotfix/stack_overflow
current develop branch seems to have missed this commit for gtsam/inference/BayesTreeCliqueBase.h it was changed back in does the the commit message "ent" mean it's not needed? |
@ShuangLiu1992 Yep, As I mentioned I only fixed the issue in my use case... Because Clique Tree is much smaller than the original tree, it did not trigger stack overflow for my case. |
@ShuangLiu1992 I just got some time to check it, gtsam/inference/BayesTreeCliqueBase.h is the base class for the node in BayesTree, so the patch should already fix the issue if you are using BayesTree.h... You can check the image I attached for how it works. If you still got a stackoverflow, maybe you are using another tree data structure with BayesTreeCliqueBase? |
@wanmeihuali Thanks for taking the time to check. I will double check. On some platforms we are constrained with a stack size of 1mb so the stackover flow might have came from somewhere else in the library. Debug builds are too big to fit on the target machine and Release builds crashes without giving any useful information..... |
1MB of stack is a bit too small for current GTSAM but you can try changing some stack variables used to |
This is a patch for Issue-1433.