Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow for large graph #1433

Closed
wanmeihuali opened this issue Jan 31, 2023 · 0 comments
Closed

Stack overflow for large graph #1433

wanmeihuali opened this issue Jan 31, 2023 · 0 comments

Comments

@wanmeihuali
Copy link
Contributor

wanmeihuali commented Jan 31, 2023

Description

Because GTSAM uses shared_ptr to manage Tree Structures, stack overflow caused by recursive destruction is a common issue when releasing these data structures(see link).

The behavior is usually a segment fault:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007f7487cdd085 in gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster::~Cluster (this=0x477b22a0, __in_chrg=<optimized out>) at /root/development/gtsam/gtsam/inference/ClusterTree.h:49
49          virtual ~Cluster() {}
(gdb) bt
#0  0x00007f7487cdd085 in gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster::~Cluster (this=0x477b22a0, __in_chrg=<optimized out>) at /root/development/gtsam/gtsam/inference/ClusterTree.h:49
#1  0x00007f7487cdd279 in __gnu_cxx::new_allocator<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster>::destroy<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster> (this=0x477b22a0, __p=0x477b22a0) at /usr/include/c++/10/ext/new_allocator.h:156
#2  0x00007f7487cdd23d in std::allocator_traits<std::allocator<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster> >::destroy<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster> (__a=..., __p=0x477b22a0) at /usr/include/c++/10/bits/alloc_traits.h:531
#3  0x00007f7487cdcf57 in std::_Sp_counted_ptr_inplace<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster, std::allocator<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x477b2290) at /usr/include/c++/10/bits/shared_ptr_base.h:560

Although simple solution such as increasing stack size exists:

ulimit -s unlimited

Such a solution isn't available in environments such as Kubernetes.

Steps to reproduce

  1. install gtsam(the current version is 4.1.1) by pip
pip3 install gtsam
  1. run the following code in python
"""This is to replicate the deprecation problem.
"""
"""
A simple example to replicate the segment fault issue causes by large amount of state variables.
"""

import gtsam
import numpy as np
from gtsam import (BetweenFactorPose3, NonlinearFactorGraph, Point3, Pose3,
                   PriorFactorPose3, Rot3, Values, noiseModel)


def X(idx: int) -> int:
    return gtsam.symbol('X', idx)


graph = NonlinearFactorGraph()
initial_estimates = Values()

# For a 4h bag, there will be around 4*60*60*20 poses
# Base on experiments, the code can run with 170,000 states and fails with 180,000 states
num_state = 170000
# Create the poses list
poses = [Pose3(Rot3().Rx(np.radians(i+np.random.normal(0, 1))),
               Point3(i, 0, 0)) for i in range(num_state)]
# This is to generate a simple pose graph
for i in range(num_state):
    initial_estimates.insert(X(i), poses[i])
    # Add a prior for every 10 poses
    if i % 10 == 0:
        prior_factor = PriorFactorPose3(X(i), poses[i], noiseModel.Diagonal.Sigmas(
            np.array([0.01, 0.01, 0.01, 0.01, 0.01,  0.01])))
        graph.add(prior_factor)
    # Add a odometry factor for consecutive pose pairs
    if i == 0:
        continue
    from_pose, to_pose = poses[i-1], poses[i]
    T_to_from = Pose3(Rot3().Rx(np.radians(1)), Point3(1, 0, 0))
    odom_cov = noiseModel.Diagonal.Sigmas(np.array([1, 1, 1, 1, 0.01,  0.01]))
    odom_factor = BetweenFactorPose3(X(i-1), X(i), T_to_from, odom_cov)
    graph.add(odom_factor)
params = gtsam.LevenbergMarquardtParams()
params.setVerbosityLM("SUMMARY")
params.setAbsoluteErrorTol(1e-20)
optimizer = gtsam.LevenbergMarquardtOptimizer(graph, initial_estimates, params)
result = optimizer.optimize()
print("Optimization Ends")
  1. a segment fault will happen during running

Expected behavior

The segment fault should not happen

Environment

The issue is found in the following environment:

ubuntu18, python, installed by pip, in a Kubernetes maintained docker container.

However, we believe it exists on all platforms.

Possible Solution

It is possible to manually release the Tree Data Structure instead of using the default destructor so that the recursive releasing of resources won't happen.
I made the following PR which can fix the issue in our use case.

However, this PR only fixes the Data Structure we used in our use case. There shall exist other Tree data structures in gtsam which may also cause a similar issue.

Also, the manual release solution may cause other issues if the user uses a node to keep a subtree after the destruction of the original data structure, e.g.

node = nullptr
{
    let t be the Tree Data Structure
    node = t.root
} // leave scope, t destructed
node_children = node.children // original implementation: children still exist; With the fix: children cleared.

We are not sure if such differences may cause crashes in other use cases.

Additional Information

The default stack size on linux is 10MB. You may not be able to reproduce this issue if you have adjusted the stack size in your running environment before.

We've discussed with @yetongumich about this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants