Stack overflow for large graph #1433

wanmeihuali · 2023-01-31T21:43:16Z

Description

Because GTSAM uses shared_ptr to manage Tree Structures, stack overflow caused by recursive destruction is a common issue when releasing these data structures(see link).

The behavior is usually a segment fault:

Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007f7487cdd085 in gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster::~Cluster (this=0x477b22a0, __in_chrg=<optimized out>) at /root/development/gtsam/gtsam/inference/ClusterTree.h:49
49          virtual ~Cluster() {}
(gdb) bt
#0  0x00007f7487cdd085 in gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster::~Cluster (this=0x477b22a0, __in_chrg=<optimized out>) at /root/development/gtsam/gtsam/inference/ClusterTree.h:49
#1  0x00007f7487cdd279 in __gnu_cxx::new_allocator<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster>::destroy<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster> (this=0x477b22a0, __p=0x477b22a0) at /usr/include/c++/10/ext/new_allocator.h:156
#2  0x00007f7487cdd23d in std::allocator_traits<std::allocator<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster> >::destroy<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster> (__a=..., __p=0x477b22a0) at /usr/include/c++/10/bits/alloc_traits.h:531
#3  0x00007f7487cdcf57 in std::_Sp_counted_ptr_inplace<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster, std::allocator<gtsam::ClusterTree<gtsam::GaussianFactorGraph>::Cluster>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x477b2290) at /usr/include/c++/10/bits/shared_ptr_base.h:560

Although simple solution such as increasing stack size exists:

ulimit -s unlimited

Such a solution isn't available in environments such as Kubernetes.

Steps to reproduce

install gtsam(the current version is 4.1.1) by pip

pip3 install gtsam

run the following code in python

"""This is to replicate the deprecation problem.
"""
"""
A simple example to replicate the segment fault issue causes by large amount of state variables.
"""

import gtsam
import numpy as np
from gtsam import (BetweenFactorPose3, NonlinearFactorGraph, Point3, Pose3,
                   PriorFactorPose3, Rot3, Values, noiseModel)


def X(idx: int) -> int:
    return gtsam.symbol('X', idx)


graph = NonlinearFactorGraph()
initial_estimates = Values()

# For a 4h bag, there will be around 4*60*60*20 poses
# Base on experiments, the code can run with 170,000 states and fails with 180,000 states
num_state = 170000
# Create the poses list
poses = [Pose3(Rot3().Rx(np.radians(i+np.random.normal(0, 1))),
               Point3(i, 0, 0)) for i in range(num_state)]
# This is to generate a simple pose graph
for i in range(num_state):
    initial_estimates.insert(X(i), poses[i])
    # Add a prior for every 10 poses
    if i % 10 == 0:
        prior_factor = PriorFactorPose3(X(i), poses[i], noiseModel.Diagonal.Sigmas(
            np.array([0.01, 0.01, 0.01, 0.01, 0.01,  0.01])))
        graph.add(prior_factor)
    # Add a odometry factor for consecutive pose pairs
    if i == 0:
        continue
    from_pose, to_pose = poses[i-1], poses[i]
    T_to_from = Pose3(Rot3().Rx(np.radians(1)), Point3(1, 0, 0))
    odom_cov = noiseModel.Diagonal.Sigmas(np.array([1, 1, 1, 1, 0.01,  0.01]))
    odom_factor = BetweenFactorPose3(X(i-1), X(i), T_to_from, odom_cov)
    graph.add(odom_factor)
params = gtsam.LevenbergMarquardtParams()
params.setVerbosityLM("SUMMARY")
params.setAbsoluteErrorTol(1e-20)
optimizer = gtsam.LevenbergMarquardtOptimizer(graph, initial_estimates, params)
result = optimizer.optimize()
print("Optimization Ends")

a segment fault will happen during running

Expected behavior

The segment fault should not happen

Environment

The issue is found in the following environment:

ubuntu18, python, installed by pip, in a Kubernetes maintained docker container.

However, we believe it exists on all platforms.

Possible Solution

It is possible to manually release the Tree Data Structure instead of using the default destructor so that the recursive releasing of resources won't happen.
I made the following PR which can fix the issue in our use case.

However, this PR only fixes the Data Structure we used in our use case. There shall exist other Tree data structures in gtsam which may also cause a similar issue.

Also, the manual release solution may cause other issues if the user uses a node to keep a subtree after the destruction of the original data structure, e.g.

node = nullptr
{
    let t be the Tree Data Structure
    node = t.root
} // leave scope, t destructed
node_children = node.children // original implementation: children still exist; With the fix: children cleared.

We are not sure if such differences may cause crashes in other use cases.

Additional Information

The default stack size on linux is 10MB. You may not be able to reproduce this issue if you have adjusted the stack size in your running environment before.

We've discussed with @yetongumich about this issue.

The text was updated successfully, but these errors were encountered:

dellaert mentioned this issue Feb 1, 2023

try fixing stack overflow issue for large tree wanmeihuali/gtsam#1

Open

wanmeihuali mentioned this issue Feb 5, 2023

Fixing stack overflow issue for large tree #1441

Merged

ProfFan closed this as completed Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack overflow for large graph #1433

Stack overflow for large graph #1433

wanmeihuali commented Jan 31, 2023 •

edited

Loading

Stack overflow for large graph #1433

Stack overflow for large graph #1433

Comments

wanmeihuali commented Jan 31, 2023 • edited Loading

Description

Steps to reproduce

Expected behavior

Environment

Possible Solution

Additional Information

wanmeihuali commented Jan 31, 2023 •

edited

Loading