-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MXNet segfault error when calling scheduler.run() 2+ times in same Python session #61
Comments
Name: mxnet |
As I previously mentioned, I also get MXNet segfault when I run the image-classification tutorial on Mac:
produces the error below: Segmentation fault: 11 Stack trace: Segmentation fault: 11 std::__1::iterator_traitsmxnet::NDArray**::reference>::value), void>::type std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* >::assignmxnet::NDArray**(mxnet::NDArray**, mxnet::NDArray**) + 30295 Segmentation fault: 11 Stack trace: Segmentation fault: 11 Stack trace: std::__1::iterator_traitsmxnet::NDArray**::reference>::value), void>::type std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* >::assignmxnet::NDArray**(mxnet::NDArray**, mxnet::NDArray**) + 30295 Segmentation fault: 11 Segmentation fault: 11 Stack trace: Segmentation fault: 11 Stack trace: |
After discussion with @zhreshold , this is MXNet release error on MacOS. To fully resolve this problem need help from MXNet maintainers.
|
Verified as Hang said, this can be fixed by building MXNet from source: https://mxnet.apache.org/get_started/osx_setup But hopefully we can get this fixed by somebody soon. |
@zhanghang1989 : Below is the MXNet error I get when I call scheduler.run() two times in the same Python session. Here are the steps to reproduce:
checkout tabular branch: https://github.com/awslabs/autogluon/tree/tabular
git checkout tabular
install tabular module by following steps in tabular/README: https://github.com/awslabs/autogluon/blob/tabular/tabular/README.md
Verify your installation worked by running the simple example in:
https://github.com/awslabs/autogluon/blob/tabular/autogluon/task/predict_table_column/examples/example_tabular_predictions.py
Note that this example does not do any HPO and does not use ag.schedulers at all.
You can run this example many times in a row inside the same Python session without any segfault issue.
https://github.com/awslabs/autogluon/blob/tabular/autogluon/task/predict_table_column/examples/example_advanced_tabular.py
This example should also work (it may produce tons of warnings, but should not produce any MXNet segfault). This example demonstrates doing HPO during task.fit() by leveraging the ag.scheduler and internally calls scheduler.run() one time. The key line of code that does this is: https://github.com/awslabs/autogluon/blob/tabular/autogluon/task/predict_table_column/examples/example_advanced_tabular.py#L30
predictor = task.fit(train_data=train_data, label=label_column, output_directory=savedir, hyperparameter_tune=True, num_trials=10, time_limits=10*60, nn_options=nn_options)
`predictor = task.fit(train_data=train_data, label=label_column, output_directory=savedir, hyperparameter_tune=True,
num_trials=10, time_limits=10*60, nn_options=nn_options)
predictor = task.fit(train_data=train_data, label=label_column, output_directory=savedir, hyperparameter_tune=True,
num_trials=10, time_limits=10*60, nn_options=nn_options)
`
0%| | 0/10 [00:00<?, ?it/s]
Segmentation fault: 11
Stack trace:
[bt] (0) 1 libmxnet.so 0x0000000117e062b0 mxnet::Storage::Get() + 4880
[bt] (1) 2 libsystem_platform.dylib 0x00007fff7f0f3b5d _sigtramp + 29
[bt] (2) 3 ??? 0x0000000000000000 0x0 + 0
[bt] (3) 4 libBLAS.dylib 0x00007fff4fac5d44 APL_sgemm + 806
[bt] (4) 5 libBLAS.dylib 0x00007fff4fa504c2 cblas_sgemm + 1592
[bt] (5) 6 libmxnet.so 0x000000011654e8b5 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 14421
[bt] (6) 7 libmxnet.so 0x000000011654b5f8 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 1432
[bt] (7) 8 libmxnet.so 0x000000011654b363 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 771
[bt] (8) 9 libmxnet.so 0x000000011774dca9 mxnet::imperative::PushFComputeEx(std::__1::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocatormxnet::engine::Var* > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocatormxnet::engine::Var* > const&, std::__1::vector<mxnet::Resource, std::__1::allocatormxnet::Resource > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&)::'lambda'(mxnet::RunContext)::operator()(mxnet::RunContext) const + 217
Segmentation fault: 11
Stack trace:
[bt] (0) 1 libmxnet.so 0x0000000117e062b0 mxnet::Storage::Get() + 4880
[bt] (1) 2 libsystem_platform.dylib 0x00007fff7f0f3b5d _sigtramp + 29
[bt] (2) 3 ??? 0x0000000000000000 0x0 + 0
[bt] (3) 4 libBLAS.dylib 0x00007fff4fac5d44 APL_sgemm + 806
[bt] (4) 5 libBLAS.dylib 0x00007fff4fa504c2 cblas_sgemm + 1592
[bt] (5) 6 libmxnet.so 0x000000011654e8b5 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 14421
[bt] (6) 7 libmxnet.so 0x000000011654b5f8 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 1432
[bt] (7) 8 libmxnet.so 0x000000011654b363 mxnet::op::FullyConnectedComputeExCPU(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 771
[bt] (8) 9 libmxnet.so 0x000000011774dca9 mxnet::imperative::PushFComputeEx(std::__1::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocatormxnet::engine::Var* > const&, std::__1::vector<mxnet::engine::Var*, std::__1::allocatormxnet::engine::Var* > const&, std::__1::vector<mxnet::Resource, std::__1::allocatormxnet::Resource > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* > const&, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&)::'lambda'(mxnet::RunContext)::operator()(mxnet::RunContext) const + 217
Segmentation fault: 11
Stack trace:
[bt] (0) 1 libmxnet.so 0x0000000117e062b0 mxnet::Storage::Get() + 4880
[bt] (1) 2 libsystem_platform.dylib 0x00007fff7f0f3b5d _sigtramp + 29
[bt] (2) 3 ??? 0x000000010e3eea00 0x0 + 4533971456
[bt] (3) 4 libmxnet.so 0x00000001161c4ad3 std::__1::map<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator >, mxnet::NDArrayFunctionReg*, std::__1::less<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const, mxnet::NDArrayFunctionReg*> > >::__find_equal_key(std::__1::__tree_node_base<void*>&, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&) + 867
[bt] (4) 5 libmxnet.so 0x0000000117479f3a void mxnet::op::FillComputeZerosExmshadow::cpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 666
[bt] (5) 6 libmxnet.so 0x0000000117685b62 SetNDInputsOutputs(nnvm::Op const, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* >, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray* >, int, void const*, int*, int, int, void***) + 3330
[bt] (6) 7 libmxnet.so 0x00000001176853c8 SetNDInputsOutputs(nnvm::Op const*, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* >, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray* >, int, void const*, int*, int, int, void***) + 1384
[bt] (7) 8 libmxnet.so 0x00000001176861d0 MXImperativeInvokeEx + 176
[bt] (8) 9 _ctypes.cpython-37m-darwin.so 0x000000010f609367 ffi_call_unix64 + 79
Segmentation fault: 11
Stack trace:
[bt] (0) 1 libmxnet.so 0x0000000117e062b0 mxnet::Storage::Get() + 4880
[bt] (1) 2 libsystem_platform.dylib 0x00007fff7f0f3b5d _sigtramp + 29
[bt] (2) 3 Python 0x000000010e1a6fdd member_set + 52
[bt] (3) 4 libmxnet.so 0x00000001161c4ad3 std::__1::map<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator >, mxnet::NDArrayFunctionReg*, std::__1::less<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > >, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const, mxnet::NDArrayFunctionReg*> > >::__find_equal_key(std::__1::__tree_node_base<void*>&, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&) + 867
[bt] (4) 5 libmxnet.so 0x0000000117479f3a void mxnet::op::FillComputeZerosExmshadow::cpu(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocatormxnet::OpReqType > const&, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray > const&) + 666
[bt] (5) 6 libmxnet.so 0x0000000117685b62 SetNDInputsOutputs(nnvm::Op const, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* >, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray* >, int, void const*, int*, int, int, void***) + 3330
[bt] (6) 7 libmxnet.so 0x00000001176853c8 SetNDInputsOutputs(nnvm::Op const*, std::__1::vector<mxnet::NDArray*, std::__1::allocatormxnet::NDArray* >, std::__1::vector<mxnet::NDArray, std::__1::allocatormxnet::NDArray* >, int, void const*, int*, int, int, void***) + 1384
[bt] (7) 8 libmxnet.so 0x00000001176861d0 MXImperativeInvokeEx + 176
[bt] (8) 9 _ctypes.cpython-37m-darwin.so 0x000000010f609367 ffi_call_unix64 + 79
The text was updated successfully, but these errors were encountered: