Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Hybridization SequenceLast Bug #13968

Closed
stephenrawls opened this issue Jan 23, 2019 · 7 comments
Closed

Hybridization SequenceLast Bug #13968

stephenrawls opened this issue Jan 23, 2019 · 7 comments

Comments

@stephenrawls
Copy link
Contributor

Description

I am experiencing a crash related to hybridization and the SequenceLast operator. It is currently consistently crashing for me, but others in team haven't been able to reproduce.

Environment info (Required)

----------Python Info----------
Version      : 3.7.2
Compiler     : Clang 9.0.0 (clang-900.0.39.2)
Build        : ('default', 'Dec 27 2018 07:35:45')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.1
Directory    : /usr/local/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.0
Directory    : /usr/local/lib/python3.7/site-packages/mxnet
Commit Hash   : b3be92f4a48bce62a5a8424271871c2f81c8f7f1
----------System Info----------
Platform     : Darwin-16.7.0-x86_64-i386-64bit
system       : Darwin
node         : dca90487206c.ant.amazon.com
release      : 16.7.0
version      : Darwin Kernel Version 16.7.0: Sun Oct 28 22:30:19 PDT 2018; root:xnu-3789.73.27~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0341 sec, LOAD: 0.6150 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0352 sec, LOAD: 0.2739 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0428 sec, LOAD: 0.2802 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0420 sec, LOAD: 0.3961 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0305 sec, LOAD: 0.8088 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0337 sec, LOAD: 0.1836 sec.

I'm using Python

Unfortunately I don't have a minimal reproducible example yet. Every time I try, the bug does not occur in my condensed example.

The problem I have is that I am seeing the following error message:

 File ".../lib/python3.4/site-packages/mxnet/base.py", line 251, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [23:04:29] .../src/src/imperative/imperative_utils.cc:58: Check failed: !ndinputs.back()->is_none() node_264 0
Stack trace returned 10 entries:
[bt] (0) .../lib/libmxnet.so(dmlc::StackTrace()+0x3d) [0x7ff375cd4e0d]
[bt] (1) .../lib/libmxnet.so(mxnet::imperative::RunGraph(bool, nnvm::IndexedGraph const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> >, unsigned long, unsigned long, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> >&&, std::vector<unsigned int, std::allocator<unsigned int> >&&, std::vector<mxnet::OpStatePtr, std::allocator<mxnet::OpStatePtr> >*, std::vector<mxnet::DispatchMode, std::allocator<mxnet::DispatchMode> > const&, bool)+0x5a4) [0x7ff375e4a924]
[bt] (2) .../lib/libmxnet.so(mxnet::Imperative::Backward(std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, bool, bool, bool)+0x22fc) [0x7ff375e3969c]
[bt] (3) .../lib/libmxnet.so(MXAutogradBackwardEx+0x2e0) [0x7ff375cfe640]

When I print out the symbolic debug string for my network and go to "node_264" it leads me to the following section of my model code:

        last_real_tag = mx.nd.SequenceLast(
            tags_t, sequence_length=mx.nd.cast(valid_lengths,dtype=tags_t.dtype), use_sequence_length=True, axis=0)

The things that I think are important about this are:
(1) The tags_t array is an integer sequence that we generate from ground truth data. So this SequenceLast call shouldn't be trying to flow any gradients back through it. (Note that the stack trace is in the backward pass)
(2) The model we run is a HybridBlock, but this particular code is part of our loss function which is not a hybrid block. It takes the output from the model, and combines it with the ground truth values from this SequenceLast operation, and calculates our loss
(3) Everything worked fine when our model was a regular gluon Bock. It works fine as a HybridBlock as long as we don't hybridize() it. But when I do hybridize() the model, I am consistently seeing this crash. (Although other team members have yet to run into it).

We currently have a work around to add the following statement directly after the SequenceLast operation:

last_real_tag = last_real_tag.detach()

When this is in place I never encounter the crash I highlighted above.

Additionally I have another hackier work-around of not using SequenceLast, but instead using a combination of Take and diag to perform the same function. When I do that I also do not ever see the crash.

I'm happy to provide further info as needed, apologies for not being able to condense down into an actual reproducible crash.

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels:

@frankfliu
Copy link
Contributor

frankfliu commented Jan 23, 2019

@mxnet-label-bot add [Gluon, operator]

@szha
Copy link
Member

szha commented Jan 27, 2019

@stephenrawls thanks for reporting the bug. If it's hard to reproduce the error, you may consider setting this environment variable: MXNET_ENGINE_TYPE=NaiveEngine. This causes all computation to be synchronous and the stacktrack should reveal where the error is raised.

@stephenrawls
Copy link
Contributor Author

@szha thanks for the tip. The stack trace is revealing the error happens in the backward pass at a specific line checking that a certain array is non-empty, which is confirmed by our work around of detaching the problematic variable from the computation graph. (And since the variable in question shouldn't have any gradient flowing through it anyway, this does not effect our calculations and is an effective work around).

But I'll play around with MXNET_ENGINE_TYPE and see if I can get a more reproducible error.

@piyushghai
Copy link
Contributor

@stephenrawls Can you provide a minimum reproducible example so that we can help nail down this issue ?

@piyushghai
Copy link
Contributor

Hi @stephenrawls, Were you able to get a reproducible example for this error ?

@stephenrawls
Copy link
Contributor Author

No I wasn't, and our code base has moved on, so I'm closing the issue. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants