Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name conflict when serializing LSTMCell #12783

Closed
lostella opened this issue Oct 10, 2018 · 13 comments

Comments

@lostella
Copy link
Contributor

commented Oct 10, 2018

Description

A name conflict occurs when serializing a custom HybridBlock that contains a HybridSequentialRNNCell LSTMCell. As a result, deserialization with mx.gluon.SymbolBlock.imports fails.

Environment info (Required)

----------Python Info----------
Version      : 3.6.6
Compiler     : GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)
Build        : ('default', 'Aug 31 2018 16:33:25')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 18.0
Directory    : /Users/[...]/.virtualenvs/[...]/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.0
Directory    : /Users/[...]/.virtualenvs/[...]/lib/python3.6/site-packages/mxnet
Commit Hash   : b3be92f4a48bce62a5a8424271871c2f81c8f7f1
----------System Info----------
Platform     : Darwin-16.7.0-x86_64-i386-64bit
system       : Darwin
node         : 8c85902e415b.ant.amazon.com
release      : 16.7.0
version      : Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0040 sec, LOAD: 1.1197 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0444 sec, LOAD: 0.9828 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0513 sec, LOAD: 0.8610 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0674 sec, LOAD: 1.2629 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0046 sec, LOAD: 1.4241 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0432 sec, LOAD: 0.2483 sec.

Package used (Python/R/Scala/Julia): I'm using Python

Error Message:

Traceback (most recent call last):
  File "2018-10-10-serialization-issue.py", line 32, in <module>
    ctx=mx.Context.default_ctx
  File "/Users/[...]/.virtualenvs/[...]/lib/python3.6/site-packages/mxnet/gluon/block.py", line 1023, in imports
    ret = SymbolBlock(sym, inputs)
  File "/Users/[...]/.virtualenvs/[...]/lib/python3.6/site-packages/mxnet/gluon/block.py", line 1051, in __init__
    for j in i.get_internals():
  File "/Users/[...]/.virtualenvs/[...]/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 93, in <genexpr>
    return (self[i] for i in self.list_outputs())
  File "/Users/[...]/.virtualenvs/[...]/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 517, in __getitem__
    raise ValueError('There are multiple outputs with name \"%s\"' % index)
ValueError: There are multiple outputs with name "myblock0_lstm0__plus0_output"

Minimum reproducible example

https://gist.github.com/lostella/261fd5d08dfb5e2054c4d01a7e2bc88e

import mxnet as mx

class MyBlock(mx.gluon.HybridBlock):
    def __init__(self):
        super().__init__()
        with self.name_scope():
            self.lstm = mx.gluon.rnn.HybridSequentialRNNCell()
            for layer in range(3):
                self.lstm.add(mx.gluon.rnn.LSTMCell(hidden_size=20))

    def hybrid_forward(self, F, seq):
        outputs, state = self.lstm.unroll(inputs=seq, length=10, layout="NTC", merge_outputs=True)
        return outputs

block = MyBlock()
block.initialize()
block.hybridize()

input = mx.nd.random_normal(shape=(32, 10, 5))
output = block(input)

block.export(path="./model", epoch=0)
symbol = mx.gluon.SymbolBlock.imports(
    symbol_file="./model-symbol.json",
    input_names=[f"data"],
    param_file="./model-0000.params",
    ctx=mx.Context.default_ctx
)

Steps to reproduce

  1. Copy-paste the MWE in a python script
  2. Run the script
@piyushghai

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2018

Thank you for filing this issue.
@mxnet-label-bot [Bug, Gluon]

@piyushghai

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2018

@lostella I tried running your example code provided, but I ran into the following error when I instantiated the block object :

Can you have a look at your example code once more :)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-db4f3e2308e1> in <module>()
----> 1 block = MyBlock()
      2 block.initialize()
      3 block.hybridize()

<ipython-input-2-07390cb3caa7> in __init__(self)
      1 class MyBlock(mx.gluon.HybridBlock):
      2     def __init__(self):
----> 3         super().__init__()
      4         with self.name_scope():
      5             self.lstm = mx.gluon.rnn.HybridSequentialRNNCell()

TypeError: super() takes at least 1 argument (0 given)
@piyushghai

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2018

@mxnet-label-bot [Pending Requester Info]

@lostella

This comment has been minimized.

Copy link
Contributor Author

commented Oct 10, 2018

It runs fine with Python 3 (see my environment details above). Attaching it as a gist as well:

https://gist.github.com/lostella/261fd5d08dfb5e2054c4d01a7e2bc88e

@piyushghai

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2018

Aah. My bad here. Seems like my Jupyter Notebook was running python 2 as default kernel.

@piyushghai

This comment has been minimized.

Copy link
Contributor

commented Oct 10, 2018

@lostella Here are my findings on the issue :

The symbol names are not getting saved properly in the symbol.json file generated.
More specifically, with the unroll sequence length generates/replicates the same LSTM layer 'n' times, where n is the unroll seq length.
Now if you closely examine the layer name in error : myblock0_lstm0__plus0_output it has a double '_' in between indicating something is amiss here.
On further investigation, I found that a time stamp suffix : t0, t1 etc indicating the unroll sequence number is missing here. To verify a quick fix, I opened up the symbol.json and manually added the time prefixes on places which were complaining about the error in imports method.
I also had to fix the suffix issue in activation layers with name prefix as : myblock0_lstm<layer_number>_activation<time-stamp> to have correct values for timestamp.
eg : myblock0_lstm2_activation0
After fixing the symbol.json file, the imports statement was working fine.

I will now investigate the root cause of this issue to fix it in code.

Attached are gists of my working .ipynb notebook, and the corrected symbol.json file.

https://gist.github.com/piyushghai/ad18f1290ec05d96ef5e9631474ae553

@lostella

This comment has been minimized.

Copy link
Contributor Author

commented Oct 11, 2018

I simplified the MWE: https://gist.github.com/lostella/9a790fd89726c1741a1fcf4194a5dac6

It seems like it's ultimately an LSTMCell problem.

import mxnet as mx

class MyBlock(mx.gluon.HybridBlock):
    def __init__(self):
        super().__init__()
        with self.name_scope():
            self.lstmcell = mx.gluon.rnn.LSTMCell(hidden_size=20)

    def hybrid_forward(self, F, seq):
        outputs, state = self.lstmcell.unroll(inputs=seq, length=10, layout="NTC", merge_outputs=True)
        return outputs

block = MyBlock()
block.initialize()
block.hybridize()

input = mx.nd.random_normal(shape=(32, 10, 5))
output = block(input)

block.export(path="./model", epoch=0)
symbol = mx.gluon.SymbolBlock.imports(
    symbol_file="./model-symbol.json",
    input_names=[f"data"],
    param_file="./model-0000.params",
    ctx=mx.Context.default_ctx
)

@lostella lostella changed the title Name conflict when serializing HybridSequentialRNNCell Name conflict when serializing LSTMCell Oct 11, 2018

@szha

This comment has been minimized.

Copy link
Member

commented Oct 11, 2018

For the first piece of code, the problem is not using the container's name_scope. Since HybridSequentialRNNCell is a container block, you need to use its name_scope if you intend to properly export it as a symbol.

class MyBlock(mx.gluon.HybridBlock):
    def __init__(self):
        super().__init__()
        with self.name_scope():
            self.lstm = mx.gluon.rnn.HybridSequentialRNNCell()
            with self.lstm.name_scope():
                for layer in range(3):
                    self.lstm.add(mx.gluon.rnn.LSTMCell(hidden_size=20))

    def hybrid_forward(self, F, seq):
        outputs, state = self.lstm.unroll(inputs=seq, length=10, layout="NTC", merge_outputs=True)
        return outputs
@lostella

This comment has been minimized.

Copy link
Contributor Author

commented Oct 11, 2018

Unfortunately, that does not seem to solve the issue. See also the simpler example in my previous comment, which does not involve HybridSequentialRNNCell.

@szha

This comment has been minimized.

Copy link
Member

commented Oct 11, 2018

For the problem with LSTM alone, the problem is in not naming some of the elementwise operations. In LSTM there are three names that are repeating:

      "name": "myblock0_lstm0__plus0",
      "name": "myblock0_lstm0__mul0",
      "name": "myblock0_lstm0__mul1",

They come from the plus0 here, and the mul0 and mul1 here

The fix should be to replace these operations with F.elemwise_X, with the proper prefix just like other operators.

@szha

This comment has been minimized.

Copy link
Member

commented Oct 11, 2018

This problem exists in RNN and GRU as well, so all three needs to be patched.

@vandanavk

This comment has been minimized.

Copy link
Contributor

commented Oct 11, 2018

Similar error occurred in #11542

@lostella

This comment has been minimized.

Copy link
Contributor Author

commented Oct 11, 2018

@szha yes, thanks. That's basically #12794.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.