Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Performance differs between 0.9.4 and 0.9.5 #5816

Closed
magic282 opened this issue Apr 13, 2017 · 17 comments
Closed

Performance differs between 0.9.4 and 0.9.5 #5816

magic282 opened this issue Apr 13, 2017 · 17 comments

Comments

@magic282
Copy link

Hi,

I found that the latest mxnet 0.9.5 I cloned yesterday performs very differently from 0.9.4.

The same code generates very different results with random seed fixed. I am doing an NMT task, for 0.9.4(20170325_mxnet_x64_vc14_gpu.7z), the perplexity drops fast:

12:11:49 INFO:root:Epoch[0] Batch [100]	Speed: 274.49 samples/sec	Train-Perplexity=1263.752708
12:12:13 INFO:root:Epoch[0] Batch [200]	Speed: 267.21 samples/sec	Train-Perplexity=83.131303
12:12:36 INFO:root:Epoch[0] Batch [300]	Speed: 269.23 samples/sec	Train-Perplexity=55.345375
12:12:59 INFO:root:Epoch[0] Batch [400]	Speed: 278.62 samples/sec	Train-Perplexity=43.107419
12:13:21 INFO:root:Epoch[0] Batch [500]	Speed: 299.71 samples/sec	Train-Perplexity=33.375421
12:13:42 INFO:root:Epoch[0] Batch [600]	Speed: 297.82 samples/sec	Train-Perplexity=27.546479
12:14:04 INFO:root:Epoch[0] Batch [700]	Speed: 294.17 samples/sec	Train-Perplexity=23.244679
12:14:27 INFO:root:Epoch[0] Batch [800]	Speed: 282.12 samples/sec	Train-Perplexity=20.813584
12:14:49 INFO:root:Epoch[0] Batch [900]	Speed: 287.42 samples/sec	Train-Perplexity=17.785522
12:15:10 INFO:root:Epoch[0] Batch [1000]	Speed: 299.02 samples/sec	Train-Perplexity=15.125730
12:15:34 INFO:root:Epoch[0] Batch [1100]	Speed: 271.39 samples/sec	Train-Perplexity=14.950460
12:15:56 INFO:root:Epoch[0] Batch [1200]	Speed: 284.24 samples/sec	Train-Perplexity=12.570355
12:16:10 INFO:root:Epoch[0] Train-Perplexity=10.155641
12:16:10 INFO:root:Epoch[0] Time cost=286.972

and after 10 epoches, the perp drops to less than 2.0;

However, for 0.9.5 (cloned yesterday: 6179974):

12:26:23 INFO:root:Epoch[0] Batch [100]	Speed: 286.60 samples/sec	Train-Perplexity=1085.541582
12:26:46 INFO:root:Epoch[0] Batch [200]	Speed: 281.46 samples/sec	Train-Perplexity=240.302835
12:27:08 INFO:root:Epoch[0] Batch [300]	Speed: 282.88 samples/sec	Train-Perplexity=210.892676
12:27:30 INFO:root:Epoch[0] Batch [400]	Speed: 292.07 samples/sec	Train-Perplexity=191.809733
12:27:51 INFO:root:Epoch[0] Batch [500]	Speed: 314.08 samples/sec	Train-Perplexity=170.977854
12:28:11 INFO:root:Epoch[0] Batch [600]	Speed: 312.02 samples/sec	Train-Perplexity=158.543727
12:28:32 INFO:root:Epoch[0] Batch [700]	Speed: 308.72 samples/sec	Train-Perplexity=146.653304
12:28:54 INFO:root:Epoch[0] Batch [800]	Speed: 296.55 samples/sec	Train-Perplexity=147.445385
12:29:15 INFO:root:Epoch[0] Batch [900]	Speed: 302.09 samples/sec	Train-Perplexity=139.550649
12:29:35 INFO:root:Epoch[0] Batch [1000]	Speed: 313.66 samples/sec	Train-Perplexity=129.206405
12:29:58 INFO:root:Epoch[0] Batch [1100]	Speed: 285.21 samples/sec	Train-Perplexity=123.492447
12:30:19 INFO:root:Epoch[0] Batch [1200]	Speed: 297.63 samples/sec	Train-Perplexity=111.161471
12:30:32 INFO:root:Epoch[0] Train-Perplexity=92.921033
12:30:32 INFO:root:Epoch[0] Time cost=273.647

and after 10 epoches, the perp is still about 40.

So I infer that something went wrong between the two versions.

@sxjscience
Copy link
Member

@magic282 What's the optimizer you are using? We've fixed a bug in RMSProp and changed it to be non-centered.

@magic282
Copy link
Author

@sxjscience I am using Adam.

@mli
Copy link
Member

mli commented Apr 13, 2017

that's interesting. can you please do a git bisect to locate which PR makes the diff?

@magic282
Copy link
Author

@mli Since it is too slow to build under Windows(takes 30 mins on my server), I just tested some nightly builds:

0404 good
0405 good
0406 error

0409 error
0410 error
0411 bad
0412 bad
0413 bad

where error means my code crashed for these builds.
The error is:

Traceback (most recent call last):
  File "D:/users/a/Projects/mxnmt/nmt/main.py", line 46, in <module>
    train()
  File "D:\users\a\Projects\mxnmt\nmt\trainer.py", line 102, in train
    num_epoch=10,
  File "C:\Anaconda3\lib\site-packages\mxnet-0.9.5-py3.5.egg\mxnet\module\base_module.py", line 471, in fit
    self.forward_backward(data_batch)
  File "C:\Anaconda3\lib\site-packages\mxnet-0.9.5-py3.5.egg\mxnet\module\base_module.py", line 194, in forward_backward
    self.forward(data_batch, is_train=True)
  File "C:\Anaconda3\lib\site-packages\mxnet-0.9.5-py3.5.egg\mxnet\module\bucketing_module.py", line 388, in forward
    self._curr_module.forward(data_batch, is_train=is_train)
  File "C:\Anaconda3\lib\site-packages\mxnet-0.9.5-py3.5.egg\mxnet\module\module.py", line 538, in forward
    self._exec_group.forward(data_batch, is_train)
  File "C:\Anaconda3\lib\site-packages\mxnet-0.9.5-py3.5.egg\mxnet\module\executor_group.py", line 369, in forward
    _load_data(data_batch, self.data_arrays, self.data_layouts)
  File "C:\Anaconda3\lib\site-packages\mxnet-0.9.5-py3.5.egg\mxnet\module\executor_group.py", line 45, in _load_data
    _load_general(batch.data, targets, major_axis)
  File "C:\Anaconda3\lib\site-packages\mxnet-0.9.5-py3.5.egg\mxnet\module\executor_group.py", line 36, in _load_general
    d_dst_copy = nd.crop(d_src, begin=tuple(begin), end=tuple(end))
  File "C:\Anaconda3\lib\site-packages\mxnet-0.9.5-py3.5.egg\mxnet\_ctypes\ndarray.py", line 164, in generic_ndarray_function
    c_array(ctypes.c_char_p, [c_str(val) for val in vals])))
  File "C:\Anaconda3\lib\site-packages\mxnet-0.9.5-py3.5.egg\mxnet\base.py", line 78, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Cannot find argument 'data', Possible Arguments:
----------------
begin : Shape(tuple), required
    starting coordinates
end : Shape(tuple), required
    ending coordinates
, in operator slice(name="", data="<NDArray 64x10 @cpu(0)>", end="(64, 10)", begin="(0, 0)")

So my guess is, something after 0405 introduced bugs.

@tangjiasheng
Copy link

@magic282 Is it a language model task? If so, which dataset do you use? PTB?

@magic282
Copy link
Author

@mli

 MINGW64 /d/mxnet (master)
$ git bisect start

 MINGW64 /d/mxnet (master|BISECTING)
$ git bisect bad

 MINGW64 /d/mxnet (master|BISECTING)
$ git bisect good 1f850e1afcc3933ffec27bcc9c56eb5f9596b16c
Bisecting: 38 revisions left to test after this (roughly 5 steps)
[22482aecf99ed6e1bc82803845aa53dbddb835dc] Add error for NDArray bool casting (#5745)

 MINGW64 /d/mxnet ((22482ae...)|BISECTING)
$ git bisect bad
Bisecting: 18 revisions left to test after this (roughly 4 steps)
[e3bb3d4eb1835176bbb2b2bde07e074a0fbc4dc8] fix argument parsing (#5693)

 MINGW64 /d/mxnet ((e3bb3d4...)|BISECTING)
$ git bisect bad
Bisecting: 9 revisions left to test after this (roughly 3 steps)
[316dd3e0cc2ec91134ff976934f0bdcced8e84e5] Update the cloud.md (#5670)

 MINGW64 /d/mxnet ((316dd3e...)|BISECTING)
$ git bisect bad
Bisecting: 4 revisions left to test after this (roughly 2 steps)
[e39df3ecadd451596eda09a92068b3c56db93155] simple changes to mxnet.image.ImageIter to make it more newbie friendly (#5278)

 MINGW64 /d/mxnet ((e39df3e...)|BISECTING)
$ git bisect bad
Bisecting: 1 revision left to test after this (roughly 1 step)
[b35148ec8bb0e483f97493d587b764088c92702b] Example ssd, major update  (#5643)

 MINGW64 /d/mxnet ((b35148e...)|BISECTING)
$ git bisect bad
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[138344683e65c87af20250e3f4cdcc5a72ac3cc5] modify mx.nd.reshape documentation (#5609)

 MINGW64 /d/mxnet ((1383446...)|BISECTING)
$ git bisect good
b35148ec8bb0e483f97493d587b764088c92702b is the first bad commit
commit b35148ec8bb0e483f97493d587b764088c92702b
Author: Joshua Z. Zhang <cheungchih@gmail.com>
Date:   Sun Apr 2 19:41:36 2017 -0500

    Example ssd, major update  (#5643)

    * add detection augmenter cc

    * temporal bak

    * new detection recordio iter.cc

    * record parser

    * detiter

    * barebones done for detiter

    * add padding, mirror, colorjitter, etc..

    * add normalization

    * fix label padding width bug

    * more documents

    * crop after padding

    * update contrib multibox layers for ssd

    * update example ssd

    * fix lint

    * fix numpy concat

    * modify caffe converter

    * minor fix readme

:040000 040000 79a259d6894e30208c6cf1e43090ca338431b00c 5038548d38a6748a1dc6466b707f8f395b93f995 M      example
:040000 040000 40b06a175814d216c64260e6891f5dd172275bab 4a1612e2c297295215a3a97c5de5f6409aa0632e M      src

The last three bads means build failure, and other bad means performance problem.

@alanhuang1990
Copy link

Did you use clip_gradient? @magic282

@mli
Copy link
Member

mli commented Apr 19, 2017

the ssd commit b35148e only adds io, contrib operators and ssd examples, i think it should not affect the results.

also the last bad commit 1383446 only updated documents, and it should also be fine.

it is possible due to the numerical instabilitiy? say if you repeat it multiple times, some of them are good, some of them are not

@magic282
Copy link
Author

@alanhuang1990 Yes I use clip_gradient.
@mli No, every time I run it, I got the same result. I am also not sure about which commit, because there are three commits caused compile error.

@alanhuang1990
Copy link

@magic282 You can try the old python implementation of Adam optimizer. I have the same convergence problem in this version using RMSProp, but the old optimizer implementation works well.

@magic282
Copy link
Author

@alanhuang1990 So it's the optimizer's fault?

@alanhuang1990
Copy link

@magic282 Not sure about it. You can try it in your problem.

@mli
Copy link
Member

mli commented Apr 20, 2017 via email

@magic282
Copy link
Author

@mli Hi Mu,

I removed the files that caused compile error, but it still reports compile errors:

3>  io.cc
3>c1xx : fatal error C1083: Cannot open source file: 'D:\mxnet\src\io\image_det_aug_default.cc': No such file or directory
3>c1xx : fatal error C1083: Cannot open source file: 'D:\mxnet\src\io\image_aug_default.cc': No such file or directory
3>  iter_csv.cc
3>  iter_image_det_recordio.cc
3>D:\mxnet\src\executor\graph_executor.cc(698): warning C4804: '>': unsafe use of type 'bool' in operation
3>  iter_image_recordio.cc
3>  iter_image_recordio_2.cc
3>c1xx : fatal error C1083: Cannot open source file: 'D:\mxnet\src\io\iter_image_recordio_2.cc': No such file or directory

Any hints about solving this?

@magic282
Copy link
Author

Sorry the previous error is not correct. I redo the cmake process, the following is the error mesage:

...
3>  c_api_graph.cc
3>  graph.cc
3>  node.cc
3>  op.cc
3>  pass.cc
3>  symbolic.cc
3>  gradient.cc
3>  infer_shape_type.cc
3>  order_mutation.cc
3>  place_device.cc
3>  plan_memory.cc
3>  saveload_json.cc
3>  c_api_error.cc
3>  c_api_symbolic.cc
3>  c_api_error.cc
3>  c_api_symbolic.cc
3>     Creating library D:/mxnet/build/Release/libmxnet.lib and object D:/mxnet/build/Release/libmxnet.exp
3>iter_image_det_recordio.obj : error LNK2019: unresolved external symbol "class std::vector<struct dmlc::ParamFieldInfo,class std::allocator<struct dmlc::ParamFieldInfo> > __cdecl mxnet::io::ListDefaultDetAugParams(void)" (?ListDefaultDetAugParams@io@mxnet@@YA?AV?$vector@UParamFieldInfo@dmlc@@V?$allocator@UParamFieldInfo@dmlc@@@std@@@std@@XZ) referenced in function "void __cdecl mxnet::io::`dynamic initializer for '__make_DataIteratorReg_ImageDetRecordIter__''(void)" (??__E__make_DataIteratorReg_ImageDetRecordIter__@io@mxnet@@YAXXZ)
3>iter_image_recordio.obj : error LNK2019: unresolved external symbol "class std::vector<struct dmlc::ParamFieldInfo,class std::allocator<struct dmlc::ParamFieldInfo> > __cdecl mxnet::io::ListDefaultAugParams(void)" (?ListDefaultAugParams@io@mxnet@@YA?AV?$vector@UParamFieldInfo@dmlc@@V?$allocator@UParamFieldInfo@dmlc@@@std@@@std@@XZ) referenced in function "void __cdecl mxnet::io::`dynamic initializer for '__make_DataIteratorReg_ImageRecordIter_v1__''(void)" (??__E__make_DataIteratorReg_ImageRecordIter_v1__@io@mxnet@@YAXXZ)
3>D:\mxnet\build\Release\libmxnet.dll : fatal error LNK1120: 2 unresolved externals
========== Build: 2 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

@magic282
Copy link
Author

magic282 commented May 8, 2017

Hi, I did the git bisect under an linux server and got the result.

a@b:~/mxnet$ git bisect log
git bisect start
# bad: [3cca692f8c79085afcf69e1c248f48d0197e248f] [doc] use debug mode to build (#6151)
git bisect bad 3cca692f8c79085afcf69e1c248f48d0197e248f
# good: [1f850e1afcc3933ffec27bcc9c56eb5f9596b16c] Adds mx.nd.moveaxis (#5640)
git bisect good 1f850e1afcc3933ffec27bcc9c56eb5f9596b16c
# bad: [d8febd1502ae44d35f3ad7e92caa1bc7afdf04f2] resnet v1 in fp16 (#5897)
git bisect bad d8febd1502ae44d35f3ad7e92caa1bc7afdf04f2
# bad: [caa3253f8432e7e92c8c93837a0c6416f00f4809] Debug error caused by missplaced import find_mxnet in googlenet (#5784)
git bisect bad caa3253f8432e7e92c8c93837a0c6416f00f4809
# bad: [d99a7bfbe4117b9e2f88be464b6f70898faf25d6] fix formatting issues with Reshape documentation (#5686)
git bisect bad d99a7bfbe4117b9e2f88be464b6f70898faf25d6
# bad: [5a144585281d58f4e51e3aba14abf04fce33a569] Adds GetDType() signature to ndarray.h and fixes typo in implementation. (#5680)
git bisect bad 5a144585281d58f4e51e3aba14abf04fce33a569
# bad: [fa37125fcd1f754bc2f021880cd74d3621445e12] compile with macOS support (#4260)
git bisect bad fa37125fcd1f754bc2f021880cd74d3621445e12
# bad: [d2f40d884a320a6c24cf60ff57c34a1649d2cdeb] Implement 3D deconvolution (cuDNN) (#5615)
git bisect bad d2f40d884a320a6c24cf60ff57c34a1649d2cdeb
# bad: [b35148ec8bb0e483f97493d587b764088c92702b] Example ssd, major update  (#5643)
git bisect bad b35148ec8bb0e483f97493d587b764088c92702b
# good: [138344683e65c87af20250e3f4cdcc5a72ac3cc5] modify mx.nd.reshape documentation (#5609)
git bisect good 138344683e65c87af20250e3f4cdcc5a72ac3cc5
# first bad commit: [b35148ec8bb0e483f97493d587b764088c92702b] Example ssd, major update  (#5643)

The bad means for either compile error or performance issue.
It seems to be the same with the result I got under Windows.
The first bad commit may not bring the performance issue, but it (or the submodule) causes an compile error. And after this compile error disappeared, the performance became different.

@sxjscience
Copy link
Member

Close it now. @magic282 Feel free to reopen it if you have other findings.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants