Bug with fusion #17105

zburning · 2019-12-18T08:35:56Z

Description

I'm running the script at https://github.com/dmlc/gluon-nlp/tree/master/scripts/language_model/run_glue.py. The command line to reproduce:
python run_glue.py --task MRPC --gpu 8 --batch_size 32

I have the following observation:

With model.hybridize()
With mxnet-cu100 >= 1.6.0b20191102, the time cost of the first batch is more than 283s with a single GPU, more than 1000s with 4 GPUs.
With mxnet-cu100==1.6.0b20191101, the time cost of the first batch is about 11s with a single GPU.
With model.hybridize() and os.environ['MXNET_USE_FUSION'] = '0'
for both mxnet-cu100 == 1.6.0b201911102 and 1.6.0b20191215, the time cost of the first batch is around 11s.
Without hybridize():
The performance are nearly the same.

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

The text was updated successfully, but these errors were encountered:

ptrendx · 2019-12-18T17:05:13Z

Some increase in the time of the first batch is expected with fusion enabled, but that is definitely excessive. Just to confirm - by mxnet-cu100 >= 1.6.0b20191102 you mean that you tested both builds form 11/02 and 12/15? There was a change #16783 that went to master on November 12 (and backported to 1.6 on November 15) that sped up the compilation by caching the results.

Anyway, will look into it.

ptrendx · 2019-12-18T18:08:36Z

Ok, I can reproduce it. The first observation is that it is not actually the compilation time that takes so much time (so probably the fusion graph pass?).

I will dig further into it. At least when comparing the second epoch between fusion-enabled and disabled runs the fusion is faster (83.8s vs 87s) ;-)

ptrendx · 2019-12-19T00:38:35Z

I created the PR with a fix to this. Locally the graph pass takes now ~4.5 s on a single GPU, down from over 200 s (since you are not using multiple processes in that script, the time for multiple GPU would still scale linearly unfortunately, but changing that would require much bigger changes to MXNet).

zburning · 2019-12-19T02:41:38Z

Thank you for your great work!

zburning added the Bug label Dec 18, 2019

leezu added the R1.6.0 label Dec 18, 2019

ptrendx self-assigned this Dec 18, 2019

ptrendx mentioned this issue Dec 19, 2019

Improve the speed of the pointwise fusion graph pass #17114

Merged

4 tasks

ptrendx closed this as completed in #17114 Dec 20, 2019

ptrendx mentioned this issue Oct 1, 2020

Faster pointwise fusion graph pass #19269

Merged

3 tasks

leezu mentioned this issue Oct 8, 2020

[1.x / 1.8] Regression in runtime fusion #19316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug with fusion #17105

Bug with fusion #17105

zburning commented Dec 18, 2019

ptrendx commented Dec 18, 2019

ptrendx commented Dec 18, 2019

ptrendx commented Dec 19, 2019

zburning commented Dec 19, 2019

Bug with fusion #17105

Bug with fusion #17105

Comments

zburning commented Dec 18, 2019

Description

Error Message

To Reproduce

Steps to reproduce

What have you tried to solve it?

Environment

ptrendx commented Dec 18, 2019

ptrendx commented Dec 18, 2019

ptrendx commented Dec 19, 2019

zburning commented Dec 19, 2019