Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Bug with fusion #17105

Closed
zburning opened this issue Dec 18, 2019 · 4 comments · Fixed by #17114
Closed

Bug with fusion #17105

zburning opened this issue Dec 18, 2019 · 4 comments · Fixed by #17114
Assignees

Comments

@zburning
Copy link
Contributor

Description

I'm running the script at https://github.com/dmlc/gluon-nlp/tree/master/scripts/language_model/run_glue.py. The command line to reproduce:
python run_glue.py --task MRPC --gpu 8 --batch_size 32

I have the following observation:

  1. With model.hybridize()
    With mxnet-cu100 >= 1.6.0b20191102, the time cost of the first batch is more than 283s with a single GPU, more than 1000s with 4 GPUs.
    With mxnet-cu100==1.6.0b20191101, the time cost of the first batch is about 11s with a single GPU.

  2. With model.hybridize() and os.environ['MXNET_USE_FUSION'] = '0'
    for both mxnet-cu100 == 1.6.0b201911102 and 1.6.0b20191215, the time cost of the first batch is around 11s.

  3. Without hybridize():
    The performance are nearly the same.

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
@zburning zburning added the Bug label Dec 18, 2019
@leezu leezu added the R1.6.0 label Dec 18, 2019
@ptrendx ptrendx self-assigned this Dec 18, 2019
@ptrendx
Copy link
Member

ptrendx commented Dec 18, 2019

Some increase in the time of the first batch is expected with fusion enabled, but that is definitely excessive. Just to confirm - by mxnet-cu100 >= 1.6.0b20191102 you mean that you tested both builds form 11/02 and 12/15? There was a change #16783 that went to master on November 12 (and backported to 1.6 on November 15) that sped up the compilation by caching the results.

Anyway, will look into it.

@ptrendx
Copy link
Member

ptrendx commented Dec 18, 2019

Ok, I can reproduce it. The first observation is that it is not actually the compilation time that takes so much time (so probably the fusion graph pass?).

I will dig further into it. At least when comparing the second epoch between fusion-enabled and disabled runs the fusion is faster (83.8s vs 87s) ;-)

@ptrendx
Copy link
Member

ptrendx commented Dec 19, 2019

I created the PR with a fix to this. Locally the graph pass takes now ~4.5 s on a single GPU, down from over 200 s (since you are not using multiple processes in that script, the time for multiple GPU would still scale linearly unfortunately, but changing that would require much bigger changes to MXNet).

@zburning
Copy link
Contributor Author

Thank you for your great work!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants