Fix OpPerf in Master #17735

ChaiBapchya · 2020-03-02T02:13:27Z

Description

Change 1 (Fix for rmsprop_update and rmsprop_alex_update)

After merging PR #17449 and #17400
refactor of optimizer was incomplete due to both PRs not knowing changes made by each other.

While #17449 added set of variables for large tensor, #17400 refactored 2 variables from gamma1 gamma2 to rho and momentum

Fixing that conflict here

Change 2 (Fix for BatchNorm)

Upon running entire opperf suite for CUDA=ON, CUDNN=ON, it was found BatchNorm fails here

<function BatchNorm at 0x7f46869e3bf8>
Traceback (most recent call last):
  File "incubator-mxnet/benchmark/opperf/opperf.py", line 213, in <module>
    sys.exit(main())
  File "incubator-mxnet/benchmark/opperf/opperf.py", line 193, in main
    benchmark_results = run_all_mxnet_operator_benchmarks(ctx=ctx, dtype=dtype, profiler=profiler, int64_tensor=int64_tensor, warmup=warmup, runs=runs)
  File "incubator-mxnet/benchmark/opperf/opperf.py", line 99, in run_all_mxnet_operator_benchmarks
    mxnet_operator_benchmark_results.append(run_nn_basic_operators_benchmarks(ctx=ctx, dtype=dtype, profiler=profiler, int64_tensor=int64_tensor, warmup=warmup, runs=runs))
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/nd_operations/nn_basic_operators.py", line 143, in run_nn_basic_operators_benchmarks
    mx_nn_basic_op_results = run_op_benchmarks(mx_nn_basic_ops, dtype, ctx, profiler, int64_tensor, warmup, runs)
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 211, in run_op_benchmarks
    warmup=warmup, runs=runs)
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 178, in run_performance_test
    benchmark_result = _run_nd_operator_performance_test(op, inputs, run_backward, warmup, runs, kwargs_list, profiler)
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/utils/benchmark_utils.py", line 115, in _run_nd_operator_performance_test
    _, _ = benchmark_helper_func(op, warmup, **kwargs_list[0])
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/utils/profiler_utils.py", line 200, in cpp_profile_it
    res = func(*args, **kwargs)
  File "/home/ubuntu/incubator-mxnet/benchmark/opperf/utils/ndarray_utils.py", line 60, in nd_forward_backward_and_profile
    nd.waitall()
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 206, in waitall
    check_call(_LIB.MXNDArrayWaitAll())
  File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: Traceback (most recent call last):
  File "/home/ubuntu/incubator-mxnet/src/operator/nn/./cudnn/cudnn_batch_norm-inl.h", line 62
MXNetError: Check failed: param.eps >= 1e-5 (1e-08 vs. 1e-05) : CuDNN requires eps to be no less than 1e-05

Change 3 Fix for lamb_update_*

wd parameter was incorrectly omitted from default_params
Added it back

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Code is well-documented:
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

ChaiBapchya · 2020-03-02T07:27:32Z

Ran the entire opperf suite after making the changes

Command

python incubator-mxnet/benchmark/opperf/opperf.py --ctx gpu --output-format md --output-file mxnet_opperf_gpu_fix_opperf_lamb.md --warmup 1 --runs 1

python incubator-mxnet/benchmark/opperf/opperf.py --ctx cpu --output-format md --output-file mxnet_opperf_cpu_fix_opperf_lamb.md --warmup 1 --runs 1

CPU & GPU

https://gist.github.com/ChaiBapchya/517445961bcd57da6b7b5d23fa5d3dc0

leezu · 2020-03-02T17:40:07Z

Restarted windows CI

connorgoggins

LGTM! Thanks for your help rectifying these issues.

benchmark/opperf/rules/default_params.py

ChaiBapchya · 2020-03-09T20:42:44Z

@mxnet-label-bot update [pr-awaiting-merge]

* cudnn expects eps to be >e105 and gamma1,gamma2 renamed in other PR * fix lamb_update_* ops * add var name for wd

ChaiBapchya added 2 commits March 2, 2020 01:54

cudnn expects eps to be >e105 and gamma1,gamma2 renamed in other PR

e01b92a

fix lamb_update_* ops

80a7aac

connorgoggins mentioned this pull request Mar 2, 2020

[Large Tensor] Implemented LT flag for OpPerf testing #17449

Merged

4 tasks

connorgoggins approved these changes Mar 2, 2020

View reviewed changes

apeforest reviewed Mar 2, 2020

View reviewed changes

benchmark/opperf/rules/default_params.py Outdated Show resolved Hide resolved

add var name for wd

a527bb0

lanking520 added the pr-awaiting-merge Review and CI is complete. Ready to Merge label Mar 9, 2020

leezu merged commit 0aa2c78 into apache:master Mar 9, 2020

MoisesHer pushed a commit to MoisesHer/incubator-mxnet that referenced this pull request Apr 10, 2020

Fix OpPerf in Master (apache#17735)

d558d72

* cudnn expects eps to be >e105 and gamma1,gamma2 renamed in other PR * fix lamb_update_* ops * add var name for wd

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020

Fix OpPerf in Master (apache#17735)

868f9cc

* cudnn expects eps to be >e105 and gamma1,gamma2 renamed in other PR * fix lamb_update_* ops * add var name for wd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OpPerf in Master #17735

Fix OpPerf in Master #17735

ChaiBapchya commented Mar 2, 2020 •

edited

ChaiBapchya commented Mar 2, 2020

leezu commented Mar 2, 2020

connorgoggins left a comment

ChaiBapchya commented Mar 9, 2020

Fix OpPerf in Master #17735

Fix OpPerf in Master #17735

Conversation

ChaiBapchya commented Mar 2, 2020 • edited

Description

Change 1 (Fix for rmsprop_update and rmsprop_alex_update)

Change 2 (Fix for BatchNorm)

Change 3 Fix for lamb_update_*

Checklist

Essentials

ChaiBapchya commented Mar 2, 2020

Command

CPU & GPU

leezu commented Mar 2, 2020

connorgoggins left a comment

Choose a reason for hiding this comment

ChaiBapchya commented Mar 9, 2020

ChaiBapchya commented Mar 2, 2020 •

edited