Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Operator Performance Regression on CPU #15429

Open
6 tasks done
roywei opened this issue Jul 2, 2019 · 23 comments
Open
6 tasks done

Operator Performance Regression on CPU #15429

roywei opened this issue Jul 2, 2019 · 23 comments

Comments

@roywei
Copy link
Member

roywei commented Jul 2, 2019

Follow up on dev list discussion:

https://lists.apache.org/thread.html/154ef1e4010671e7375c7a7cbedb413d5a4a3677321488440fb32a3a@%3Cdev.mxnet.apache.org%3E

We have found some operators to have performance regression using the operator benchmark module here:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf

@sandeep-krishnamurthy has helped to run the benchmark and this is the training mode result:
https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50

The above result is using training mode (autograd.record()) and calculating both forward and backward time.

As most users use CPU for inference only, to further investigate the impact on inference I have run the scripts using inference mode

Please find the inference and training mode results here:
https://docs.google.com/spreadsheets/d/1_eezNWbrBAm3s3i6G1m0Rd3YYdTEnmKlYtn4klqdyN0/edit?usp=sharing

I have calculated the regression percentage and sorted them, thanks to @aaronmarkham for providing the first version.

Although there are variances on perf numbers between runs, we observe the following commonly used operators be slower consistently.

We need to look into them and fix if root caused.

  • Dropout
  • relu
  • LeakyReLU
  • dot
  • element wise ops (mul, div, sub)
  • broadcast ops (mul, sub)

Some ops regression seems only to happen on mxnet-mkl version (refer to 4th sheet of the google sheet)

Environment:

AWS C5.18xLarge
Deep Learning Base AMI (Ubuntu) Version 18.1
Python 3.6

MXNet versions:

with MKLDNN
pip install mxnet-mkl==1.5.0b20190627 
pip install mxnet-mkl==1.4.1

without MKLDNN
pip install mxnet==1.5.0b20190627
pip install mxnet==1.4.1

Note: nightly 20190627 contains the latest commit in v.1.5.x

Scripts:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
Notes: you need to modify the scripts a bit to run

  1. requires python 3.6, requires to add your scripts path to PYTHONPATH, follow instructions to run benchmark on all operators.
  2. To run operators in inference mode, you need to set False at this line
    and change run_backward to False in all files under:
    https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf/nd_operations
    for example here.
@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

@roywei
Copy link
Member Author

roywei commented Jul 2, 2019

cc @PatricZhao @szha @ddavydenko

@roywei roywei changed the title Operator Performance regression on CPU Operator Performance Regression on CPU Jul 2, 2019
@roywei
Copy link
Member Author

roywei commented Jul 2, 2019

script command:
python incubator-mxnet/benchmark/opperf/opperf.py --output-format md --output-file mxnet-{version}-infer.md

raw script output:
https://drive.google.com/drive/folders/1OE-bedTFd-HN0Og6vW8XGWkswrRrdyW0?usp=sharing

@pengzhao-intel
Copy link
Contributor

Maybe this is the same thing with #15430. @ciyongch is WIP

@ciyongch
Copy link
Contributor

ciyongch commented Jul 2, 2019

Yes, #15430 is also caused by op perf regression. I will try to reproduce the case and focus the inference firstly.

@ciyongch
Copy link
Contributor

ciyongch commented Jul 2, 2019

@roywei I've collected some performance data of Dropout, relu, and dot on C5.18xlarge, and saw some big run to run variance on dot and Dropout as below table without binding cores as table 1.
After binding cores to a socket via below command before running benchmarking, I got a stable and better results as table 2.

export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=18

I ran 5 times with 10 warmup + 100 runs for each time. Not sure if you're setting the env variables when running the benchmark. And I suggest to re-run the ops for several times to see if it's a real degradation.

As for relu operator got worse perf on v1.5, I found v1.5 included the commit of c645591 (#14262), which added "IsNan(a)" check for data which introduced some overhead. Will add more data and more operator results later.

Table1

Operator Avg Forward Time (ms) 1.4.1 Avg Forward Time (ms) 1.5.0 Regression data shape
dot 0.1791 0.1576 12.0% {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot 0.2505 0.1146 54.3% {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot 0.1798 0.2531 -40.8% {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot 0.1529 0.2786 -82.2% {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot 0.4024 0.0763 81.0% {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot 0.1824 0.3437 -88.4% {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot 0.1107 0.2491 -125.0% {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot 0.4099 0.2401 41.4% {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot 1.4348 0.2439 83.0% {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot 0.3124 0.2922 6.5% {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
Dropout 0.0551 0.0477 13.4% {'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout 0.0446 0.0462 -3.6% {'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout 0.078 0.0567 27.3% {'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout 0.0311 0.0797 -156.3% {'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout 0.3707 0.0671 81.9% {'mode': 'always', 'data': (10000, 10), 'p': 0.5}

Table 2

Operator Avg Forward Time (ms) 1.4.1 Avg Forward Time (ms) 1.5.0 Regression data shape
dot 0.0079 0.0087 -10.1% {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot 0.0076 0.0089 -17.1% {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot 0.0079 0.009 -13.9% {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot 0.008 0.0093 -16.3% {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot 0.0078 0.0089 -14.1% {'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot 0.0946 0.0895 5.4% {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot 0.0937 0.0889 5.1% {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot 0.093 0.0895 3.8% {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot 0.097 0.0898 7.4% {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot 0.0908 0.089 2.0% {'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
Dropout 0.0238 0.0235 1.3% {'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout 0.0244 0.0234 4.1% {'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout 0.024 0.0234 2.5% {'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout 0.024 0.0235 2.1% {'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout 0.022 0.0234 -6.4% {'mode': 'always', 'data': (10000, 10), 'p': 0.5}

@roywei
Copy link
Member Author

roywei commented Jul 2, 2019

Thanks @ciyongch , setting the environment variables did reduce the variances.
I have updated the document(1st sheet): https://docs.google.com/spreadsheets/d/1_eezNWbrBAm3s3i6G1m0Rd3YYdTEnmKlYtn4klqdyN0/edit#gid=196553607

With the current data Dot and Dropout is not a big concern now. Relu's regression is something we have to accept as otherwise it could lead to nans and bugs.

@ciyongch
Copy link
Contributor

ciyongch commented Jul 3, 2019

@roywei I looked at the table, except a few ops with big variances, most of the degraded ops listed in the table dropped into mshadow. I did some further analysis, and found that the performance regression is caused by the commit of 6e94643 (dmlc/mshadow#371), which changed the data type for index_t. Seems the quick fix in that PR didn't work for these ops, and a better solution is required for the type changing.

Here's some perf data collected on my local machine (SKX-8180), you can have a try on your platform :)

Operator Avg Forward Time (ms) 1.4.1 Avg Forward Time (ms) 1.5.0 Regression Avg Forward Time (ms) 1.5.0 (revert 6e94643) Updated Regression data shape
relu 0.0047 0.0107 127.7% 0.0057 21.3% {'data': (10000, 1)}
relu 0.0065 0.0143 120.0% 0.005 -23.1% {'data': (10000, 1)}
relu 0.005 0.0144 188.0% 0.0051 2.0% {'data': (10000, 1)}
relu 0.0052 0.0115 121.2% 0.0066 26.9% {'data': (10000, 1)}
relu 0.005 0.0138 176.0% 0.0061 22.0% {'data': (10000, 1)}
logical_not 0.005 0.0221 342.0% 0.0057 14.0% {'data': (10000, 1)}
logical_not 0.0051 0.0214 319.6% 0.0056 9.8% {'data': (10000, 1)}
logical_not 0.0054 0.0215 298.1% 0.0056 3.7% {'data': (10000, 1)}
logical_not 0.0051 0.0214 319.6% 0.0057 11.8% {'data': (10000, 1)}
logical_not 0.0051 0.0216 323.5% 0.0056 9.8% {'data': (10000, 1)}
softsign 0.0054 0.0166 207.4% 0.0059 9.3% {'data': (10000, 1)}
softsign 0.0053 0.0166 213.2% 0.006 13.2% {'data': (10000, 1)}
softsign 0.0058 0.0163 181.0% 0.006 3.4% {'data': (10000, 1)}
softsign 0.0053 0.0162 205.7% 0.006 13.2% {'data': (10000, 1)}
softsign 0.0054 0.0164 203.7% 0.0062 14.8% {'data': (10000, 1)}

@TaoLv
Copy link
Member

TaoLv commented Jul 3, 2019

cc @apeforest

@sandeep-krishnamurthy
Copy link
Contributor

Thanks @ciyongch for diving deep and help in identifying root cause of the issues.

Can you please help me understand why we see variance only for few operators without setting environment variables for KMP and OMP? Our users don't do these environment variable setup so is it something we need to worry about that these operators will perform badly and indeterministic without that setting?

@apeforest
Copy link
Contributor

@TaoLv @ciyongch Thanks for the investigation. I will try to verify it on my machine.

@ciyongch
Copy link
Contributor

ciyongch commented Jul 4, 2019

@sandeep-krishnamurthy As most of the MXNet's ops will do the parallelism via openmp on CPU, from performance perspective, binding CPU cores will reduce the performance degradation due to cache misses and get better thread scheduling.
Without binding the cores, the performance variance could be bigger due to thread/context switch among those available cores, and plus the time of benchmarking on an individual op is usually small, like several to a hundred microsecond as showed in the above table (in the case of 10 warmup + 100 runs for each op).
BTW, as op level performance tuning is not always reflect its real performance in a real model (for example, data is more easily/likely cached in op benchmarking, which results the better performance compared to a real model scenario). It's better if the future CI will combine some basic/general models benchmarking to determine the regression. What do you think?

@roywei
Copy link
Member Author

roywei commented Jul 5, 2019

I agree that what impact actuall user experience is the final speed of the model inference/training as operators get fused and there are other performance improvement technics applied on overall model. Currently we don't have enough data to say these ops regression we have will impact actuall model speed.

Regarding the OP regression, we are focusing on root causing regression of broadcast ops, rest ops should not block 1.5.0 release.

We found out no matter what's the flag (int32/64), there is around 15% regression on broadcast ops, on both mxnet-mkl and mxnet pip packages between 1.4.1 and 1.5.0. I'm still root causing it.

  mxnet-mkl mxnet-mkl mxnet-mkl mxnet-mkl mxnet-mkl   mxnet mxnet  
  1.4.1 (int64) 1.5.0 int32 1.5.0 int64 1.5 int 32 vs 1.4.1 1.5 int 64 vs 1.4.1   1.5.0 mxnet 1.4.1 mxnet regression
broadcast_add 0.00242 0.00286 0.0029 18% 20%   0.0021 0.0016 31%
broadcast_div 0.00244 0.00286 0.00296 17% 21%   0.0021 0.0017 24%
broadcast_equal 0.0024 0.00294 0.00294 23% 23%   0.0021 0.0016 31%
broadcast_greater 0.00244 0.00282 0.00294 16% 20%   0.0021 0.0016 31%
broadcast_greater_equal 0.00244 0.00284 0.00286 16% 17%   0.002 0.0017 18%
broadcast_hypot 0.0025 0.00292 0.00302 17% 21%   0.0021 0.0016 31%
broadcast_lesser 0.00242 0.00288 0.00298 19% 23%   0.002 0.0015 33%
broadcast_lesser_equal 0.00246 0.00288 0.00288 17% 17%   0.0021 0.0017 24%
broadcast_logical_and 0.0024 0.0029 0.00292 21% 22%   0.0021 0.0016 31%
broadcast_logical_or 0.0025 0.00288 0.00288 15% 15%   0.0021 0.0016 31%
broadcast_logical_xor 0.00242 0.0029 0.00296 20% 22%   0.0021 0.0016 31%
broadcast_maximum 0.00248 0.00288 0.00284 16% 15%   0.0021 0.0016 31%
broadcast_minimum 0.0025 0.00286 0.00284 14% 14%   0.002 0.0015 33%
broadcast_mod 0.00262 0.00294 0.00302 12% 15%   0.0021 0.0016 31%
broadcast_mul 0.00244 0.00288 0.00298 18% 22%   0.0021 0.0017 24%
broadcast_not_equal 0.00258 0.00296 0.00294 15% 14%   0.002 0.0016 25%
broadcast_power 0.00296 0.00336 0.00348 14% 18%   0.0023 0.0019 21%
broadcast_sub 0.0025 0.00296 0.003 18% 20%   0.002 0.0016 25%

@roywei
Copy link
Member Author

roywei commented Jul 5, 2019

I have ran a search for perf regression on broadcast_add and there are 2 regressions between 0412-0425, 0619-0621

mxnet-mkl version broadcast_add speed commit
1.5.0b20190411 0.003
1.5.0b20190412 0.003 30d479f
1.5.0b20190425 0.0034 587d480
1.5.0b20190426 0.0034
1.5.0b20190619 0.0033 ccbbf6b
1.5.0b20190620 0.0035 2de0db0
1.5.0b20190621 0.0037 8b5f376

@roywei
Copy link
Member Author

roywei commented Jul 5, 2019

@sandeep-krishnamurthy @Zha0q1
#15240 This PR introduced regression from 0619-0621.
broadcast_add speed before this PR: 0.004, after this PR: 0.0047
Note speed is different than pip installed version due to build flags but the regression is there.
build: sudo make -j$(nproc) USE_MKLDNN=1
script:

import mxnet as mx
from mxnet import nd
mx.random.seed(10)
from benchmark.opperf.utils.benchmark_utils import run_performance_test

add_res = run_performance_test(nd.broadcast_add, run_backward=False, dtype='float32', ctx=mx.cpu(),
                               inputs=[{'lhs': [(1024, 1024), (10000, 10), (10000, 1)], 'rhs': [(1024, 1024), (10000, 10), (10000, 1)]} ],
                               warmup=10, runs=500)

print("speed:", add_res[0]['broadcast_add'][0]['avg_time_forward_broadcast_add'])

@roywei
Copy link
Member Author

roywei commented Jul 5, 2019

another regression from PR #14661
broadcast_add speed:
before 0.0033
after: 0.0039

@roywei
Copy link
Member Author

roywei commented Jul 5, 2019

Update: all benchmark results before this comment maybe inaccurate as the profiler code between 1.4.1 and 1.5 is different. and the opperf script is using profiler.

There is no much regression (accounting variance also) between 1.4.1 and 1.5.0 on broadcast ops if using python time(). it may due to profiler change between versions.
Result on C5.18xlarge.

  1.4.1 1.4.1 1.4.1 average 1.5.0 1.5.0 1.5.0 1.5.0 1.5.0 average
broadcast_add 0.055 0.059 0.056624 0.0567635 0.058 0.06 0.057 0.057 0.058183
elemwise_add 0.104 0.104460 0.104491 0.10444 0.121223 0.127 0.124 0.124 0.125145

Thanks to @reminisce for helping out and providing the scripts.

import time
import mxnet as mx

mx.random.seed(0)

a = mx.nd.random.uniform(shape=(1024, 1024))
b = mx.nd.random.uniform(shape=(1024, 1024))

repeat = 10000

mx.nd.waitall()
start = time.time()
for _ in range(repeat):
    c = mx.nd.broadcast_add(a, b)
    # c = mx.nd.elemwise_add(a, b)
    c.wait_to_read()
elapse = time.time() - start

print("elapse time: %fms" % (elapse * 1000 / repeat))

@apeforest
Copy link
Contributor

apeforest commented Jul 5, 2019

#15240 may introduced extra runtime in profiler in mxnet 1.5. Using timeit module, broadcast_add between mxnet1.4 and mxnet1.5 does not show significant difference:

import timeit
SETUP_CODE = '''
import mxnet as mx
from mxnet import nd
mx.random.seed(10)
def binary_op(lhs, rhs):
    return nd.broadcast_add(lhs=lhs, rhs=rhs,  dtype='float32', ctx=mx.cpu())

lhs = nd.random.uniform(shape=(1024, 1024), dtype='float32', ctx=mx.cpu())
rhs = nd.random.uniform(shape=(1, 1024), dtype='float32', ctx=mx.cpu())
res = binary_op(lhs, rhs)
'''

TEST_CODE = '''
res.wait_to_read()
'''
# timeit.repeat statement
times = timeit.repeat(setup = SETUP_CODE,
                      stmt = TEST_CODE,
                      repeat = 3,
                      number = 10000)

# priniting minimum exec. time
print('time: {}'.format(min(times)))

Results using script above

  1.4.1 1.5.0
broadcast_add 0.0073 0.0071
elemwise_add 0.0079 0.0072
sigmoid 0.0079 0.0075
hard_sigmoid 0.0072 0.0069

@roywei
Copy link
Member Author

roywei commented Jul 5, 2019

There is also no significant regression on BatchNorm op between 1.4.1 and 1.5.0
Also the speed actually improves if turning int64 flag on.

  1.4.1 (int64) runs 1.5.0 (int 32) runs 1.5.0 (int 64) runs
BatchNorm 2.609942 2.63 2.031
2nd run 2.621809 2.63 2.054
3rd run 2.621809 2.611 2.041
average 2.60764 2.61148 2.042131

script:

import mxnet as  mx
import time


mx.random.seed(0)

data = mx.nd.random.uniform(0, 256, (32, 3, 256, 256))
beta = mx.nd.random.uniform(shape=(3,))
gamma = mx.nd.random.uniform(shape=(3,))
mean = mx.nd.random.uniform(shape=(3,))
var = mx.nd.random.uniform(shape=(3,))
repeat = 1000

mx.nd.waitall()
start = time.time()
for _ in range(repeat):
    #c = mx.nd.broadcast_add(a, b)
    c = mx.nd.BatchNorm(data=data, gamma=gamma, beta=beta, moving_mean=mean, moving_var=var)
    # c = mx.nd.elemwise_add(a, b)
    c.wait_to_read()
elapse = time.time() - start

print("elapse time: %fms" % (elapse * 1000 / repeat))

@roywei
Copy link
Member Author

roywei commented Jul 5, 2019

Conclusion

  1. No significant regression for broadcast ops and batchnorm. This issue should no block 1.5.0 release.
  2. Regression on ReLU is expected as it's necessary as stated in Fix NaN value comparisons in relu, max and min ops #14262
  3. We should use timeit or time module for benchmark. Or use the same profiler module to make sure it's apple to apple comparison
  4. Need better benchmark methods to reduce the variance between runs to clearly identify if there is a regession, otherwise it's hard to root cause and fix. Right now we use fixed seed and set OMP_NUM_THREADS

@ciyongch
Copy link
Contributor

ciyongch commented Jul 8, 2019

@roywei does it mean the operator profiling results with current profiler module is not accurate?

@apeforest
Copy link
Contributor

@ciyongch The profiler was implemented in 1.5 and there are changes in the profiler code which were not in 1.4.1 release: #15240. This could have made difference in runtime as the runtime in each operator is very sensitive.

Going forward, we should enhance the profiler module and make it available in CI so we can capture the performance degradation in time.

@ciyongch
Copy link
Contributor

ciyongch commented Jul 9, 2019

@apeforest Thanks for the info, as we saw the current perf degradation only happened for some ops or some certain shapes with v1.5, any suggestion to do profiling for a real model? Could we still rely on the profiling result with latest MXNet code base (with #15240) so far?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants