Operator Performance Regression on CPU #15429

roywei · 2019-07-02T06:36:02Z

Follow up on dev list discussion:

https://lists.apache.org/thread.html/154ef1e4010671e7375c7a7cbedb413d5a4a3677321488440fb32a3a@%3Cdev.mxnet.apache.org%3E

We have found some operators to have performance regression using the operator benchmark module here:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf

@sandeep-krishnamurthy has helped to run the benchmark and this is the training mode result:
https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50

The above result is using training mode (autograd.record()) and calculating both forward and backward time.

As most users use CPU for inference only, to further investigate the impact on inference I have run the scripts using inference mode

Please find the inference and training mode results here:
https://docs.google.com/spreadsheets/d/1_eezNWbrBAm3s3i6G1m0Rd3YYdTEnmKlYtn4klqdyN0/edit?usp=sharing

I have calculated the regression percentage and sorted them, thanks to @aaronmarkham for providing the first version.

Although there are variances on perf numbers between runs, we observe the following commonly used operators be slower consistently.

We need to look into them and fix if root caused.

Some ops regression seems only to happen on mxnet-mkl version (refer to 4th sheet of the google sheet)

Environment:

AWS C5.18xLarge
Deep Learning Base AMI (Ubuntu) Version 18.1
Python 3.6

MXNet versions:

with MKLDNN
pip install mxnet-mkl==1.5.0b20190627 
pip install mxnet-mkl==1.4.1

without MKLDNN
pip install mxnet==1.5.0b20190627
pip install mxnet==1.4.1

Note: nightly 20190627 contains the latest commit in v.1.5.x

Scripts:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
Notes: you need to modify the scripts a bit to run

requires python 3.6, requires to add your scripts path to PYTHONPATH, follow instructions to run benchmark on all operators.
To run operators in inference mode, you need to set False at this line
and change run_backward to False in all files under:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf/nd_operations
for example here.

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-07-02T06:36:05Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

roywei · 2019-07-02T06:49:25Z

cc @PatricZhao @szha @ddavydenko

roywei · 2019-07-02T07:44:06Z

script command:
python incubator-mxnet/benchmark/opperf/opperf.py --output-format md --output-file mxnet-{version}-infer.md

raw script output:
https://drive.google.com/drive/folders/1OE-bedTFd-HN0Og6vW8XGWkswrRrdyW0?usp=sharing

pengzhao-intel · 2019-07-02T07:53:03Z

Maybe this is the same thing with #15430. @ciyongch is WIP

ciyongch · 2019-07-02T08:41:22Z

Yes, #15430 is also caused by op perf regression. I will try to reproduce the case and focus the inference firstly.

ciyongch · 2019-07-02T14:24:19Z

@roywei I've collected some performance data of Dropout, relu, and dot on C5.18xlarge, and saw some big run to run variance on dot and Dropout as below table without binding cores as table 1.
After binding cores to a socket via below command before running benchmarking, I got a stable and better results as table 2.

export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=18

I ran 5 times with 10 warmup + 100 runs for each time. Not sure if you're setting the env variables when running the benchmark. And I suggest to re-run the ops for several times to see if it's a real degradation.

As for relu operator got worse perf on v1.5, I found v1.5 included the commit of c645591 (#14262), which added "IsNan(a)" check for data which introduced some overhead. Will add more data and more operator results later.

Table1

Operator	Avg Forward Time (ms) 1.4.1	Avg Forward Time (ms) 1.5.0	Regression	data shape
dot	0.1791	0.1576	12.0%	{'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot	0.2505	0.1146	54.3%	{'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot	0.1798	0.2531	-40.8%	{'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot	0.1529	0.2786	-82.2%	{'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot	0.4024	0.0763	81.0%	{'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot	0.1824	0.3437	-88.4%	{'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot	0.1107	0.2491	-125.0%	{'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot	0.4099	0.2401	41.4%	{'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot	1.4348	0.2439	83.0%	{'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot	0.3124	0.2922	6.5%	{'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
Dropout	0.0551	0.0477	13.4%	{'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout	0.0446	0.0462	-3.6%	{'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout	0.078	0.0567	27.3%	{'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout	0.0311	0.0797	-156.3%	{'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout	0.3707	0.0671	81.9%	{'mode': 'always', 'data': (10000, 10), 'p': 0.5}

Table 2

Operator	Avg Forward Time (ms) 1.4.1	Avg Forward Time (ms) 1.5.0	Regression	data shape
dot	0.0079	0.0087	-10.1%	{'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot	0.0076	0.0089	-17.1%	{'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot	0.0079	0.009	-13.9%	{'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot	0.008	0.0093	-16.3%	{'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot	0.0078	0.0089	-14.1%	{'lhs': (1000, 1), 'rhs': (100, 1000), 'transpose_b': True, 'transpose_a': True}
dot	0.0946	0.0895	5.4%	{'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot	0.0937	0.0889	5.1%	{'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot	0.093	0.0895	3.8%	{'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot	0.097	0.0898	7.4%	{'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
dot	0.0908	0.089	2.0%	{'lhs': (1000, 10), 'rhs': (1000, 10), 'transpose_b': True}
Dropout	0.0238	0.0235	1.3%	{'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout	0.0244	0.0234	4.1%	{'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout	0.024	0.0234	2.5%	{'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout	0.024	0.0235	2.1%	{'mode': 'always', 'data': (10000, 10), 'p': 0.5}
Dropout	0.022	0.0234	-6.4%	{'mode': 'always', 'data': (10000, 10), 'p': 0.5}

roywei · 2019-07-02T23:15:51Z

Thanks @ciyongch , setting the environment variables did reduce the variances.
I have updated the document(1st sheet): https://docs.google.com/spreadsheets/d/1_eezNWbrBAm3s3i6G1m0Rd3YYdTEnmKlYtn4klqdyN0/edit#gid=196553607

With the current data Dot and Dropout is not a big concern now. Relu's regression is something we have to accept as otherwise it could lead to nans and bugs.

ciyongch · 2019-07-03T13:35:54Z

@roywei I looked at the table, except a few ops with big variances, most of the degraded ops listed in the table dropped into mshadow. I did some further analysis, and found that the performance regression is caused by the commit of 6e94643 (dmlc/mshadow#371), which changed the data type for index_t. Seems the quick fix in that PR didn't work for these ops, and a better solution is required for the type changing.

Here's some perf data collected on my local machine (SKX-8180), you can have a try on your platform :)

Operator	Avg Forward Time (ms) 1.4.1	Avg Forward Time (ms) 1.5.0	Regression	Avg Forward Time (ms) 1.5.0 (revert 6e94643)	Updated Regression	data shape
relu	0.0047	0.0107	127.7%	0.0057	21.3%	{'data': (10000, 1)}
relu	0.0065	0.0143	120.0%	0.005	-23.1%	{'data': (10000, 1)}
relu	0.005	0.0144	188.0%	0.0051	2.0%	{'data': (10000, 1)}
relu	0.0052	0.0115	121.2%	0.0066	26.9%	{'data': (10000, 1)}
relu	0.005	0.0138	176.0%	0.0061	22.0%	{'data': (10000, 1)}
logical_not	0.005	0.0221	342.0%	0.0057	14.0%	{'data': (10000, 1)}
logical_not	0.0051	0.0214	319.6%	0.0056	9.8%	{'data': (10000, 1)}
logical_not	0.0054	0.0215	298.1%	0.0056	3.7%	{'data': (10000, 1)}
logical_not	0.0051	0.0214	319.6%	0.0057	11.8%	{'data': (10000, 1)}
logical_not	0.0051	0.0216	323.5%	0.0056	9.8%	{'data': (10000, 1)}
softsign	0.0054	0.0166	207.4%	0.0059	9.3%	{'data': (10000, 1)}
softsign	0.0053	0.0166	213.2%	0.006	13.2%	{'data': (10000, 1)}
softsign	0.0058	0.0163	181.0%	0.006	3.4%	{'data': (10000, 1)}
softsign	0.0053	0.0162	205.7%	0.006	13.2%	{'data': (10000, 1)}
softsign	0.0054	0.0164	203.7%	0.0062	14.8%	{'data': (10000, 1)}

TaoLv · 2019-07-03T14:27:25Z

cc @apeforest

sandeep-krishnamurthy · 2019-07-03T15:24:59Z

Thanks @ciyongch for diving deep and help in identifying root cause of the issues.

Can you please help me understand why we see variance only for few operators without setting environment variables for KMP and OMP? Our users don't do these environment variable setup so is it something we need to worry about that these operators will perform badly and indeterministic without that setting?

apeforest · 2019-07-03T17:14:49Z

@TaoLv @ciyongch Thanks for the investigation. I will try to verify it on my machine.

ciyongch · 2019-07-04T07:58:23Z

@sandeep-krishnamurthy As most of the MXNet's ops will do the parallelism via openmp on CPU, from performance perspective, binding CPU cores will reduce the performance degradation due to cache misses and get better thread scheduling.
Without binding the cores, the performance variance could be bigger due to thread/context switch among those available cores, and plus the time of benchmarking on an individual op is usually small, like several to a hundred microsecond as showed in the above table (in the case of 10 warmup + 100 runs for each op).
BTW, as op level performance tuning is not always reflect its real performance in a real model (for example, data is more easily/likely cached in op benchmarking, which results the better performance compared to a real model scenario). It's better if the future CI will combine some basic/general models benchmarking to determine the regression. What do you think?

roywei · 2019-07-05T00:52:57Z

I agree that what impact actuall user experience is the final speed of the model inference/training as operators get fused and there are other performance improvement technics applied on overall model. Currently we don't have enough data to say these ops regression we have will impact actuall model speed.

Regarding the OP regression, we are focusing on root causing regression of broadcast ops, rest ops should not block 1.5.0 release.

We found out no matter what's the flag (int32/64), there is around 15% regression on broadcast ops, on both mxnet-mkl and mxnet pip packages between 1.4.1 and 1.5.0. I'm still root causing it.

	mxnet-mkl	mxnet-mkl	mxnet-mkl	mxnet-mkl	mxnet-mkl	mxnet	mxnet
	1.4.1 (int64)	1.5.0 int32	1.5.0 int64	1.5 int 32 vs 1.4.1	1.5 int 64 vs 1.4.1	1.5.0 mxnet	1.4.1 mxnet	regression
broadcast_add	0.00242	0.00286	0.0029	18%	20%	0.0021	0.0016	31%
broadcast_div	0.00244	0.00286	0.00296	17%	21%	0.0021	0.0017	24%
broadcast_equal	0.0024	0.00294	0.00294	23%	23%	0.0021	0.0016	31%
broadcast_greater	0.00244	0.00282	0.00294	16%	20%	0.0021	0.0016	31%
broadcast_greater_equal	0.00244	0.00284	0.00286	16%	17%	0.002	0.0017	18%
broadcast_hypot	0.0025	0.00292	0.00302	17%	21%	0.0021	0.0016	31%
broadcast_lesser	0.00242	0.00288	0.00298	19%	23%	0.002	0.0015	33%
broadcast_lesser_equal	0.00246	0.00288	0.00288	17%	17%	0.0021	0.0017	24%
broadcast_logical_and	0.0024	0.0029	0.00292	21%	22%	0.0021	0.0016	31%
broadcast_logical_or	0.0025	0.00288	0.00288	15%	15%	0.0021	0.0016	31%
broadcast_logical_xor	0.00242	0.0029	0.00296	20%	22%	0.0021	0.0016	31%
broadcast_maximum	0.00248	0.00288	0.00284	16%	15%	0.0021	0.0016	31%
broadcast_minimum	0.0025	0.00286	0.00284	14%	14%	0.002	0.0015	33%
broadcast_mod	0.00262	0.00294	0.00302	12%	15%	0.0021	0.0016	31%
broadcast_mul	0.00244	0.00288	0.00298	18%	22%	0.0021	0.0017	24%
broadcast_not_equal	0.00258	0.00296	0.00294	15%	14%	0.002	0.0016	25%
broadcast_power	0.00296	0.00336	0.00348	14%	18%	0.0023	0.0019	21%
broadcast_sub	0.0025	0.00296	0.003	18%	20%	0.002	0.0016	25%

roywei · 2019-07-05T06:21:07Z

I have ran a search for perf regression on broadcast_add and there are 2 regressions between 0412-0425, 0619-0621

mxnet-mkl version	broadcast_add speed	commit
1.5.0b20190411	0.003
1.5.0b20190412	0.003	`30d479f`
1.5.0b20190425	0.0034	`587d480`
1.5.0b20190426	0.0034
1.5.0b20190619	0.0033	`ccbbf6b`
1.5.0b20190620	0.0035	`2de0db0`
1.5.0b20190621	0.0037	`8b5f376`

roywei · 2019-07-05T07:54:04Z

@sandeep-krishnamurthy @Zha0q1
#15240 This PR introduced regression from 0619-0621.
broadcast_add speed before this PR: 0.004, after this PR: 0.0047
Note speed is different than pip installed version due to build flags but the regression is there.
build: sudo make -j$(nproc) USE_MKLDNN=1
script:

import mxnet as mx
from mxnet import nd
mx.random.seed(10)
from benchmark.opperf.utils.benchmark_utils import run_performance_test

add_res = run_performance_test(nd.broadcast_add, run_backward=False, dtype='float32', ctx=mx.cpu(),
                               inputs=[{'lhs': [(1024, 1024), (10000, 10), (10000, 1)], 'rhs': [(1024, 1024), (10000, 10), (10000, 1)]} ],
                               warmup=10, runs=500)

print("speed:", add_res[0]['broadcast_add'][0]['avg_time_forward_broadcast_add'])

roywei · 2019-07-05T08:55:53Z

another regression from PR #14661
broadcast_add speed:
before 0.0033
after: 0.0039

roywei · 2019-07-05T18:12:05Z

Update: all benchmark results before this comment maybe inaccurate as the profiler code between 1.4.1 and 1.5 is different. and the opperf script is using profiler.

There is no much regression (accounting variance also) between 1.4.1 and 1.5.0 on broadcast ops if using python time(). it may due to profiler change between versions.
Result on C5.18xlarge.

	1.4.1	1.4.1	1.4.1	average	1.5.0	1.5.0	1.5.0	1.5.0	1.5.0 average
broadcast_add	0.055	0.059	0.056624	0.0567635	0.058	0.06	0.057	0.057	0.058183
elemwise_add	0.104	0.104460	0.104491	0.10444	0.121223	0.127	0.124	0.124	0.125145

Thanks to @reminisce for helping out and providing the scripts.

import time
import mxnet as mx

mx.random.seed(0)

a = mx.nd.random.uniform(shape=(1024, 1024))
b = mx.nd.random.uniform(shape=(1024, 1024))

repeat = 10000

mx.nd.waitall()
start = time.time()
for _ in range(repeat):
    c = mx.nd.broadcast_add(a, b)
    # c = mx.nd.elemwise_add(a, b)
    c.wait_to_read()
elapse = time.time() - start

print("elapse time: %fms" % (elapse * 1000 / repeat))

apeforest · 2019-07-05T18:12:58Z

#15240 may introduced extra runtime in profiler in mxnet 1.5. Using timeit module, broadcast_add between mxnet1.4 and mxnet1.5 does not show significant difference:

import timeit
SETUP_CODE = '''
import mxnet as mx
from mxnet import nd
mx.random.seed(10)
def binary_op(lhs, rhs):
    return nd.broadcast_add(lhs=lhs, rhs=rhs,  dtype='float32', ctx=mx.cpu())

lhs = nd.random.uniform(shape=(1024, 1024), dtype='float32', ctx=mx.cpu())
rhs = nd.random.uniform(shape=(1, 1024), dtype='float32', ctx=mx.cpu())
res = binary_op(lhs, rhs)
'''

TEST_CODE = '''
res.wait_to_read()
'''
# timeit.repeat statement
times = timeit.repeat(setup = SETUP_CODE,
                      stmt = TEST_CODE,
                      repeat = 3,
                      number = 10000)

# priniting minimum exec. time
print('time: {}'.format(min(times)))

Results using script above

	1.4.1	1.5.0
broadcast_add	0.0073	0.0071
elemwise_add	0.0079	0.0072
sigmoid	0.0079	0.0075
hard_sigmoid	0.0072	0.0069

roywei · 2019-07-05T21:14:16Z

There is also no significant regression on BatchNorm op between 1.4.1 and 1.5.0
Also the speed actually improves if turning int64 flag on.

	1.4.1 (int64) runs	1.5.0 (int 32) runs	1.5.0 (int 64) runs
BatchNorm	2.609942	2.63	2.031
2nd run	2.621809	2.63	2.054
3rd run	2.621809	2.611	2.041
average	2.60764	2.61148	2.042131

script:

import mxnet as  mx
import time


mx.random.seed(0)

data = mx.nd.random.uniform(0, 256, (32, 3, 256, 256))
beta = mx.nd.random.uniform(shape=(3,))
gamma = mx.nd.random.uniform(shape=(3,))
mean = mx.nd.random.uniform(shape=(3,))
var = mx.nd.random.uniform(shape=(3,))
repeat = 1000

mx.nd.waitall()
start = time.time()
for _ in range(repeat):
    #c = mx.nd.broadcast_add(a, b)
    c = mx.nd.BatchNorm(data=data, gamma=gamma, beta=beta, moving_mean=mean, moving_var=var)
    # c = mx.nd.elemwise_add(a, b)
    c.wait_to_read()
elapse = time.time() - start

print("elapse time: %fms" % (elapse * 1000 / repeat))

roywei · 2019-07-05T21:27:20Z

Conclusion

No significant regression for broadcast ops and batchnorm. This issue should no block 1.5.0 release.
Regression on ReLU is expected as it's necessary as stated in Fix NaN value comparisons in relu, max and min ops #14262
We should use timeit or time module for benchmark. Or use the same profiler module to make sure it's apple to apple comparison
Need better benchmark methods to reduce the variance between runs to clearly identify if there is a regession, otherwise it's hard to root cause and fix. Right now we use fixed seed and set OMP_NUM_THREADS

ciyongch · 2019-07-08T01:39:47Z

@roywei does it mean the operator profiling results with current profiler module is not accurate?

apeforest · 2019-07-08T17:04:20Z

@ciyongch The profiler was implemented in 1.5 and there are changes in the profiler code which were not in 1.4.1 release: #15240. This could have made difference in runtime as the runtime in each operator is very sensitive.

Going forward, we should enhance the profiler module and make it available in CI so we can capture the performance degradation in time.

ciyongch · 2019-07-09T01:36:35Z

@apeforest Thanks for the info, as we saw the current perf degradation only happened for some ops or some certain shapes with v1.5, any suggestion to do profiling for a real model? Could we still rely on the profiling result with latest MXNet code base (with #15240) so far?

roywei changed the title ~~Operator Performance regression on CPU~~ Operator Performance Regression on CPU Jul 2, 2019

lanking520 added Operator Performance labels Jul 2, 2019

roywei mentioned this issue Jul 3, 2019

[WIP]add inference mode for op perf benchmark #15453

Closed

roywei mentioned this issue Jul 23, 2019

Performance regression for MXNet 1.5.0 #15640

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator Performance Regression on CPU #15429

Operator Performance Regression on CPU #15429

roywei commented Jul 2, 2019 •

edited

mxnet-label-bot commented Jul 2, 2019

roywei commented Jul 2, 2019

roywei commented Jul 2, 2019

pengzhao-intel commented Jul 2, 2019

ciyongch commented Jul 2, 2019

ciyongch commented Jul 2, 2019

roywei commented Jul 2, 2019

ciyongch commented Jul 3, 2019

TaoLv commented Jul 3, 2019 •

edited

sandeep-krishnamurthy commented Jul 3, 2019

apeforest commented Jul 3, 2019

ciyongch commented Jul 4, 2019

roywei commented Jul 5, 2019 •

edited

roywei commented Jul 5, 2019 •

edited

roywei commented Jul 5, 2019

roywei commented Jul 5, 2019

roywei commented Jul 5, 2019 •

edited

apeforest commented Jul 5, 2019 •

edited

roywei commented Jul 5, 2019 •

edited

roywei commented Jul 5, 2019

ciyongch commented Jul 8, 2019

apeforest commented Jul 8, 2019

ciyongch commented Jul 9, 2019

Operator Performance Regression on CPU #15429

Operator Performance Regression on CPU #15429

Comments

roywei commented Jul 2, 2019 • edited

mxnet-label-bot commented Jul 2, 2019

roywei commented Jul 2, 2019

roywei commented Jul 2, 2019

pengzhao-intel commented Jul 2, 2019

ciyongch commented Jul 2, 2019

ciyongch commented Jul 2, 2019

roywei commented Jul 2, 2019

ciyongch commented Jul 3, 2019

TaoLv commented Jul 3, 2019 • edited

sandeep-krishnamurthy commented Jul 3, 2019

apeforest commented Jul 3, 2019

ciyongch commented Jul 4, 2019

roywei commented Jul 5, 2019 • edited

roywei commented Jul 5, 2019 • edited

roywei commented Jul 5, 2019

roywei commented Jul 5, 2019

roywei commented Jul 5, 2019 • edited

Update: all benchmark results before this comment maybe inaccurate as the profiler code between 1.4.1 and 1.5 is different. and the opperf script is using profiler.

apeforest commented Jul 5, 2019 • edited

roywei commented Jul 5, 2019 • edited

roywei commented Jul 5, 2019

Conclusion

ciyongch commented Jul 8, 2019

apeforest commented Jul 8, 2019

ciyongch commented Jul 9, 2019

roywei commented Jul 2, 2019 •

edited

TaoLv commented Jul 3, 2019 •

edited

roywei commented Jul 5, 2019 •

edited

roywei commented Jul 5, 2019 •

edited

roywei commented Jul 5, 2019 •

edited

apeforest commented Jul 5, 2019 •

edited

roywei commented Jul 5, 2019 •

edited