Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

Open
apeforest opened this issue Jan 15, 2020 · 25 comments
Open

[mxnet 2.0] [item 2.4] Turning on large tensor support by default #17331

apeforest opened this issue Jan 15, 2020 · 25 comments
Assignees
Projects

Comments

@apeforest
Copy link
Contributor

apeforest commented Jan 15, 2020

Description

Currently, MXNet only supports tensor size smaller than 2^31. To support large tensors, users need to recompile MXNet with USE_INT64_TENSOR_SIZE compiler flag set to ON.

Large tensor is used often in applications such as recommendation system with sparse embedding matrix and graph neural networks such as DGL.

To provide a better user experience, we would like to turn on this compiler flag by default so that MXNet binary release will support large tensors.

RFC: https://lists.apache.org/thread.html/df53b8c26e9e0433378dd803baba9fec4dd922728a5ce9135dc164b3@%3Cdev.mxnet.apache.org%3E

Current Status:

Large tensor support is already implemented in MXNet backend and C API. Over 80 operators have been tested and more are being tested.

There was performance degradation in a few operators such as transpose and it has been fixed (#16104)

Model Inference Performance

int64/int32 P50 records the 50-percentile inference runtime
% Diff: Runtime speedup of int64 build vs int32 build.
Thus a positive means inference time is reduced using int64 as tensor index.

Model Mode int64 P50 (ms) int32 P50 (ms) Diff (%)
resnext101_64x4d gluon 47.34253883 49.46685 4.29%
resnext101_64x4d module 28.83672714 28.48792 -1.22%
resnext50 gluon 17.14539528 18.05592 5.04%
resnext50 module 10.05506516 9.636641 -4.34%
nin gluon 2.574443817 2.608061 1.29%
nin module 2.432107925 2.737761 11.16%
resnet18 gluon 3.895759583 3.638268 -7.08%
resnet18 module 2.954959869 3.182888 7.16%
wavernn gluon 262.9389763 256.5546 -2.49%
caffenet gluon 2.930879593 3.087759 5.08%
caffenet module 3.169536591 3.225327 1.73%
vgg19 gluon 14.18304443 13.89098 -2.10%
vgg19 module 13.80157471 14.33492 3.72%
maskrcnn gluon 2340.852737 2391.741 2.13%
maskrcnn module 1943.515778 1926.38 -0.89%
superres gluon 17.39168167 18.00895 3.43%
superres module 16.98470116 17.26198 1.61%
resnet101 gluon 18.73707771 18.4412 -1.60%
resnet101 module 16.66593552 14.78386 -12.73%
vgg16 gluon 12.403965 16.2611 23.72%
vgg16 module 17.93074608 11.83605 -51.49%
yolov3 gluon 22.96686172 23.01311 0.20%
yolov3 module 18.57829094 20.05506 7.36%
ssd gluon 17.17400551 16.73698 -2.61%
ssd module 13.98611069 14.00757 0.15%
rnn gluon 28.2740593 28.92017 2.23%
rnn module 19.32096481 28.63479 32.53%
a3c gluon 0.928401947 0.94223 1.47%
a3c module 0.673055649 0.858545 21.61%
squeezenetv10 gluon 4.072666168 4.251957 4.22%
squeezenetv10 module 3.686189651 3.818274 3.46%
resnet152 gluon 25.8705616 27.65441 6.45%
resnet152 module 20.5206871 21.03257 2.43%
resnet34 gluon 6.978273392 7.166862 2.63%
resnet34 module 5.693674088 5.653858 -0.70%
squeezenetv11 gluon 3.037929535 3.165722 4.04%
squeezenetv11 module 2.890110016 2.983332 3.12%
resnext101 gluon 29.1929245 27.65107 -5.58%
resnext101 module 15.9804821 17.51709 8.77%
bert gluon 44.32678223 43.77675 -1.26%
bert module 43.85828972 45.38655 3.37%
resnet50 gluon 10.39171219 10.31256 -0.77%
resnet50 module 9.351491928 8.312941 -12.49%
fasterrcnn gluon 1041.807413 1061.532 1.86%
fasterrcnn module 702.3141384 703.7232 0.20%
inception gluon 7.934331894 8.714437 8.95%
inception module 5.178928375 5.363703 3.44%
Average gluon n/a n/a 0.69%
Average module n/a n/a -0.37%

Model Training Performance

Model int64 Samples/Second int32 Samples/Second Percentage Change
xception 67.51961 68.61849 -1.60%
resnet50_v2 299.0174 299.1728 -0.05%
gnmt 7.65 7.675 -0.33%
vgg16 228.4218 230.0739 -0.72%
bert 38.1 46.7 -18.42%
yolo3_darknet53_custom 31.6145 40.65 -22.23%
inceptionv3 225.4025 227.1884 -0.79%
se_resnet152_v1 123.7371 124.1493 -0.33%
word_language_model 15651.19 15524.71 0.81%
*mobilenet0.25_cifar10 56.6609205 60.5992765 6.50%  
resnet101_v1 176.6355 177.3132 -0.38%
squeezenet1.0 790.7722 790.1395 0.08%
mobilenetv2_0.75 680.4143 672.2202 1.22%
ssd 66.2365 67.56 -1.96%
Average -3.44%

* measures speed instead of throughput

What Caused Performance Drop in BERT

Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).

Running operator-level profiler we could identify the 2.2X performance drop in broadcast_axis operator.

w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]

w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}

Why is broadcast_axis Operator Affected

Too many div/mul/mod ALU operations in the indices which changed from int32 type to int64 type

template<typename OP>
struct broadcast_kernel {
  template<typename IType, typename OType>
  MSHADOW_XINLINE static void Map(index_t i,
                                  IType *input,
                                  OType *output,
                                  mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> in_shape,
                                  mshadow::Shape<MXNET_SPECIAL_MAX_NDIM> out_shape,
                                  const OpReqType req,
                                  const uint32_t ndim) {
    size_t in_stride = 1;
    size_t out_stride = 1;
    index_t idx = i;
    index_t in_idx = i;
    for (int iter = ndim - 1; iter >= 0; --iter) {
      size_t dim_idx = idx % out_shape[iter];
      in_idx -= dim_idx * out_stride;
      if (in_shape[iter] != 1) {
        in_idx += dim_idx * in_stride;
      }
      idx /= out_shape[iter];
      in_stride *= in_shape[iter];
      out_stride *= out_shape[iter];
    }
    KERNEL_ASSIGN(output[i], req, OP::Map(input[in_idx]));
  }
};

TODO

@apeforest apeforest self-assigned this Jan 15, 2020
@apeforest apeforest added this to To do in MXNet 2.0 via automation Jan 16, 2020
@apeforest apeforest changed the title [mxnet 2.0] Turning on large tensor support by default [mxnet 2.0] [item 2.4] Turning on large tensor support by default Jan 16, 2020
@apeforest apeforest moved this from To do to In progress in MXNet 2.0 Jan 21, 2020
@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Jan 30, 2020

Add LT support to ops found via OpPerf
NN optimizers and 1 activation #17444 [Merged]
Random, Sample, PDF ops : #17445 [Merged]

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Jan 30, 2020

[OpPerf] : Indexing Ops #16253 [Merged]
[OpPerf] : Neural Network Loss Ops #17482 [Merged]
[OpPerf] : Consolidate array manipulation related operators #17487

@JonTanS
Copy link
Contributor

JonTanS commented Feb 5, 2020

Inference Benchmarks comparing LT_MKL with just MKL Enabled.
All Time in MS.
% Diff calculated by doing 1 - (P50 with LT divided by P50 without LT).
A positive number means a speed increase, a negative number means a speed decrease.

         
Model Mode P50 w/ LT P50 No LT Percentage Difference
resnext101_64x4d gluon 47.34253883 49.46685 4.29%
resnext101_64x4d module 28.83672714 28.48792 -1.22%
resnext50 gluon 17.14539528 18.05592 5.04%
resnext50 module 10.05506516 9.636641 -4.34%
nin gluon 2.574443817 2.608061 1.29%
nin module 2.432107925 2.737761 11.16%
resnet18 gluon 3.895759583 3.638268 -7.08%
resnet18 module 2.954959869 3.182888 7.16%
wavernn gluon 262.9389763 256.5546 -2.49%
caffenet gluon 2.930879593 3.087759 5.08%
caffenet module 3.169536591 3.225327 1.73%
vgg19 gluon 14.18304443 13.89098 -2.10%
vgg19 module 13.80157471 14.33492 3.72%
maskrcnn gluon 2340.852737 2391.741 2.13%
maskrcnn module 1943.515778 1926.38 -0.89%
superres gluon 17.39168167 18.00895 3.43%
superres module 16.98470116 17.26198 1.61%
resnet101 gluon 18.73707771 18.4412 -1.60%
resnet101 module 16.66593552 14.78386 -12.73%
vgg16 gluon 12.403965 16.2611 23.72%
vgg16 module 17.93074608 11.83605 -51.49%
yolov3 gluon 22.96686172 23.01311 0.20%
yolov3 module 18.57829094 20.05506 7.36%
ssd gluon 17.17400551 16.73698 -2.61%
ssd module 13.98611069 14.00757 0.15%
rnn gluon 28.2740593 28.92017 2.23%
rnn module 19.32096481 28.63479 32.53%
a3c gluon 0.928401947 0.94223 1.47%
a3c module 0.673055649 0.858545 21.61%
squeezenetv10 gluon 4.072666168 4.251957 4.22%
squeezenetv10 module 3.686189651 3.818274 3.46%
resnet152 gluon 25.8705616 27.65441 6.45%
resnet152 module 20.5206871 21.03257 2.43%
resnet34 gluon 6.978273392 7.166862 2.63%
resnet34 module 5.693674088 5.653858 -0.70%
squeezenetv11 gluon 3.037929535 3.165722 4.04%
squeezenetv11 module 2.890110016 2.983332 3.12%
resnext101 gluon 29.1929245 27.65107 -5.58%
resnext101 module 15.9804821 17.51709 8.77%
bert gluon 44.32678223 43.77675 -1.26%
bert module 43.85828972 45.38655 3.37%
resnet50 gluon 10.39171219 10.31256 -0.77%
resnet50 module 9.351491928 8.312941 -12.49%
fasterrcnn gluon 1041.807413 1061.532 1.86%
fasterrcnn module 702.3141384 703.7232 0.20%
inception gluon 7.934331894 8.714437 8.95%
inception module 5.178928375 5.363703 3.44%
drmm gluon 837.1179104 614.3708 -36.26%
drmm module 830.9795856 607.6496 -36.75%

Average Percentage Change over all numbers:
Gluon: 0.69%
Module: -0.37%

@JonTanS
Copy link
Contributor

JonTanS commented Feb 6, 2020

Training Benchmarks comparing LT_MKL with just MKL Enabled.
Speed measured seconds per Epoch.
GPU Memory measured in MB.

Note: Samples/Second are opposite so I have multiple the percentages by -1. A quick explanation: The number should be going higher so a positive percentage change means there are now less samples/second. A negative percentage change means there are more samples/second.

Model Speed P50 LT Speed P50 No LT GPU Memory LT GPU Memory No LT Samples/Second P50 LT Samples/Second P50 no LT Speed Percentage Change GPU Memory Percentage Change Samples/Second Percentage Change
xception 19247.12517 18935.02989 15304 15320 67.51961 68.61849 -1.65% 0.10% -1.60%
resnet50_v2 4342.953992 4342.899322 6892 6762 299.0174 299.1728 0.00% -1.92% -0.05%
gnmt N/A N/A 4244 4112 7.65 7.675   -3.21% -0.33%
vgg16 5680.658345 5641.058277 9480 9496 228.4218 230.0739 -0.70% 0.17% -0.72%
bert 20.66 16.8 4684 4050 38.1 46.7 -22.98% -15.65% -18.42%
yolo3_darknet53_custom 517.4205 454.908 7304 12436 31.6145 40.65 -13.74% 41.27% -22.23%
inceptionv3 5765.122603 5723.867063 8318 8304 225.4025 227.1884 -0.72% -0.17% -0.79%
se_resnet152_v1 10497.33863 10465.23692 11290 10568 123.7371 124.1493 -0.31% -6.83% -0.33%
word_language_model 141.125 142.3 8846 7426 15651.19 15524.71 0.83% -19.12% 0.81%
mobilenet0.25_cifar10 56.6609205 60.5992765 1234 1134 N/A N/A 6.50% -8.82%  
resnet101_v1 7354.353666 7329.202738 8118 8022 176.6355 177.3132 -0.34% -1.20% -0.38%
squeezenet1.0 1677.752777 1678.684668 3770 3590 790.7722 790.1395 0.06% -5.01% 0.08%
mobilenetv2_0.75 1938.194231 1968.429737 5078 5008 680.4143 672.2202 1.54% -1.40% 1.22%
ssd 424.28 254.9485 4702 4592 66.2365 67.56 -66.42% -2.40% -1.96%

Average Percentage Change:
Speed: -7.53%
GPU Memory: -1.73%
Samples / Second: -3.44%

@eric-haibin-lin
Copy link
Member

@jonatan1626 thanks for the update. Does -22.98% mean 22.98% slower?

@JonTanS
Copy link
Contributor

JonTanS commented Feb 6, 2020

@eric-haibin-lin Yes I am calculating this by: 1 - (LT, MKL value / MKL value).
For the samples/sec I doing the above and then multiplying by -1.

@apeforest
Copy link
Contributor Author

@eric-haibin-lin Yes I am calculating this by: 1 - (LT, MKL value / MKL value).
For the samples/sec I doing the above and then multiplying by -1.

In your description "A negative percentage change means there are more samples/second." Doesn't that mean negative percentage is faster?

@JonTanS
Copy link
Contributor

JonTanS commented Feb 6, 2020

@apeforest Oh sorry, so I'm multiplying by only for the samples/second column -1 to keep the meaning consistent with everything else. The rest of the columns depict the correct positive percentage improvement and negative percentage degradation.

For example if MKL_LT gives 66 samples/sec and MKL gives 70 samples/sec that will be:
1-(66/70) or 6%. Because it's positive, we think that it's better but actually it's worse because the throughput has gone down.

On the other hand if MKL_LT gives 74 samples/sec and MKL gives 70 samples/sec that will be:
1-(74/70) or -5%. Because it's negative, we think it's worse but actually it's better because our throughput has gone up.

So I multiply by -1 to give it the same meaning as the rest of the percentages, where positive is better and negative is worse.

@szha
Copy link
Member

szha commented Feb 17, 2020

The slowdown for BERT (-22.98%) is quite significant. We will need to mitigate this before moving forward.

@apeforest
Copy link
Contributor Author

Thanks to @JonTanS for running the profiler, we have ping pointed the performance degradation in operator broadcast_axis (from 138ms to 177ms) and MXNDArraySyncCopyToCPU (from 592ms to 679ms).

Running operator-level profiler we could also identify the performance drop in broadcast_axis alone.

w/o USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 2.7753}]}]

w/ USE_INT64_TENSOR_SIZE flag:
[{'broadcast_axis': [{'inputs': {'data': (1, 1024, 1), 'axis': (0, 2), 'size': (1024, 8)}, 'max_storage_mem_alloc_gpu/0': 16777.2168, 'avg_time_forward_broadcast_axis': 6.3178}]}

Also, as I look into the implementation of broadcast_axis operator, many modulo and multiplication operator on the indices are involved. The next step will be to find an optimal implementation of broadcast_axis to reduce the ALU on indices in the kernel.

@access2rohit
Copy link
Contributor

@access2rohit
Copy link
Contributor

access2rohit commented May 2, 2020

@szha @eric-haibin-lin @apeforest

With current master and new broadcast_axis changes on p3.16xl single GPU training run.

Bert Run Command:

python3 run_pretraining.py --data='./part-0000.train' --data_eval='./part-0000.train' --num_steps 100 --lr 1e-4 --optimizer lamb --accumulate 1 --raw --gpus 0 --num_dataset_workers 2 --num_batch_workers 1 --circle_length 1 --total_batch_size 4 --total_batch_size_eval 4 --log_interval 10

Results:

Code Version throughput (samples/sec) total time
avg p50 p90 (only training ignoring evaluation steps)
master LT 24.38k 25.50k 28.47k 134.8 sec
master 25.90k 25.90k 27.82k 131.9 sec
new LT 25.87k 25.80k 28.00k 127.3 sec
new 25.92k 25.80k 27.80k 131.5 sec

"new" refers to mxnet code with optimized broadcast_axis.
"master" refers to mxnet master branch code
"LT" refers to of the build was done after enabling large tensor.

@apeforest
Copy link
Contributor Author

apeforest commented May 6, 2020

@access2rohit This result is a little surprising. In the earlier benchmark results provided by @JonTanS, there is a ~18% degradation in BERT training when large tensor (LT) compiler flag is turned on:

bert 38.1 46.7 -18.42%

However, from your result, even without your latest speedup in broadcast_axis operator, there is very little difference with LT flag is on:

master LT 24.38k 25.50k 28.47k 134.8 sec
master 25.90k 25.90k 27.82k 131.9 sec

Could you provide more insights?

@access2rohit
Copy link
Contributor

@apeforest THe profiling done by @JonTanS was done long back using mxnet-1.6in november. These results are using current master branch of MXNet, bert scripts have changed too. If there are newer setting for running BERT on single node they are not available on Gluon NLP site. If @eric-haibin-lin or @szhengac can verify whether my BERT is correct or not and also provide proper tuning params to run BERT on single node I will re-run benchmarks and update the results here.

@access2rohit
Copy link
Contributor

PR: #17882 fixes regression in SSD. Following are the new results for SSD run:

Code SSD 1 Epoch time (sec) %age Speedup/Slowdown w.r.t Master (large tensor disabled)
Master (large tensor disabled) 226 0
Master (large tensor enabled) 335 48.23% slowdown
Master + CPU Optimized broadcast_axis (large tensor disabled) 130 42.5% speedup
Master + CPU Optimized broadcast_axis (large tensor enabled) 184 18.5% speedup

@access2rohit
Copy link
Contributor

access2rohit commented Jun 27, 2020

@apeforest @sandeep-krishnamurthy @szha @zheng-da

PR's to enable Large Tensor Support as default in master are divided into two stages:
Stage1: Unix CPU/GPU and Windows CPU/GPU #18625
Stage2: All remaining platforms #18626

Once the above 2 PR's are merged MXNet will support Large Tensors for CPU/GPU(depending on Global Memory) on master.

@access2rohit
Copy link
Contributor

access2rohit commented Jul 10, 2020

Currently Large Tensor Support work on all operators implemented in MXNet and MKLDNN also supports int64. CUDA kernels written inside MXNET both generic(cpu/gpu) and specific(gpu only) support large tensors depending on device memory.

BLAS and LAPACK libs were not considered while defining the scope of the project. Currently following BLAS and LAPACK implementations are supported inside MXNet

openBLAS (Default)
MKL
ATLAS
Apple Accelerate

upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t.

Initially openBLAS can be supported since it is used by default and in pypi wheels as well. Thus not, breaking any default behaviour of customer. Users attempting to use Large Tensor with other BLAS and LAPACK implementations won't face issues as long as they don't use large tensors. Additional error messages will be added in case Large Tensor is used BLAS implementation is not openBLAS until that BLAS library is made to work with large tensor support of MXNet.

NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script)

@sandeep-krishnamurthy @leezu @szha @zheng-da

@sandeep-krishnamurthy
Copy link
Contributor

Thanks @access2rohit for the summary.

Is the plan for enabling Large Tensor Support in the following order?

  1. Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.
  2. Next, we work on enabling MKL bindings capable of Large Tensor Support, as a separate PR. So users building custom MXNet builds with MKL as BLAS will get the Large Tensor functionality.
  3. We need to debate on ATLAS and Accelerate BLAS support and we can pick up this discussion once we get above 2 major steps done.

Do you see this order of execution okay @access2rohit @leezu @szha @zheng-da ?

@szha
Copy link
Member

szha commented Jul 10, 2020

Has the large tensor for numpy array been supported?

@sandeep-krishnamurthy
Copy link
Contributor

@access2rohit can correct me, but, few of them are supported as they use same kernels under the hood. This issue scope was mainly on the NDArray when it got started. After these are done, remaining Numpy ops will also be supported.

@access2rohit
Copy link
Contributor

Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.

yes

@access2rohit
Copy link
Contributor

access2rohit commented Jul 10, 2020

Has the large tensor for numpy array been supported?

upon inspecting numpy files inside MXNet and they are using index_t for iterating over elements in their own kernels and use NDarray ones for remaining in which we ensured to use index_t where required. For kernels using BLAS I will update them in the same PR as making MXNet wrappers for openBLAS int64 compatible.

@leezu
Copy link
Contributor

leezu commented Jul 10, 2020

NOTE: currently openBLAS works correctly with smaller inputs(within range of int32) but will truncate parameters passed with higher values and hence will result in either SIGSEGV(mostly) or garbage values being found(will eventually cause SIGSEGV in a bigger script)

I'm a little concerned that we don't have a correct integration of BLAS and Lapack. BLAS kernels and will get potential crashes or corrupt results. But I think @sandeep-krishnamurthy's point

Make openBLAS compatible with Large Tensor support and merge the PR for Enabling Large Tensor Support so that default PyPi users of MXNet can already benefit from the new capability. This will actually cover largest user base of MXNet.

refers to fixing this? If so, I'm fine with the order of execution. Thank you @access2rohit for the hard work on this feature

@access2rohit
Copy link
Contributor

upon investigation openBLAS needs to be built with specific flag to support int64_t signatures and MKL will support long long int signatures(in which case reinterpret_cast<>() is needed for casting pointers as int64_t is treated as long int* as opposed to long long int* in MKL). Additionally LAPACK and BLAS wrappers need to be updated from int -> int64_t.

@leezu yes thats what I meant

@szha
Copy link
Member

szha commented Jul 10, 2020

I think the numpy frontend hasn't supported large tensors yet. I started working on it here #18368 but I haven't found the time to finish migrating all the tests. @access2rohit would you be able to help out and take that over?

@szha szha assigned access2rohit and unassigned apeforest Jul 23, 2020
@szha szha moved this from In progress to Done in MXNet 2.0 Sep 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
MXNet 2.0
  
Done
Development

No branches or pull requests

8 participants