Performance Regression on CUDA10 #14725

stu1130 · 2019-04-17T19:42:24Z

Description

I have observed performance regression on LSTM model. The performance drop from around 950 samples/ sec to around 780 samples/sec. I compared the result with mxnet-cu100mkl and mxnet-cu92mkl. Both of them are 1.4.0 without using nightly build.

Environment info (Required)

Deep Learning AMI (Ubuntu) Version 22.0 with P3.16xlarge

Steps to reproduce

(Paste the commands you ran that produced the error.)

lanuch a DL ami or DL base ami with p3.16xlarge(it will only use 1 gpu)
sudo apt-get install python3-pip && sudo pip3 install virtualenv
python3 -m venv mxnet-cu100mkl/mxnet-cu92mkl
source mxnet-cu100mkl/bin/activate or mxnet-cu92mkl/bin/activate
pip2 install psutil --user && pip2 install --upgrade pandas --user && pip2 install mxnet-cu100mkl or mxnet-cu92mkl
git clone https://github.com/awslabs/deeplearning-benchmark.git
cd deeplearning-benchmark
python2 benchmark_driver.py --framework mxnet --task-name mkl_lstm_ptb_symbolic --num-gpus 1 --epochs 10 --metrics-suffix test --kvstore local
use sudo rm /usr/local/cuda/ sudo ln -s /usr/local/cuda-9.2 /usr/local/cuda to switch cuda version

Sample result
cu100mkl result:
INFO:root:{'mxnet.mkl_lstm_ptb_symbolic.validation_perplexity.test': 340.001303, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_std.test': 632.3600240369404, 'mxnet.mkl_lstm_ptb_symbolic.speed.test': 787.0409230769234, 'mxnet.mkl_lstm_ptb_symbolic.total_training_time.test': 534.887, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_mean.test': 1131.2, 'mxnet.mkl_lstm_ptb_symbolic.cpu_memory_usage.test': 796, 'mxnet.mkl_lstm_ptb_symbolic.train_perplexity.test': 361.069259, 'mxnet.mkl_lstm_ptb_symbolic.uptime_in_seconds.test': 4132.82, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_max.test': 1414.0}

cu92mkl result:
INFO:root:{'mxnet.mkl_lstm_ptb_symbolic.validation_perplexity.test': 328.856996, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_std.test': 633.2544512279404, 'mxnet.mkl_lstm_ptb_symbolic.speed.test': 916.6658846153846, 'mxnet.mkl_lstm_ptb_symbolic.total_training_time.test': 458.66499999999996, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_mean.test': 1132.8, 'mxnet.mkl_lstm_ptb_symbolic.cpu_memory_usage.test': 736, 'mxnet.mkl_lstm_ptb_symbolic.train_perplexity.test': 351.549512, 'mxnet.mkl_lstm_ptb_symbolic.uptime_in_seconds.test': 9839.61, 'mxnet.mkl_lstm_ptb_symbolic.gpu_memory_usage_max.test': 1416.0}

More data points
G3dn.16xlarge -cu100mkl aveage throughtput: 445
G3dn.16xlarge -cu92mkl average throughput: 444
On G3dn(Tesla M60) I don't see the performance regression
P3.16xlarge -cu100 aveage throughtput: 779
P3.16xlarge -cu92 aveage throughtput: 930

Update suggested by @KellenSunderland
nvprof --profile-child-processes -s python2 benchmark_driver.py --framework mxnet --task-name mkl_lstm_ptb_symbolic --num-gpus 1 --epochs 1 --metrics-suffix test --kvstore local > nvprof_output_cu100mkl 2>&1

cu100mkl

 ==22332== NVPROF is profiling process 22332, command: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==22332== Profiling application: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==22332== Warning: 29 API trace records have same start and end timestamps.
This can happen because of short execution duration of CUDA APIs and low timer resolution on the underlying operating system.
==22332== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   32.96%  11.4365s    139948  81.719us  60.160us  1.9775ms  volta_sgemm_128x64_nt
                   26.94%  9.34508s    136024  68.701us  63.808us  83.135us  volta_sgemm_32x32_sliced1x4_nn
                   11.46%  3.97589s    150520  26.414us  24.319us  34.656us  volta_sgemm_64x32_sliced1x4_tn
                    3.53%  1.22311s      1308  935.10us  361.25us  2.0452ms  volta_sgemm_128x64_nn
                    2.30%  797.69ms    373330  2.1360us  2.0150us  14.400us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
                    2.13%  739.36ms    349570  2.1150us  1.9840us  14.464us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
                    2.13%  738.49ms       687  1.0750ms  980.92us  1.2799ms  volta_sgemm_32x128_tn
                    1.63%  564.77ms      1308  431.78us  163.55us  972.06us  void mshadow::cuda::AddTakeGradKernel<int=6, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, float, long, long, int)
                    1.40%  484.33ms    133408  3.6300us  2.8800us  16.319us  void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::plusto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
                    1.32%  456.59ms    151933  3.0050us  1.7920us  164.10us  void mshadow::cuda::MapPlanKernel<mshadow::sv::plusto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    1.26%  435.71ms    291784  1.4930us  1.3750us  14.048us  [CUDA memset]
                    1.16%  402.49ms       566  711.11us  570.72us  1.7457ms  volta_sgemm_128x64_tn
                    1.14%  396.27ms     14388  27.541us  1.3760us  96.448us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_9SGDKernelEJPfS4_S4_ffffNS_9OpReqTypeEEEEviDpT0_
                    1.08%  376.09ms      1422  264.48us  110.75us  579.74us  void mshadow::cuda::SoftmaxKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, long)
                    1.02%  352.83ms    225780  1.5620us  1.5030us  15.008us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=6, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    1.00%  345.93ms    207960  1.6630us  1.5990us  13.696us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=11, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    0.91%  316.27ms    207960  1.5200us  1.3750us  13.824us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op5rightEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
                    0.87%  300.73ms    205344  1.4640us  1.3110us  13.472us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op4leftEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
                    0.83%  288.70ms    228510  1.2630us     992ns  13.760us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3mulELi1EEEJPfS7_S7_EEEviDpT0_
                    0.75%  260.61ms    150520  1.7310us  1.5680us  14.080us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=7, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    0.68%  236.88ms    138640  1.7080us  1.6310us  13.664us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=12, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    0.67%  230.86ms    153250  1.5060us     992ns  14.304us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPfS7_S7_EEEviDpT0_
                    0.61%  213.10ms    133408  1.5970us  1.3750us  14.368us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_3SumEJPfNS_9OpReqTypeES4_S4_EEEviDpT0_
                    0.38%  131.73ms     14616  9.0120us  1.0240us  108.19us  [CUDA memcpy DtoD]
                    0.37%  127.57ms      1308  97.532us  40.031us  214.02us  void mshadow::cuda::SoftmaxGradKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>>(mshadow::gpu, int=2, float, long)
                    0.33%  112.78ms      1299  86.822us  32.863us  164.67us  void mshadow::cuda::MapPlanKernel<mshadow::sv::multo, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    0.28%  96.707ms      6540  14.787us  2.8160us  116.96us  void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::saveto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
                    0.18%  62.853ms       169  371.91us  347.84us  465.53us  volta_sgemm_128x128_tn
                    0.17%  57.555ms      5460  10.541us  9.8240us  13.824us  void mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>(int, int, bool, float const *, mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>*, mshadow::Shape<int=2>, mshadow::Shape, mshadow::Shape, mshadow::Shape, int, bool)
                    0.15%  50.856ms      1332  38.180us  1.2480us  46.112us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    0.14%  47.081ms      5471  8.6050us  1.5030us  14.016ms  [CUDA memcpy DtoH]
                    0.12%  42.434ms      5608  7.5660us  1.5030us  5.8883ms  [CUDA memcpy HtoD]
                    0.03%  10.385ms      1422  7.3030us  3.5200us  13.824us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_7TakeGPUILb1EEEJPfS5_S5_llEEEviDpT0_
                    0.02%  6.6979ms      2730  2.4530us  1.5680us  10.912us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_4pickILi2ELb1EEEJPfS5_PiiiN7mshadow5ShapeILi2EEES9_EEEviDpT0_
                    0.02%  6.3452ms      5688  1.1150us     991ns  12.768us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_
                    0.01%  4.1586ms      2730  1.5230us  1.1840us  10.592us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, int>, int>, mshadow::expr::Plan<mshadow::expr::TypecastExp<int, float, mshadow::Tensor<mshadow::gpu, int=1, float>, int=1>, int>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=1)
                    0.01%  3.6414ms      2730  1.3330us     992ns  7.9680us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op2eqELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  3.4687ms      2730  1.2700us  1.0560us  13.024us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3logELi1EEEJPfS7_EEEviDpT0_
                    0.01%  3.2744ms      2730  1.1990us     960ns  14.112us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op6rminusELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  3.0465ms      2730  1.1150us     992ns  13.184us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op7maximumELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  2.1137ms         9  234.85us  232.29us  236.06us  void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::plusto, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
                    0.01%  2.1085ms         9  234.28us  232.29us  236.29us  void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::multo, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
      API calls:   56.38%  44.7969s   3145400  14.242us  6.1280us  23.314ms  cudaLaunchKernel
                   19.23%  15.2759s    336467  45.400us  3.7720us  4.0563ms  cudaStreamSynchronize
                    5.44%  4.32640s      1503  2.8785ms  155.14us  3.87303s  cudaMemGetInfo
                    4.45%  3.53308s    291784  12.108us  4.8700us  95.429ms  cudaMemsetAsync
                    3.03%  2.40724s        18  133.74ms     884ns  635.37ms  cudaFree
                    2.67%  2.12090s    291776  7.2680us  2.0380us  2.3136ms  cudaEventQuery
                    2.35%  1.86598s   1993278     936ns       0ns  1.0557ms  cudaPeekAtLastError
                    2.23%  1.77185s     25683  68.989us  8.3060us  23.490ms  cudaMemcpy2DAsync
                    1.38%  1.10031s   1152674     954ns       0ns  975.49us  cudaGetLastError
                    1.10%  877.91ms    291778  3.0080us     981ns  12.349ms  cudaEventRecord
                    0.89%  704.22ms    288598  2.4400us     624ns  10.880ms  cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags
                    0.42%  333.73ms    166750  2.0010us      73ns  1.1830ms  cudaSetDevice
                    0.24%  194.41ms     67011  2.9010us     365ns  1.2253ms  cudaGetDevice
                    0.12%  93.806ms      1547  60.637us  10.066us  2.2598ms  cudaMalloc
                    0.04%  30.262ms      2256  13.413us     467ns  1.6574ms  cuDeviceGetAttribute
                    0.01%  9.0654ms        24  377.73us  239.21us  574.83us  cuDeviceTotalMem
                    0.01%  5.5531ms         4  1.3883ms  954.64us  1.7572ms  cudaGetDeviceProperties
                    0.00%  2.2392ms        24  93.299us  71.233us  155.01us  cuDeviceGetName
                    0.00%  1.5461ms       912  1.6950us     518ns  73.179us  cudaFuncSetAttribute
                    0.00%  1.4418ms       216  6.6750us     830ns  383.84us  cudaEventCreateWithFlags
                    0.00%  1.3181ms         2  659.06us  35.784us  1.2823ms  cudaHostAlloc
                    0.00%  841.92us        12  70.159us  17.552us  539.11us  cudaMemcpy
                    0.00%  812.45us        22  36.929us  13.082us  92.853us  cudaStreamCreateWithFlags
                    0.00%  610.40us         4  152.60us  62.885us  301.45us  cudaStreamCreate
                    0.00%  481.01us         8  60.126us  13.224us  343.28us  cudaStreamCreateWithPriority
                    0.00%  277.53us       202  1.3730us     658ns  23.334us  cudaDeviceGetAttribute
                    0.00%  48.017us        32  1.5000us     635ns  4.4500us  cuDeviceGet
                    0.00%  28.351us         8  3.5430us  2.8770us  5.5370us  cuDeviceGetPCIBusId
                    0.00%  21.491us         6  3.5810us  1.7350us  6.8090us  cudaEventCreate
                    0.00%  21.416us        24     892ns     522ns  1.5600us  cuDeviceGetUuid
                    0.00%  6.6090us         2  3.3040us  3.0640us  3.5450us  cudaHostGetDevicePointer
                    0.00%  6.2010us         5  1.2400us     631ns  2.1050us  cuDeviceGetCount
                    0.00%  4.8960us         4  1.2240us     709ns  2.1750us  cudaGetDeviceCount
                    0.00%  4.5440us         2  2.2720us  2.0340us  2.5100us  cudaDeviceGetStreamPriorityRange
                    0.00%  4.3850us         2  2.1920us  1.7190us  2.6660us  cuInit
                    0.00%  3.8450us         2  1.9220us  1.3320us  2.5130us  cuDriverGetVersion

cu92mkl

==21941== NVPROF is profiling process 21941, command: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==21941== Profiling application: python word_language_model/lstm_bucketing.py --num-hidden 650 --num-embed 650 --gpus 1 --epochs 1 --kv-store local
==21941== Warning: Found 55 invalid records in the result.
==21941== Warning: This can happen if device ran out of memory or if a device kernel was stopped due to an assertion.
==21941== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   33.70%  9.45188s    136024  69.486us  65.248us  87.616us  volta_sgemm_32x32_sliced1x4_nn
                   16.70%  4.68282s    139948  33.461us  17.664us  2.2807ms  volta_sgemm_128x64_nt
                   15.80%  4.43089s    150520  29.437us  25.984us  39.232us  volta_sgemm_32x32_sliced1x4_tn
                    4.36%  1.22354s      1308  935.43us  361.79us  2.3943ms  volta_sgemm_128x64_nn
                    4.17%  1.16848s      1422  821.72us  346.91us  1.9999ms  volta_sgemm_128x128_tn
                    2.85%  800.01ms    373330  2.1420us  2.0150us  13.888us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
                    2.66%  745.25ms    349570  2.1310us  1.9840us  14.208us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::expr::SliceExp<mshadow::Tensor<mshadow::gpu, int=3, float>, mshadow::gpu, float, int=3, int=2>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=3, float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=3)
                    2.02%  565.53ms      1308  432.36us  163.46us  1.0837ms  void mshadow::cuda::AddTakeGradKernel<int=6, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, float, long, long, int)
                    1.70%  477.35ms    133408  3.5780us  2.9120us  14.623us  void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::plusto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
                    1.63%  456.01ms    151933  3.0010us  2.0150us  163.81us  void mshadow::cuda::MapPlanKernel<mshadow::sv::plusto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    1.41%  395.84ms     14388  27.511us  1.3120us  99.583us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_9SGDKernelEJPfS4_S4_ffffNS_9OpReqTypeEEEEviDpT0_
                    1.33%  371.83ms      1422  261.48us  109.66us  580.57us  void mshadow::cuda::SoftmaxKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=2, long)
                    1.26%  352.15ms    225780  1.5590us  1.5030us  13.088us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=6, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    1.24%  346.55ms    207960  1.6660us  1.5990us  12.320us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=11, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    1.13%  315.67ms    207960  1.5170us  1.3750us  13.888us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op5rightEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
                    1.07%  300.31ms    205344  1.4620us  1.3110us  14.304us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_19backward_grad_tunedINS0_10mshadow_op4leftEEELi1EEEJPfPKfSB_SB_EEEviDpT0_
                    1.03%  289.23ms    228510  1.2650us     992ns  13.600us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3mulELi1EEEJPfS7_S7_EEEviDpT0_
                    0.92%  257.58ms    150520  1.7110us  1.5360us  13.344us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=7, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=1>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    0.84%  235.80ms    138640  1.7000us  1.5990us  13.792us  void op_generic_tensor_kernel<int=2, float, float, float, int=256, cudnnGenericOp_t=12, cudnnNanPropagation_t=0, cudnnDimOrder_t=0, int=2>(cudnnTensorStruct, float*, cudnnTensorStruct, float const *, cudnnTensorStruct, float const *, float, float, float, float, dimArray, reducedDivisorArray)
                    0.82%  229.41ms    153250  1.4960us     992ns  13.728us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op4plusELi1EEEJPfS7_S7_EEEviDpT0_
                    0.80%  224.70ms    133408  1.6840us  1.4710us  13.824us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_3SumEJPfNS_9OpReqTypeES4_S4_EEEviDpT0_
                    0.47%  132.46ms     14616  9.0620us  1.0880us  120.67us  [CUDA memcpy DtoD]
                    0.45%  127.20ms      1308  97.248us  40.288us  213.15us  void mshadow::cuda::SoftmaxGradKernel<int=8, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>>(mshadow::gpu, int=2, float, long)
                    0.40%  112.83ms      1299  86.860us  32.864us  164.86us  void mshadow::cuda::MapPlanKernel<mshadow::sv::multo, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    0.37%  104.37ms      6540  15.958us  3.2640us  120.22us  void mshadow::cuda::MapRedKeepLowestKernel<mshadow::sv::saveto, mshadow::red::sum, int=5, float, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, float>, float>, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>>(mshadow::gpu, int=1, float, mshadow::Shape<int=2>)
                    0.21%  58.328ms      5460  10.682us  10.080us  17.088us  void mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>(int, int, bool, float const *, mxnet::op::broadcast::reduce_kernel<mshadow::red::sum, int=2, float, mxnet::op::mshadow_op::identity, int=1>*, mshadow::Shape<int=2>, mshadow::Shape, mshadow::Shape, mshadow::Shape, int, bool)
                    0.18%  50.957ms      1332  38.256us  1.2480us  44.320us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2)
                    0.17%  46.791ms      5471  8.5520us  1.5030us  13.900ms  [CUDA memcpy DtoH]
                    0.16%  44.861ms      5608  7.9990us  1.5040us  5.6511ms  [CUDA memcpy HtoD]
                    0.04%  10.481ms      1422  7.3700us  3.3600us  14.272us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_7TakeGPUILb1EEEJPfS5_S5_llEEEviDpT0_
                    0.03%  7.1928ms      2730  2.6340us  2.1430us  14.304us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS0_4pickILi2ELb1EEEJPfS5_PiiiN7mshadow5ShapeILi2EEES9_EEEviDpT0_
                    0.02%  6.2377ms      5688  1.0960us     991ns  9.6000us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS1_10set_to_intILi0EEELi1EEEJPfEEEviDpT0_
                    0.01%  4.1941ms      2730  1.5360us  1.1840us  12.704us  void mshadow::cuda::MapPlanKernel<mshadow::sv::saveto, int=8, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=1, int>, int>, mshadow::expr::Plan<mshadow::expr::TypecastExp<int, float, mshadow::Tensor<mshadow::gpu, int=1, float>, int=1>, int>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=1)
                    0.01%  3.6781ms      2730  1.3470us     992ns  7.9680us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op2eqELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  3.4240ms      2730  1.2540us  1.0560us  8.5440us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op3logELi1EEEJPfS7_EEEviDpT0_
                    0.01%  3.2767ms      2730  1.2000us     992ns  11.456us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op7maximumELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  3.0875ms      2730  1.1300us     992ns  11.648us  _ZN5mxnet2op8mxnet_op20mxnet_generic_kernelINS1_11op_with_reqINS0_10mshadow_op6rminusELi1EEEJPfS7_fEEEviDpT0_
                    0.01%  2.8087ms      1821  1.5420us  1.3760us  13.760us  [CUDA memset]
                    0.01%  2.1258ms         9  236.20us  234.02us  239.55us  void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::plusto, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::Broadcast1DExp<mshadow::Tensor<mshadow::gpu, int=1, float>, float, int=2, int=1>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
                    0.01%  2.1201ms         9  235.57us  234.24us  238.05us  void mshadow::cuda::MapPlanLargeKernel<mshadow::sv::multo, int=8, int=1024, mshadow::expr::Plan<mshadow::Tensor<mshadow::gpu, int=2, float>, float>, mshadow::expr::Plan<mshadow::expr::ScalarExp<float>, float>>(mshadow::gpu, long, mshadow::Shape<int=2>, int=2, int)
      API calls:   63.12%  42.1805s   3145400  13.410us  5.3740us  25.822ms  cudaLaunchKernel
                   18.65%  12.4655s    336467  37.048us  3.8090us  3.6356ms  cudaStreamSynchronize
                    6.36%  4.24941s      1498  2.8367ms  154.77us  3.89087s  cudaMemGetInfo
                    3.36%  2.24470s        18  124.71ms     728ns  566.12ms  cudaFree
                    2.92%  1.95056s   1993245     978ns       2ns  1.4790ms  cudaPeekAtLastError
                    2.70%  1.80358s     25683  70.224us  8.0210us  20.948ms  cudaMemcpy2DAsync
                    1.76%  1.17849s   1152100  1.0220us       3ns  1.1467ms  cudaGetLastError
                    0.49%  327.96ms    166750  1.9660us      68ns  993.66us  cudaSetDevice
                    0.29%  191.04ms     67011  2.8500us     221ns  910.93us  cudaGetDevice
                    0.16%  110.12ms      1542  71.413us  9.3680us  18.735ms  cudaMalloc
                    0.07%  48.392ms      1821  26.574us  9.4630us  9.4157ms  cudaMemsetAsync
                    0.04%  27.123ms      2256  12.022us     458ns  575.77us  cuDeviceGetAttribute
                    0.02%  16.629ms      1813  9.1710us  4.4710us  62.413us  cudaEventQuery
                    0.01%  8.4049ms        24  350.20us  240.68us  575.37us  cuDeviceTotalMem
                    0.01%  7.6088ms      1815  4.1920us  2.1630us  44.168us  cudaEventRecord
                    0.01%  5.3368ms         4  1.3342ms  965.25us  1.4861ms  cudaGetDeviceProperties
                    0.00%  2.1962ms        24  91.507us  69.921us  158.19us  cuDeviceGetName
                    0.00%  2.1156ms       216  9.7940us     833ns  1.2832ms  cudaEventCreateWithFlags
                    0.00%  1.2896ms         2  644.81us  40.351us  1.2493ms  cudaHostAlloc
                    0.00%  1.1971ms         4  299.28us  51.603us  761.62us  cudaStreamCreate
                    0.00%  841.37us        12  70.113us  13.792us  535.50us  cudaMemcpy
                    0.00%  799.59us       384  2.0820us     772ns  97.788us  cudaFuncSetAttribute
                    0.00%  739.83us        22  33.628us  14.586us  100.29us  cudaStreamCreateWithFlags
                    0.00%  473.78us         8  59.223us  13.159us  335.79us  cudaStreamCreateWithPriority
                    0.00%  247.13us       202  1.2230us     642ns  3.6050us  cudaDeviceGetAttribute
                    0.00%  99.833us        32  3.1190us     527ns  16.689us  cuDeviceGet
                    0.00%  21.441us         6  3.5730us  1.7490us  6.7460us  cudaEventCreate
                    0.00%  21.325us         8  2.6650us  2.1230us  4.6420us  cuDeviceGetPCIBusId
                    0.00%  6.8890us         2  3.4440us  3.3850us  3.5040us  cudaHostGetDevicePointer
                    0.00%  4.9550us         5     991ns     470ns  1.4300us  cuDeviceGetCount
                    0.00%  4.8480us         2  2.4240us  2.0320us  2.8160us  cudaDeviceGetStreamPriorityRange
                    0.00%  4.5770us         2  2.2880us  1.6370us  2.9400us  cuInit
                    0.00%  4.5240us         4  1.1310us     678ns  1.7230us  cudaGetDeviceCount
                    0.00%  4.3320us         2  2.1660us  1.2450us  3.0870us  cuDriverGetVersion

MXNet Profiler API result for 1 epoch
cu100mkl

Profile Statistics.
	Note that counter items are counter values and not time units.
Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: cpu/0                         193       81507.3906       26000.0000      328184.1875      151092.0938
Memory: gpu/0                       66993      486890.5625           7.6800      486936.6250      243464.4844

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArraySyncCopyFromCPU               24         165.8120           0.0640          89.2050           6.9088
MXNDArrayReshape64                   2730          11.1690           0.0020           0.0400           0.0041
MXExecutorForward                    1422        1260.3890           0.1420          11.0660           0.8863
MXExecutorBackwardEx                 1308        1223.9100           0.1730           9.1420           0.9357
MXNDArraySlice                       5574          13.2250           0.0000           0.0500           0.0024
MXNDArrayFree                       41162         366.8830           0.0000           0.2150           0.0089
MXNDArrayGetContext                 10958          10.5630           0.0000           0.1010           0.0010
MXSymbolListAttr                        3           5.6350           0.9650           2.7530           1.8783
MXImperativeInvokeEx                50115        1389.6400           0.0070           1.7010           0.0277
MXNDArrayGetDType                   11015          10.1360           0.0000           0.0400           0.0009
MXNet C API Calls                  195096         195.0960           0.0010         195.0960          97.5475
MXNet C API Concurrency            390192           0.0000           0.0000           0.0010           0.0005
MXExecutorSimpleBind                    6        3734.5410          15.9950        3542.6250         622.4235
MXSymbolCreateAtomicSymbol            5736          36.1470           0.0000           0.1140           0.0063
MXNDArraySyncCopyToCPU               5460       80959.1719           0.0850         363.2810          14.8277
MXSymbolGetAttr                         6           0.0140           0.0010           0.0040           0.0023
MXNDArrayGetStorageType               171           0.1540           0.0000           0.0050           0.0009
MXNDArrayCreateEx                    8258          73.3280           0.0010           3.3300           0.0089
MXExecutorOutputs                       6           0.0160           0.0020           0.0030           0.0027
MXExecutorPrint                         6          28.1480           1.7110          10.1980           4.6913
MXNDArrayGetShape                   51136          52.0260           0.0000           0.0440           0.0010

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_plus_scalar                           22          24.1390           0.0160           4.5600           1.0972
CopyGPU2CPU                            22         176.3340           0.0570          20.1660           8.0152
SoftmaxOutput                        2844         812.7720           0.1220           1.4340           0.2858
broadcast_mul                        5460         210.2000           0.0170           0.6040           0.0385
broadcast_add                        5460         208.4420           0.0150           0.8420           0.0382
log                                  5460         159.6020           0.0150           0.4100           0.0292
add_n                              266816        9588.1748           0.0110           2.0040           0.0359
_backward_Activation               693200       13628.1191           0.0120           2.3540           0.0197
_backward_mul                      415920        9997.9492           0.0110           2.0910           0.0240
_backward_copy                      71936         479.4360           0.0030           0.4750           0.0067
sgd_update                          28776        6515.3848           0.0130           2.8010           0.2264
_backward_FullyConnected           279896       37841.1094           0.0860           2.1650           0.1352
Reshape                              5688         146.5510           0.0030           1.4950           0.0258
sum                                 10920         549.3320           0.0250           0.7170           0.0503
_zeros                              11398         529.2270           0.0100           8.5110           0.0464
Activation                         752600       21769.1758           0.0120           3.0950           0.0289
_equal_scalar                        5460         224.6300           0.0160           1.3360           0.0411
_rminus_scalar                       5460         233.0980           0.0190           0.2030           0.0427
CopyCPU2CPU                            88         304.1030           0.0050          27.1770           3.4557
SetupExec                           22932          48.5340           0.0000           0.0970           0.0021
Concat                               2844         431.9350           0.0450           1.2210           0.1519
_full                                  10           0.0810           0.0020           0.0160           0.0081
_backward_SliceChannel             141256        5605.5112           0.0230           1.3650           0.0397
_slice_assign_scalar                    4           0.2660           0.0190           0.1140           0.0665
pick                                 5460         282.8300           0.0180           3.2620           0.0518
_copy                               20928        1165.9010           0.0150           1.3830           0.0557
_random_uniform                        12          74.5140           2.9510          12.7100           6.2095
_backward_SoftmaxOutput              2616         578.9200           0.0920           1.2720           0.2213
SyncCopyGPU2CPU                     10920         582.0080           0.0200           0.1760           0.0533
ResourceParallelRandomSetSeed               2          19.5030           9.7440           9.7590           9.7515
elemwise_mul                       451560        9896.3320           0.0100           2.4910           0.0219
FullyConnected                     303884       21313.8770           0.0450           4.1210           0.0701
_maximum_scalar                      5460         193.6250           0.0140           1.3690           0.0355
DeleteVariable                      65696         385.6180           0.0010           1.4810           0.0059
elemwise_add                       301040        5272.2681           0.0100           1.7210           0.0175
SliceChannel                       153364        6224.5142           0.0230          12.3770           0.0406
_div_scalar                            22          22.9760           0.0100           5.3440           1.0444
expand_dims                         75260        1019.0590           0.0030           1.1830           0.0135
_backward_Concat                     2616         539.6740           0.0490          32.9550           0.2063
CopyCPU2GPU                         11192         781.9070           0.0130           5.8030           0.0699
SetValueOp                             48           3.2530           0.0500           0.1220           0.0678
Embedding                            2844         161.1000           0.0190           1.7600           0.0566
_copyto_GPU2GPU                     10920         567.0010           0.0130           0.2170           0.0519
_backward_Embedding                  2616        1285.4290           0.2150           1.8690           0.4914
WaitForVar                          10968          89.9160           0.0020           0.3290           0.0082

cu92mkl

Profile Statistics.
	Note that counter items are counter values and not time units.
Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: cpu/0                         193       81094.4297       26000.0000      328184.1875      151092.0938
Memory: gpu/0                       66993      486890.5625           7.6800      486936.6250      243464.4844

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArraySyncCopyFromCPU               24         176.6150           0.0330          69.3230           7.3590
MXNDArrayReshape64                   2730          12.2640           0.0020           0.0410           0.0045
MXExecutorForward                    1422        1316.2700           0.1480          14.0690           0.9256
MXExecutorBackwardEx                 1308        1286.3680           0.1690           5.8390           0.9835
MXNDArraySlice                       5574          13.4970           0.0000           0.0590           0.0024
MXNDArrayFree                       41162         396.0320           0.0000           3.7350           0.0096
MXNDArrayGetContext                 10958          10.7350           0.0000           0.0380           0.0010
MXSymbolListAttr                        3           4.0050           1.2550           1.4000           1.3350
MXImperativeInvokeEx                50115        1499.5780           0.0070           2.2110           0.0299
MXNDArrayGetDType                   11015           9.8290           0.0000           0.0370           0.0009
MXNet C API Calls                  195096         195.0960           0.0010         195.0960          97.5475
MXNet C API Concurrency            390192           0.0000           0.0000           0.0010           0.0005
MXExecutorSimpleBind                    6        3615.5139           8.9550        3481.5950         602.5857
MXSymbolCreateAtomicSymbol            5736          30.6770           0.0010           0.1260           0.0053
MXNDArraySyncCopyToCPU               5460       72482.7344           0.0850         249.1380          13.2752
MXSymbolGetAttr                         6           0.0140           0.0010           0.0030           0.0023
MXNDArrayGetStorageType               171           0.1090           0.0000           0.0030           0.0006
MXNDArrayCreateEx                    8258          67.8010           0.0010           3.1960           0.0082
MXExecutorOutputs                       6           0.0150           0.0020           0.0040           0.0025
MXExecutorPrint                         6          25.9240           0.9360          12.1280           4.3207
MXNDArrayGetShape                   51136          55.6040           0.0000           0.0500           0.0011

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_plus_scalar                           22          47.2760           0.0080          10.5180           2.1489
CopyGPU2CPU                            22         156.4810           0.0340          20.2620           7.1128
broadcast_add                        5460         198.4270           0.0160           3.6910           0.0363
_backward_Activation               693200       13550.5508           0.0130           1.9900           0.0195
_backward_mul                      415920       10045.2979           0.0110           2.0190           0.0242
_backward_copy                      71936         495.9330           0.0030           0.6440           0.0069
SoftmaxOutput                        2844         806.0880           0.1250           0.6040           0.2834
Reshape                              5688         145.6910           0.0030           1.0440           0.0256
add_n                              266816        9710.6777           0.0120           1.8540           0.0364
log                                  5460         162.8830           0.0150           2.2690           0.0298
broadcast_mul                        5460         218.6930           0.0170           1.6100           0.0401
sgd_update                          28776        6457.4722           0.0130           2.4300           0.2244
_backward_FullyConnected           279896       23458.5215           0.0410           4.9550           0.0838
_backward_Concat                     2616         520.1320           0.0490          24.3040           0.1988
CopyCPU2GPU                         11192         753.7490           0.0140           7.5330           0.0673
_backward_Embedding                  2616        1299.2860           0.2160           5.2340           0.4967
WaitForVar                          10968          92.5920           0.0020           0.2480           0.0084
_rminus_scalar                       5460         209.0210           0.0170           1.5620           0.0383
CopyCPU2CPU                            88         308.8270           0.0040          40.2790           3.5094
SetupExec                           22932          59.5050           0.0000           3.3250           0.0026
Concat                               2844         438.0390           0.0470           0.6820           0.1540
_full                                  10           4.8940           0.0070           2.4050           0.4894
_backward_SliceChannel             141256        5496.1040           0.0230           3.3710           0.0389
_slice_assign_scalar                    4           9.3150           0.0330           4.6260           2.3287
pick                                 5460         291.0040           0.0180           5.0680           0.0533
_copy                               20928        1201.1370           0.0160           0.7570           0.0574
_random_uniform                        12          97.6060           2.5460          23.4520           8.1338
sum                                 10920         555.5390           0.0260           4.9930           0.0509
_zeros                              11398         543.3210           0.0110          14.9830           0.0477
_equal_scalar                        5460         238.2550           0.0170           9.3850           0.0436
Activation                         752600       21566.7363           0.0120           3.8930           0.0287
ResourceParallelRandomSetSeed               2          19.7140           9.8530           9.8610           9.8570
Embedding                            2844         169.3260           0.0180           7.2920           0.0595
SetValueOp                             48           4.7550           0.0320           0.9550           0.0991
_copyto_GPU2GPU                     10920         581.7290           0.0140          11.1590           0.0533
SliceChannel                       153364        6120.6851           0.0230           9.5310           0.0399
FullyConnected                     303884       19919.7012           0.0440           2.4720           0.0656
elemwise_mul                       451560        9846.6777           0.0110           2.3670           0.0218
_maximum_scalar                      5460         186.1150           0.0150           1.6400           0.0341
DeleteVariable                      65696         389.9880           0.0010           3.7150           0.0059
elemwise_add                       301040        5299.9009           0.0110           1.6230           0.0176
SyncCopyGPU2CPU                     10920         570.1200           0.0240           0.3210           0.0522
_backward_SoftmaxOutput              2616         576.9860           0.0930           2.0890           0.2206
_div_scalar                            22          27.2660           0.0070           7.1150           1.2394
expand_dims                         75260         835.0670           0.0030           0.1240           0.0111

The text was updated successfully, but these errors were encountered:

mxnet-label-bot · 2019-04-17T19:42:26Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Performance

frankfliu · 2019-04-17T21:19:52Z

@mxnet-label-bot add [performance, cuda]

KellenSunderland · 2019-04-19T01:56:43Z

Good catch, that's a big drop. Would you be able to run nvprof --summary before each run and paste the output here?

anirudh2290 · 2019-04-19T02:10:46Z

@KellenSunderland I was looking at this with @stu1130 from mxnet profiler it seemed like fully_connected_backward seems to be taking more time. It could be that cublas gemm (internally called by fc backward) performance regressed for certain input shapes. agreed nvprof would be worth to look at.

KellenSunderland · 2019-04-22T17:57:39Z

Good investigation so far, your explanation makes sense @anirudh2290 and @stu1130. I think that's enough information to get NVIDIA started on verifying the regression and looking for root causes and a fix.

If we wanted to investigate further there is one more step we could potentially run to really lock down exactly why this kernel regressed. I've found API logging from cuDNN and cuBLAS to be quite useful. If you want to take a look at them, we could re-run without nvprof and add the env vars:

export CUBLAS_LOGINFO_DBG=1
export CUBLAS_LOGDEST_DBG=/tmp/cublas_api_logs.txt

Then grep /tmp/cublas_api_logs.txt for volta_sgemm_128x64_nt calls, and see how many variants of parameters we're calling that kernel with there are. We could then just create a minimal reproducible call to cublas that uses those params (hopefully there aren't many of them) and show the arguments for which the cublas lib regresses.

anirudh2290 · 2019-04-22T22:56:05Z

Thanks for the useful suggestion @KellenSunderland . @stu1130 obtained the logs and didnt see any volta_sgemm_128x64_nt calls . We see cublasSgemmEx_internal calls and we tried to grep for configuration of m=128 n=64 but weren't able to find anything. I am not sure which cublas call maps to this call (volta_sgemm_128_64_nt) and we have also asked nvidia for help. Let us know if you have any ideas.

stu1130 · 2019-04-23T00:35:51Z

Update for what @anirudh2290 & I have done so far

Based on MXNet profiler result(I have updated on the description), we found _backward_FullyConnected increase from 0.0838 to 0.1352.
We added following piece of code below https://github.com/apache/incubator-mxnet/blob/68efc1598240bbe36b91d6660489519431795a5d/src/operator/nn/fully_connected-inl.h#L168

LOG(INFO) << grad.shape_ << " " << data.shape_ << " " << gwmat.shape_;

I wrote a python script to collect the grad shape, data shape and gwmat shape.

# grad shape/data shape/gwmat shape   count for 1 epoch 
('640,10000', '640,650', '10000,650'): 465
('32,2600', '32,650', '2600,650'): 138640
('960,10000', '960,650', '10000,650'): 440
('1600,10000', '1600,650', '10000,650'): 48
('1280,10000', '1280,650', '10000,650'): 192
('320,10000', '320,650', '10000,650'): 154
('1920,10000', '1920,650', '10000,650'): 9

@anirudh2290 wrote a script to reproduce it

import time
import random
import mxnet as mx
import numpy as np
 
 
mx.random.seed(1234)
np.random.seed(1234)
random.seed(1234)
 
a = mx.nd.random.uniform(shape=(32, 650), ctx=mx.gpu(0))
b = mx.nd.random.uniform(shape=(2600, 650), ctx=mx.gpu(0))
c = mx.nd.random.uniform(shape=(2600,), ctx=mx.gpu(0)) 

data = mx.sym.var("data")
weight = mx.sym.var("weight")
bias = mx.sym.var("bias")
out = mx.sym.FullyConnected(data, weight, bias, flatten=True, no_bias=False, num_hidden=2600)
out = mx.sym.make_loss(out)
ex = out.bind(args={"data": a, "weight": b, "bias": c}, args_grad={"data": mx.nd.zeros(a.shape, ctx=mx.gpu(0)), "weight": mx.nd.zeros(b.shape, ctx=mx.gpu(0)), "bias": mx.nd.zeros(c.shape, ctx=mx.gpu(0))}, ctx=mx.gpu(0))
mx.nd.waitall()
total_cost = 0
num = 100
warmup = 5
for i in range(num):
    if i < warmup:
	    continue
    ex.forward(is_train=True, data=a, weight=b, bias=c)
    start = time.time()
    ex.backward()
    ex.outputs[0].asnumpy()
    ex.grad_arrays[0].asnumpy()
    mx.nd.waitall()
    end = time.time() - start
    total_cost += end
print("Average: {}".format(total_cost / (num - warmup)))

And we got the result

('640,10000', '640,650', '10000,650')
CUDA 10: Average: 0.00515265703201294
CUDA 9.2: Average: 0.005957033634185791
('32,2600', '32,650', '2600,650')
CUDA 10: Average: 0.0008404040336608886
CUDA 9.2: Average: 0.0008698630332946777
('960,10000', '960,650', '10000,650')
CUDA 10: Average: 0.02021113634109497
CUDA 9.2: Average: 0.01884469509124756
(‘1600,10000’, ‘1600,650’, ‘10000,650’)
CUDA 10: Average: 0.030648748874664306
CUDA 9.2: Average: 0.02739755630493164
(‘1280,10000’, ‘1280,650’, ‘10000,650’)
CUDA 10: Average: 0.023832192420959474
CUDA 9.2: Average: 0.02541511058807373
(‘320,10000’, ‘320,650’, ‘10000,650’)
CUDA 10: Average: 0.003631892204284668
CUDA 9.2: Average: 0.0030696773529052734
('1920,10000', '1920,650', '10000,650')
CUDA 10: Average: 0.03472873210906982
CUDA 9.2: Average: 0.03499695301055908

stu1130 · 2019-04-23T23:56:58Z

Rerun the minimal reproducible script shown above with shape combination - data shape (32,2600), wieght: (32,650), bias: (2600,650'). Set the num = 100000 and got the result along with nvprof -s

# CUDA 10.0 mxnet-cu100mkl
GPU activities:   35.43%  7.72535s     99995  77.257us  71.167us  83.295us  volta_sgemm_32x32_sliced1x4_nn
                   29.99%  6.53866s     99995  65.389us  62.623us  72.287us  volta_sgemm_128x64_nt
                   16.11%  3.51241s    199990  17.562us  7.2000us  49.471us  [CUDA memcpy DtoH]
                   13.52%  2.94757s     99995  29.477us  27.872us  34.559us  volta_sgemm_64x32_sliced1x4_tn
...
Average: 0.001091881209394027
Total: 109.18266153335571
------------------------------------------------------------------------------
# CUDA 9.2 mxnet-cu92mkl
GPU activities:   44.88%  7.94254s     99995  79.429us  75.583us  84.703us  volta_sgemm_32x32_sliced1x4_nn
                   19.34%  3.42300s    199990  17.115us  7.2950us  58.656us  [CUDA memcpy DtoH]
                   17.95%  3.17554s     99995  31.757us  29.952us  38.655us  volta_sgemm_32x32_sliced1x4_tn
                   12.94%  2.28917s     99995  22.892us  20.927us  29.280us  volta_sgemm_128x64_nt
...
Average: 0.0009327297395715428
Total: 93.26831030845642

We can find volta_sgemm_128x64_nt on CUDA 10 took almost 3 times longer than CUDA 9.2. The reason why the total time it takes is simliar is that volta_sgemm_32x32_sliced1x4_nn takes most of the excution time and it's not the case in LSTM.
Provide the other data shape combination script result

Note that all of them run only 100 times (num = 100)

data shape ('640,10000', '640,650', '10000,650')
# CUDA 10.0 mxnet-cu100mkl
GPU activities:   54.06%  256.89ms       190  1.3521ms  133.44us  11.855ms  [CUDA memcpy DtoH]
                   15.03%  71.430ms        95  751.90us  707.26us  837.63us  volta_sgemm_128x64_nn
                   14.59%  69.329ms        95  729.77us  690.62us  809.12us  volta_sgemm_128x64_nt
                   12.93%  61.456ms        95  646.91us  568.61us  730.72us  volta_sgemm_128x64_tn
                    1.33%  6.3357ms        95  66.691us  65.983us  67.487us
# CUDA 9.2 mxnet-cu92mkl
GPU activities:   56.68%  295.08ms       190  1.5531ms  133.41us  13.890ms  [CUDA memcpy DtoH]
                   14.10%  73.427ms        95  772.92us  705.98us  839.36us  volta_sgemm_128x64_nn
                   13.10%  68.185ms        95  717.74us  655.04us  775.13us  volta_sgemm_128x64_nt
                   13.00%  67.684ms        95  712.47us  640.54us  778.40us  volta_sgemm_128x128_tn
------------------------------------------------------------------------------
data shape ('960,10000', '960,650', '10000,650')
# CUDA 10.0 mxnet-cu100mkl
GPU activities:   79.61%  1.38828s       190  7.3068ms  289.41us  18.374ms  [CUDA memcpy DtoH]
                    6.66%  116.15ms        95  1.2226ms  1.2185ms  1.2268ms  volta_sgemm_128x64_nn
                    6.42%  111.92ms        95  1.1781ms  1.1731ms  1.1876ms  volta_sgemm_128x64_nt
                    5.98%  104.26ms        95  1.0975ms  1.0937ms  1.1239ms  volta_sgemm_32x128_tn
                    0.54%  9.4320ms        95  99.284us  98.719us  99.839us
# CUDA 9.2 mxnet-cu92mkl
GPU activities:   80.65%  1.45573s       190  7.6617ms  270.69us  18.452ms  [CUDA memcpy DtoH]
                    6.43%  116.12ms        95  1.2223ms  1.2162ms  1.2279ms  volta_sgemm_128x64_nn
                    6.03%  108.86ms        95  1.1459ms  1.1444ms  1.1490ms  volta_sgemm_128x64_nt
                    5.61%  101.24ms        95  1.0657ms  1.0618ms  1.0718ms  volta_sgemm_128x128_tn
------------------------------------------------------------------------------
data shape ('1600,10000', '1600,650', '10000,650')
# CUDA 10.0 mxnet-cu100mkl
GPU activities:   81.71%  2.63850s       190  13.887ms  534.24us  32.062ms  [CUDA memcpy DtoH]
                    6.48%  209.16ms        95  2.2017ms  2.1817ms  2.3835ms  volta_sgemm_128x64_nn
                    5.67%  183.19ms        95  1.9283ms  1.9143ms  1.9402ms  volta_sgemm_128x64_nt
                    5.01%  161.71ms        95  1.7023ms  1.6479ms  1.7094ms  volta_sgemm_128x64_tn
# CUDA 9.2 mxnet-cu92mkl
GPU activities:   81.32%  2.57187s       190  13.536ms  505.56us  32.412ms  [CUDA memcpy DtoH]
                    6.63%  209.82ms        95  2.2086ms  2.1808ms  2.3828ms  volta_sgemm_128x64_nn
                    5.71%  180.48ms        95  1.8998ms  1.8987ms  1.9037ms  volta_sgemm_128x64_nt
                    5.19%  164.03ms        95  1.7266ms  1.7225ms  1.7318ms  volta_sgemm_128x128_tn
------------------------------------------------------------------------------
data shape ('1280,10000', '1280,650', '10000,650')
# CUDA 10.0 mxnet-cu100mkl
GPU activities:   82.14%  2.12338s       190  11.176ms  451.01us  26.345ms  [CUDA memcpy DtoH]
                    5.94%  153.49ms        95  1.6157ms  1.6120ms  1.6218ms  volta_sgemm_128x64_nn
                    5.71%  147.61ms        95  1.5538ms  1.5445ms  1.5637ms  volta_sgemm_128x64_nt
                    5.06%  130.83ms        95  1.3772ms  1.3723ms  1.3809ms  volta_sgemm_32x128_tn
# CUDA 9.2 mxnet-cu92mkl
GPU activities:   82.17%  2.08856s       190  10.992ms  467.93us  26.238ms  [CUDA memcpy DtoH]
                    5.99%  152.15ms        95  1.6016ms  1.5995ms  1.6043ms  volta_sgemm_128x64_nn
                    5.70%  144.78ms        95  1.5240ms  1.5219ms  1.5288ms  volta_sgemm_128x64_nt
                    4.98%  126.54ms        95  1.3320ms  1.3283ms  1.3399ms  volta_sgemm_128x128_tn
------------------------------------------------------------------------------
data shape ('320,10000', '320,650', '10000,650')
# CUDA 10.0 mxnet-cu100mkl
GPU activities:   58.78%  186.02ms       190  979.07us  65.055us  6.4167ms  [CUDA memcpy DtoH]
                   14.23%  45.045ms        95  474.16us  457.02us  492.99us  volta_sgemm_128x64_nt
                   12.23%  38.704ms        95  407.41us  348.99us  540.25us  volta_sgemm_128x128_tn
                   11.78%  37.298ms        95  392.61us  359.84us  424.28us  volta_sgemm_128x64_nn
# CUDA 9.2 mxnet-cu92mkl
GPU activities:   62.66%  207.77ms       190  1.0935ms  65.087us  6.6399ms  [CUDA memcpy DtoH]
                   11.83%  39.221ms        95  412.86us  349.63us  540.79us  volta_sgemm_128x128_tn
                   11.48%  38.053ms        95  400.56us  360.64us  424.86us  volta_sgemm_128x64_nn
                   11.14%  36.947ms        95  388.92us  354.17us  423.29us  volta_sgemm_128x64_nt
------------------------------------------------------------------------------
data shape ('1920,10000', '1920,650', '10000,650')
# CUDA 10.0 mxnet-cu100mkl
GPU activities:   82.80%  3.29555s       190  17.345ms  664.22us  40.348ms  [CUDA memcpy DtoH]
                    5.70%  227.01ms        95  2.3896ms  2.3852ms  2.3959ms  volta_sgemm_128x64_nn
                    5.52%  219.56ms        95  2.3111ms  2.2841ms  2.3172ms  volta_sgemm_128x64_nt
                    4.81%  191.49ms        95  2.0157ms  1.9679ms  2.0367ms  volta_sgemm_128x64_tn
# CUDA 9.2 mxnet-cu92mkl
GPU activities:   80.50%  2.80315s       190  14.753ms  664.44us  36.470ms  [CUDA memcpy DtoH]
                    6.52%  227.09ms        95  2.3904ms  2.3850ms  2.3956ms  volta_sgemm_128x64_nn
                    6.21%  216.39ms        95  2.2778ms  2.2743ms  2.2854ms  volta_sgemm_128x64_nt
                    5.44%  189.46ms        95  1.9943ms  1.9876ms  2.0082ms  volta_sgemm_128x128_tn

rakababa · 2019-05-02T15:12:25Z

I am using mxnet-cu100 for CUDA10, but i am seeing that irrespective of using mx.cpu() or mx.gpu(), the system ends up always using the GPU. Anyone knows why?

stu1130 · 2019-05-08T16:59:34Z

Tested the same model with CUDA 10.1 and got around 1027.52625 throughput which is similar to what we got 1005.25185 on CUDA 9.2

stu1130 · 2019-05-16T22:15:51Z

After sync up with Nvidia, the solution is to either upgrade to CUDA 10.1 or downgrade to CUDA 9.2.

marcoabreu added CUDA Performance labels Apr 17, 2019

This was referenced May 5, 2019

[Dependency Update] Upgrade cuDNN & NCCL #14884

Merged

[Dependency Update] CUDA10.1 Support #14887

Merged

stu1130 closed this as completed May 16, 2019

stu1130 mentioned this issue May 17, 2019

[Dependency Update] Upgrade cuDNN & NCCL #14988

Merged

stu1130 mentioned this issue Jun 4, 2019

[Dependency Update] Bump up cuDNN & NCCL version #15142

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Regression on CUDA10 #14725

Performance Regression on CUDA10 #14725

stu1130 commented Apr 17, 2019 •

edited

mxnet-label-bot commented Apr 17, 2019

frankfliu commented Apr 17, 2019

KellenSunderland commented Apr 19, 2019

anirudh2290 commented Apr 19, 2019

KellenSunderland commented Apr 22, 2019

anirudh2290 commented Apr 22, 2019

stu1130 commented Apr 23, 2019 •

edited

stu1130 commented Apr 23, 2019 •

edited

rakababa commented May 2, 2019 •

edited

stu1130 commented May 8, 2019

stu1130 commented May 16, 2019

Performance Regression on CUDA10 #14725

Performance Regression on CUDA10 #14725

Comments

stu1130 commented Apr 17, 2019 • edited

Description

Environment info (Required)

Steps to reproduce

mxnet-label-bot commented Apr 17, 2019

frankfliu commented Apr 17, 2019

KellenSunderland commented Apr 19, 2019

anirudh2290 commented Apr 19, 2019

KellenSunderland commented Apr 22, 2019

anirudh2290 commented Apr 22, 2019

stu1130 commented Apr 23, 2019 • edited

stu1130 commented Apr 23, 2019 • edited

rakababa commented May 2, 2019 • edited

stu1130 commented May 8, 2019

stu1130 commented May 16, 2019

stu1130 commented Apr 17, 2019 •

edited

stu1130 commented Apr 23, 2019 •

edited

stu1130 commented Apr 23, 2019 •

edited

rakababa commented May 2, 2019 •

edited