Skip to content

Latest commit

 

History

History
94 lines (74 loc) · 4.9 KB

Operator-Optimization.md

File metadata and controls

94 lines (74 loc) · 4.9 KB

Optimization of Operator

Hardware and Software Configuration

Hardware: Alibaba Cloud ECS general purpose instance family with high clock speeds - ecs.hfg7.2xlarge.

CPU number: 8 cores

Baseline version: Tensorflow v1.15.5

Optimized version: DeepRec

Gcc version 7.5.0

Performance Data

Op Name Input Tensor Shape Baseline Perf (latency/ms) Optimized Perf (latency/ms) Speedup
Select condition: (1024, 64), x: (1024, 64), y: (1024, 64) 2.080 0.564 +3.68X
Dynamic_stitch indices: (40, 2500), data: (40, 2500, 64) 82.14 24.77 +3.31X
Transpose data: (1024, 64) 1.504 0.366 +4.11X
Tile input: (512, 50), multiples: (2, 50) 1.68 0.125 +13.44X
BiasAddGrad data: (51200, 512) 26.84 1.67 +16.07X
SparseSegmentMean data: (51200, 128), indices: (51200), seg index: (51200) 1.93 0.445 +4.34X
Unique
Gather
BiasAdd
where
DynamicPartition
SparseConcat

Case Studies:Select

The computing process of operator Select:

select.png

TensorFlow original implementation:Broadcast + Elementwise Select

template <typename Device, typename T, int NDIMS>
struct BCastSelectFunctorBase {
  void operator()(const Device& d,
                  typename TTypes<T, NDIMS>::Tensor output_tensor,
                  typename TTypes<bool, NDIMS>::ConstTensor cond_tensor,
                  typename TTypes<T, NDIMS>::ConstTensor then_tensor,
                  typename TTypes<T, NDIMS>::ConstTensor else_tensor,
                  typename Eigen::array<Eigen::DenseIndex, NDIMS> cond_bcast,
                  typename Eigen::array<Eigen::DenseIndex, NDIMS> then_bcast,
                  typename Eigen::array<Eigen::DenseIndex, NDIMS> else_bcast) {
    output_tensor.device(d) = cond_tensor.broadcast(cond_bcast)
                                  .select(then_tensor.broadcast(then_bcast),
                                          else_tensor.broadcast(else_bcast));
  }
};

PAI-TF (Merged to Community):Row Select ,Optimized redundant broadcast operations in the original TensorFlow version.。

    if (c[i]) {
        for (size_t j = 0; j < batch_size; ++j) {
        output[offset + j] = t[offset + j];
        }
    } else {
        for (size_t j = 0; j < batch_size; ++j) {
        output[offset + j] = e[offset + j];
        }
    }

DeepRec: vectorized Row Select, used AVX512 mask vectorisation instructions for the further optimizing of select operation, which improved the performance of this operator by 3.68x.

    __mmask16 cmask = (c[i] == false) ? 0xffff : 0x0000;  // select t/e
    size_t ofs = 0;

    for (size_t j = 0; j < quotient; ++j) {
        __m512 src = _mm512_loadu_ps(t + offset + ofs);
        __m512 tmp = _mm512_mask_loadu_ps(src, cmask, e + offset + ofs);
        _mm512_storeu_ps(output + offset + ofs,  tmp);
        ofs +=  float_alignment;
    }

    if (remainder != 0) {
        __mmask16 mask = (remainder >= float_alignment)
            ? 0xffff : 0xffff >> (float_alignment - remainder);
        cmask &= mask;
        __m512 src  = _mm512_mask_loadu_ps(_mm512_setzero_ps(), mask, t + offset + ofs);
        __m512 tmp = _mm512_mask_loadu_ps(src, cmask, e + offset + ofs);
        _mm512_mask_storeu_ps(output + offset + ofs, mask, tmp);
    }