Skip to content

Implementing High Performance Computation on General Mathematics Operation

License

Notifications You must be signed in to change notification settings

YJHMITWEB/HPC-on-matrix

Repository files navigation

HPC-on-matrix:

The purpose of this project is to show how high performance computing (or, parallel computing, heterogeneous computing), could accelerate common algorithms in matrix operations as well as some artificial intelligence techniques. Contains demos of matrix euclidean distance, matrix multiplying, 2D convolution computing, and 3D Stencil computing.

Matrix GEmm-based Euclidean Distance

  • Environment

    • Windows 10
    • VS2022 Release x64
    • AMD Ryzen 3500x (~4.0GHz, L1c = 64 KB, disable multiprocessing)
    • DDR4 16G 3000MHz
  • Implemented GED:

    Methods CHUNK SSE/AVX Time(ms) Validation speed-up % of peak performance
    CPU serial: 20.29 pass 1x 4.22
    CPU chunk-avx: 2.51 pass 8.09x 34.14
  • General settings:

    • Size of Matrix1: (M, K)=(1024, 1024), Matrix2: (K, N)=(1024, 25)
  • GFLOPs of single core peak performance on different sizes of matrix:

  • Demonstration of AVX GEMM:

Matrix Multiply

  • General Matrix Multiply:

    • CPU Entrance: general_matrix_multiply/general_matrix_multiply/
    • GPU Entrance: GeneralMatrixMultiply_GPN/GeneralMatrixMultiply_GPN/
  • Environment:

    • Windows 10
    • VS2017 Release x64 (disable all auto-optimization)
    • Intel Core i7 4720HQ (4 cores, 8 threads, L1c = 32 KB, L3c = 6144 KB)
    • DDR3L 8G 1600MHz
    • Geforce GTX 960M, 640 cu
  • Implemented Matrix Multiply(MM):

    Methods OpenMP CHUNK SSE/AVX CUDA NEON Time(ms) Validation speed-up % of peak performance
    CPU serial: 10757.1 pass 1x
    CPU chunk-serial: 4464.14 pass 2.41x
    CPU omp: 2307.25 pass 4.66x
    CPU chunk-omp: 1487.28 pass 7.23x
    CPU chunk-avx: 978.398 pass 10.99x
    CPU chunk-avx-omp: 270.828 pass 39.72x
    GPU AB: 32.91 pass 326.86x
    GPU ABT: 166.89 pass 64.46x
    GPU ATB: 31.74 pass 338.91x
    GPU AB (shared memory): 20.80 pass 517.17x
    ARM A15 (L1 Cache Tiling): pass 70.56
  • General settings:

    • Size of Matrix1 * Matrix2: (1024, 1024) * (1024, 1024)
    • Chunk size: 64*64 elements
    • OMP threads: 8
    • AVX SIMD parallel: 8*float(32 bits)
    • AVX parallel in chunk: 2 rows (depends on how many AVX registers on the machine)
    • CUDA: Block(32, 32)

3D Stencil computing

  • General Matrix Multiply:

    • CPU/GPU Entrance: Stencil/Stencil/
  • Environment:

    • Windows 10
    • VS2017 Release x64 (disable all auto-optimization)
    • Intel Core i7 4720HQ (4 cores, 8 threads, L3c = 6144 KB)
    • DDR3L 8G 1600MHz
    • Geforce GTX 960M, 640 cu
  • Implemented 3D Stencil computing:

Methods OpenMP CUDA Time(ms) Validation speed-up
CPU serial: 11352 pass 1x
GPU: 70.47 pass 161.09x
  • General settings:

    • Size of Tensor: (512, 512, 512)
    • Stencil radius: 6
    • Kernel size: 13
    • CUDA: Block(32, 32)
  • Reference:

  • X86 CPU's dilemma:

    • 3D Stencil requires accessing data on three dimensions consecutively, however, the cache on X86 is one dimensional. Let's see what happens when doing 3D tiling on L1 cache. Given that X86's L1 cache size is 32KB, 8 * 1024 32-bit-float values could be stored. If using 3D cublic tiling, the length of each dimension is 16. Then, we do 3D tiling again on L2 cache. Given that X86's L2 cache size is 256KB, if using 3D cublic tiling, the length of each dimension is 32. Here comes that problem --- the length of cache line on X86 is 64 Bytes, therefore hardware pre-fetching cannot be used properly, which may even cause negative effects on performance.

    • Considering cache tiling, no matter doing it on L1 or L2, the tile size will be too small to exploit the capacity of X86 cache hierarchy, which is why in any 3D Stencil implementation on X86 will always be limited by L3 cache and RAM's performance.

    • This problem will happen again on NVIDIA GPU if trying to store 3D data on shared memory. A way to solve it is on above reference, and my code shows exactly how to do.

2D Convolution

  • 2D Convolution:

    • CPU Entrance: 2D_Convolution/2D_Convolution/
    • GPU Entrance: 2D_Convolution_GPU/2D_Convolution_GPU/
  • Environment:

    • Windows 10
    • VS2017 Release x64 (disable all auto-optimization)
    • Intel Core i7 4720HQ (4 cores, 8 threads, L3c = 6144 KB)
    • DDR3L 8G 1600MHz
    • Geforce GTX 960M, 640 cu
  • Implemented 2D Convolution:

    Methods OpenMP CHUNK SSE/AVX CUDA NEON Time(ms) Validation speed-up % of peak performance
    CPU serial: 1459.82 pass 1x
    CPU serial-loop-unroll: 1218.51 pass 1.2x
    CPU avx: 327.698 pass 4.45x
    CPU chunk-avx: 284.75 pass 5.13x
    CPU omp-avx: 89.729 pass 16.27x
    GPU: 86.954 pass 16.79x
    GPU constant_kernel: 77.04 pass 18.95x
    GPU constant_kernel_Tiling: 74.927 pass 19.48x
    ARM A15: pass 4.2GFlops
  • General settings:

    • Size of Matrix: (4096, 4096)
    • Size of Kernel: 5 x 5
    • CUDA: Block(32, 32)

About

Implementing High Performance Computation on General Mathematics Operation

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published