Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDNA3 not being utilised to its full potential #5

Open
muziqaz opened this issue Mar 25, 2023 · 8 comments
Open

RDNA3 not being utilised to its full potential #5

muziqaz opened this issue Mar 25, 2023 · 8 comments

Comments

@muziqaz
Copy link

muziqaz commented Mar 25, 2023

HI,
I'm nearly done testing all of my AMD GPUs comparing them between OpenCL and HIP environments, and today it was 7900xtx turn. Here are the results and comparison vs 6900xt:

6900xt MBA (23.04TFLOPS)
  OpenCL (ns/day) HIP (ns/day) Diff.%
gbsa 967.65 1644.17 69.91%
rf 831.189 1410.187 69.66%
pme 398.526 1046.064 162.48%
apoa1rf 342.568 505.241 47.49%
apoa1pme 183.49 381.036 107.66%
apoa1ljpme 127.05 300.048 136.17%
amoebagk 2.4 37.444 1460.17%
amoebapme 12.021 16.261 35.27%
7900xtx Nitro+ (61+TFLOPS)
  OpenCL (ns/day) HIP (ns/day) Diff.% HIP (6900xt/7900xtx)
gbsa 1075.82 1812.23 68.45% 10.22%
rf 912.438 1503.63 64.79% 6.63%
pme 415.988 1103.77 165.34% 5.52%
apoa1rf 437.261 645.718 47.67% 27.8%
apoa1pme 231.098 521 125.45% 36.73%
apoa1ljpme 164.816 400.924 143.26% 33.62%
amoebagk 4.22695 42.958 916.29% 14.73%
amoebapme 17.0797 23.0998 35.25% 42.06%

Not much of the improvement going from 6900xt. I'll try to get AMD's attention to this.
Will post the rest of the GPU test results in other hip/openmm area monday most likely.
conda env built was standard. Have no knowledge on how to play around with fft backends, but I think that wouldn't change the outcome too much compared to vkfft

@ex-rzr
Copy link
Contributor

ex-rzr commented Mar 25, 2023

  1. Most of these cases are probably too small to utilize it, could you show results for amber20 (amber20-cellulose and amber20-stmv) tests?
  2. There are some recent changes in the repo that are not included in conda package, I wonder how they perform (if you are able to build from sources on your machine).
  3. Are you sure that performance with boost frequency is relevant for comparison? https://www.amd.com/en/products/graphics/amd-radeon-rx-7900xtx says: "Boost Clock Frequency is the maximum frequency achievable on the GPU running a bursty workload. Boost clock achievability, frequency, and sustainability will vary based on several factors, including but not limited to: thermal conditions and variation in applications and workloads. GD-151". Can you check with watch -n 1 rocm-smi what the frequency really is for various tests?

@ex-rzr
Copy link
Contributor

ex-rzr commented Mar 25, 2023

I forgot to add:

Have no knowledge on how to play around with fft backends, but I think that wouldn't change the outcome too much compared to vkfft

VkFFT is the fastest so there is likely no real reason to try other FFT backends, unless you want to (see https://github.com/amd/openmm-hip#fft-backends)

@muziqaz
Copy link
Author

muziqaz commented Mar 25, 2023

  1. The tests would be too small for 6900xt too, or Radeon 7 :) OpenCL shows same behaviour in FAH. I get just a bit better performance than 6900xt in various workloads (large atom counts too). It seems neither opencl nor HIP can utilise dual issue "pipe" available in RDNA3 arch. All the amber attempts failed on all of my tests due to some modules missing (scipy).
  2. Unfortunately, 7900xtx is back in my Windows system, this was just rare occasion just to complete the tests on all of my GPUs on Linux. The card is a bit of the brick hard to get in and out of the case. Depending on available free time, I might try compiling few things, but I'm very rusty in Linux, so reading manuals takes more time than compiling things :D
  3. My card is folding at 3ghz stable :) Even with clocks at same levels of 6900xt, RDNA3 should blow it out of the water easily. I bought Sapphire Nitro+ which is waay overbuilt compared to MBA models.
    It has been suggested that RDNA3 arch requires specific driver and API level optimisations to expose all the available resources.

@ex-rzr
Copy link
Contributor

ex-rzr commented Mar 25, 2023

The tests would be too small for 6900xt too, or Radeon 7 :) OpenCL shows same behaviour in FAH

In my experience, only large cases like amber20-cellulose (400k atoms) and amber20-stmv (1M atoms) reflect relative performance of different GPUs, i.e. performance scales with more compute units/higher frequency/etc., smaller cases scale worse (latency of launching kernels, scheduling work groups by GPU etc. are sometimes higher than kernels' work).

All the amber attempts failed on all of my tests due to some modules missing (scipy).

Yeah, this dependency is not installed with openmm automatically as it's used only for these benchmarks. You can try to install it with conda install scipy (or pip3 install scipy).

It seems neither opencl nor HIP can utilise dual issue "pipe" available in RDNA3 arch

I didn't run OpenMM on RDNA3 but I saw that the HIP compiler generates dual issue instructions, I just wouldn't expect too much as not every pair of instructions in every kernel can be encoded using it.
If 61+TFLOPS means performance of FMA with dual issue then it's completely theoretical peak performance because I doubt that most real kernels of openmm have 100% instructions with dual issue, that's impossible :) (I will be not surprised by <20-30%)
I guess 61+TFLOPS can be achieved for something like matrix-matrix multiplication of really large matrices because such kernels indeed have a lot of instructions that can benefit of the dual issue feature.

Depending on available free time, I might try compiling few things, but I'm very rusty in Linux, so reading manuals takes more time than compiling things

That would be great. OpenMM (and OpenMM-HIP) has quite simple building instructions, I hope they'll work for you without issues.

Unfortunately, 7900xtx is back in my Windows system

Sad. Anyway, thanks for benchmarking. I hope you'll get a chance to run amber20 tests on this and other GPUs.

It has been suggested that RDNA3 arch requires specific driver and API level optimisations to expose all the available resources.

I'm not aware of it, do you know any details? For example, dual issue is the compiler's way to generate code, it does it but I can't say how effective. Perhaps the suggestion about drivers was for games? Because shaders are compiled basically by the driver's compiler, unlike ROCm where the compiler is a part of ROCm distribution.

@muziqaz
Copy link
Author

muziqaz commented Mar 25, 2023

I ran through variety of FAH projects with 7900xtx, and it is consistently 15-20% faster than 6900xt. 6900xt folds at 2.3ghz or so, 7900xtx folds at 2.95-3ghz. 7900xtx has much higher clocks and also more CUs (80 vs 96), which would be utilised by opencl/openmm regardless. But then again, 7900xtx has shader clocks (2.2Ghz or something). So that increase we see right now might be due to CU count increase from 80 to 96, which kinda makes sense. But those CUs have more resources in itself, thus the crazy increase in FLOPS. Even ignoring those FLOPS, 7900xtx should be much faster than 6900xt. And I understand we need large systems for any high end GPU. nVidia has similar issue, but they worked out their CUDA thingy quite well, and their cards are still crazy fast even with relatively low atom counts. They saw quite a jump in FAH performance going from Turing to ampere, and then more progress with Ada. Obviously nothing close to what their CEO tells everyone in the slides, but still.
regarding dual issue SIMDs, nVidia moved to similar set up with Ampere few years back. With that arch they have one pipe which does fp32 only, other pipe does either fp32 or int. In FAH, workload uses both pipes as 2 fp32 pipes, since FAH doesn't need integer calcs. So I'm thinking it might be similar with RDNA3.

Regardless of that, I saw tremendous perf increase going from opencl to hip. And that is across a lot of AMD GPUs. Hopefully things start moving with HIP Fahcore :)

@muziqaz
Copy link
Author

muziqaz commented Mar 27, 2023

I know that's not 7900xtx, but here is Radeon 7 running amber20:

Radeon 7 OpenCL HIP Diff:%
amber20-dhfr 329.067 754.694 129.34%
amber20-cellulose 19.1284 55.9463 192.48%
amber20-stmv 5.94076 21.1691 256.34%
I believe 7900xtx would see similar increase, but it would still be within 20% of 6900xt.

@DanielWicz
Copy link

Is this project even alive ?

@muziqaz
Copy link
Author

muziqaz commented May 12, 2023

Is this project even alive ?

As far as I understand hip is working as plug in, and those interested can build openmm/hip environments within conda and build what they want. This is in Linux.
In windows AMD hasn't released the SDK yet, so nothing can be tested, but hopefully soon.
On folding@home side I believe it would be possible to build fahcore based on hip openmm in Linux. But I think we will hold off until windows sdk is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants