Clang performance regression since 40800 due to GCC 14 tool chain #3036

marioroy · 2024-02-12T19:35:07Z

I have an application, when re-compiled, noticed a signification performance regression.

The times on the left and right columns are CL 40750 and 40800, respectively. The get properties does parallel IO (chunking), inserting into emhash7::HashMap or phmap::parallel_flat_hash_map container. The other aspects of the application have similar times { map container to vector, vector stable sort, and write stdout }.

$ ./llil4emh in/big* in/big* in/big* >/dev/null
llil4emh (fixed string length=12) start
use OpenMP
use boost sort        
get properties         7.446 secs       8.415 secs
emhash to vector       0.682 secs       0.681 secs
vector stable sort     1.105 secs       1.104 secs
write stdout           0.552 secs       0.554 secs
total time            10.012 secs      10.975 secs
    count lines     970195200        970195200
    count unique    200483043        200483043

$ ./llil4map in/big* in/big* in/big* >/dev/null
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties         9.080 secs      10.231 secs
phmap to vector        0.696 secs       0.695 secs
vector stable sort     1.111 secs       1.109 secs
write stdout           0.551 secs       0.557 secs
total time            11.439 secs      12.594 secs
    count lines     970195200        970195200
    count unique    200483043        200483043

Were you guys expecting the performance regression with GCC 14? No improvements updating to CL 40830.

$ clang -v
clang version 17.0.5
Target: x86_64-generic-linux
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-generic-linux/11
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-generic-linux/12
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-generic-linux/13
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-generic-linux/14
Selected GCC installation: /usr/bin/../lib64/gcc/x86_64-generic-linux/14
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /opt/cuda, version

Selecting GCC 13 by passing --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/13 to clang++ resolves the issue.

The text was updated successfully, but these errors were encountered:

marioroy · 2024-02-12T20:15:58Z

I ran serially without OpenMP. Same thing, "get properties" is slower using the GCC 14 toolchain (right column).

clang++ --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/13 ...
clang++ --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/14 ...

$ ./llil4map in/big* in/big* in/big* >/dev/null
llil4map (fixed string length=12) start
don't use OpenMP
use boost sort
get properties       228.544 secs     233.647 secs
phmap to vector        0.967 secs       0.964 secs
vector stable sort    22.299 secs      22.317 secs
write stdout           3.306 secs       3.363 secs
total time           255.117 secs     260.292 secs
    count lines     970195200        970195200
    count unique    200483043        200483043

fenrus75 · 2024-02-12T20:25:09Z

hmm very surprised that gcc has so much impact on llvm

…

On Mon, Feb 12, 2024 at 12:16 PM Mario Roy ***@***.***> wrote: I ran serially without OpenMP. Same thing, "get properties" is slower using the GCC 14 toolchain (right column). clang++ --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/13 ... clang++ --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/14 ... $ ./llil4map in/big* in/big* in/big* >/dev/null llil4map (fixed string length=12) start don't use OpenMPuse boost sortget properties 228.544 secs 233.647 secsphmap to vector 0.967 secs 0.964 secsvector stable sort 22.299 secs 22.317 secswrite stdout 3.306 secs 3.363 secstotal time 255.117 secs 260.292 secs count lines 970195200 970195200 count unique 200483043 200483043 — Reply to this email directly, view it on GitHub <#3036 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ54FOZPVCOI6DPR6I7WIDYTJ2AZAVCNFSM6AAAAABDFI4RSOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZZGUYDCOBVHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

marioroy · 2024-02-12T20:26:39Z

OpenMP was removed in LLVM since CL 39970. To restore clang/clang++ OpenMP functionality, I have a script to install the missing OpenMP bits (headers and libs).

Build LLVM 17 with OpenMP enabled and install to /opt/llvm-17.
Run fix script to copy missing OpenMP headers and libs in /usr tree.

#!/bin/bash
if [[ -d /opt/llvm-17/lib64/clang/17 && -d /usr/lib64/clang/17 ]]; then
  cd /usr/lib64/cmake
  sudo cp -a /opt/llvm-17/lib64/cmake/openmp .

  cd /usr/lib64/clang/17/include
  sudo cp -a /opt/llvm-17/lib64/clang/17/include/omp*.h .

  cd /usr/lib64
  sudo cp -a /opt/llvm-17/lib64/libarcher.so .
  sudo cp -a /opt/llvm-17/lib64/libompd.so .
  sudo cp -a /opt/llvm-17/lib64/libomp.so .
  sudo ln -sf libomp.so libiomp5.so
fi

That restored my sanity from the CL team removing OpenMP functionality in LLVM :)

marioroy · 2024-02-12T20:49:25Z

The source for llil4map.cc resides at https://gist.github.com/marioroy/862fa2fc6aa3b6f523f7a6ef9dd8d157.

The parallel chunk "get_properties" routine begins at line 190. Read IO is serially. Otherwise, threads append to the hash map concurrently. On my machine, LLVM with OpenMP functionality is amazing.

CL 40750 25.17x get_properties parallel performance over serial
CL 40800 22.84x due to clang++ using GCC 14 tool chain

marioroy · 2024-02-16T08:25:36Z

One can specify which GCC tool chain to use to restore the prior performance.

clang++ --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/13 ...

fenrus75 · 2024-02-16T15:49:34Z

it would be really weird that llvm depends on this but also .. quite interesting to figure out what is the issue
(perf top might show how cycles are spent different between the two) so that we can ensure that when gcc 14 becomes THE gcc (in a couple of weeks when it is released) we don't get thing wrong

marioroy · 2024-02-26T04:09:35Z

I tried again. Testing was done on Clear 41120, LTS kernel 6.1.69-1331.ltsprev.

I captured data using xcapture (via run_xcapture.sh) from https://0x.tools/. The data suggests more futex locking behind the scene versus gcc13.

clang++ using gcc14 toolchain:

$ grep llil4emh /tmp/gcc14/2024-02-25.21.csv | cut -d, -f10-11 | sort | uniq -c | sort -rn
    216 ./llil4emh,
     91 ./llil4emh,->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
     80 ./llil4emh,->__x64_sys_futex()->futex_wait()
     31 ./llil4emh,->__x64_sys_futex()->futex_wake()->wake_up_q()->try_to_wake_up()->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
     18 ./llil4emh,->try_to_wake_up()->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
     14 ./llil4emh,->native_send_call_func_single_ipi()
      3 ./llil4emh,->pick_next_task()->pick_next_task_fair()->update_curr()
      2 ./llil4emh,->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      2 ./llil4emh,->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      2 ./llil4emh,->asm_exc_page_fault()->exc_page_fault()->do_user_addr_fault()->lock_mm_and_find_vma()
      1 ./llil4emh,->sched_clock_cpu()
      1 ./llil4emh,->futex_wait()
      1 ./llil4emh,->asm_common_interrupt()->common_interrupt()->irqentry_exit()->irqentry_exit_to_user_mode()->exit_to_user_mode_prepare()

clang++ using gcc13 toolchain:

$ grep llil4emh /tmp/gcc13/2024-02-25.21.csv | cut -d, -f10-11 | sort | uniq -c | sort -rn
    310 ./llil4emh,
     22 ./llil4emh,->__x64_sys_futex()->futex_wait()
      9 ./llil4emh,->__x64_sys_futex()->futex_wake()->wake_up_q()->try_to_wake_up()->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      6 ./llil4emh,->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      5 ./llil4emh,->pick_next_task()->pick_next_task_fair()->update_curr()
      4 ./llil4emh,->asm_exc_page_fault()->exc_page_fault()->do_user_addr_fault()->lock_mm_and_find_vma()
      3 ./llil4emh,->try_to_wake_up()->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      3 ./llil4emh,->__do_sys_sched_yield()->do_sched_yield()
      1 ./llil4emh,->update_process_times()->scheduler_tick()->task_tick_fair()
      1 ./llil4emh,->update_curr()
      1 ./llil4emh,->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      1 ./llil4emh,->sched_clock_cpu()
      1 ./llil4emh,->asm_sysvec_apic_timer_interrupt()->sysvec_apic_timer_interrupt()->irqentry_exit()->irqentry_exit_to_user_mode()->exit_to_user_mode_prepare()

marioroy · 2024-03-25T02:18:41Z

I compared various Mutex libraries. The issue is resolved at the application level.

fast_mutex and spinlock_mutex can be found here.
https://github.com/mvorbrodt/blog/blob/master/src/mutex.hpp

 // std::mutex     L[NUM_MAPS];
 // omp_lock_t     L[NUM_MAPS];
 // fast_mutex     L[NUM_MAPS];
    spinlock_mutex L[NUM_MAPS];

Selecting gcc13 or gcc14 toolchain.

clang++ -o llil4emh \
    --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/13 \
    -std=c++20 -fopenmp -Wall -O3 llil4emh.cc

clang++ -o llil4emh \
    --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/14 \
    -std=c++20 -fopenmp -Wall -O3 llil4emh.cc

Testing involves 963 mutexes. OpenMP threads pick one depending on the hash value % number of maps. The program outputs the time to populate the maps in parallel (get_properties).

GCC 13 std::mutex       7.456 secs
GCC 14 std::mutex       8.282 secs

GCC 13 omp_lock_t       8.353 secs
GCC 14 omp_lock_t       8.408 secs

GCC 13 fast_mutex       8.049 secs
GCC 14 fast_mutex       8.651 secs

GCC 13 spinlock_mutex   7.061 secs
GCC 14 spinlock_mutex   7.012 secs

There are ~ 1 billion lines read, of which ~ 200 million unique.

$ ./llil4emh in/big* in/big* in/big* | cksum
llil4emh (fixed string length=12) start
use OpenMP
use boost sort
get properties         7.012 secs
emhash to vector       0.661 secs
vector stable sort     1.095 secs
write stdout           0.690 secs
total time             9.683 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

marioroy added bug new labels Feb 12, 2024

marioroy closed this as completed Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clang performance regression since 40800 due to GCC 14 tool chain #3036

Clang performance regression since 40800 due to GCC 14 tool chain #3036

marioroy commented Feb 12, 2024 •

edited

marioroy commented Feb 12, 2024

fenrus75 commented Feb 12, 2024 via email

marioroy commented Feb 12, 2024 •

edited

marioroy commented Feb 12, 2024 •

edited

marioroy commented Feb 16, 2024

fenrus75 commented Feb 16, 2024

marioroy commented Feb 26, 2024

marioroy commented Mar 25, 2024 •

edited

Clang performance regression since 40800 due to GCC 14 tool chain #3036

Clang performance regression since 40800 due to GCC 14 tool chain #3036

Comments

marioroy commented Feb 12, 2024 • edited

marioroy commented Feb 12, 2024

fenrus75 commented Feb 12, 2024 via email

marioroy commented Feb 12, 2024 • edited

marioroy commented Feb 12, 2024 • edited

marioroy commented Feb 16, 2024

fenrus75 commented Feb 16, 2024

marioroy commented Feb 26, 2024

marioroy commented Mar 25, 2024 • edited

marioroy commented Feb 12, 2024 •

edited

marioroy commented Feb 12, 2024 •

edited

marioroy commented Feb 12, 2024 •

edited

marioroy commented Mar 25, 2024 •

edited