Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clang performance regression since 40800 due to GCC 14 tool chain #3036

Closed
marioroy opened this issue Feb 12, 2024 · 8 comments
Closed

Clang performance regression since 40800 due to GCC 14 tool chain #3036

marioroy opened this issue Feb 12, 2024 · 8 comments

Comments

@marioroy
Copy link

marioroy commented Feb 12, 2024

I have an application, when re-compiled, noticed a signification performance regression.

The times on the left and right columns are CL 40750 and 40800, respectively. The get properties does parallel IO (chunking), inserting into emhash7::HashMap or phmap::parallel_flat_hash_map container. The other aspects of the application have similar times { map container to vector, vector stable sort, and write stdout }.

$ ./llil4emh in/big* in/big* in/big* >/dev/null
llil4emh (fixed string length=12) start
use OpenMP
use boost sort        
get properties         7.446 secs       8.415 secs
emhash to vector       0.682 secs       0.681 secs
vector stable sort     1.105 secs       1.104 secs
write stdout           0.552 secs       0.554 secs
total time            10.012 secs      10.975 secs
    count lines     970195200        970195200
    count unique    200483043        200483043

$ ./llil4map in/big* in/big* in/big* >/dev/null
llil4map (fixed string length=12) start
use OpenMP
use boost sort
get properties         9.080 secs      10.231 secs
phmap to vector        0.696 secs       0.695 secs
vector stable sort     1.111 secs       1.109 secs
write stdout           0.551 secs       0.557 secs
total time            11.439 secs      12.594 secs
    count lines     970195200        970195200
    count unique    200483043        200483043

Were you guys expecting the performance regression with GCC 14? No improvements updating to CL 40830.

$ clang -v
clang version 17.0.5
Target: x86_64-generic-linux
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-generic-linux/11
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-generic-linux/12
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-generic-linux/13
Found candidate GCC installation: /usr/bin/../lib64/gcc/x86_64-generic-linux/14
Selected GCC installation: /usr/bin/../lib64/gcc/x86_64-generic-linux/14
Candidate multilib: .;@m64
Selected multilib: .;@m64
Found CUDA installation: /opt/cuda, version 

Selecting GCC 13 by passing --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/13 to clang++ resolves the issue.

@marioroy
Copy link
Author

I ran serially without OpenMP. Same thing, "get properties" is slower using the GCC 14 toolchain (right column).

clang++ --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/13 ...
clang++ --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/14 ...
$ ./llil4map in/big* in/big* in/big* >/dev/null
llil4map (fixed string length=12) start
don't use OpenMP
use boost sort
get properties       228.544 secs     233.647 secs
phmap to vector        0.967 secs       0.964 secs
vector stable sort    22.299 secs      22.317 secs
write stdout           3.306 secs       3.363 secs
total time           255.117 secs     260.292 secs
    count lines     970195200        970195200
    count unique    200483043        200483043

@fenrus75
Copy link
Contributor

fenrus75 commented Feb 12, 2024 via email

@marioroy
Copy link
Author

marioroy commented Feb 12, 2024

OpenMP was removed in LLVM since CL 39970. To restore clang/clang++ OpenMP functionality, I have a script to install the missing OpenMP bits (headers and libs).

  1. Build LLVM 17 with OpenMP enabled and install to /opt/llvm-17.
  2. Run fix script to copy missing OpenMP headers and libs in /usr tree.
#!/bin/bash
if [[ -d /opt/llvm-17/lib64/clang/17 && -d /usr/lib64/clang/17 ]]; then
  cd /usr/lib64/cmake
  sudo cp -a /opt/llvm-17/lib64/cmake/openmp .

  cd /usr/lib64/clang/17/include
  sudo cp -a /opt/llvm-17/lib64/clang/17/include/omp*.h .

  cd /usr/lib64
  sudo cp -a /opt/llvm-17/lib64/libarcher.so .
  sudo cp -a /opt/llvm-17/lib64/libompd.so .
  sudo cp -a /opt/llvm-17/lib64/libomp.so .
  sudo ln -sf libomp.so libiomp5.so
fi

That restored my sanity from the CL team removing OpenMP functionality in LLVM :)

@marioroy
Copy link
Author

marioroy commented Feb 12, 2024

The source for llil4map.cc resides at https://gist.github.com/marioroy/862fa2fc6aa3b6f523f7a6ef9dd8d157.

The parallel chunk "get_properties" routine begins at line 190. Read IO is serially. Otherwise, threads append to the hash map concurrently. On my machine, LLVM with OpenMP functionality is amazing.

CL 40750 25.17x get_properties parallel performance over serial
CL 40800 22.84x due to clang++ using GCC 14 tool chain

@marioroy
Copy link
Author

One can specify which GCC tool chain to use to restore the prior performance.

clang++ --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/13 ...

@fenrus75
Copy link
Contributor

it would be really weird that llvm depends on this but also .. quite interesting to figure out what is the issue
(perf top might show how cycles are spent different between the two) so that we can ensure that when gcc 14 becomes THE gcc (in a couple of weeks when it is released) we don't get thing wrong

@marioroy
Copy link
Author

I tried again. Testing was done on Clear 41120, LTS kernel 6.1.69-1331.ltsprev.

I captured data using xcapture (via run_xcapture.sh) from https://0x.tools/. The data suggests more futex locking behind the scene versus gcc13.

clang++ using gcc14 toolchain:

$ grep llil4emh /tmp/gcc14/2024-02-25.21.csv | cut -d, -f10-11 | sort | uniq -c | sort -rn
    216 ./llil4emh,
     91 ./llil4emh,->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
     80 ./llil4emh,->__x64_sys_futex()->futex_wait()
     31 ./llil4emh,->__x64_sys_futex()->futex_wake()->wake_up_q()->try_to_wake_up()->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
     18 ./llil4emh,->try_to_wake_up()->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
     14 ./llil4emh,->native_send_call_func_single_ipi()
      3 ./llil4emh,->pick_next_task()->pick_next_task_fair()->update_curr()
      2 ./llil4emh,->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      2 ./llil4emh,->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      2 ./llil4emh,->asm_exc_page_fault()->exc_page_fault()->do_user_addr_fault()->lock_mm_and_find_vma()
      1 ./llil4emh,->sched_clock_cpu()
      1 ./llil4emh,->futex_wait()
      1 ./llil4emh,->asm_common_interrupt()->common_interrupt()->irqentry_exit()->irqentry_exit_to_user_mode()->exit_to_user_mode_prepare()

clang++ using gcc13 toolchain:

$ grep llil4emh /tmp/gcc13/2024-02-25.21.csv | cut -d, -f10-11 | sort | uniq -c | sort -rn
    310 ./llil4emh,
     22 ./llil4emh,->__x64_sys_futex()->futex_wait()
      9 ./llil4emh,->__x64_sys_futex()->futex_wake()->wake_up_q()->try_to_wake_up()->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      6 ./llil4emh,->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      5 ./llil4emh,->pick_next_task()->pick_next_task_fair()->update_curr()
      4 ./llil4emh,->asm_exc_page_fault()->exc_page_fault()->do_user_addr_fault()->lock_mm_and_find_vma()
      3 ./llil4emh,->try_to_wake_up()->ttwu_queue_wakelist()->__smp_call_single_queue()->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      3 ./llil4emh,->__do_sys_sched_yield()->do_sched_yield()
      1 ./llil4emh,->update_process_times()->scheduler_tick()->task_tick_fair()
      1 ./llil4emh,->update_curr()
      1 ./llil4emh,->send_call_function_single_ipi()->native_send_call_func_single_ipi()
      1 ./llil4emh,->sched_clock_cpu()
      1 ./llil4emh,->asm_sysvec_apic_timer_interrupt()->sysvec_apic_timer_interrupt()->irqentry_exit()->irqentry_exit_to_user_mode()->exit_to_user_mode_prepare()

@marioroy
Copy link
Author

marioroy commented Mar 25, 2024

I compared various Mutex libraries. The issue is resolved at the application level.

fast_mutex and spinlock_mutex can be found here.
https://github.com/mvorbrodt/blog/blob/master/src/mutex.hpp

 // std::mutex     L[NUM_MAPS];
 // omp_lock_t     L[NUM_MAPS];
 // fast_mutex     L[NUM_MAPS];
    spinlock_mutex L[NUM_MAPS];

Selecting gcc13 or gcc14 toolchain.

clang++ -o llil4emh \
    --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/13 \
    -std=c++20 -fopenmp -Wall -O3 llil4emh.cc

clang++ -o llil4emh \
    --gcc-install-dir=/usr/lib64/gcc/x86_64-generic-linux/14 \
    -std=c++20 -fopenmp -Wall -O3 llil4emh.cc

Testing involves 963 mutexes. OpenMP threads pick one depending on the hash value % number of maps. The program outputs the time to populate the maps in parallel (get_properties).

GCC 13 std::mutex       7.456 secs
GCC 14 std::mutex       8.282 secs

GCC 13 omp_lock_t       8.353 secs
GCC 14 omp_lock_t       8.408 secs

GCC 13 fast_mutex       8.049 secs
GCC 14 fast_mutex       8.651 secs

GCC 13 spinlock_mutex   7.061 secs
GCC 14 spinlock_mutex   7.012 secs

There are ~ 1 billion lines read, of which ~ 200 million unique.

$ ./llil4emh in/big* in/big* in/big* | cksum
llil4emh (fixed string length=12) start
use OpenMP
use boost sort
get properties         7.012 secs
emhash to vector       0.661 secs
vector stable sort     1.095 secs
write stdout           0.690 secs
total time             9.683 secs
    count lines     970195200
    count unique    200483043
2057246516 1811140689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants