SIMT lowering support for SG transpose B case and row/col reductions. #1

charithaintc · 2025-08-15T17:30:15Z

This PR contains changes needed in upstream to support load + transpose B optimization and row/col reduction support.

When LLVM_LINK_LLVM_DYLIB is ON, `check-bolt` target reports unit test failures: BOLT-Unit :: Core/./CoreTests/failed_to_discover_tests_from_gtest BOLT-Unit :: Profile/./ProfileTests/failed_to_discover_tests_from_gtest The reason is that when llvm-lit runs a unit-test executable: /path/to/CoreTests --gtest_list_tests '--gtest_filter=-*DISABLED_*' an assertion is triggered with the following message: LLVM ERROR: Option 'default' already exists! This assertion triggers when the initializer of defaultListDAGScheduler defined at SelectionDAGISel.cpp:219 is called as a statically-linked function after already being called during the initialization of libLLVM. The issue can be traced down to LLVMTestingSupport library which depends on libLLVM as neither COMPONENT_LIB nor DISABLE_LLVM_LINK_LLVM_DYLIB is specified in a call to `add_llvm_library(LLVMTestingSupport ...)`. Specifying DISABLE_LLVM_LINK_LLVM_DYLIB for LLVMTestingSupport makes Clang unit test fail and COMPONENT_LIB is probably inappropriate for a testing-specific library, thus as a workaround, added Error.cpp source from LLVMTestingSupport directly to the list of source files of CoreTests target (as it depends on `llvm::detail::TakeError(llvm::Error)`) and removed LLVMTestingSupport from the list of dependencies of ProfileTests.

…DEPENDS (llvm#152431)

…lvm#152156) With this new A320 in-order core, we follow adding the FeatureUseFixedOverScalableIfEqualCost feature to A510 and A520 (llvm#132246), which reaps the same code generation benefits of preferring fixed over scalable when the cost is equal. So when we have: ``` void foo(float* a, float* b, float* dst, unsigned n) { for (unsigned i = 0; i < n; ++i) dst[i] = a[i] + b[i]; } ``` When compiling without the feature enabled, we get: ``` ... ld1b { z0.b }, p0/z, [x0, x10] ld1b { z2.b }, p0/z, [x1, x10] add x12, x0, x10 ldr z1, [x12, #1, mul vl] add x12, x1, x10 ldr z3, [x12, #1, mul vl] fadd z0.s, z2.s, z0.s add x12, x2, x10 fadd z1.s, z3.s, z1.s dech x11 st1b { z0.b }, p0, [x2, x10] incb x10, all, mul #2 str z1, [x12, #1, mul vl] ... ``` When compiling with, we get: ``` ... ldp q0, q1, [x12, #-16] ldp q2, q3, [x11, #-16] subs x13, x13, llvm#8 fadd v0.4s, v2.4s, v0.4s fadd v1.4s, v3.4s, v1.4s add x11, x11, llvm#32 add x12, x12, llvm#32 stp q0, q1, [x10, #-16] add x10, x10, llvm#32 ... ```

Add a device function to check if a device queue is empty. If liboffload tries to create an event for an empty queue, we create an "empty" event that is already complete. This allows `olCreateEvent`, `olSyncEvent` and `olWaitEvent` to run quickly for empty queues.

... so we don't have to create Pointer instances when we don't need them.

llvm#152457) Judging from the reaction to llvm#152302, we are not ready to make this a fatal error. Remove the specific version number, and update the libc message to match the others' wording.

This fixes llvm#152097 This commit fixes two instances of a (somewhat) recently enabled assertion. One with a test, the other I can't reproduce (might be dead code) but certainly looks like an instance of the same problem. The PR that introduced the regression: llvm#117558 With this patch, the AVR backend is usable again for TinyGo.

This patch extends llvm#149095 for EOR and ORR. It uses a simple partition scheme to try to find two suitable disjoint bitmasks that can be used with EOR/ORR to reconstruct the original mask. Fixes: llvm#148987.

To avoid noise in PRs such as in llvm#146383.

…lvm#151940) We need to reject plans that contain recipes with invalid costs. LICM can move recipes with invalid costs out of the loop region, which then get missed by the main cost computation. Extend the logic to check recipes for invalid cost currently only covering the middle block to include all skeleton blocks. Fixes llvm#144358 Fixes llvm#151664 PR: llvm#151940

…xpr (llvm#152363) Closes llvm#152324. Part of llvm#30794. This PR adds `constexpr` support for the following AVX512 integer reduction intrinsics: - `_mm512_reduce_add_epi32` - `_mm512_reduce_add_epi64` - `_mm512_reduce_mul_epi32` - `_mm512_reduce_mul_epi64` - `_mm512_reduce_and_epi32` - `_mm512_reduce_and_epi64` - `_mm512_reduce_or_epi32` - `_mm512_reduce_or_epi64` - `_mm512_reduce_max_epi32` - `_mm512_reduce_max_epi64` - `_mm512_reduce_min_epi32` - `_mm512_reduce_min_epi64` - `_mm512_reduce_max_epu32` - `_mm512_reduce_max_epu64` - `_mm512_reduce_min_epu32` - `_mm512_reduce_min_epu64` --------- Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>

Auto-generate checks for llvm#151925. Also update some naming to make more consistent with other tests.

…#152138) This commit converts RetainCountChecker to the new checker family framework that was introduced in the commit 6833076 This commit also performs some minor cleanup around the parts that had to be changed, but lots of technical debt still remains in this old codebase.

…lvm#151995) Make it easier for us to add ABI versions. Close llvm#144332

Added by llvm#150846. Checks the size of a structure, which is only correct for 64-bit systems.

…ntrinsics to be used in constexpr (llvm#152435) Fixed llvm#152313 --------- Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>

Previously, specializing the GraphWriter class required a full class specialization. This change introduces CRTP for GraphWriter, allowing for partial specialization. This change is in support of printing the module dependency graph as part of the RFC for driver-managed module builds, for which we want to print the graph nodes in a more human-readable format by: - Printing descriptive IDs instead of pointer addresses as node labels. - Printing the full node labels separately from the node relations to avoid clutter. With this approach, only GraphWriter::writeNodes() needs to be specialized (, aside from DOTGraphTraits). RFC for driver-managed module builds: https://discourse.llvm.org/t/rfc-modules-support-simple-c-20-modules-use-from-the-clang-driver-without-a-build-system

Desc is only used once and we can get that information from the Block as well.

@durga4github

[NVPTX] Add Prefetch tensormap intrinsics This PR adds prefetch intrinsics with the relevant tensormap_space. * Lit tests are added as part of prefetch.ll * The generated PTX is verified with a 12.3 ptxas executable. * Added docs for these intrinsics in NVPTXUsage.rst. For more information, refer to the PTX ISA for prefetch intrinsic : [Prefetch Tensormap](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-prefetch-prefetchu) @durga4github @schwarzschild-radius

Changes: The original patch, landed as 1336675, was reverted due to a bug in LoopVectorize resulting in a crash. The bug has now been fixed by 95c32bf ([VPlan] Return invalid cost if any skeleton block has invalid costs), and this reland is identical to the original patch.

Inspired by llvm#152478

…ops. (llvm#148424) Adds `linalg-morph-ops` pass to convert an op from one representation to another: named-op <--> category_op (elementwise, contraction, ..) <--> generic e.g. ```mlir %exp = linalg.exp ins(%A : tensor<16x8xf32>) outs(%B : tensor<16x8xf32>) -> tensor<16x8xf32> ``` After `mlir-opt -linalg-morph-ops=named-to-category ..` ```mlir %0 = linalg.elementwise kind=#linalg.elementwise_kind<exp> ins(%arg0 : tensor<16x8xf32> .. Note: this is generalization of `--linalg-generalize-named-ops` is the path `named-op --> generic-op` `--linalg-specialize-generic-ops` is the path `named-op <-- generic-op` email: quic_mabsar@quicinc.com

…ows (llvm#152318) Currrently flang-rt assumes that LLVM was always built with the dynamic MSVC runtime. This may not be the case, if the user has specified a different runtime with -DCMAKE_MSVC_RUNTIME_LIBRARY. Since this flag is implied by -DLLVM_ENABLE_RPMALLOC=On, which is used by the Windows release script, this is causing that script to fail. Fixes llvm#151920

Split out from llvm#150248: Use the size of the alloca instead of the size passed to the lifetime intrinsic. As a bonus, this handles dynamic allocas correctly (see the added test) instead of doing a memset with size -1...

…llvm#152478) Adds missing C++ run lines to test files containing `constexpr` tests. Also adds missing 32/64-bit test coverage to the following tests: - `clang/test/CodeGen/X86/avx512-reduceIntrin.c` - `clang/test/CodeGen/X86/avx512-reduceMinMaxIntrin.c` - `clang/test/CodeGen/X86/avx512vpopcntdq-builtins.c` - `clang/test/CodeGen/X86/avx512vpopcntdqvl-builtins.c` Additionally, fixes a `_mm512_popcnt_epi64` `constexpr` test that incorrectly assumed 32-bit integers, leading to incorrect bit counts. This change updates the test result to assume 64-bit integers.

We currently log every single test that we run in premerge. This leads to gigantic logs (200k+ lines on Linux) that can be difficult to parse through. Having an indicator of progress is nice, especially for the LLVM tests, but is not strictly necessary and not often used (I imagine). Having a progress indicator from lit that works in CI cases is on my TODO list. For the rare cases where someone does need to see the list of tests that run, the JUnit XML emitted by lit is available in the artifacts.

…lvm#152007) This allows not having the END CRITICAL directive in certain situations. Update semantic checks and symbol resolution.

…m#152466) `add_conformance_test` checks for libc and prints a warning if it is not found. However, this warning ends up being printed once for each test, spamming the cmake log. Moving it up to the folder cmake allows it to be reported only once.

llvm#152813) We'll remove the size estimator after, this change is to get the `ml-*` build bots green after the aforementioned PR. We never used the size estimator again after the initial DQN-based training. Should we want to again, we now have IR2Vec, which the old estimator was approximating in functionality.

Summary: Small fix that just ignores all the extra lanes if we're running the server from a platform that potentially has more.

Without linker relaxation enabled for a particular relocatable file or section (e.g., using .option norelax), the assembler will not generate R_RISCV_ALIGN relocations for alignment directives. This becomes problematic in a two-stage linking process: ``` ld -r a.o b.o -o ab.o // b.o is norelax. Its alignment information is lost in ab.o. ld ab.o -o ab ``` When ab.o is linked into an executable, the preceding relaxed section (a.o's content) might shrink. Since there's no R_RISCV_ALIGN relocation in b.o for the linker to act upon, the `.word 0x3a393837` data in b.o may end up unaligned in the final executable. To address the issue, this patch inserts NOP bytes and synthesizes an R_RISCV_ALIGN relocation at the beginning of a text section when the alignment >= 4. For simplicity, when RVC is disabled, we synthesize an ALIGN relocation (addend: 2) for a 4-byte aligned section, allowing the linker to trim the excess 2 bytes. See also https://sourceware.org/bugzilla/show_bug.cgi?id=33236 Pull Request: llvm#151639

…ation) (llvm#152028) Fix llvm#151080

This patch defines a couple of helper functions so that we can convert four loops to range-based for loops.

llvmgnsyncbot and others added 30 commits August 7, 2025 07:29

[gn build] Port 6897ca4

0b3ee20

[DAG] canCreateUndefOrPoison - add FP_EXTEND (llvm#152249)

6ce68d3

Fixes llvm#152141

[libclc] Add missing clc/lib/ptx-nvidiacl/SOURCES to CMAKE_CONFIGURE_…

d618c36

…DEPENDS (llvm#152431)

[clang][bytecode] Refactor Check* functions (llvm#152300)

c869ef6

... so we don't have to create Pointer instances when we don't need them.

[llvm][cmake] Remove version number from runtimes in PROJECTS warnings (

6cd6de5

llvm#152457) Judging from the reaction to llvm#152302, we are not ready to make this a fatal error. Remove the specific version number, and update the libc message to match the others' wording.

[AArch64] Allow splitting bitmasks for EOR/ORR. (llvm#150394)

565f707

This patch extends llvm#149095 for EOR and ORR. It uses a simple partition scheme to try to find two suitable disjoint bitmasks that can be used with EOR/ORR to reconstruct the original mask. Fixes: llvm#148987.

[DA][NFC] clang-format DependenceAnalysis (llvm#151505)

04196ba

To avoid noise in PRs such as in llvm#146383.

[LV] Auto-generate checks for sve-low-trip-count.ll.

47944d0

Auto-generate checks for llvm#151925. Also update some naming to make more consistent with other tests.

[Clang][NFC] Enumerate Clang ABI versions in a separate header file (l…

109040a

…lvm#151995) Make it easier for us to add ABI versions. Close llvm#144332

[llvm][MC][test] Disable many-instructons.s on 32-bit systems

246990d

Added by llvm#150846. Checks the size of a structure, which is only correct for 64-bit systems.

[Headers][X86] Update AVX/AVX512 float/double add/sub/mul/div/unpck i…

e1d6753

…ntrinsics to be used in constexpr (llvm#152435) Fixed llvm#152313 --------- Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>

[clang][bytecode][NFC] Remove unnecessary local variable (llvm#152468)

4784585

Desc is only used once and we can get that information from the Block as well.

[X86] avx512dq-builtins.c - add C/C++ and 32/64-bit test coverage

f24c50a

Inspired by llvm#152478

[flang][OpenMP] Make OpenMPCriticalConstruct follow block structure (l…

e368b53

…lvm#152007) This allows not having the END CRITICAL directive in certain situations. Update semantic checks and symbol resolution.

arsenm and others added 10 commits August 9, 2025 08:33

AMDGPU: Add missing static to cl::opt (llvm#152747)

0a0f077

[gn build] Port 2c4b876

96dd26c

[gn build] Port 4d3feae

8bfb54b

[flang][cuda] Set correct bind(c) name for __popc (llvm#152795)

1acb101

[libc] Fix server code when GPU is acting as the server

0c13988

Summary: Small fix that just ignores all the extra lanes if we're running the server from a platform that potentially has more.

[AVR] Fix Avr indvar detection and strength reduction (missed optimiz…

97f0ff0

…ation) (llvm#152028) Fix llvm#151080

[ADT] Use range-based for loops in SmallPtrSet.h (NFC) (llvm#152821)

92164fa

This patch defines a couple of helper functions so that we can convert four loops to range-based for loops.

fix

a5547e2

charithaintc closed this Aug 15, 2025

charithaintc deleted the sg_b_transpose_and_reduction_support_draft branch August 15, 2025 17:33

charithaintc restored the sg_b_transpose_and_reduction_support_draft branch August 15, 2025 17:33

charithaintc mentioned this pull request Aug 15, 2025

SIMT lowering support for SG transpose B case and row/col reductions. #1 #2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SIMT lowering support for SG transpose B case and row/col reductions. #1

SIMT lowering support for SG transpose B case and row/col reductions. #1

Uh oh!

charithaintc commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

139 participants

SIMT lowering support for SG transpose B case and row/col reductions. #1

SIMT lowering support for SG transpose B case and row/col reductions. #1

Uh oh!

Conversation

charithaintc commented Aug 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

139 participants