CUTLASS SYCL 0.5

This repository fast-follows NVIDIA CUTLASS repository adding SYCL support for Intel GPUs.

For SYCL support instructions, refer to the SYCL build documentation

CUTLASS‑SYCL is a modular, header‑only C++ template framework for high‑performance GEMM, and fused epilogue kernels. It applies hierarchical tiling, composable policy abstractions, and efficient data‑movement primitives to build flexible, reusable building blocks for dense linear algebra. The SYCL implementation brings those optimizations to Intel GPUs with tuned kernels for modern execution units and memory hierarchies. It adds mixed‑precision and epilogue fusion pathways designed to simplify integrating advanced quantization and post‑processing into custom pipelines.

To support a wide variety of applications, CUTLASS-SYCL provides extensive support for mixed-precision computations on Intel hardware, providing specialized data-movement and multiply-accumulate abstractions for FP64, FP32, FP16, BF16, 8b floating point types (E5M2 and E4M3 for FP8), narrow integer types (4 and 8b signed and unsigned integers with support for zero-point quantization), and mixed-precision operations with tensor-wise, channel-wise, and group-wise quantization support. CUTLASS-SYCL demonstrates optimal matrix multiply operations targeting Intel's programmable, high-throughput execution units implemented in Intel Data Center GPU Max/Flex Series (Intel Xe architecture, codename: Ponte-Vecchio) and Intel Arc B580 GPUs.

See the Quick Start Guide to get started quickly.

See the functionality docs for a more comprehensive list of kernel level features, data types, instructions, and minimum supported by CUTLASS on each GPU architecture.

Base NVIDIA CUTLASS Versions for CUTLASS-SYCL releases:

CUTLASS SYCL	NVIDIA CUTLASS
0.1	3.9
0.2	3.9.2
0.3	3.9.2
0.5	4.2.0

What's New in CUTLASS SYCL 0.5

Major Architecture Changes

Xe Rearchitecture (#477): Complete redesign of Xe CuTe atoms with new architecture
- New MMA atoms for improved performance
- Enhanced 2D copy atoms (loads, stores, prefetch with VNNI/transpose support)
- New 2D copy helpers (low-level make_block_2d_copy and high-level make_block_2d_copy_{A,B,C})
- Generic and optimized reorder atoms for {int4, uint4, int8, uint8, e2m1, e4m3, e5m2} -> {half, bfloat16}
- Requires IGC version v2.18.5 or later

New Features

G++ Host Compiler Support (#490): Support for G++ 13 as host compiler
Migrated syclcompat to this repository as cutlasscompat for better compatibility
- Fixed compilation issues when using G++ instead of clang++
- Added new CI workflow for testing G++ host compiler builds
- Enhanced build system to support -DDPCPP_HOST_COMPILER=g++ option
Grouped GEMM for Mixed Dtype (#457): Extended grouped GEMM support to mixed precision operations
- Added support for BF16 + S8 mixed dtype grouped GEMM
- Added support for FP16 + U4 mixed dtype grouped GEMM
- New examples: 10_bmg_grouped_gemm_bf16_f16_s8.cpp and 10_bmg_grouped_gemm_f16_u4.cpp
See the CHANGELOG for details of all past releases and updates.

CuTe

CUTLASS-SYCL supports the newly introducted core library, CuTe, to describe and manipulate tensors of threads and data. CuTe in CUTLASS-SYCL is a collection of C++ SYCL template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides Layout and Tensor objects that compactly package the type, shape, memory space, and layout of data, while performing the complicated indexing for the user. This lets programmers focus on the logical descriptions of their algorithms while CuTe does the mechanical bookkeeping for them. With these tools, we can quickly design, implement, and modify all dense linear algebra operations.

The core abstractions of CuTe are hierarchically multidimensional layouts which can be composed with data arrays to represent tensors. The representation of layouts is powerful enough to represent nearly everything we need to implement efficient dense linear algebra. Layouts can also be combined and manipulated via functional composition, on which we build a large set of common operations such as tiling and partitioning.

CUTLASS-SYCL and beyond adopts CuTe throughout the GEMM hierarchy in its templates. This greatly simplifies the design and improves code composability and readability. More documentation specific to CuTe can be found in its dedicated documentation directory.

Compatibility

Minimum requirements:

Architecture: Intel Data Center GPU Max Series (codename: Ponte-Vecchio)
Compiler: Must support at least C++17
DPC++ Compiler Version: oneAPI 2025.1 and onwards
Intel Compute Runtime: 25.13 (with Intel Graphics Compiler 2.10.10)

Hardware Support

CUTLASS-SYCL runs successfully on the following Intel GPUs.

GPU	Intel GPU Architecture
Intel Data Center GPU Max Series	Xe-HPC
Intel Arc GPU B580 Graphics	Xe2

Validated Software Configurations

We are regularly testing following setup in CI.

Platform	Operating System	DPC++ Compiler	G++	Intel Compute Runtime	Intel Graphics Compiler
Xe-HPC	Ubuntu 22.04	2025.2+	G++13	25.18	2.11
Xe2	Ubuntu 25.04	2025.2+	G++13	25.35	2.18

Target Architecture

The target architecture information is passed on to CUTLASS-SYCL via the cmake flag DPCPP_SYCL_TARGET.

cmake .. -DDPCPP_SYCL_TARGET="intel_gpu_pvc"

Or

cmake .. -DDPCPP_SYCL_TARGET="intel_gpu_bmg_g21"

Please refer to the functionality documentation for details on which kernels require which target architectures.

Documentation

CUTLASS is described in the following documents and the accompanying Doxygen documentation.

Quick Start Guide - basics of building and running CUTLASS
Functionality - summarizes functionality available in CUTLASS
Efficient GEMM in CUDA - describes how GEMM kernels may be implemented efficiently in CUDA
CUTLASS 3.x Design - describes the CUTLASS 3.x design, its benefits, and how CuTe enables us to write much more composable components
GEMM API 3.x - describes the CUTLASS 3.x GEMM model and C++ template concepts
Implicit GEMM Convolution - describes 2-D and 3-D convolution in CUTLASS
Code Organization - describes the organization and contents of the CUTLASS project
Terminology - describes terms used in the code
Programming Guidelines - guidelines for writing efficient modern CUDA C++
Fundamental types - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
Layouts - describes layouts of matrices and tensors in memory
Tile Iterators - describes C++ concepts for iterating over tiles of matrices in memory
CUTLASS Utilities - additional templates used to facilitate rapid development

Resources

Building CUTLASS-SYCL

CUTLASS-SYCL is a header-only template library and does not need to be built to be used by other projects. Client applications should target CUTLASS-SYCL's include/ directory in their include paths.

CUTLASS-SYCL unit tests, examples, and utilities can be built with CMake. The minimum version of CMake is given in the Quickstart guide. Make sure you have Intel oneAPI DPC++ compiler installed and the environment is properly set up.

$ source /opt/intel/oneapi/setvars.sh

Create a build directory within the CUTLASS-SYCL project, then run CMake. You need to specify the target Intel GPU architecture using the DPCPP_SYCL_TARGET flag. For Intel Data Center GPU Max Series (Ponte Vecchio), use intel_gpu_pvc. For Intel Arc GPU B580 Graphics, use intel_gpu_bmg_g21.

$ mkdir build && cd build

$ CC=icx CXX=icpx cmake .. -G Ninja -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET="intel_gpu_pvc"     # compiles for Intel Data Center GPU Max Series

Or for Intel Arc GPU B580 Graphics:

$  CC=icx CXX=icpx cmake .. -G Ninja -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_SYCL_TARGET="intel_gpu_bmg_g21" # compiles for Intel Arc GPU B580 Graphics

To compile with G++ as host compiler, add the flag -DDPCPP_HOST_COMPILER=g++-13 to the cmake command. Please note that the build system must be able to find g++-13 in your PATH.

$  CC=icx CXX=icpx cmake .. -G Ninja -DCUTLASS_ENABLE_SYCL=ON -DDPCPP_HOST_COMPILER=g++-13 -DDPCPP_SYCL_TARGET="intel_gpu_bmg_g21" # compiles for Intel Arc GPU B580 Graphics with G++ as host compiler

From the build/ directory, compile and run the CUTLASS-SYCL unit tests by building the target test_unit with make.

The unit tests are organized as several binaries mirroring the top-level namespaces of CUTLASS-SYCL, and they may be executed in parallel via make's -j command line argument.

$ make test_unit -j
...
...
...
[----------] Global test environment tear-down
[==========] XXX tests from YY test cases ran. (ZZZZ ms total)
[  PASSED  ] XXX tests.

All tests should pass on supported Intel GPU platforms, though the exact number of tests may vary over time.

Project Structure

CUTLASS-SYCL is arranged as a header-only library along with Utilities, Tools, Examples, and unit tests.

A detailed explanation of the source code organization may be found in the CUTLASS-SYCL documentation, but several main components are summarized below.

CUTLASS-SYCL Template Library

include/                     # client applications should target this directory in their build's include paths

  cutlass/                   # SYCL Templates for Linear Algebra Subroutines and Solvers - headers only

    arch/                    # direct exposure of Intel GPU architecture features (including instruction-level GEMMs)

    conv/                    # code specialized for convolution on Intel GPUs

    epilogue/                # code specialized for the epilogue of gemm/convolution using SYCL

    gemm/                    # code specialized for general matrix product computations with SYCL

    layout/                  # layout definitions for matrices, tensors, and other mathematical objects in memory

    platform/                # SYCL-capable Standard Library components for Intel GPUs

    reduction/               # bandwidth-limited reduction kernels optimized for Intel GPU architectures

    thread/                  # SYCL workgroup and subgroup code for Intel GPU execution units
    
    transform/               # code specialized for layout, type, and domain transformations using SYCL

    *                        # core vocabulary types, containers, and basic numeric operations

  cute/                      # CuTe Layout, layout algebra, MMA/Copy atoms, tiled MMA/Copy for SYCL

    algorithm/               # Definitions of core operations such as copy, gemm, and operations on cute::tuples

    arch/                    # Intel GPU architecture wrapper structs for copy and math instructions

    atom/                    # Meta-information for Intel GPU operators and SYCL kernels

      mma_atom.hpp           # cute::Mma_Atom and cute::TiledMma for Intel GPU architectures

      copy_atom.hpp          # cute::Copy_Atom and cute::TiledCopy optimized for SYCL

      *xe*.hpp               # Intel Xe architecture specific meta-information for copy and math operations

    *                        # Core library types such as Shape, Stride, Layout, Tensor, and associated operations

CUTLASS SDK Examples

CUTLASS SDK examples apply CUTLASS templates to implement basic computations.

Tools

tools/
  library/                   # CUTLASS-SYCL Instance Library - contains instantiations of all supported CUTLASS-SYCL templates
    include/
      cutlass/
        library/

  profiler/                  # CUTLASS Profiler         - SYCL support not yet available
                             #                            (command-line utility for executing operations)
  
  util/                      # CUTLASS-SYCL Utilities   - contains numerous helper classes for
    include/                 #                            managing tensors in Intel GPU device memory, reference
      cutlass/               #                            implementations for SYCL GEMM, random initialization
        util/                #                            of tensors, and I/O for Intel GPU environments.

Test

The test/unit/ directory consist of unit tests implemented with Google Test that demonstrate basic usage of Core API components and complete tests of the CUTLASS GEMM computations.

Instructions for building and running the Unit tests are described in the Quickstart guide.

About

CUTLASS-SYCL is released by INTEL Corporation as Open Source software under the 3-clause "New" BSD license.

Contributors

The official list of CUTLASS-SYCL developers and contributors is available here: CONTRIBUTORS.

Copyright

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

  3. Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Name		Name	Last commit message	Last commit date
Latest commit History 1,302 Commits
.github		.github
applications		applications
benchmarks		benchmarks
cmake		cmake
docs		docs
examples		examples
include		include
media		media
python		python
test		test
tools		tools
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG-SYCL.md		CHANGELOG-SYCL.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CMakeLists.txt		CMakeLists.txt
CONTRIBUTORS.md		CONTRIBUTORS.md
CUDA.cmake		CUDA.cmake
Doxyfile		Doxyfile
LICENSE.txt		LICENSE.txt
PUBLICATIONS.md		PUBLICATIONS.md
README.md		README.md
SECURITY.md		SECURITY.md
SYCL.cmake		SYCL.cmake
bin2hex.cmake		bin2hex.cmake
cuBLAS.cmake		cuBLAS.cmake
cuDNN.cmake		cuDNN.cmake
customConfigs.cmake		customConfigs.cmake
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CUTLASS SYCL 0.5

What's New in CUTLASS SYCL 0.5

Major Architecture Changes

New Features

CuTe

Compatibility

Hardware Support

Validated Software Configurations

Target Architecture

Documentation

Resources

Building CUTLASS-SYCL

Project Structure

CUTLASS-SYCL Template Library

CUTLASS SDK Examples

Tools

Test

About

Contributors

Copyright

About

Uh oh!

Releases 4

Uh oh!

Languages

License

intel/cutlass-sycl

Folders and files

Latest commit

History

Repository files navigation

CUTLASS SYCL 0.5

What's New in CUTLASS SYCL 0.5

Major Architecture Changes

New Features

CuTe

Compatibility

Hardware Support

Validated Software Configurations

Target Architecture

Documentation

Resources

Building CUTLASS-SYCL

Project Structure

CUTLASS-SYCL Template Library

CUTLASS SDK Examples

Tools

Test

About

Contributors

Copyright

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Uh oh!

Languages