Skip to content

dsl-learn/cutile-learn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

cuTile learning

Tutorials

Benchmark

All benchmarks were run with Torch 2.9.1, Triton 3.5.1, cuTile (cuda-tile) 1.0.0, and tileiras, using CUDA compilation tools 13.1 (V13.1.80).

Currently, I only have results from an RTX 5090 (sm_120), data in benchmark/5090. Contributions from Blackwell B200 (sm_100) users are very welcome!

5090 Transformers Inference

use NVIDIA/TileGym/tree/main/modeling/transformers and profile data in profile-data repository

Transformers Inference

5090 attention fwd

5090 attention

5090 softmax

softmax-performance

5090 layer normal

5090-layer-norm

5090 matmul

5090 matmul

My Zhihu article

如何评价 cuTile? —— BobHuang的回答

浅析cuTile执行流程

Documents

Github repositorys

NVIDIA/cutile-python

NVIDIA/TileGym

YouTube videoes

Deep Dive: How to Use cuTile Python

THE FUTURE IS TILED: using cuTile and CUDA Tile IR to write portable, high-performance GPU Kernels

About

NVIDIA cuTile learn

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages