Skip to content

geohot/tt-twitch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tenstorrent example from twitch

Grayskull has 120 cores in a grid.

See https://private-user-images.githubusercontent.com/3885633/313425522-6ea0cefc-6109-4579-8470-7a620f45b314.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTA2MjI3MzQsIm5iZiI6MTcxMDYyMjQzNCwicGF0aCI6Ii8zODg1NjMzLzMxMzQyNTUyMi02ZWEwY2VmYy02MTA5LTQ1NzktODQ3MC03YTYyMGY0NWIzMTQucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDMxNiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDAzMTZUMjA1MzU0WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MzVkNWExZTIzYzFkMTNlMGIzZTk3M2JhZDEyMjQ4MDU4OTg0YWRkZGM1MmRiZTYyZTBlOWI1YjE5ZTk0ZDQ0ZiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.GzHj-U062blQf_DaBHeAUNkl89GeSwMx9cNJaqSeXQ0

Each core has 5 RISC-V processors

# TRISC0, TRISC1, TRISC2 = 3 compute kernels, UNPACK, MATH, PACK respectively
# NCRISC (1) = read kernel (ex: reader_binary_diff_lengths)
# BRISC (0) = write kernel (ex: writer_unary)

The driver has two parts, kmd and umd (user mode driver)
Debug with TT_METAL_LOGGER_LEVEL=1

There's low level dispatch firmware running by default on all the cores.
* brisc.cc, ncrisc.cc, trisc.cc

There's a command queue (impl/dispatch/command_queue.cpp)
* EnqueueProgram
* EnqueueWriteBuffer
* EnqueueReadBuffer

The command queue is entirely software defined, there's a producer running on (7,11) and a consumer running on (1,11)

You can manually dispatch kernels to specific cores (in the command queue even).
Each core has 1MB of L1, and this can be used for either CircularBuffer or long standing L1

# math
"Grayskull supports full fp16/BFLOAT at 92 TFLOPs;"
766 GFLOPS/core, 2048 muladds / clock. 1300 MHz
block-based 8-bit floating point at 368 FLOPs.

120*2048*1300 MFLOPS = 320 TFLOPS (fp8, half that for bf16)
32x32 MAC = 32*32*2 = 2048

bfloat16 accumulator in grayskull
FP32 accumulator in wormhole

See matmul.h. Math is custom RISC-V instructions (ckernel_ops.h)

About

tenstorrent kernel from twitch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published