# Introduction to TT-Metal

## General Overview

**TT-Metal** (AKA **TT-Metal** or **Metal**) is Tenstorrent's low-level SDK for programming Tensix processors. It sits at the foundation of Tenstorrent's software stack:

- **TT-Forge / TT-MLIR**: High-level compilation frameworks for deploying neural networks.
- **TTNN**: Library of kernels that implements common Machine Learning operations.
- **TT-Metalium(Metalium)**: ⬅ **This tutorial focuses here** - Low-level programming interface for Tensix hardware.
- **TT-LLK (Low Level Kernels)**: Hardware-specific kernel implementations.

etal offers a set of abstractions between the host systems (e.g., desktop with x86 CPU) and the device. Its most important capability allows programmers to write **C++ programs**, called **kernels**, and execute these kernels on Tenstorrent hardware.

This is similar to CUDA in spirit (programming language + runtime), but with a different architectural model emphasizing explicit data movement and local memory.

## Introduction to TT-Metalium


For production use, refer to the official installation guide: [INSTALLING.md](https://github.com/tenstorrent/tt-metal/blob/main/INSTALLING.md)

The provided VMs should have a `tt-metal` folder with several directories such as:

```
- ttnn/ # TT-NN, contains library of operations and Kernels 
- tt_metal/ # Metalium-specific code, also contains TT-LLK as sub-repository
- tt_stl/ # Standard library
- tools/ # Debugging and profiling tools (e.g. tracy, triage)
- tests/ # Python scripts for internal testing
- docs/ # TT-Metaliumand TT-NN Documentation
```

In [None]:
%%bash
cd tt-metal/
ls

## TT-Metalium Setup and Build

We can build TT-Metalium using the following command:

In [None]:
%%bash
./build_metal.sh --build-tests --debug -e --enable-profiler

In this case, we also include serveral compile options which do the following:

- `--build-tests` builds TT-Metalium and TT-NN tests (`test/` directory).
- `--debug` set target to 'Debug' mode without optimization flags for host code (separate from kernel code compilation).
- `-e` produce symbols (useful for IDEs such as Visual Studio Code).
- `--enable-profiler`: enable device profiler.

Building TT-Metalium from scratch takes a few minutes. 

## Testing

Before running any test, set environment variables to use the correct python scripts, and define the path to TT-Metalium.
The following commands assume that `tt-metal` (where TT-Metaliumhas been installed) is the current working directory.

In [None]:
%%bash
source python_env/bin/activate
export PYTHONPATH=$(pwd)
export TT_METAL_HOME=$(pwd)

Having built TT-Metaliumand TT-NN, we can tests basic operations like matrix multiplication (matmul) with the following python command

In [None]:
%%bash
python -m pytest ./tests/ttnn/unit_tests/operations/matmul/test_matmul.py::test_tutorial_matmul

This command ensures that matmul operations execute correctly.
These tests use the pytest library, hence `-m pytest` argument.

## Tenstorrent Programming Model Overview


### Architecture: A Grid of Specialized Cores

Tenstorrent chips are a massively parallel grid of interconnected cores, with different types of cores serving specialized functions:
- **Tensix Cores**: The main compute units (where our kernels run).
- **DRAM Banks**: Interface with off-chip memory (DRAM).
- **Ethernet Cores**: For multi-chip communication and scaling operations and models to multiple chips.
- **ARC/PCIe Cores**: Host I/O interfaces.

In this tutorial, we will implement kernels on Tenstorrent hardware, focusing on Tensix Cores.

![Wormhole Architecture](https://raw.githubusercontent.com/tenstorrent/tt-metal/refs/heads/main/docs/source/common/images/tenstorrent-wormhole-logical-noc-diagram.webp)

Each Tensix core contains the following components:
- **5 Baby RISC-V CPUs** that fetch instructions, determine control flow, and configure data movement. 
  - 2 Data Movement cores (RISC-V 0 and 1): Handle NoC (Network-on-Chip) transfers.
  - 3 Compute cores (Unpack, Math, Pack): Handle computation pipeline.

These Baby RISC-V cores interact with the following components on Tensix cores:
- **Matrix engine** (also known as **FPU**) - Specializes in matrix operations such as matrix multiplication and convolutions.
- **Vector engine** (also known as **SFPU**) - Computes general purpose operations like element-wise, activation functions, data shuffling, etc.
- **SRAM** (**L1**) - Local memory for each core (1.5MB for Wormhole and Blackhole). Unlike CPU, this _cache_ is managed manually by the programmer, similar to shared memory on CUDA.
- **2 NoC Interfaces** - Reads and writes data across the chip, and is responsible for inter-core communication.

A detailed introduction of the programming model can be found in the [Metalium Guide](https://github.com/tenstorrent/tt-metal/blob/main/METALIUM_GUIDE.md).

### Programming Philosophy: Bottom-Up Design

Operations on Tenstorrent hardware are typically designed from the **bottom up**:
1. We start from a kernel on a single Tensix core.
2. We schedule the kernel on multiple Tensix cores within the same chip. In some cases, this requires synchronizations between Tensix cores.
3. We scale the kernel to multiple devices. This is critical for the developement of larger LLMs. 

To become familiar with TT-Metalium, we'll start from the bottom and study kernels on a single Tensix core.

### Pipeline Dataflow Across Tensix Cores

The typical data flow within each Tensix core follows a **pipeline pattern** through three different kernels:
1. **Reader Kernel** (Data Movement): Reads data from DRAM into circular buffers. 
2. **Compute Kernel**: Processes data from circular buffers and performs computations.
3. **Writer Kernel** (Data Movement): Writes results from circular buffers back to DRAM.

![Kernel Pipeline](https://raw.githubusercontent.com/tenstorrent/tt-metal/refs/heads/main/docs/source/common/images/tenstorrent-circular-buffer-send-data-cross-kernel-or-itself.webp)

**Circular buffers** communicate and synchronize data between kernels.
They act as FIFOs (First-In-First-Out) data structure, and enable Reader to read new data while compute processes previous data.

This paradigm offers flexibility with data movement while allowing computation to overlap with data transfer.

### Host and Device Components

Developing kernels on Tenstorrent hardware requires code written for both the **device** and **host**.

**1. Host code (C++)**:
- Runs on your CPU
- Opens and initializes the device.
- Allocates memory buffers (DRAM and L1).
- Configures circular buffers.
- Compiles and loads kernels.
- Sets runtime arguments (e.g. tell the kernel where to read and where to write data).
- Enqueues kernels for execution.
- Retrieves results from the device to the host.

**2. Device code (C++ kernels)**: Runs on Tensix cores
- Reader kernel: Fetches data from DRAM via the NoC (Network on Chip)
- Compute kernel: Performs operations using the Matrix and Vector engines (FPU and SFPU)
- Writer kernel: Writes results back to DRAM via NoC

This is similar to other parallel programming APIs like **CUDA**, **OpenCL**, or **SYCL**, where you have host orchestration code and device kernels.

### Typical Project Structure

By convention, TT-Metalium examples follow this file organization:

```
example/
├── example_host.cpp          # Host program
└── kernels/
    ├── dataflow/
    │   ├── reader.cpp        # Reader kernel (RISC-V 0)
    │   └── writer.cpp        # Writer kernel (RISC-V 1)
    └── compute/
        └── compute.cpp       # Compute kernel (Unpack/Math/Pack)
```

## Programming Examples: Adding Two Integers in RISC-V


Having presented the basic principles of the Tenstorrent programming model, we will now look at specific examples.
The most basic example of Tenstorrent kernel is the addition of two integers on a single Baby RISC-V. 

First, we build TT-Metalium with its programming examples:

In [None]:
%%bash
./build_metal.sh --build-tests --debug --build-programming-examples -e --enable-profiler

Programming examples will build in `build_Release/programming_examples`.

Verify that the programming examples have been built by running:

In [None]:
%%bash
ls build_Debug_tracy/programming_examples/

This following programming examples will be displayed: 

```
- contributed
- distributed
- metal_example_add_2_integers_in_compute
- metal_example_add_2_integers_in_riscv
- metal_example_custom_sfpi_add
- metal_example_custom_smoothstep
- metal_example_eltwise_binary
- metal_example_eltwise_sfpu
- ...
```

### Adding Integers on Baby RISC-V

As stated in previous sections, operations are designed around a Reader/Compute/Writer pipeline for maximum performance and leverage of Matrix and Vector engines.
While Baby RISC-Vs are relatively slow and meant to coordinate data transfers, they still support arithmetic operations, and can be used to demonstrate a basic addition kernel such as the [following example](https://github.com/tenstorrent/tt-metal/blob/main/tt_metal/programming_examples/add_2_integers_in_riscv/add_2_integers_in_riscv.md).

The `metal_example_add_2_integers_in_riscv` shows:
- How to set up and execute a basic kernel with two inputs. 
- How to read data from DRAM on a reader kernel.
- The addition of two integers on the Baby RISC-V.
- How to write back output data to DRAM. 


### File Organization

Having introduced the general outline of our example, we can verify its file structure:

In [None]:
%%bash
ls tt_metal/programming_examples/add_2_integers_in_riscv/


should give the following
```
add_2_integers_in_riscv/
├── add_2_integers_in_riscv.cpp          # Host program
└── kernels/
    ├── reader_writer_add_in_riscv.cpp   # Device kernel
```

In this case, we are using only 1 of the 5 Baby RISC-V cores, which means that we only need a single kernel.

This example can be run using the following command:

In [None]:
%%bash
./build_Debug_tracy/programming_examples/metal_example_add_2_integers_in_riscv

### Understanding the Program Structure

This simple example show key components of TT-Metaliumprograms. Let's break down invidual parts:


#### Host Program (`add_2_integers_in_riscv.cpp`)

The host program configures data structures and coordinates the entire operation:

1. **Device Setup**: Creates a 1x1 `MeshDevice` (description of a single hardware device) and gets a command queue for submitting work
2. **Buffer Allocation**: Creates 6 buffers total:
   - 3 DRAM buffers (src0, src1, dst)
   - 3 L1 buffers for temporary storage during computation
3. **Data Upload**: Writes input values (14 and 7) to DRAM buffers (Host -> Device)
4. **Kernel Creation**: Compiles the reader/writer kernel for a single Tensix core (core 0)
5. **Runtime Arguments**: Passes buffer addresses to the kernel
6. **Execution**: Enqueues the program and waits for completion
7. **Result Retrieval**: Reads back the result (21) from DRAM to host


#### Device Kernel (`kernels/reader_writer_add_in_riscv.cpp`)

The kernel runs on a Baby RISC-V core and performs three tasks:

```cpp
void kernel_main() {
    // 1. Get runtime arguments (buffer addresses)
    uint32_t src0_dram = get_arg_val<uint32_t>(0);
    uint32_t src1_dram = get_arg_val<uint32_t>(1);
    uint32_t dst_dram = get_arg_val<uint32_t>(2);
    uint32_t src0_l1 = get_arg_val<uint32_t>(3);
    uint32_t src1_l1 = get_arg_val<uint32_t>(4);
    uint32_t dst_l1 = get_arg_val<uint32_t>(5);
    // ... 
    
    // 2. Read data from DRAM into L1 (local SRAM)
    // Whereas DRAM is outside the chip, L1 memory is located inside the tensix core
    // and has both lower latency and significantly higher bandwidth.
    noc_async_read(src0_dram_noc_addr, src0_l1, sizeof(uint32_t));
    noc_async_read(src1_dram_noc_addr, src1_l1, sizeof(uint32_t));
    noc_async_read_barrier();  // Wait for transfers to complete
    
    // 3. Perform addition on RISC-V core.
    // This read inputs from L1, compute addition, and store back result to L1
    uint32_t* dat0 = (uint32_t*)src0_l1;
    uint32_t* dat1 = (uint32_t*)src1_l1;
    (*out0) = (*dat0) + (*dat1);  // 14 + 7 = 21
    
    // 4. Write result back to DRAM (operation is complete)
    noc_async_write(dst_l1, dst_dram_noc_addr, sizeof(uint32_t));
    noc_async_write_barrier();
}
```

**Some important concepts:**
- The **NoC (Network-on-Chip)** is the network of connections on the chip that are used to move (read/write) data from/to any location on the chip.
- Read and writes are asynchronous: `noc_async_read` and `noc_async_write` do not block, which allows core to perform computation in paralell. But we need barriers to ensure completion.
- **L1 staging**: RISC-V core can not access DRAM data directly and must first copy data to L1. 
- **InterleavedAddrGen**: To maximize kernel performance, TT-Metaliumuses specific data allocation patterns to place data in DRAM (e.g. interleave memory in multiple banks). To ease the computation of DRAM address for the programmer, TT-Metaliumprovides helper classes such as **InterleavedAddrGen**.  



## From one Baby RISC-V to Big accelerator engines: adding two integers in Compute

Having looked at the configuration and execution of a kernel on a single Baby RISC-V processor, we will now look how to use all 5 RISC-V processors in a pipeline. 
Indeed, the Baby RISC-V cores are not highly performant by themselves.
For optimal performance, we want to:
1) Design a pipeline of operations to hide latency and increase throughput
2) Perform computation on the significantly more powerful vector and matrix engines. These can only be accessed from compute kernels.

The following example perform the same addition between two inputs. But instead of doing everything on a single Baby RISC-V processor, we will have three kernels:
- A reader kernel that reads data from DRAM
- A compute kernel that perform the addition
- A writer kernel that writes back the result to DRAM

A more in-depth description of this example can be found [here](https://github.com/tenstorrent/tt-metal/blob/main/tt_metal/programming_examples/add_2_integers_in_compute/add_2_integers_in_compute.md).
A similar example with a more complex operation can be also be found [here](
Also: https://docs.tenstorrent.com/tt-metal/latest/tt-metalium/tt_metal/examples/eltwise_binary.html
).


Before checking the code, we can check the structure of this new example: 

In [None]:
%%bash
ls tt_metal/programming_examples/add_2_integers_in_compute/

```
add_2_integers_in_compute/
    - CMakeLists.txt
    - add_2_integers_in_compute.cpp
    - kernels/
        - dataflow/
            - reader_binary_1_tile.cpp
            - writer_1_tile.cpp
        - compute/
            - add_2_tiles.cpp
```

Unlike the previous example (`add_2_integers_in_riscv`) which only had a host file and a single kernel source file, we now have 4 files:
- A single host file to configure and execute the kernels (`add_2_integers_in_compute`)
- Three kernels: a reader (`reader_binary_1_tile.cpp`), a writer kernel `writer_1_tile.cpp` and a compute kernel (`add_2_tiles.cpp`)

Having checked the structure, we can test the program with the following command:

In [None]:
%%bash
./build_Debug_tracy/programming_examples/metal_example_add_2_integers_in_compute


### Kernel Structure 

Now let's understand the three-kernel architecture for the addition example. Unlike the previous RISC-V example, this one properly utilizes the compute engines (FPU/SFPU) for better performance.

#### Reader Kernel (`kernels/dataflow/reader_binary_1_tile.cpp`)

Here, the role of the reader kernel is to fetch both inputs from DRAM and send their data to the compute kernel.
To do this, we re-use the `noc_async_read_tile()` primitives from the previous `add_2_integers_in_riscv` example. 
But unlike the single-kernel example, the reader puts data into **Circular Buffers*. 

Two primitive are needed to write data into a circular buffer:
- `cb_reserve_back()`: Reserve a **tile** worth of data at the back of the circular buffer. If the buffer is full then wait until buffer has at least 1 tile worth of available space.  
- `cb_push_back()`: Notify compute kernel that data has been written to circular buffer and move the tail of the circular buffer by 1 tile.

To copy the data to the reserved space in the circular buffer, we can use the address of the circular buffer (`get_write_ptr(cb)`) as destination for `noc_async_read_tile()`. Indeed, circular buffers are stored in L1, and their data can be written to.

As indicated here, both primitives use **tiles** rather than bytes or number of elements as unit. This is for performance reasons: on one hand, communication through circular buffers has an overhead than we want to minimize, and on the other hand, the matrix and vectors engines in the compute kernel use tiles as work unit.  
In TT-Metal, a tile typically contains `32*32 = 1024` elements.


```cpp
void kernel_main() {
    // Read parameters from the kernel arguments
    uint32_t in0_addr = get_arg_val<uint32_t>(0); // DRAM address of input0 
    uint32_t in1_addr = get_arg_val<uint32_t>(1); // DRAM address of input1
    
    // Circular buffers to write into. 
    // Each circular buffer has an unique identifier. 
    constexpr uint32_t cb_in0 = tt::CBIndex::c_0;
    constexpr uint32_t cb_in1 = tt::CBIndex::c_1;
    
    // read the tiles from DRAM into the circular buffers
    cb_reserve_back(cb_in0, 1);
    uint32_t cb_in0_addr = get_write_ptr(cb_in0);
    noc_async_read_tile(0, in0, cb_in0_addr);  // read
    noc_async_read_barrier();                  // wait until the read is done
    cb_push_back(cb_in0, 1);                   // mark the tile as ready.

    // same process for the second input (different circular buffer and input buffer)
    cb_reserve_back(cb_in1, 1);
    uint32_t cb_in1_addr = get_write_ptr(cb_in1);
    noc_async_read_tile(0, in1, cb_in1_addr);
    noc_async_read_barrier();
    cb_push_back(cb_in1, 1);
}
```


#### Writer Kernel (`kernels/dataflow/write_tile.cpp`)

The writer reads from the output circular buffer and writes to DRAM:
On the other end of the pipeline, the writer kernel looks is the mirrored version of the reader kernel. 
Instead of reading from DRAM and into a circular buffer, the writer kernel reads from a circular buffer and to DRAM.

But while reader used `cb_reserve_back()` and `cb_push_back()` primitive, the writer use two other circular primitives:
- `cb_wait_front()`: Wait until circular buffer contains at least a tile worth of data. If circular buffer is empty, then it will wait until producer calls `cb_push_back()`. 
- `cb_pop_front()`: Free a tile from circular and move the head by 1 tile. If circular buffer was full and producer was waiting with `cb_reserve_back()` then producer will be able to reserve a tile and continue.

Just like with the reader, these two primitives are only used to synchronize data exchanges between producer (compute kernel) and writer. 
As the underlying data is stored in L1, it can be sent directly to DRAM using `noc_async_write_tile()`, with the read address of the circular buffer (`get_read_ptr(cb)`).

```cpp
void kernel_main() {
    uint32_t dst_addr = get_arg_val<uint32_t>(0);

    // The circular buffer that we are going to read from and write to DRAM
    constexpr uint32_t cb_out0 = tt::CBIndex::c_16;
    const uint32_t tile_size_bytes = get_tile_size(cb_out0);

    
    // Make sure there is a tile in the circular buffer
    cb_wait_front(cb_out0, 1);
    uint32_t cb_out0_addr = get_read_ptr(cb_out0);
    // write the tile to DRAM
    noc_async_write_tile(0, dst, cb_out0_addr);
    noc_async_write_barrier();  // This will wait until the write is done. As an alternative,
                                // noc_async_write_flushed() can be faster because it waits
                                // until the write request is sent. In that case, you have to
                                // use noc_async_write_barrier() at least once at the end of
                                // data movement kernel to make sure all writes are done.
    // Mark the tile as consumed
    cb_pop_front(cb_out0, 1);
}
```


#### Compute Kernel (`kernels/compute/add_2_tiles.cpp`)

The compute kernel is the most complex. Although written as a single file, it's **compiled into three separate binaries** that run on three different RISC-V cores within the Tensix:

```cpp
namespace NAMESPACE {
void MAIN {
    uint32_t n_tiles = get_arg_val<uint32_t>(0);
    
    constexpr auto cb_in0 = tt::CBIndex::c_0;
    constexpr auto cb_in1 = tt::CBIndex::c_1;
    constexpr auto cb_out0 = tt::CBIndex::c_16;
    constexpr uint32_t dst_reg = 0;
    
    // Initialize the compute engines (runs on all 3 processor)
    binary_op_init_common(cb_in0, cb_in1, cb_out0);  // Configure unpacker/packer
    add_tiles_init(cb_in0, cb_in1);                  // Configure FPU for addition
    
    for (uint32_t i = 0; i < n_tiles; i++) {
        cb_wait_front(cb_in0, 1);   // [Unpack cprocessorre] Wait for input data
        cb_wait_front(cb_in1, 1);
        
        tile_regs_acquire();        // [Math processor] Acquire Dst registers
        add_tiles(cb_in0, cb_in1, 0, 0, dst_reg);  // [Unpack+Math] Add tiles using FPU
        tile_regs_commit();         // [Math processor] Transfer Dst to packer
        
        cb_pop_front(cb_in0, 1);    // [Unpack processor] Release input tiles
        cb_pop_front(cb_in1, 1);
        
        cb_reserve_back(cb_out0, 1); // [Pack processor] Reserve output space
        tile_regs_wait();            // [Pack processor] Wait for Dst access
        pack_tile(dst_reg, cb_out0); // [Pack processor] Write Dst to circular buffer
        tile_regs_release();         // [Pack processor] Release Dst registers
        
        cb_push_back(cb_out0, 1);   // [Pack processor] Signal output ready
    }
}
}
```

**The three processors work together:**
1. **Unpack processor**: Manages input circular buffers and copies data into internal registers (`SrcA`/`SrcB`) for the Matrix engine (FPU)
2. **Math processor**: Controls Matrix and Vector engines (FPU and SFPU), which will write their results to internal register (`Dst` registers). 
3. **Pack processor**: Takes results from Matrix and Vector engines and writes them to output circular buffers

Because the same code is executed by three different Baby RISC-V processors, several synchronization primitives are required to avoid data-races on internal registers.
- `tile_regs_acquire()`: Math and Unpack processors claims ownership of internal registers (`Dst` registers) for computation
- `tile_regs_commit()`: Math and Unpack processors release ownership of registers
- `tile_regs_wait()`: Pack processor waits until internal registers are ready
- `tile_regs_release()`: Pack processor releases ownership of internal registers

Moreover, the unpack and pack processors re-use the aforementioned circular buffer primitives to received inputs from reader kernel, and send output to writer kernel. 

**Why initialize?**
In our example, the element-wise addition is performed by the Matrix engine. 
But before we call the operation, we need to configure the Matrix engine as well as the packer and unpackers. 

To do this, we use the following 'init' functions:
- `binary_op_init_common()`: Configures the unpacker/packer for the data formats in the circular buffers
- `add_tiles_init()`: Configures the Matrix engine to perform addition (vs. multiply, etc.)
- These setup functions tell the hardware HOW to interpret and process the data

Having set up the Matrix engine and unpacker/packer, we can perform addition using `add_tiles()`. 
This will read two tiles worth of data from both input circular buffers, add them together, and write the data to another internal register (0-th `Dst` register).

### Summary

In this section, you learned:
- How to set-up a data-processing pipeline: Reader → Compute → Writer working in parallel via circular buffers
- How to use the Circular Buffer API: `cb_reserve_back`/`cb_push_back` (producer) and `cb_wait_front`/`cb_pop_front` (consumer)
- How compute kernels work: One source file → three binaries (Unpack/Math/Pack cores)
- How internal registers of compute kernels are managed and synchronized using `tile_regs_*` functions
- About tiles, which are 32×32 grids and serve as the fundamental unit of computation in TT-Metaliumkernels.
- How to set-up Matrix engine to perform basic element-wise operations

## Exercise: Implementation of a Kernel

The implementation of a models requires many operations and as many kernels. To simplify the work, TT-NN provides a wide range of operations. 
However, it is sometimes useful to re-implement a kernel (if it does not exist, or to tune to a specific use-case).

In this section, we invite you to re-implement an element-wise kernel that computes the hypotenuse of two inputs: `sqrt(a^2 + b^2)`.

To make things easier, we can copy the previous programming example to a new directory.

In [None]:
%%bash
cp -r tt_metal/programming_examples/add_2_integers_in_compute/ tt_metal/programming_examples/my_first_binary_kernel/

To be able to compile this example, we add it to `tt_metal/programming_examples/CMakeLists.txt`:

```
...
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/custom_sfpi_add)
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/custom_sfpi_smoothstep)
> add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/my_first_binary_kernel)
```

And rename the target in `tt_metal/programming_examples/my_first_binary_kernel/CMakeLists.txt` to replace `metal_example_add_2_integers_in_compute` to `metal_my_first_binary_kernel`. 

We can then recompile TT-Metaliumwith this new example.

In [None]:
%%bash
./build_metal.sh --build-tests --debug --build-programming-examples -e --enable-profiler


To implement `hypot`, we only need to modify the compute kernel to replace addition with hypotenuse computation.


```cpp
#include <cstdint>
#include "compute_kernel_api/eltwise_binary.h"
#include "compute_kernel_api/tile_move_copy.h"

namespace NAMESPACE {
void MAIN {
    constexpr auto cb_in0 = tt::CBIndex::c_0;
    constexpr auto cb_in1 = tt::CBIndex::c_1;
    constexpr auto cb_out0 = tt::CBIndex::c_16;

    // The following sequence of operations are compiled onto the 3 compute cores (Unpack, Math, Pack) in the Tensix
    // core. The work together to perform the addition of two input tiles and store the result in the output tile to the
    // output circular buffer. Which is then picked up by the writer kernel and written back to DRAM.

    // Metalium API Calls                              Involved Processors
    // TODO: Use matching init
    // ...

    // wait for a tile to be ready in the input CBs
    cb_wait_front(cb_in0, 1);  // Unpack
    cb_wait_front(cb_in1, 1);  // Unpack

    // acquire 8 tile registers to perform the addition
    tile_regs_acquire();  // Math

    // ---- TODO ----
    // TODO: Hypotenuse computation goes here
    // ...
    // ---- END of TODO ----

    // signal the packer
    tile_regs_commit();  // Math

    // packer waits here
    tile_regs_wait();  // Pack
    // Copy the result from tile registers to the
    // output circular buffer (also called packing)
    pack_tile(0, cb_out0);  // Pack
    // packer releases
    tile_regs_release();  // Pack

    cb_pop_front(cb_in0, 1);  // Unpack
    cb_pop_front(cb_in1, 1);  // Unpack

    cb_push_back(cb_out0, 1);  // Pack
}
}  // namespace NAMESPACE
```

A full description of available Kernel APIs such as `add_tiles()` can be found [here](https://docs.tenstorrent.com/tt-metal/latest/tt-metalium/tt_metal/apis/kernel_apis.html)

__Hint:__ `add_binary_tile()` (SFPU version), `square_tile`, `sqrt_tile()`, `copy_tile()` which all read and write from and to DST registers.

## Relevant useful Resources

### Related Repositories

- **tt-isa-documentation**: Documentation of Tenstorrent's hardware-specific Open-Source Instruction Set Architecture  
  https://github.com/tenstorrent/tt-isa-documentation/

- **tt-metal**: Main repository with examples and documentation  
  https://github.com/tenstorrent/tt-metal

- **tt-forge**: High-level neural network compiler framework  
  https://github.com/tenstorrent/tt-forge-fe

### Key Documentation

- **Metalium Guide**: Comprehensive architecture and programming guide (see link at the top)
- **API Reference**: https://docs.tenstorrent.com/tt-metal/latest/tt-metalium/tt_metal/apis/
- **Technical Reports**: Performance studies and architecture deep-dives in `tt-metal/tech_reports/`
  - Matrix multiplication performance (GEMM_FLOPS)
  - Convolution networks (CNNs)
  - Attention mechanisms (FlashAttention)
  - Multi-chip scaling (TT-Fabric)

### Programming Examples

The `tt_metal/programming_examples/` directory contains many useful examples:
- **hello_world_***: Minimal examples for data movement and compute
- **matmul_***: Single-core and multi-core matrix multiplication
- **eltwise_***: Element-wise operations (binary, SFPU)
- **custom_sfpi_***: Custom vector operations on the SFPU
- **distributed/**: Multi-chip programming examples

### Debugging Tools

**Print debugging**: Use `DPRINT` macro in kernels
```cpp
DPRINT << "Value: " << my_value << "\n";
```
Then set the environment variable to see output:
```bash
export TT_METAL_DPRINT_CORES=0,0  # Print from core (0,0)
```

**Tracy profiler**: Enable profiling during build
```bash
./build_metal.sh --build-tests --release -e --enable-tracy
```

### Getting Help

- **Documentation**: https://docs.tenstorrent.com/
- **GitHub Issues**: https://github.com/tenstorrent/tt-metal/issues
- **Discord Community**: Join the Tenstorrent Discord for community support

