## TTNN

### Introduction
You don’t need to be proficient in C++ or think about circular buffers to program Tenstorrent devices, just like you don’t need to know CUDA to use PyTorch, or C to use numpy!

Meet ttnn - a library for tensor manipulation.

### Source

Like the rest of the Tenstorrent stack, it is open source!

You can find the code [here](https://github.com/tenstorrent/tt-metal/tree/main/ttnn)

### Docs
You can also find more documentation and API reference [here](https://docs.tenstorrent.com/tt-metal/latest/ttnn/index.html)

## Usage

TTNN's internals are implemented in C++, but the main way to interact with it is from Python

The library is already installed in your environments!

In [None]:
import ttnn

Start by opening the Tenstorrent device you will be using

In [None]:
device = ttnn.open_device(device_id=0)

### Tensor creation and movement

You can create a tensor on host

In [None]:
host_tensor =ttnn.full([10,15], 1.0)
host_tensor

then move it to device

In [None]:
device_tensor = ttnn.to_device(host_tensor, device)
device_tensor

or directly create the tensor on device

In [None]:
device_tensor_2 = ttnn.rand([10,15], device=device)
device_tensor_2

You can also create a ttnn tensor from a torch tensor!

In [None]:
import torch


In [None]:
torch_tensor = torch.rand([10,15])

In [None]:
host_ttnn_from_torch = ttnn.from_torch(torch_tensor)
host_ttnn_from_torch

In [None]:
device_ttnn_from_torch = ttnn.from_torch(torch_tensor, device=device, layout=ttnn.TILE_LAYOUT)
device_ttnn_from_torch

Sending tensors back is also just as easy

In [None]:
device_tensor = ttnn.rand([10,15], device=device)

In [None]:
host_tensor = ttnn.from_device(device_tensor)
host_tensor

And moving tensors back to torch

In [None]:
torch_tensor = ttnn.to_torch(device_tensor)
torch_tensor

### Tensor layout

As you may remember from earlier sections, Tenstorrent devices operate efficiently on tiled data.

Many operations require the inputs to be tilized, not row-major. 

You can change the layout, or choose it when creating or moving the tensor.

The tensor gets padded to fill the tiles, but this is transparent to the user.

Tensors are usually created in row-major layout

In [None]:
print(ttnn.full([3,4], 1.0).layout)
print(ttnn.full([3,4], 1.0, device=device).layout)

except in some cases

In [None]:
ttnn.rand([10,15], device=device).layout

and maintain their layout when moved to device

In [None]:
host_tensor = ttnn.full([3,4], 1.0)
print(host_tensor.layout)
device_tensor = ttnn.to_device(host_tensor, device)
print(device_tensor.layout)

unless explicitly converted

In [None]:
device_tensor = ttnn.to_layout(device_tensor, ttnn.TILE_LAYOUT)
print(device_tensor.layout)

Torch tensors are row-major, but you can tilize during the conversion

In [None]:
torch_tensor = torch.rand([10,15])
print(ttnn.from_torch(torch_tensor).layout)
print(ttnn.from_torch(torch_tensor, device=device).layout)
print(ttnn.from_torch(torch_tensor, device=device, layout=ttnn.TILE_LAYOUT).layout)

### Tensor operations

_Note_:

Most operations are only supported on device, not on host.



To find out more about controlling operation math fidelity and limitations, such as TF32-like matrix multiplication of FP32 inputs, see [details](https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/matrix_engine/matrix_engine.md)

Many operations you know and love from pytorch are already here!

In [None]:
x = ttnn.arange(start=0, end=100, device=device, layout=ttnn.TILE_LAYOUT)
x = ttnn.divide(x, 100)
x = x.reshape([1,100])

In [None]:
y=ttnn.rand([1, 100], device=device)
y

In [None]:
x+y
x*y
x-y
ttnn.divide(x, y)

In [None]:
ttnn.sin(x)
ttnn.cos(x)
ttnn.exp(x)
ttnn.log(x)
ttnn.sqrt(x)
ttnn.pow(x, 2)



In [None]:
ttnn.sort(y)

In [None]:
ttnn.concat([x, y], dim=1)

Slicing also works!

In [None]:
x[:, 50:100]

And many, many more!

You can find the full set of supported operations [here](https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api.html#operations)

## Compilation and compilation cache

You may notice that the operations seem slow when you first run them, and fast the following times

In [None]:
import time
x = ttnn.rand([1000, 1000], device=device)
start = time.time()
y = ttnn.softmax(x, dim=1)
# becuse we're not reading back the result,
# we need the synchronize to actually measure the execution time,
# not just time taken to dispatch the operation
ttnn.synchronize_device(device)
end = time.time()
print(f"First iteration: {end - start} seconds")
start = time.time()
y = ttnn.softmax(x, dim=1)
ttnn.synchronize_device(device)
end = time.time()
print(f"Time taken: {end - start} seconds")


This is because when you first run an operation for a given tensor shape, the underlying tt-metal kernel gets compiled. The following runs re-use the same binary.

If a compile-time argument changes, such as tensor shape, a new compilation is needed

In [None]:
# Same operation, different shape
x = ttnn.x = ttnn.rand([1337, 1337], device=device)
start = time.time()
y = ttnn.softmax(x, dim=1)
ttnn.synchronize_device(device)
end = time.time()
print(f"First iteration: {end - start} seconds")
start = time.time()
y = ttnn.softmax(x, dim=1)
ttnn.synchronize_device(device)
end = time.time()
print(f"Time taken: {end - start} seconds")

## Direct SRAM (L1) control

As explained in previous sections, with tt-metal and tt-nn the user has full control over moving the data into and out of faster, but limited SRAM memory, also known as L1.

In [None]:
dram_tensor = ttnn.rand([4096,2048], device=device)
dram_tensor.memory_config()

In [None]:
sram_tensor = ttnn.to_memory_config(dram_tensor, ttnn.L1_MEMORY_CONFIG)
sram_tensor.memory_config()

In [None]:
# warmup, compilation
ttnn.sum(dram_tensor, dim = 0)
ttnn.sum(sram_tensor, dim = 0)
ttnn.synchronize_device(device)
start = time.time()
for _ in range(10):
    ttnn.sum(dram_tensor, dim = 0)
ttnn.synchronize_device(device)
end = time.time()
print(f"DRAM Time taken: {end - start} seconds")
start = time.time()
for _ in range(10):
    ttnn.sum(sram_tensor, dim = 0)
ttnn.synchronize_device(device)
end = time.time()
print(f"SRAM Time taken: {end - start} seconds")

When doing a series of operations, deallocate tensors manually to free up memory. This is especially important for the limited L1.

In [None]:
ttnn.deallocate(sram_tensor)

For even better performance, you can shard the L1 tensor to keep the data closer to the cores processing it - learn more [here](https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/tensor.html#tensor-sharding)

In [None]:
sharded_config = ttnn.create_sharded_memory_config(shape=dram_tensor.shape,
core_grid=ttnn.CoreGrid(x=8,y=8),
strategy=ttnn.ShardStrategy.WIDTH)
sharded_tensor = ttnn.to_memory_config(dram_tensor, sharded_config)
ttnn.sum(sharded_tensor, dim = 0)
ttnn.synchronize_device(device)
start = time.time()
for _ in range(10):
    res = ttnn.sum(sharded_tensor, dim = 0)
ttnn.synchronize_device(device)
end = time.time()
print(f"Sharded Time taken: {end - start} seconds")
ttnn.deallocate(sharded_tensor)

sharded_config = ttnn.create_sharded_memory_config(shape=dram_tensor.shape,
core_grid=ttnn.CoreGrid(x=8,y=8),
strategy=ttnn.ShardStrategy.HEIGHT)
sharded_tensor = ttnn.to_memory_config(dram_tensor, sharded_config)
ttnn.sum(sharded_tensor, dim = 0)
ttnn.synchronize_device(device)
start = time.time()
for _ in range(10):
    res = ttnn.sum(sharded_tensor, dim = 0)
ttnn.synchronize_device(device)
end = time.time()
print(f"Sharded 2 Time taken: {end - start} seconds")
ttnn.deallocate(sharded_tensor)

Manual control over L1 lets you keep intermediate results in the cache without moving fusing operations

In [None]:
x = ttnn.rand([32, 128], device=device, memory_config=ttnn.L1_MEMORY_CONFIG)

In [None]:
w1 = ttnn.rand([128, 128], device=device, memory_config=ttnn.L1_MEMORY_CONFIG)
w2 = ttnn.rand([128, 128], device=device, memory_config=ttnn.L1_MEMORY_CONFIG)

In [None]:
x1 = ttnn.linear(x, w1, memory_config=ttnn.L1_MEMORY_CONFIG)
print(x1.memory_config())
x2 = ttnn.relu(x1) # automatically maintains L1 config
print(x2.memory_config())
x3 = ttnn.linear(x2, w2, memory_config=ttnn.L1_MEMORY_CONFIG)
print(x3.memory_config())

ttnn.deallocate(x1)
ttnn.deallocate(x2)
ttnn.deallocate(x3)


### TTNN neural network operations

TTNN provides neural network operations as pure functions, similar to `torch.nn.functional`. This lets you structure your neural network module classes however you like!

In [None]:
input_ids = ttnn.from_torch(torch.randint(0, 1000, (2, 32)), dtype=ttnn.uint32, device=device)
emb_weight = ttnn.rand((1, 1, 1000, 512), dtype=ttnn.bfloat16, device=device)
x = ttnn.embedding(input_ids, emb_weight, layout=ttnn.TILE_LAYOUT)  # [2, 32, 512]
x = ttnn.reshape(x, (2, 1, 32, 512))
# LayerNorm
x = ttnn.layer_norm(x, epsilon=1e-5)
# Linear: 512 -> 2048 -> 512
w1 = ttnn.rand((1, 1, 512, 2048), dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
x = ttnn.relu(ttnn.linear(x, w1))
w2 = ttnn.rand((1, 1, 2048, 512), dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
x = ttnn.linear(x, w2)

For more operations, like an efficient SDPA implementation, see [here](https://docs.tenstorrent.com/tt-metal/latest/ttnn/ttnn/api.html)

## Inference only

You may notice we did not mention autograd - TTNN is focused on inference.

Support for training is being developed in a separate framework - have you seen our talk?