# 0 - setup
In this guide, we will implement a simple “Hello World” style NKI kernel and run it on a NeuronDevice (Trainium/Inferentia2 or beyond device). We will showcase how to invoke a NKI kernel standalone through NKI baremetal mode and also through ML frameworks (PyTorch). Before diving into kernel implementation, let’s make sure you have the correct environment setup for running NKI kernels.

## Environment Setup
You need a [Trn1](https://aws.amazon.com/ec2/instance-types/trn1/) or [Inf2](https://aws.amazon.com/ec2/instance-types/inf2/) instance set up on AWS to run NKI kernels on a NeuronDevice. Once logged into the instance, follow steps below to ensure you have all the required packages installed in your Python environment.

NKI is shipped as part of the Neuron compiler package. To make sure you have the latest compiler package, see [Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/index.html) for an installation guide.

You can verify that NKI is available in your compiler installation by running the following command:

In [1]:
import neuronxcc.nki

This attempts to import the NKI package. It will error out if NKI is not included in your Neuron compiler version or if the Neuron compiler is not installed. The import might take about a minute the first time you run it. Whenever possible, we recommend using local instance NVMe volumes instead of EBS for executable code.

If you intend to run NKI kernels without any ML framework for quick prototyping, you will also need NumPy installed.

To call NKI kernels from PyTorch, you also need to have torch_neuronx installed. For an installation guide, see PyTorch Neuron Setup. You can verify that you have torch_neuronx installed by running the following command:

In [2]:
import torch_neuronx

## Implementing your first NKI kernel
In current NKI release, all input and output tensors must be passed into the kernel as device memory (HBM) tensors on a NeuronDevice. The body of the kernel typically consists of three main phases:

1. Load the inputs from device memory to on-chip memory (SBUF).
2. Perform the desired computation.
3. Store the outputs from on-chip memory to device memory.

For more details on the above terms, see [NKI Programming Model](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/programming_model.html).

In [3]:
import neuronxcc.nki.language as nl

def nki_tensor_add_kernel(a_input, b_input, c_output):
    """
    NKI kernel to compute element-wise addition of two input tensors
    """

    # Check all input/output tensor shapes are the same for element-wise operation
    assert a_input.shape == b_input.shape == c_output.shape

    # Check size of the first dimension does not exceed on-chip memory tile size limit,
    # so that we don't need to tile the input to keep this example simple
    assert a_input.shape[0] <= nl.tile_size.pmax

    # Load the inputs from device memory to on-chip memory
    a_tile = nl.load(a_input)
    b_tile = nl.load(b_input)

    # Specify the computation (in our case: a + b)
    c_tile = nl.add(a_tile, b_tile)

    # Store the result to c_output from on-chip memory to device memory
    nl.store(c_output, value=c_tile)

## NKI baremetal

To run the above `nki_tensor_add_kernel` kernel in baremetal mode, we can decorate the function with `@baremetal` as follows:


```python
@baremetal
def nki_tensor_add_kernel(a_input, b_input, c_output):
```

See [nki.baremetal](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/api/generated/nki.baremetal.html) API doc for available input arguments to the decorator. `nki.baremetal` expects input and output tensors of the NKI kernel to be NumPy arrays. To invoke the kernel, we first initialize the two input tensors `a` and `b` and the output tensor `c` as NumPy arrays. In this scenario, it’s not necessary to zero out the output tensor, as it will be completely overwritten by the result of the addition. However, in some cases, a kernel might overwrite only a part of the output tensor, and the user might want to reset it beforehand to avoid garbage data. Finally, we call the NKI kernel just like any other Python function

In [4]:
from neuronxcc.nki import baremetal

"""
Note that this is the same as: 

@baremetal
def nki_tensor_add_kernel(a_input, b_input, c_output):
"""
nki_tensor_add_kernel_baremetal = baremetal(nki_tensor_add_kernel) 

import numpy as np

a = np.ones((4, 3), dtype=np.float16)
b = np.ones((4, 3), dtype=np.float16)
c = np.zeros((4, 3), dtype=np.float16)

# Run NKI kernel on a NeuronDevice
nki_tensor_add_kernel_baremetal(a, b, c)

print(c)


## PyTorch

To run the above `nki_tensor_add_kernel` kernel using PyTorch, we can decorate the function with `@nki_jit` as follows:

```python
@nki_jit
def nki_tensor_add_kernel(a_input, b_input, c_output):
```

The kernel caller code is highly similar to NKI baremetal mode, except the input and output tensors must now be initialized as PyTorch `device` tensors instead.

In [5]:
import torch
from torch_xla.core import xla_model as xm
from torch_neuronx import nki_jit

"""
Note that this is the same as: 

@nki_jit
def nki_tensor_add_kernel(a_input, b_input, c_output):
"""
nki_tensor_add_kernel_pytorch = nki_jit(nki_tensor_add_kernel)

device = xm.xla_device()

a = torch.ones((4, 3), dtype=torch.float16).to(device=device)
b = torch.ones((4, 3), dtype=torch.float16).to(device=device)
c = torch.zeros((4, 3), dtype=torch.float16).to(device=device)

nki_tensor_add_kernel_pytorch(a, b, c)

print(c) # an implicit XLA barrier/mark-step (triggers XLA compilation)

## Release the NeuronCore for the next notebook

Before moving to the next notebook we need to release the NeuronCore. If we don't do this the next notebook will not be able resources - you can also stop the kernel via the GUI

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)