# Advanced Topics in Inference APIs

This tutorial explains a little bit more advanced topics about Inference APIs. The followings are the main topics:
* How to specify a NPU device including *NPU core fusion*.
* Asynchronous and non-blocking inference API

## Prerequisites
To follow this tutorial, please install the following requisites.

First, you must install NPU driver, firmware, and runtime by following the instruction at [FuriosaAI Driver, Firmware, Runtime Installation Guide](https://furiosa-ai.github.io/docs/latest/ko/software/installation.html).

Then, please install the following python packages:
```sh
pip install furiosa-sdk matplotlib mnist
```
Or, you can run the following command to install all dependent packages for all notebook examples at once:
```sh
pip install -r requirements.txt
```

And then, let's check if your NPU device is ready as following:

In [1]:
!furiosactl info

[0m+[0m[0m------[0m[0m+[0m[0m--------[0m[0m+[0m[0m----------------[0m[0m+[0m[0m-------[0m[0m+[0m[0m--------[0m[0m+[0m[0m--------------[0m[0m+
[0m[0m[0m|[0m[0m [0m[0m[0m[1mNPU [0m [0m[0m|[0m[0m [0m[0m[0m[1mName  [0m [0m[0m|[0m[0m [0m[0m[0m[1mFirmware      [0m [0m[0m|[0m[0m [0m[0m[0m[1mTemp.[0m [0m[0m|[0m[0m [0m[0m[0m[1mPower [0m [0m[0m|[0m[0m [0m[0m[0m[1mPCI-BDF     [0m [0m[0m|[0m[0m
[0m[0m[0m+[0m[0m------[0m[0m+[0m[0m--------[0m[0m+[0m[0m----------------[0m[0m+[0m[0m-------[0m[0m+[0m[0m--------[0m[0m+[0m[0m--------------[0m[0m+
[0m[0m[0m|[0m[0m [0m[0m[0mnpu0[0m [0m[0m|[0m[0m [0m[0m[0mwarboy[0m [0m[0m|[0m[0m [0m[0m[0m1.7.0, 0a4411e[0m [0m[0m|[0m[0m [0m[0m[0m 38°C[0m [0m[0m|[0m[0m [0m[0m[0m2.22 W[0m [0m[0m|[0m[0m [0m[0m[0m0000:49:00.0[0m [0m[0m|[0m[0m
[0m[0m[0m+[0m[0m------[0m[0m+[0m[0m--------[0m[0m+[0m[0m-------

Then, let's make sure that your SDK is ready to run immediately by running the following command. If you see any error here, please follow the instructions at
* [FuriosaAI Driver, Firmware, Runtime Installation Guide](https://furiosa-ai.github.io/docs/v0.5.0/ko/software/installation.html)
* [Setting up a Python Environment](https://furiosa-ai.github.io/docs/v0.5.0/ko/software/python-sdk.html#python)

In [2]:
!python -c "from furiosa import runtime;print(runtime.__full_version__)"

libfuriosa_hal.so --- v0.11.0, built @ 43c901f
Furiosa SDK Runtime 0.10.0-dev (rev: e80482f4) (libnux 0.9.0 062c7dd1f 2023-04-12T20:55:14Z)


## How to Specify a NPU device

You may need to specify a NPU device for your applications in the following cases:
* Case A: when you have more than one NPU devices
* Case B: if you want to use individual PEs separately for smaller DNN applications or a single fusioned PE

FuriosaAI SDK provides a couple of ways to specify a NPU device that your application uses. In this section, we are going to explain this feature.

### Understanding NPU IDs

NPU IDs are used across all of furioaAI SDK components. So, you need to understand how a NPU device is represented as a single NPU ID string.

`npu0`, `npu1`, `npuN` represents a single NPU device. The last digit number starts from 0, and can be increased sequentially as you add more NPUS to your machine. There are individual 2 PEs in a single NPU device. They are individually represented as `pe0` and `pe1`.

Usually, a NPU ID can represent both a certain NPU device and certain PE(s). For example, if you have 2 NPU devices and want to list all available individual PEs, they are represented by:
* `npu0pe0`
* `npu0pe1`
* `npu1pe0`
* `npu1pe1`

In Warboy, you are able to fuse 2 PEs belonging to the same NPU. 2 fused NPUs are represented by:
* `npu0pe0-1`
* `npu1pe0-1`

### Using Shell Environment Variable to Specify a NPU device

All of FuriosaAI SDKs recognize the shell environment variable `NPU_DEVNAME`. If you specify `NPU_DEVNAME` in your shell, your application will use the NPU device specified in `NPU_DEVNAME`. For example, you can specify a NPU device in your shell as following:

```sh
export NPU_DEVNAME="npu0pe0"
```

Please note that a single NPU device is occupied while another application is using the device. So, you cannot run multiple applications with the same `NPU_DEVNAME` setting.

### Using Session Option

In Python SDK, `Session` is the core class to run inferences, and it allows various options. One of the options is `device`, allowing a user to specific a NPU device for the session. If you are not familar with `Session`, you can learn from [Getting Started With Python SDK](GettingStartedWithPythonSDK.ipynb).

For example, you can specify a NPU device when you create a `Session` object, as following:
```python
from furiosa.runtime import session
sess = session.create('mnist-8.onnx', device="npu0pe0")
```

Please note that a specific NPU device in Session option overrides the shell environment variable `NPU_DEVNAME`.

## Asynchronous Inference APIs

Asynchronous Inference API allows an user application to handle multiple inference requests through a single thread.

To use asynchronous inference APIs, please call `session.create_async()` that create both `submitter` and `queue` instances as following.

In [3]:
from furiosa.runtime import session

model_path = "models/MNIST_MobileNet_v2_uint8_quant_without_avgpool_softmax.tflite"

submitter, queue = session.create_async(model_path, 
                                        worker_num=1, 
                                        # Determine how many asynchronous requests you can submit
                                        # without blocking.
                                        input_queue_size=100,
                                        output_queue_size=100)

libfuriosa_hal.so --- v0.11.0, built @ 43c901f
Saving the compilation log into /home/hyunsik/.local/state/furiosa/logs/compile-20230413184105-357mhm.log
Using furiosa-compiler 0.9.0 (rev: 062c7dd1f built at 2023-04-12T20:55:14Z)
[1m[2m[1/6][0m 🔍   Compiling from tflite to dfg
Done in 0.014001096s
[1m[2m[2/6][0m 🔍   Compiling from dfg to ldfg


[2m2023-04-13T23:41:05.755788Z[0m [32m INFO[0m [2mnux::npu[0m[2m:[0m Npu (npu0pe0-1) is being initialized
[2m2023-04-13T23:41:05.759270Z[0m [32m INFO[0m [2mnux[0m[2m:[0m NuxInner create with pes: [PeId(0)]


Done in 19.000982s
[1m[2m[3/6][0m 🔍   Compiling from ldfg to cdfg
Done in 0.000484564s
[1m[2m[4/6][0m 🔍   Compiling from cdfg to gir
Done in 0.003669258s
[1m[2m[5/6][0m 🔍   Compiling from gir to lir
Done in 0.001117423s
[1m[2m[6/6][0m 🔍   Compiling from lir to enf
Done in 0.008298314s
✨  Finished in 19.029139s


A `submitter` provides APIs to submit inference requests, and a `queue` provides APIs to receive the completed inference requests.

In [4]:
submitter.inputs()

[TensorDesc(shape=(1, 28, 28, 1), dtype=UINT8, format=NHWC, size=784, len=784)]

In [5]:
submitter.outputs()

[TensorDesc(shape=(1, 10), dtype=UINT8, format=??, size=10, len=10)]

In [6]:
from furiosa.runtime import tensor
import numpy as np
import mnist
import random

train_images = mnist.train_images()

# Submit the inference requests asynchronously
for i in range(0, 5):
    idx = random.randint(0, 59999)
    input = np.array(train_images[idx:idx+1].reshape(1, 28, 28, 1), np.uint8)
    submitter.submit(input, context=idx)

In [7]:
# Receive the results asynchronously
for i in range(0, 5):
    context, outputs = queue.recv(100) # 100 is timeout. If None, queue.recv() will be blocking.
    print(f"Context: {context}, Predict: {np.argmax(outputs[0].numpy())}")

Context: 49148, Predict: 1
Context: 856, Predict: 1
Context: 17638, Predict: 1
Context: 32495, Predict: 1
Context: 4727, Predict: 2


You need to close `queue` and `submitter` after you use the asynchronous session.

In [8]:
if queue:
    queue.close()
if submitter:
    submitter.close()

[2m2023-04-13T23:41:25.541872Z[0m [32m INFO[0m [2mnux::npu[0m[2m:[0m NPU (npu0pe0-1) has been destroyed
