# Introduction to AMD Xilinx Vitis AI

AMD Xilinx Vitis AI is an environment (or development stack) for the whole process of embedded implementation for FPGA.

The environment consist of:
- Vitis AI Quantizer - delivers quantizer for model conversion from floating point model to quantized model.

    There is available Post Training Quantization and Quantization Aware Training.

    PTQ allows for designation of INT8 quantization parameters ow weights, biases, inputs and features maps.

    For quantization is used small part of training dataset.

    Vitis AI supports popular frameworks like PyTorch, TensorFlow or Caffe.
    
- Vitis AI Optiomizer - delivers tools for network optimization f.g. pruning.

- Vitis AI Compiler - compiles quantized model into representation understandable by accelerator.

- Vitis AI RunTime (VART) - driver library for communication with accelerator.

- Vitis AI Deep learning Processor Unit (DPU) - sequential / general purpose 
    
    accelerator implemented in reconfigurable logic - FPGA.
    
    DPU allows for execution of 
    
    - Convolution 1-3D: standard, depthwise, transposed
    - upsampling: bilinear, nearest neighbor
    - max / average pooling
    - elementwise addition and multiplication
    - activations: ReLU, ReLU6, LeakyReLU, softmax
    - for some HW platforms available also sigmoid and hyperbolic tangent.
    
    DPU is generated by appropriate software (Vitis HLS / Vivado).
    Generation time allows for final DPU configuration changes:
    - available operations (depthwise, elementwise mul., LeakyReLU, softmax, average pooling).
    - resources usage: DSP, dRAM
    - energy saving mode

This laboratory is dedicated to the part related to quantization and compilation of PyTorch NN model with Vitis AI. 

## Part 1 - FLOATING-POINT TRAINING

1. Instantiate evaluation (batch size = 1) loader with test data.

Instantiate MiniResNet model.

Print model parameters.

Use functions from `local_utils` module. 

In [1]:
import torch
import matplotlib.pyplot as plt
import local_utils as lu

eval_loader =  lu.get_test_dataset(1)
print("len(eval_loader) =", len(eval_loader))

net = lu.MiniResNet()

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 14201930.49it/s]


Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 24427443.80it/s]


Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 6182918.75it/s]


Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 17255913.74it/s]

Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw

len(eval_loader) = 10000





2. Instantiate train and test loaders with batch size = 64.

Use functions from `local_utils`

In [2]:
BATCH_SIZE = 64

train_loader = lu.get_train_dataset(BATCH_SIZE)
test_loader = lu.get_test_dataset(BATCH_SIZE)

print("len(train_loader) =", len(train_loader))
print("len(test_loader) =", len(test_loader))

loader = train_loader
for X, y in loader:
    print(X.shape)
    print(y.shape)
    break

len(train_loader) = 938
len(test_loader) = 157
torch.Size([64, 1, 28, 28])
torch.Size([64])


3. Train the network with:
- SGD optimizer
- learning rate 0.1
- update period of 5
- 5 epochs
- accuracy metric

Plot history.

In [4]:
metric = lu.AccuracyMetic()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.1)

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
print(device)

net.to(device)
net, history = lu.training(net, train_loader, test_loader, criterion, metric, optimizer, 5, 5, device=device)


cuda
Epoch 1 / 5: STARTED
TRAINING
Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


938it [00:11, 80.07it/s] 


VALIDATION
Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


157it [00:00, 158.26it/s]


After epoch 1: loss=1.7432 acc=0.7183 val_loss=1.5947 val_acc=0.8685
Epoch 1 / 5: FINISHED

Epoch 2 / 5: STARTED
TRAINING
Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


938it [00:10, 90.64it/s]


VALIDATION
Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


157it [00:00, 167.18it/s]


After epoch 2: loss=1.5836 acc=0.8779 val_loss=1.5793 val_acc=0.8806
Epoch 2 / 5: FINISHED

Epoch 3 / 5: STARTED
TRAINING
Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


938it [00:09, 96.71it/s] 


VALIDATION
Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


157it [00:00, 167.22it/s]


After epoch 3: loss=1.5731 acc=0.8872 val_loss=1.5710 val_acc=0.8890
Epoch 3 / 5: FINISHED

Epoch 4 / 5: STARTED
TRAINING
Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


938it [00:09, 94.48it/s] 


VALIDATION
Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


157it [00:00, 164.45it/s]


After epoch 4: loss=1.4929 acc=0.9695 val_loss=1.4811 val_acc=0.9811
Epoch 4 / 5: FINISHED

Epoch 5 / 5: STARTED
TRAINING
Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


938it [00:10, 91.15it/s]


VALIDATION
Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


157it [00:00, 164.15it/s]

After epoch 5: loss=1.4768 acc=0.9851 val_loss=1.4779 val_acc=0.9841
Epoch 5 / 5: FINISHED






4. Extract model state dict and save it in file `weights.pth`.

Note: `_use_new_zipfile_serialization=False` is needed for backward compatibility.
with older version of PyTorch in Vitis AI docker environment.

In [5]:
torch.save(net.state_dict(), "model_weights.pth", _use_new_zipfile_serialization=False)

## Part 2 - EVALUATION - host device

5. Instantiate `MiniResNet` network with the same input shape.

Load state dict from `weights.pth` file and initialize with them network (`load_state_dict`, with `map_location=device`).  

Evaluate model on `eval_loader` dataset with `local_utils.train_test_pass`.

Print information about loss, accuracy, time of execution, number of processed images and throughput (fps).

Experiment do for 'cpu' and for 'cuda' devices.

In [8]:
# CUDA - GPU
net = lu.MiniResNet()
net.load_state_dict(torch.load("model_weights.pth"))

tm = lu.TimeMeasurement("Host-GPU", len(eval_loader))
with tm:
    net, loss, acc = lu.train_test_pass(net, eval_loader, criterion, metric, optimizer, "cuda")

print(repr(tm))
print("loss:", loss)
print("acc:", acc)

Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


10000it [00:16, 623.78it/s]

TimeMeasurement(context="Host-GPU","0:0:16.0:0.0", frames=10000, throughput=625.0)
loss: 1.4778948688983917
acc: 0.9841





In [9]:
# CPU
net = lu.MiniResNet()
net.load_state_dict(torch.load("model_weights.pth"))

tm = lu.TimeMeasurement("Host-CPU", len(eval_loader))
with tm:
    net, loss, acc = lu.train_test_pass(net, eval_loader, criterion, metric, optimizer, "cpu")

print(repr(tm))
print("loss:", loss)
print("acc:", acc)

Running on platform: Linux-5.15.0-56-generic-x86_64-with-glibc2.10, machine: x86_64, python_version: 3.8.8, processor: x86_64, system: Linux, 


10000it [00:15, 633.57it/s]

TimeMeasurement(context="Host-CPU","0:0:15.0:0.0", frames=10000, throughput=666.6666666666666)
loss: 1.4778948688983917
acc: 0.9841





# Part 3 - AMD Xilinx Vitis AI environment

This part is a guid how to run AMD Xilinx Vitis AI (Vitis AI / VAI as shortcuts):

1. Open console in directory with `lab_11` files 

(or create new cell and write commands with `!` at the beginning of line f.g. `!ls`).

2. Run docker container (preceded by some Xilinx consents of VAI usage) by running a script:

`./docker_run.sh xilinx/vitis-ai:1.4.916`

Mentioned script pulls (if it's necessary) the docker image of Vitis AI environment with version 1.4.916

and starts bash terminal.

This operation may take some time...

3. Now your terminal is placed in VAI container.

Current directory in container (`/workspace`) is mapped to directory where you run a VAI container
(`lab_11` directory).

4. Activate VAI conda environment dedicated for PyTorch library;

`conda activate vitis-ai-pytorch`

5. Run Jupyter server inside container.

`jupyter notebook --no-browser --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.password=''`
 
6. Save this file.

7. Open link to Jupyter's browser interface and run `notebook_quntize.ipynb`.