# Image Classification w/ TensorFlow Model
This tutorial demonstrates the steps required to prepare and deploy a trained TensorFlow model for FPGA acceleration  
We will prepare a trained Inception v1 model, and then run a single inference.  

# Model Preparation (Offline Process, Performed Once):


## Phase 1: Compile The Model  
    * A Network Graph (protobuf) is compiled
    * The network is optimized
    * FPGA Instructions are generated
      * These instructions are required to run the network in "one-shot", and minimize data movement

## Phase 2: Quantize The Model
    * The Quantizer will generate a json file holding scaling parameters for quantizing floats to INT16 or INT8
    * This is required, because FPGAs will take advantage of Fixed Point Precision, to achieve faster inference
      * While floating point precision is useful in the model training scenario
          It is not required for high speed, high accuracy inference
    
# Model Deployment (Online Process, Typically Performed Iteratively):  
    
## Phase 3: Deploy The Model
Once you have the outputs of the compiler and quantizer, you will use the xfDNN deployment APIs to:
1. Open a handle for FPGA communication
2. Load weights, biases, and quantization parameters to the FPGA DDR
3. Allocate storage for FPGA inputs (such as images to process)
4. Allocate storage for FPGA outputs (the activation of the final layer run on the FPGA)
5. Execute the network
6. Run fully connected layers on the CPU
7. Run Softmax on CPU
8. Print the result (or send the result for further processing)
9. When you are done, close the handle to the FPGA

### Step 1. Import required packages, check environment.

In [None]:
# Import some things
import os,sys,cv2
from __future__ import print_function

import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

# Bring in Xilinx ML Suite Compiler, Quantizer, PyXDNN
from xfdnn.tools.compile.bin.xfdnn_compiler_tensorflow import TFFrontend as xfdnnCompiler
from xfdnn.tools.quantize.quantize_tf import tf_Quantizer as xfdnnQuantizer
import xfdnn.rt.xdnn as xdnn
import xfdnn.rt.xdnn_io as xdnn_io

import ipywidgets

import warnings
warnings.simplefilter("ignore", UserWarning)

print("Current working directory: %s" % os.getcwd())
print("Running on host: %s" % os.uname()[1])
print("Running w/ LD_LIBRARY_PATH: %s" %  os.environ["LD_LIBRARY_PATH"])
print("Running w/ XILINX_OPENCL: %s" %  os.environ["XILINX_OPENCL"])
print("Running w/ XCLBIN_PATH: %s" %  os.environ["XCLBIN_PATH"])
print("Running w/ PYTHONPATH: %s" %  os.environ["PYTHONPATH"])
print("Running w/ SDACCEL_INI_PATH: %s" %  os.environ["SDACCEL_INI_PATH"])

id = !whoami

# Make sure there is no error in this cell
# The xfDNN runtime depends upon the above environment variables

config = {} # Config dict

### Step 2. Use a config dictionary to pass parameters.

Here, we will setup and use a config dictionary to simplify handling of the arguments. For this first example, we will attempt to classify a picture of a dog. 

In [None]:
config["platform"] = None
platforms = ["alveo-u200","alveo-u200-ml","alveo-u250","aws","nimbix","1525","1525-ml"]

def setPlatform(platform):
    global config
    config["platform"] = platform

print ("Please select your hardware platform")
ipywidgets.interact(setPlatform,platform=platforms)

In [None]:
print ("Running on platform: %s" % config["platform"])

In [None]:
# Chose an image to run, display it for reference
config["images"] = ["../examples/classification/dog.jpg"] # Image of interest (Must provide as a list)

img = cv2.imread(config["images"][0])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(img)
plt.title(config["images"])
plt.show()

### Step 3. Define an xfdnnCompiler instance and pass it arguments.  
To simplify handling of arguments, we continue to use a config dictionary. Take a look at the dictionary entries below. 

The arguments that need to be passed are: 
- `protobuf` - Caffe representation of the network
- `netcfg` - Filename to save micro-instruction produced by the compiler needed to deploy
- `memory` - Parameter to set the on-chip memory for the target xDNN overlay. This example will target an overlay with 5 MB of cache. 
- `dsp` - Parameter to set the size of the target xDNN overlay. This example uses an overlay of size 32x56 DSPs. 
- `finalnode` - Output node of the tensorflow graph  

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////  
Memory, and DSP are critical arguments that correspond to the hardware accelerator you plan to load onto the FPGA.  
The memory, and dsp parameters can be extracted from the name of the fpga programming file "xclbin".  
Don't worry about this detail for now. Just know that if you change the xclbin, you have to recheck these parameters.  
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////  

The xfDNN Compiler interfaces with Caffe to read a network graph, and generates a sequence of instructions for the xfDNN Deploy APIs to execute on the FPGA.  

During this process the xfDNN Compiler performs computational graph traversal, node merging and optimization, memory allocation and optimization and, finally, micro-instruction generation.
  

In [None]:
# Compiler Arguments
config["model"] = "GoogLeNet"
config["protobuf"] = "../models/tensorflow/bvlc_googlenet_without_lrn/fp32/bvlc_googlenet_without_lrn_test.pb"
#config["outmodel"] = "work/optimized_model" # String for naming optimized model NOT YET SUPPORTED
config["netcfg"] = "work/fpga.cmds" # Compiler will generate FPGA instructions
config["memory"] = 5 # Available on-chip SRAM
config["dsp"] = 56 # Width of Systolic Array
config["finalnode"] = "prob" # Terminal node in your tensorflow graph

compiler = xfdnnCompiler(
    networkfile=config["protobuf"],      # Protobuf filename: input file
    #anew=config["outmodel"],            # String for intermediate protobuf NOT YET SUPPORTED
    generatefile=config["netcfg"],       # Script filename: output file
    memory=config["memory"],             # Available on chip SRAM within xclbin
    dsp=config["dsp"],                   # Rows in DSP systolic array within xclbin # keep defaults 
    finalnode=config["finalnode"],       # Terminal node in your tensorflow graph
    weights=True                         # Instruct Compiler to generate a weights directory for runtime
)

# Invoke compiler
try:
    compiler.compile()
    
    # The compiler extracts the floating point weights from the .caffemodel. 
    # This weights dir will be stored in the work dir with the appendex '_data'. 
    # The compiler will name it after the caffemodel, and append _data
    config["datadir"] = "work/" + os.path.basename(config["protobuf"]) + "_data"
        
    if os.path.exists(config["datadir"]) and os.path.exists(config["netcfg"]+".json"):
        print("Compiler successfully generated JSON and the data directory: {:s}".format(config["datadir"]))
    else:
        print("Compiler failed to generate the JSON or data directory: {:s}".format(config["datadir"]))
        raise
        
    print("**********\nCompilation Successful!\n")
    
    import json
    data = json.loads(open(config["netcfg"]+".json").read())
    print("Network Operations Count: {:d}".format(data['ops']))
    print("DDR Transfers (bytes): {:d}".format(data['moveops']))
    
except Exception as e:
    print("Failed to complete compilation:",e)

### Step 4. Create Quantizer Instance and run it.

To simplify handling of arguments, a config dictionary is used. Take a look at the dictionary below.

The arguments that need to be passed are:
- `model_file` - Filename generated by the compiler for the optimized prototxt and caffemodel.
- `quantizecfg` - Output JSON filename of quantization scaling parameters. 
- `bitwidths` - Desired precision from quantizer. This is to set the precision for [image data, weight bitwidth, conv output]. All three values need to be set to the same setting. The valid options are `16` for Int16 and `8` for Int8.  
- `img_mean` - Depending on network training, subtract image mean if available.
- `calibration_size` - Number of images the quantizer will use to calculate the dynamic range. 
- `calibration_directory` - Location of dir of images used for the calibration process. 

Below is an example with all the parameters filled in. `channel_swap` `raw_scale` `img_mean` `input_scale` are image preprocessing arguments specific to a given model.

In [None]:
config["img_mean"] = [104.007, 116.669, 122.679] # Mean of the training set
config["quantizecfg"] = "work/quantization_params.json" # Quantizer will generate quantization params
config["calibration_directory"] = "../xfdnn/tools/quantize/calibration_directory" # Directory of images for quantizer
config["calibration_size"] = 15 # Number of calibration images quantizer will use
config["bitwidths"] = [16,16,16] # Supported quantization precision
config["img_raw_scale"] = 255.0 # Raw scale of input pixels, i.e. 0 <-> 255
config["img_input_scale"] = 1.0 # Input multiplier, Images are scaled by this factor after mean subtraction


quantizer = xfdnnQuantizer(
    model_file=config["protobuf"],          # Prototxt filename: input file
    quantize_config=config["quantizecfg"],  # Quant filename: output file
    bitwidths=config["bitwidths"],          # Fixed Point precision: 8b or 16b
    cal_size=config["calibration_size"],    # Number of calibration images to use
    img_mean=config["img_mean"],            # Image mean per channel to caffe transformer
    cal_dir=config["calibration_directory"] # Directory containing calbration images
)

# Invoke quantizer
try:
    quantizer.quantize(inputName = "data", outputName = "prob")

    import json
    data = json.loads(open(config["quantizecfg"]).read())
    print("**********\nSuccessfully produced quantization JSON file for %d layers.\n"%len(data['network']))
except Exception as e:
    print("Failed to quantize:",e)

## Phase 3. Deploy The Model.
Next, we need to utilize the xfDNN APIs to deploy our network to the FPGA. We will walk through the deployment APIs, step by step: 
1. Open a handle for FPGA communication
2. Load weights, biases, and quantization parameters to the FPGA DDR
3. Allocate storage for FPGA inputs (such as images to process)
4. Allocate storage for FPGA outputs (the activation of the final layer run on the FPGA)
5. Execute the network
6. Run fully connected layers on the CPU
7. Run Softmax on CPU
8. Print the result (or send the result for further processing)
9. When you are done, close the handle to the FPGA

First, we will create the handle to communicate with the FPGA and choose which FPGA overlay to run the inference on. For this lab, we will use the `xdnn_v2_32x56_2pe_16b_6mb_bank21` overlay. You can learn about other overlay options in the ML Suite Tutorials [here][].  

[here]: https://github.com/Xilinx/ml-suite
        
### Step 5. Open a handle for FPGA communication.

In [None]:
# Create a handle with which to communicate to the FPGA
# The actual handle is managed by xdnn
config["xclbin"] = "../overlaybins/" + config["platform"] + "/overlay_3.xclbin" # Chosen Hardware Overlay
## NOTE: If you change the xclbin, we likely need to change some arguments provided to the compiler
## Specifically, the DSP array width, and the memory arguments

ret, handles = xdnn.createHandle(config['xclbin'])

if ret:                                                             
    print("ERROR: Unable to create handle to FPGA")
else:
    print("INFO: Successfully created handle to FPGA")
    
# If this step fails, most likely the FPGA is locked by another user, or there is some setup problem with the hardware

### Step 6. Apply quantization scaling and transfer model weights to the FPGA. 

In [None]:
# Quantize, and transfer the weights to FPGA DDR

# config["datadir"] = "work/" + config["caffemodel"].split("/")[-1]+"_data" # From Compiler
config["scaleA"] = 10000 # Global scaler for weights (Must be defined)
config["scaleB"] = 30 # Global scaler for bias (Must be defined)
config["PE"] = 0 # Run on Processing Element 0 - Different xclbins have a different number of Elements
config["batch_sz"] = 1 # We will load 1 image at a time from disk
config["in_shape"] = (3,224,224) # We will resize images to 224x224

#(weightsBlob, fcWeight, fcBias ) = pyxfdnn_io.loadWeights(config)
fpgaRT = xdnn.XDNNFPGAOp(handles,config)
(fcWeight, fcBias) = xdnn_io.loadFCWeightsBias(config)

### Step 7. Allocate space in host memory for inputs, load images from disk, and prepare images. 

In [None]:
# Allocate space in host memory for inputs, Load images from disk
batch_array = np.empty(((config['batch_sz'],) + config['in_shape']), dtype=np.float32, order='C')
img_paths = xdnn_io.getFilePaths(config['images'])

for i in xrange(0, len(img_paths), config['batch_sz']):
    pl = []
    for j, p in enumerate(img_paths[i:i + config['batch_sz']]):
        batch_array[j, ...], _ = xdnn_io.loadImageBlobFromFile(p, config['img_raw_scale'], config['img_mean'], 
                                                                  config['img_input_scale'], config['in_shape'][2], 
                                                                  config['in_shape'][1])
        pl.append(p)

### Step 8. Allocate space in host memory for outputs.

In [None]:
# Allocate space in host memory for outputs
if config["model"] == "GoogLeNet":
    config["fpgaoutsz"] = 1024 # Number of elements in the activation of the last layer ran on the FPGA
elif config["model"] == "ResNet50":
    config["fpgaoutsz"] = 2048 # Number of elements in the activation of the last layer ran on the FPGA

config["outsz"] = 1000 # Number of elements output by FC layers (1000 used for imagenet)

fpgaOutput = np.empty ((config['batch_sz'], config['fpgaoutsz'],), dtype=np.float32, order='C') # Space for fpga output
fcOutput = np.empty((config['batch_sz'], config['outsz'],), dtype=np.float32, order='C') # Space for output of inner product

### Step 9. Write optimized micro-code to the xDNN Processing Engine on the FPGA. 

In [None]:
# Write FPGA Instructions to FPGA and Execute the network!
fpgaRT.execute(batch_array, fpgaOutput)

### Step 10. Execute the Fully Connected Layers on the CPU.

In [None]:
# Compute the inner product
xdnn.computeFC(fcWeight, fcBias, fpgaOutput, config['batch_sz'], config['outsz'], config['fpgaoutsz'], fcOutput)

### Step 11. Execute the Softmax layers.

In [None]:
# Compute the softmax to convert the output to a vector of probabilities
softmaxOut = xdnn.computeSoftmax(fcOutput)

### Step 12. Output the classification prediction scores.

In [None]:
# Print the classification given the labels synset_words.txt (Imagenet classes)
config["labels"] = "../examples/classification/synset_words.txt"
labels = xdnn_io.get_labels(config['labels'])
xdnn_io.printClassification(softmaxOut, pl, labels)

#Print Original Image for Reference 
img = cv2.imread(config["images"][0])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(img)
plt.title(config["images"])
plt.show()

### Step 13. Close the handle.

In [None]:
xdnn.closeHandle()

## C'est fini!