<a href="https://colab.research.google.com/github/dfeddema/Nvidia-RH/blob/main/bert_squad_tf_finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" width=60 height=60 align="left"/>
<img src="https://info.nvidia.com/rs/156-OFN-742/images/Red_Hat_new_BW.jpg" width=100 height=100 align="left"/>

<br><br>

# BERT Question Answering Fine-Tuning & Deployment Using Triton Inference Server In OCP Kubernetes Cluster   

## Overview

This notebook demonstrates how to: 

1. Fine-Tune BERT on SQuAD QA dataset on RedHat OpenShift (OCP) kubernetes cluster 
2. Optimize the model with Nvidia TensorRT
3. Deploy the fine-tuned BERT QA TF and TRT model with Nvidia Triton Inference Server on OCP kubernetes cluster
4. Observe inference metrics on Prometheus and Grafana

## 1. Requirements

- NVIDIA GPU 
  - A100 or V100 or T4
- OpenShift Platform
- AWS S3 or GCP storage bucket 


## 2. Links

**Nvidia NGC resources**


* BERT Tensorflow: 

  https://ngc.nvidia.com/catalog/resources/nvidia:bert_for_tensorflow 

* BERT pre-trained checkpoint: 
  
  https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_ckpt_large_pretraining_amp_lamb 

* BERT fine-tuned on QA checkpoint: 

  https://ngc.nvidia.com/catalog/models/nvidia:bert_tf_ckpt_large_qa_squad11_amp_384

* TensorRT

  https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt

* Triton Inference Server

  https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver



**RedHat OpenShift resources**

* RedHat OpenShift link

  http://openshift.com/

* OpenShift operators

  https://www.openshift.com/learn/topics/operators




## 3. BERT Question Answering Task

Bidirectional Embedding Representations from Transformers (BERT), is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks like Question Answering, Sentiment analysis, Named Entity Recognition etc.  

The original paper can be found here: https://arxiv.org/abs/1810.04805.

<img src="https://drive.google.com/uc?export=view&id=1rcQBXaiJhEqaUNxaXTulA1Nb4FLhoxbj">

NVIDIA's BERT model is an optimized version of Google's official implementation, leveraging *mixed precision arithmetic* and *Tensor Cores* on A100, V100 and T4 GPUs for faster training times while maintaining target accuracy.


**Pre-training vs Fine-tuning**

<img src="https://drive.google.com/uc?export=view&id=1YKbZedBjiJZwUUxA4Ru9L2PVhglgc9lV">

A language model like BERT requires pre-training to be able to encode a given type of text into representations that contain the full underlying meaning of sentences. Once the meaning of words in the languge gets encoded into the model, one can fine-tune the model to solve a particular problem like QA.

Note that, compared to pre-training, fine-tuning is generally far less computationally demanding and hence, we'll use pre-trained BERT from NGC and fine-tune it to provide answers to questions on particular paragraphs from the SQuAD dataset.


Based on the model size, we have the following two default configurations of BERT.

| **Model** | **Hidden layers** | **Hidden unit size** | **Attention heads** | **Feedforward filter size** | **Max sequence length** | **Parameters** |
|:---------:|:----------:|:----:|:---:|:--------:|:---:|:----:|
|BERTBASE |12 encoder| 768| 12|4 x  768|512|110M|
|BERTLARGE|24 encoder|1024| 16|4 x 1024|512|330M|

We will use large pre-trained models avaialble on NGC (NVIDIA GPU Cluster, https://ngc.nvidia.com).
There are many configuration available, in particular we will download and use the following:

**bert_tf_large_fp16_384**

Which is pre-trained using the Wikipedia and Book corpus datasets as training data. 
We will fine-tune on the SQuAD 1.1 Dataset.


## 3. Dataset

**S**tanford **Qu**estion **A**nswering **D**ataset (SQuAD) comprises of more than 100,000+ pairs of Questions on Wikipedia articles and their respective answers. 

**Links**: 

- SQuAD download: https://rajpurkar.github.io/SQuAD-explorer/
- SQuAD paper: https://arxiv.org/pdf/1606.05250.pdf

<img src="https://drive.google.com/uc?export=view&id=157Frt0Z3cYaITldxirJqA6JhS9U2yq-t">


The dataset contains the following files:

* train-v1.1.json - Used for fine-tuning the model
* dev-v1.1.json - Used for validating the model
* evaluate-v1.1.py - Used for testing the model


## 4. Fine-tuning BERT

Steps:

#### 1. Download the training scripts, model checkpoints and dataset from the links in resources section

#### 2. Check the parameter in `<bert-training-code>/scripts/run_squad.sh`


```
batch_size=${1:-"8"} : Batch size of the model
learning_rate=${2:-"5e-6"} : Learning rate 
precision=${3:-"fp16"} : Floating point 16 precision
use_xla=${4:-"true"} : By default use XLA optimizations
num_gpu=${5:-"8"} : Number of GPUs to use for training
seq_length=${6:-"384"} : Maximum input sequence length
doc_stride=${7:-"128"} : When splitting up a long document into chunks, how much stride to take between chunks
bert_model=${8:-"large"} : BERT large model
```


 
#### 3. We will use the following command to kick off training from run_squad.sh script with the additional arguments:


```
squad_version: 1.1
model_checkpoint: /home/model_checkpoints/model.ckpt-1000000
epochs: 1.5
```

The final command will look like this: 



`
bash scripts/run_squad.sh 1 5e-6 fp16 false 1 384 128 large 1.1 /home/model_checkpoints/model.ckpt-1000000 1.5
`

#### 4. Now we are ready to package the above steps into a yaml for deployment on OpenShift cluster


## 5. Using Openshift "oc" Commands to access the cluster from the CLI and Operator Installation from the Openshift Web Console

Users can choose to access the Openshift cluster from either the CLI or the Openshift Web Console.  You can also do a combination of both which we will demostrate here.

Display your user name


In [None]:
! oc whoami

Display the nodes on this Openshift Cluster


In [None]:
! oc get nodes

Verify that the GPU worker node (with Nvidia T4) is labeled correctly.   Output from the command below will show the Node Feature Discovery (NFD) Operator has labeled this node to indicate is has an Nvidia T4 GPU.  NFD labels the host with node-specific attributes, like PCI cards, kernel or OS version and more. The PCI label from NFD is used to schedule gpu workloads only on nodes that have a gpu special resource (e.g. 0x10DE is the PCI vendor id for NVIDIA).


In [None]:
# copy the first worker node name into next command 
! oc describe node <worker-node-from-previous-line> | egrep 'Roles|pci'

Now switch over to your Openshift Web Console. You can get your URL by issuing the command below to get the OCP 4 console route.

In [None]:
! oc get -n openshift-console route console

## 6. Fine-tuning BERT QA model  Step 1 

Download the fine-tuning training scripts, model checkpoints for the BERT model and the Standford Question and Answer dataset (SQuaD) to your local machine. 




Download the training scripts for BERT fine-tuning

In [None]:
! wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/bert_for_tensorflow/versions/20.06.7/zip -O bert_for_tensorflow_20.06.7.zip

Download the model checkpoints for BERT fine-tuning


In [None]:
! wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/bert_for_tensorflow/versions/1/zip -O bert_for_tensorflow_1.zip


Download the SQuaD Data from https://rajpurkar.github.io/SQuAD-explorer/ version 1.1



In [None]:
! wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json  -O train-v1.1.json

In [None]:
! wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json   -O dev-v1.1.json

In [None]:
! wget https://github.com/allenai/bi-att-flow/archive/master.zip 

In [None]:
! ls -l bert_for_tensorflow_20.06.7.zip bert_for_tensorflow_1.zip train-v1.1.json dev-v1.1.json master.zip

## Fine-tuning BERT QA model Step 2 
Upload the fine-tuning scripts,  BERT model checkpoints and SQuAD data to an S3 bucket.  


In [None]:
!aws s3 cp bert_for_tensorflow_20.06.7.zip s3://openshift-bert-demo/bert-finetuning-scripts/bert_for_tensorflow_20.06.7.zip

In [None]:
!aws s3 cp bert_for_tensorflow_1.zip s3://openshift-bert-demo/bert-model-checkpoints/bert_for_tensorflow_1.zip

In [None]:
!aws s3 cp train-v1.1.json s3://openshift-bert-demo/squad/train-v1.1.json

In [None]:
!aws s3 cp dev-v1.1.json s3://openshift-bert-demo/squad/dev-v1.1.json

In [None]:
!aws s3 ls

## Fine-tuning BERT QA model Step 3
Create a persistent volume claim in openshift in the namespace where you plan to run your bert training (fine-tuning) job.  

Create a file on your local system containing this yaml. In this example we call the file "pvc.yaml"




```---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ocs-ml-data
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi
  storageClassName: ocs-storagecluster-cephfs
```



Create the persistent volume claim in OCS


In [None]:
!oc create -f pvc.yaml

In [None]:
!oc get pvc --namespace default



You can fine-tune BERT to accomplish different tasks.  In this step we are fine-tuning BERT to accomplish a Question/Answer task by presenting an example to the model which is the SQuAD dataset (a labeled dataset). This transfer learning characteric of BERT style models enables it to be adapted for different NLP problems (e.g. sentiment analysis and named entity recognition). 

Now we will submit the following pod manifest to our Openshift cluster and the Openshift scheduler will schedule our pod on a worker node with an Nvidia GPU.



```
apiVersion: v1
kind: Pod
metadata:
  name: bert
  namespace: default
spec:
  restartPolicy: OnFailure
  containers:
    - name: bert
      image: "nvcr.io/nvidia/tensorflow:20.09-tf1-py3"
      command: ["/bin/bash", "-ec", "cd /home/scripts; export BERT_DIR=/home/scripts; export MODEL_DIR=/home/model_checkpoints;
      bash scripts/run_squad.sh 1 5e-6 fp16 false 1 384 128 large 1.1 /home/model_checkpoints/model.ckpt-1000000 1.5"]
            env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "compute,utility"
        - name: NVIDIA_REQUIRE_CUDA
          value: "cuda>=5.0"
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
      volumeMounts:
      - mountPath: /home
        name: ocs-ml-data
  volumes:
  - name: ocs-ml-data
    persistentVolumeClaim:
      # directory location on host
      claimName: ocs-ml-data 
      readOnly: false

```



In [None]:
! cat bert-fine-tuning.yaml

You can fine tune BERT to accomplish different tasks. In our case we are fine-tuning BERT to a accomplish Question/Answer task by presenting an example to the model which is the Stanford Question Answer dataset (SQuAD).  

Submit the fine-tuning job

In [None]:
! oc create -f bert-fine-tuning.yaml

View running pods in namespace 'default'


In [None]:
! oc get pods --namespace default

View logs from fine-tuning (training) pod


In [None]:
! oc logs -f bert

## 7. TensorRT optimizations

### What is TensorRT?
TensorRT is built on CUDA, NVIDIA’s parallel programming model, and enables you to optimize inference for all deep learning frameworks leveraging libraries, development tools and technologies in CUDA-X. 

TensorRT provides INT8 and FP16 optimizations for production deployments of deep learning inference applications. Reduced precision inference significantly reduces application latency in production.

Prominent features of TensorRT:

* Weight & Activation Precision Calibration: 

  Maximizes throughput by quantizing models to INT8 while preserving accuracy


* Layer & Tensor Fusion

  Optimizes use of GPU memory and bandwidth by fusing nodes in a kernel


* Kernel Auto-Tuning

  Selects best data layers and algorithms based on target GPU platform
  

* Dynamic Tensor Memory

  Minimizes memory footprint and re-uses memory for tensors efficiently
  

* Multi-Stream Execution

  Scalable design to process multiple input streams in parallel

Optimizing fine-tuned BERT QA model to TensorRT (TRT)

Steps:

#### 1. Clone TensorRT Github repository on OCP node: https://github.com/NVIDIA/TensorRT.git

#### 2. We'll use TensorRT container from NGC: https://ngc.nvidia.com/catalog/containers/nvidia:tensorrt 

  This container does not come preinstalled with all the python dependencies. Please install the dependencies by executing the following command from within the container: 

```
/opt/tensorrt/python/python_setup.sh 
```

#### 3. We'll be using /TensorRT/demo/BERT/builder.py script to build our optimized TensorRT engine with the following arguments:


```
mkdir -p /home/engines && \                     # Make dir to save model
python3 builder.py \                            # Python script to build TRT engine
-m /home/bert-fine-tuned/model.ckpt-8144 \      # Fine-tuned BERT model
-o /home/engines/bert_large_128.engine \        # Output dir where TRT engine will be stored
-b 1 \                                          # Batch size
-s 128 \                                        # Sequence length
--fp32 \                                        # Precision
-c /home/bert-fine-tuned/                       # Config dir

```

Now we are ready to package the above steps in a yaml for deployment on OCP. We can use the same yaml used for training by modifying the image and command to run in the pod.



```
apiVersion: v1
kind: Pod
metadata:
  name: trt
  namespace: default
spec:
  restartPolicy: OnFailure
  containers:
    - name: trt
      image: "nvcr.io/nvidia/tensorrt:20.09-py3"
      command: ["/bin/bash", "-ec", " bash /opt/tensorrt/python/python_setup.sh; cd /home/TensorRT/demo/BERT; mkdir -p /home/engines && python3 builder.py -m /home/bert-fine-tuned/model.ckpt-8144 -o /home/engines/bert_large_128.engine -b 1 -s 128 --fp32 -c /home/bert-fine-tuned/;"]
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "compute,utility"
        - name: NVIDIA_REQUIRE_CUDA
          value: "cuda>=5.0"
      securityContext:
        privileged: true
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
      volumeMounts:
      - mountPath: /home
        name: ocs-ml-data
  volumes:
  - name: ocs-ml-data
    persistentVolumeClaim:
      # directory location on host
      claimName: ocs-ml-data
      readOnly: false
```




In [None]:
! oc create -f trt_export.yaml

Let's check the status of the pod

In [None]:
! oc get pods

Finally, let's check the logs and make sure the engine is created in the output directory

In [None]:
! oc logs

## 8. Model deployment in Triton Inference Server on OpenShift

Once the model is fine-tuned and optimized it can be deployed into production using Triton Inference Server. 

The Triton Inference Server provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP endpoint, allowing remote clients to request inferencing for any AI model at scale. 

The inference server supports:
* All major DL frameworks like TensorFlow, PyTorch, ONNX, TensorRT 
* Low latency real time inferencing
* Dynamic matching to maximize GPU/CPU utilization
* GPU Metrics in Prometheus format


We'll deploy both BERT QA fine-tuned TensorFlow model and optimized TensorRT model in Triton on OCP

Steps:

#### 1. Clone Triton Github repository to your local
 
```
git clone https://github.com/triton-inference-server/server.git
```

#### 2. Deployment files are in \<Triton repo\>/deploy/single_server


```
Chart.yaml                    # Helm Chart for deployment
dashboard.json                # Grafana dashboard
values.yaml                   # kubernetes values file with info on image, model repository etc.
templates/deployment.yaml     # Deployment file
templates/service.yaml        # Service for exporting metrics 
```

Make sure to include right values in values.yaml 


```
imageName: nvcr.io/nvidia/tritonserver:20.09-py3      # Nvidia Triton server docker image
modelRepositoryPath: /home/model_repository           # Model repository path can be link to S3 bucket, GS storage or local folder
```

Triton requires TensorFlow Models model repository in specific format as shown below:



```
<model-repository-path>/ 
  <model-name>/                                       # bert                            
    config.pbtxt
    1/                                                # version
      model.savedmodel/
         <saved-model files>                          # variables
```

#### 3. Deploy triton server using helm


In [None]:
! helm install triton server/deploy/single_server/

Check if the pod is created and running

In [None]:
! oc get pods

Check the logs of the pod

In [None]:
! oc logs  # paste pod name

Get external IP of load balancer

In [None]:
! oc get services 

Check the status of Triton Inference server

In [None]:
! export triton_ip = ""           #paste external IP of load balancer here

! curl -v  $triton_ip:8000/v2/health/ready