### TAO remote client (object detection with Multi-class Classification)

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)


### The workflow in a nutshell

- Creating a dataset
- Upload VOC dataset to the service
- Running dataset convert
- Getting a PTM from NGC
- Model Actions
    - Train
    - Evaluate
    - Prune, retrain
    - Export
    - Convert
    - Inference on TAO
    - Inference on TRT

### Table of contents

1. [Install TAO remote client ](#head-1)
1. [Set the remote service base URL](#head-2)
1. [Access the shared volume](#head-3)
1. [Create the datasets](#head-4)
1. [List datasets](#head-5)
1. [Provide and customize dataset convert specs](#head-6)
1. [Run dataset convert](#head-7)
1. [Create a model experiment](#head-8)
1. [Find classification pretrained model](#head-9)
1. [Customize model metadata](#head-10)
1. [Provide train specs](#head-11)
1. [Run train](#head-12)
1. [Provide evaluate specs](#head-13)
1. [Run evaluate](#head-14)
1. [Provide prune specs](#head-15)
1. [Run prune](#head-16)
1. [Provide retrain specs](#head-17)
1. [Run retrain](#head-18)
1. [Run evaluate on retrain](#head-18-1)
1. [Provide FP32 export specs](#head-19)
1. [Run FP32 export](#head-20)
1. [Provide Int8 export specs](#head-21)
1. [Run Int8 export](#head-22)
1. [Provide model convert specs](#head-23)
1. [Run model convert](#head-24)
1. [Provide TAO inference specs](#head-25)
1. [Run TAO inference](#head-26)
1. [Provide TRT inference specs](#head-27)
1. [Run TRT inference](#head-28)
1. [Delete experiment](#head-30)
1. [Delete datasets](#head-31)
1. [Unmount shared volume](#head-32)
1. [Uninstall TAO Remote Client](#head-33)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
import os
import glob
import subprocess
import getpass
import uuid
import json

In [None]:
namespace = 'default'

### Install TAO remote client <a class="anchor" id="head-1"></a>

In [None]:
# SKIP this step IF you have already installed the TAO-Client wheel.
! pip3 install nvidia-tao-client

In [None]:
# View the version of the TAO-Client
! tao-client --version

### FIXME


1. Assign the ip_address and port_number in FIXME 1 and FIXME 2 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
2. Set NGC API key in FIXME 3
3. Assign path of data directory in FIXME 4
4. Choose between default or custom dataset in FIXME 5

### Set the remote service base URL <a class="anchor" id="head-2"></a>

In [None]:
# Define the node_addr and port number
node_addr = "<ip_address>" # FIXME1 example: 10.137.149.22
node_port = "<port_number>" # FIXME2 example: 32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
%env BASE_URL=http://{node_addr}:{node_port}/{namespace}/api/v1

In [None]:
# FIXME: Set ngc_api_key valiable
ngc_api_key = "<ngc_api_key>" # FIXME3 example: zZYtczM5amdtdDcwNjk0cnA2bGU2bXQ3bnQ6NmQ4NjNhMDItMTdmZS00Y2QxLWI2ZjktNmE5M2YxZTc0OGyM

# Exchange NGC_API_KEY for JWT
identity = json.loads(subprocess.getoutput(f'tao-client login --ngc-api-key {ngc_api_key}'))

%env USER={identity['user_id']}
%env TOKEN={identity['token']}

### Access the shared volume <a class="anchor" id="head-3"></a>

In [None]:
# Get PVC ID
pvc_id = subprocess.getoutput(f'kubectl get pvc tao-toolkit-api-pvc -n {namespace} -o jsonpath="{{.spec.volumeName}}"')
print(pvc_id)

In [None]:
# Get NFS server info
provisioner = json.loads(subprocess.getoutput(f'helm get values nfs-subdir-external-provisioner -o json'))
nfs_server = provisioner['nfs']['server']
nfs_path = provisioner['nfs']['path']
print(nfs_server, nfs_path)

In [None]:
user = getpass.getuser()
home = os.path.expanduser('~')

! echo "Password for {user}"
password = getpass.getpass()

In [None]:
# Mount shared volume 
! mkdir -p ~/shared

command = "apt-get -y install nfs-common >> /dev/null"
! echo {password} | sudo -S -k {command}

command = f"mount -t nfs {nfs_server}:{nfs_path}/{namespace}-tao-toolkit-api-pvc-{pvc_id} ~/shared"
! echo {password} | sudo -S -k {command} && echo DONE

### Create the datasets <a class="anchor" id="head-4"></a>

We will be using the `VOC image classification dataset` for this example. Please make sure you have the tar downloaded under data_dir

**If using custom dataset; it should follow this dataset structure, and skip running** "**Split dataset into train and val sets**"
```
DATA_DIR
├── images_test
│   ├── class_name_1
│   │   ├── image_name_1.jpg
│   │   ├── image_name_2.jpg
│   │   ├── ...
|   |   ... 
│   └── class_name_n
│       ├── image_name_3.jpg
│       ├── image_name_4.jpg
│       ├── ...
├── images_train
│   ├── class_name_1
│   │   ├── image_name_5.jpg
│   │   ├── image_name_6.jpg
|   |   ...
│   └── class_name_n
│       ├── image_name_7.jpg
│       ├── image_name_8.jpg
│       ├── ...
|
└── images_val
    ├── class_name_1
    │   ├── image_name_9.jpg
    │   ├── image_name_10.jpg
    │   ├── ...
    |   ...
    └── class_name_n
        ├── image_name_11.jpg
        ├── image_name_12.jpg
        ├── ...
```
- Each class name folder should contain the images corresponding to that class
- Same class name folders should be present across images_test, images_train and images_val

In [None]:
DATA_DIR = "classification_data" # FIXME4
os.environ['DATA_DIR']= DATA_DIR
!mkdir -p $DATA_DIR

In [None]:
dataset_to_be_used = "default" # FIXME5 example: default/custom; default for the dataset used in this tutorial notebook; custom for a different dataset

In [None]:
if dataset_to_be_used == "default":
    if not os.path.exists(os.path.join(DATA_DIR,"VOCtrainval_11-May-2012.tar")):
        print("Download VOC tar data into ", DATA_DIR)
    else:
        !tar -xf $DATA_DIR/VOCtrainval_11-May-2012.tar -C $DATA_DIR

In [None]:
if dataset_to_be_used == "default":
    !python3 -m pip install tqdm
    # Split dataset into train and val sets
    from os.path import join as join_path
    import os
    import glob
    import re
    import shutil

    DATA_DIR=os.environ.get('DATA_DIR')
    source_dir = join_path(DATA_DIR, "VOCdevkit/VOC2012")
    target_dir = join_path(DATA_DIR, "formatted")


    suffix = '_trainval.txt'
    classes_dir = join_path(source_dir, "ImageSets", "Main")
    images_dir = join_path(source_dir, "JPEGImages")
    classes_files = glob.glob(classes_dir+"/*"+suffix)
    for file in classes_files:
        # get the filename and make output class folder
        classname = os.path.basename(file)
        if classname.endswith(suffix):
            classname = classname[:-len(suffix)]
            target_dir_path = join_path(target_dir, classname)
            if not os.path.exists(target_dir_path):
                os.makedirs(target_dir_path)
        else:
            continue
        print(classname)


        with open(file) as f:
            content = f.readlines()


        for line in content:
            tokens = re.split('\s+', line)
            if tokens[1] == '1':
                # copy this image into target dir_path
                target_file_path = join_path(target_dir_path, tokens[0] + '.jpg')
                src_file_path = join_path(images_dir, tokens[0] + '.jpg')
                shutil.copyfile(src_file_path, target_file_path)

    from random import shuffle
    from tqdm import tqdm

    DATA_DIR=os.environ.get('DATA_DIR')
    SOURCE_DIR=os.path.join(DATA_DIR, 'formatted')
    TARGET_DIR=os.path.join(DATA_DIR,'split')
    # list dir
    print(os.walk(SOURCE_DIR))
    dir_list = next(os.walk(SOURCE_DIR))[1]
    # for each dir, create a new dir in split
    for dir_i in tqdm(dir_list):
        newdir_train = os.path.join(TARGET_DIR, 'images_train', dir_i)
        newdir_val = os.path.join(TARGET_DIR, 'images_val', dir_i)
        newdir_test = os.path.join(TARGET_DIR, 'images_test', dir_i)

        if not os.path.exists(newdir_train):
                os.makedirs(newdir_train)
        if not os.path.exists(newdir_val):
                os.makedirs(newdir_val)
        if not os.path.exists(newdir_test):
                os.makedirs(newdir_test)

        img_list = glob.glob(os.path.join(SOURCE_DIR, dir_i, '*.jpg'))
        # shuffle data
        shuffle(img_list)

        for j in range(int(len(img_list)*0.7)):
                shutil.copy2(img_list[j], os.path.join(TARGET_DIR, 'images_train', dir_i))

        for j in range(int(len(img_list)*0.7), int(len(img_list)*0.8)):
                shutil.copy2(img_list[j], os.path.join(TARGET_DIR, 'images_val', dir_i))

        for j in range(int(len(img_list)*0.8), len(img_list)):
                shutil.copy2(img_list[j], os.path.join(TARGET_DIR, 'images_test', dir_i))

    print('Done splitting dataset.')

In [None]:
# Check the dataset is present
!if [ ! -d $DATA_DIR/split/images_train ]; then echo 'train folder NOT found.'; else echo 'Found train images folder.';fi
!if [ ! -d $DATA_DIR/split/images_val ]; then echo 'val folder NOT found.'; else echo 'Found val images folder.';fi
!if [ ! -d $DATA_DIR/split/images_test ]; then echo 'test folder NOT found.'; else echo 'Found test images folder.';fi

In [None]:
train_dataset_id = subprocess.getoutput("tao-client classification dataset-create --dataset_type image_classification --dataset_format default")
print(train_dataset_id)

In [None]:
! rsync -ah --info=progress2 {DATA_DIR}/split/images_train ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
! echo DONE

In [None]:
eval_dataset_id = subprocess.getoutput("tao-client classification dataset-create --dataset_type image_classification --dataset_format default")
print(eval_dataset_id)

In [None]:
! rsync -ah --info=progress2 {DATA_DIR}/split/images_val ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
! echo DONE

In [None]:
# Creating classmap json
classmap = {}
for idx,folder in enumerate(sorted(os.listdir(os.path.join(DATA_DIR,"split","images_test")))):
    classmap[folder] = idx
with open(os.path.join(DATA_DIR,"split","classmap.json"), "w") as classmap_file:
    json.dump(classmap, classmap_file, indent=2)

In [None]:
infer_dataset_id = subprocess.getoutput("tao-client classification dataset-create --dataset_type image_classification --dataset_format raw")
print(infer_dataset_id)

In [None]:
! rsync -ah --info=progress2 {DATA_DIR}/split/images_test ~/shared/users/{os.environ['USER']}/datasets/{infer_dataset_id}/
! rsync -ah --info=progress2 {DATA_DIR}/split/classmap.json ~/shared/users/{os.environ['USER']}/datasets/{infer_dataset_id}/
! echo DONE

### List datasets <a class="anchor" id="head-5"></a>

In [None]:
pattern = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', '*', 'metadata.json')

datasets = []
for metadata_path in glob.glob(pattern):
    with open(metadata_path, 'r') as metadata_file:
        datasets.append(json.load(metadata_file))

print(json.dumps(datasets, indent=2))

### Create a model experiment <a class="anchor" id="head-8"></a>

In [None]:
network_arch = "classification"
model_id = subprocess.getoutput(f"tao-client classification model-create --network_arch {network_arch} --encryption_key nvidia_tlt ")
print(model_id)

### Find classification pretrained model <a class="anchor" id="head-9"></a>

In [None]:
pattern = os.path.join(home, 'shared', 'users', '*', 'models', '*', 'metadata.json')

ptm_id = None
for metadata_path in glob.glob(pattern):
  with open(metadata_path, 'r') as metadata_file:
    metadata = json.load(metadata_file)
    ngc_path = metadata.get("ngc_path")
    metadata_architecture = metadata.get("network_arch")
    if metadata_architecture == network_arch and "pretrained_classification:resnet18" in ngc_path:
      ptm_id = metadata["id"]
      break

print(ptm_id)

### Customize model metadata <a class="anchor" id="head-10"></a>

In [None]:
metadata_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'metadata.json')

with open(metadata_path , "r") as metadata_file:
    metadata = json.load(metadata_file)

metadata["train_datasets"] = [train_dataset_id]
metadata["eval_dataset"] = eval_dataset_id
metadata["inference_dataset"] = infer_dataset_id
metadata["ptm"] = ptm_id

with open(metadata_path, "w") as metadata_file:
    json.dump(metadata, metadata_file, indent=2)

print(json.dumps(metadata, indent=2))

### Provide train specs <a class="anchor" id="head-11"></a>

In [None]:
# Default train model specs
! tao-client classification model-train-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/train.json

In [None]:
# Customize train model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'train.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

specs["train_config"]["n_epochs"] = 2

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run train <a class="anchor" id="head-12"></a>

In [None]:
train_job_id = subprocess.getoutput("tao-client classification model-train --id " + model_id)
print(train_job_id)

In [None]:
def my_tail(logs_dir, log_file):
    %env LOG_FILE={logs_dir}/{log_file}
    ! mkdir -p {logs_dir}
    ! [ ! -f "$LOG_FILE" ] && touch $LOG_FILE && chmod 666 $LOG_FILE
    ! tail -f -n +1 $LOG_FILE | while read LINE; do echo "$LINE"; [[ "$LINE" == "EOF" ]] && pkill -P $$ tail; done
    
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'logs')
log_file = f"{train_job_id}.txt"

my_tail(logs_dir, log_file)

### Provide evaluate specs <a class="anchor" id="head-13"></a>

In [None]:
# Default evaluate model specs
! tao-client classification model-evaluate-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/evaluate.json

In [None]:
# Customize evaluate model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'evaluate.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

# Change any spec if you wish

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run evaluate <a class="anchor" id="head-14"></a>

In [None]:
eval_job_id = subprocess.getoutput(f"tao-client classification model-evaluate --id {model_id} --job {train_job_id}")
print(eval_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{eval_job_id}.txt"
my_tail(logs_dir, log_file)

### Provide prune specs <a class="anchor" id="head-15"></a>

In [None]:
# Default prune model specs
! tao-client classification model-prune-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/prune.json

### Run prune <a class="anchor" id="head-16"></a>

In [None]:
prune_job_id = subprocess.getoutput(f"tao-client classification model-prune --id {model_id} --job {train_job_id}")
print(prune_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{prune_job_id}.txt"
my_tail(logs_dir, log_file)

### Provide retrain specs <a class="anchor" id="head-17"></a>

In [None]:
# Default retrain model specs
! tao-client classification model-retrain-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/retrain.json

In [None]:
# Customize retrain model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'retrain.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

specs["train_config"]["n_epochs"] = 2

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run retrain <a class="anchor" id="head-18"></a>

In [None]:
retrain_job_id = subprocess.getoutput(f"tao-client classification model-retrain --id {model_id} --job {prune_job_id}")
print(retrain_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{retrain_job_id}.txt"
my_tail(logs_dir, log_file)

### Run evaluate on retrained model <a class="anchor" id="head-18-1"></a>

In [None]:
eval2_job_id = subprocess.getoutput(f"tao-client classification model-evaluate --id {model_id} --job {retrain_job_id}")
print(eval2_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{eval2_job_id}.txt"
my_tail(logs_dir, log_file)

### Provide FP32 export specs <a class="anchor" id="head-19"></a>

In [None]:
# Default export model specs
! tao-client classification model-export-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/export.json

In [None]:
# Customize export model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'export.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

specs["data_type"] = "fp32"

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run FP32 export <a class="anchor" id="head-20"></a>

In [None]:
fp32_export_job_id = subprocess.getoutput(f"tao-client classification model-export --id {model_id} --job {train_job_id}")
print(fp32_export_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{fp32_export_job_id}.txt"
my_tail(logs_dir, log_file)

### Provide Int8 export specs <a class="anchor" id="head-21"></a>

In [None]:
# Default export model specs
! tao-client classification model-export-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/export.json

In [None]:
# Customize export model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'export.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

specs["data_type"] = "int8"
specs["batches"] = 10
specs["batch_size"] = 4

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run Int8 export <a class="anchor" id="head-22"></a>

In [None]:
int8_export_job_id = subprocess.getoutput(f"tao-client classification model-export --id {model_id} --job {train_job_id}")
print(int8_export_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{int8_export_job_id}.txt"
my_tail(logs_dir, log_file)

### Provide model convert specs <a class="anchor" id="head-23"></a>

In [None]:
# Default convert model specs
! tao-client classification model-convert-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/convert.json

In [None]:
# Customize convert model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'convert.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

specs["t"] = "int8"
specs["b"] = 64
specs["m"] = 64
specs["d"] = "3,224,224"
specs["i"] = "nchw"
specs["o"] = "predictions/Softmax"

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run model convert <a class="anchor" id="head-24"></a>

In [None]:
convert_job_id = subprocess.getoutput(f"tao-client classification model-convert --id {model_id} --job {int8_export_job_id}")
print(convert_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{convert_job_id}.txt"
my_tail(logs_dir, log_file)

### Provide TAO inference specs <a class="anchor" id="head-25"></a>

In [None]:
# Default inference model specs
! tao-client classification model-inference-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/inference.json

In [None]:
# Customize TAO inference specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'inference.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

# Change any spec if you wish

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run TAO inference <a class="anchor" id="head-26"></a>

In [None]:
tlt_inference_job_id = subprocess.getoutput(f"tao-client classification model-inference --id {model_id} --job {train_job_id}")
print(tlt_inference_job_id)

In [None]:
# Check status (the file won't exist until the backend Toolkit container is running -- can take several minutes)
log_file = f"{tlt_inference_job_id}.txt"
my_tail(logs_dir, log_file)

In [None]:
job_dir = f"{home}/shared/users/{os.environ['USER']}/models/{model_id}/{tlt_inference_job_id}"
# You can find the predicted results here
!ls {job_dir}


### Provide TRT inference specs <a class="anchor" id="head-27"></a>

In [None]:
# Default inference model specs
! tao-client classification model-inference-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/inference.json

In [None]:
# Customize TAO inference specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'inference.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

# Change any spec if you wish

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Delete experiment <a class="anchor" id="head-30"></a>

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/models/{model_id}
! echo DONE

### Delete datasets <a class="anchor" id="head-31"></a>

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{infer_dataset_id}
! echo DONE

### Unmount shared volume <a class="anchor" id="head-32"></a>

In [None]:
command = "umount ~/shared"
! echo {password} | sudo -S -k {command} && echo DONE

### Uninstall TAO Remote Client <a class="anchor" id="head-33"></a>

In [None]:
! pip3 uninstall -y nvidia-tao-client