### Notebook to demonstrate TAO-Remote Client AutoML workflow for License Plate Character Recognition

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)


### Learning Objective

This AutoML notebook applies to identifying the optimal hyperparameters (e.g., learning rate, batch size, weight regularizer, number of layers, etc.) in order to obtain better accuracy results or converge faster on AI models for license plate recognition application.
- Take a pretrained model and choose automl algorithm/parameters to start AutoML train.
- At the end of an AutoML run, you will receive a config file that specifies the best performing model, along with the binary model file to deploy it to your application.


### The workflow in a nutshell

- Creating train and eval dataset
- Upload datasets to the service
- Set AutoML algorithm configurations
- Override train config defaults
  - Add/Remove AutoML parameters
- Run AutoML


### AutoML Workflow

User starts with selecting model topology, create and upload dataset, configuring parameters, training with AutoML to comparing the model.

![image](https://raw.githubusercontent.com/vpraveen-nv/model_card_images/main/api/automl_workflow.png)


### Table of contents

1. [Install TAO remote client](#head-1)
1. [Set the remote service base URL](#head-2)
1. [Access the shared volume](#head-3)
1. [Create the datasets](#head-4)
1. [List datasets](#head-5)
1. [Create a model experiment](#head-6)
1. [Find pretrained model](#head-7)
1. [Set AutoML related configurations](#head-8)
1. [Provide train specs](#head-9)
1. [Run AutoML train](#head-10)
1. [Get the best model from AutoML](#head-11)
1. [Delete experiment](#head-12)
1. [Delete datasets](#head-13)
1. [Unmount shared volume](#head-14)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

In [None]:
import os
import glob
import subprocess
import getpass
import uuid
import json

### FIXME

1. Choose between default or custom dataset in FIXME 1
2. Assign the ip_address and port_number in FIXME 2 and FIXME 3 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
2. Set NGC API key in FIXME 4
3. Assign path of data directory in FIXME 5
4. Choose between Bayesian and Hyperband automl_algorithm in FIXME 6

In [None]:
namespace = 'default'

In [None]:
# more information about lprnet can be found in https://docs.nvidia.com/tao/tao-toolkit/text/character_recognition/lprnet.html
model_name = "lprnet"
dataset_to_be_used = "default" # FIXME1 #default/custom; default for the dataset used in this tutorial notebook; custom for a different dataset

### Install TAO remote client <a class="anchor" id="head-1"></a>

In [None]:
# SKIP this step IF you have already installed the TAO-Client wheel.
! pip3 install nvidia-tao-client

In [None]:
# View the version of the TAO-Client
! tao-client --version

### Set the remote service base URL <a class="anchor" id="head-2"></a>

In [None]:
# Define the node_addr and port number
node_addr = "<ip_address>" # FIXME2 example: 10.137.149.22
node_port = "<port_number>" # FIXME3 example: 32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
%env BASE_URL=http://{node_addr}:{node_port}/{namespace}/api/v1

In [None]:
# FIXME: Set ngc_api_key valiable
ngc_api_key = "<ngc_api_key>" # FIXME4 example: zZYtczM5amdtdDcwNjk0cnA2bGU2bXQ3bnQ6NmQ4NjNhMDItMTdmZS00Y2QxLWI2ZjktNmE5M2YxZTc0OGyM

# Exchange NGC_API_KEY for JWT
identity = json.loads(subprocess.getoutput(f'tao-client login --ngc-api-key {ngc_api_key}'))

%env USER={identity['user_id']}
%env TOKEN={identity['token']}

### Access the shared volume <a class="anchor" id="head-3"></a>

In [None]:
# Get PVC ID
pvc_id = subprocess.getoutput(f'kubectl get pvc tao-toolkit-api-pvc -n {namespace} -o jsonpath="{{.spec.volumeName}}"')
print(pvc_id)

In [None]:
# Get NFS server info
provisioner = json.loads(subprocess.getoutput(f'helm get values nfs-subdir-external-provisioner -o json'))
nfs_server = provisioner['nfs']['server']
nfs_path = provisioner['nfs']['path']
print(nfs_server, nfs_path)

In [None]:
user = getpass.getuser()
home = os.path.expanduser('~')

! echo "Password for {user}"
password = getpass.getpass()

In [None]:
# Mount shared volume 
! mkdir -p ~/shared

command = "apt-get -y install nfs-common >> /dev/null"
! echo {password} | sudo -S -k {command}

command = f"mount -t nfs {nfs_server}:{nfs_path}/{namespace}-tao-toolkit-api-pvc-{pvc_id} ~/shared"
! echo {password} | sudo -S -k {command} && echo DONE

### Create the datasets <a class="anchor" id="head-4"></a>

We will be using the `OpenALPR benchmark dataset` for the tutorial. The following script will download the dataset automatically and convert it to the format used by TAO.

**If using custom dataset; it should follow this dataset structure**
```
DATA_DIR
├── train
│   ├── characters.txt
│   ├── image
│   │   ├── image_name_1.jpg
│   │   ├── image_name_2.jpg
|   |   ├── ...
│   └── label
│       ├── image_name_1.txt
│       ├── image_name_2.txt
|       ├── ...
└── val
    ├── characters.txt
    ├── image
    │   ├── image_name_11.jpg
    │   ├── image_name_12.jpg
    |   ├── ...
    └── label
        ├── image_name_11.txt
        ├── image_name_12.txt
        ├── ...
```
The file name should be same for image and label folders

In [None]:
DATA_DIR = "lprnet_data" # FIXME5
os.environ['DATA_DIR']= DATA_DIR
!mkdir -p $DATA_DIR

In [None]:
if dataset_to_be_used == "default":
    !python3 -m pip install --upgrade pip
    !python3 -m pip install "opencv-python>=3.4.0.12,<=4.5.5.64"
    !bash ../dataset_prepare/lprnet/download_and_prepare_data.sh $DATA_DIR

In [None]:
country = "us" # us/ch; us for United States, ch for China

In [None]:
if dataset_to_be_used == "default":
    character_file_link = "https://api.ngc.nvidia.com/v2/models/nvidia/tao/lprnet/versions/trainable_v1.0/files/{}_lp_characters.txt".format(country)
    !wget -q -O $DATA_DIR/train/characters.txt $character_file_link
    !cp $DATA_DIR/train/characters.txt $DATA_DIR/val/characters.txt 

In [None]:
if model_name == "lprnet":
    ds_format = model_name

In [None]:
train_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type character_recognition --dataset_format {ds_format}")
print(train_dataset_id)

In [None]:
if model_name == "lprnet":
    ! rsync -ah --info=progress2 {DATA_DIR}/train/image ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 {DATA_DIR}/train/label ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
    ! rsync -ah --info=progress2 {DATA_DIR}/train/characters.txt ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
! echo DONE

In [None]:
eval_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type character_recognition --dataset_format {ds_format}")
print(eval_dataset_id)

In [None]:
if model_name == "lprnet":
    ! rsync -ah --info=progress2 {DATA_DIR}/val/image ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
    ! rsync -ah --info=progress2 {DATA_DIR}/val/label ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
    ! rsync -ah --info=progress2 {DATA_DIR}/val/characters.txt ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
! echo DONE

### List datasets <a class="anchor" id="head-5"></a>

In [None]:
pattern = os.path.join(home, 'shared', 'users', os.environ['USER'], 'datasets', '*', 'metadata.json')

datasets = []
for metadata_path in glob.glob(pattern):
    with open(metadata_path, 'r') as metadata_file:
        datasets.append(json.load(metadata_file))

print(json.dumps(datasets, indent=2))

### Create a model experiment <a class="anchor" id="head-6"></a>

In [None]:
network_arch = model_name.replace("-","_")
if network_arch == "lprnet":
    encode_key = "nvidia_tlt"
model_id = subprocess.getoutput(f"tao-client {model_name} model-create --network_arch {network_arch} --encryption_key {encode_key} ")
print(model_id)

### Assign train, eval datasets 

In [None]:
metadata_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'metadata.json')

with open(metadata_path , "r") as metadata_file:
    metadata = json.load(metadata_file)

metadata["train_datasets"] = [train_dataset_id]
metadata["eval_dataset"] = eval_dataset_id

### Find pretrained model <a class="anchor" id="head-7"></a>

In [None]:
pattern = os.path.join(home, 'shared', 'users', '*', 'models', '*', 'metadata.json')

ptm_id = None
for ptm_metadata_path in glob.glob(pattern):
  with open(ptm_metadata_path, 'r') as metadata_file:
    ptm_metadata = json.load(metadata_file)
    ngc_path = ptm_metadata.get("ngc_path")
    ptm_country_info = ptm_metadata.get("additional_id_info")
    metadata_network_arch = ptm_metadata.get("network_arch")
    if metadata_network_arch == network_arch and ptm_country_info == country:
      ptm_id = ptm_metadata["id"]
      break

metadata["ptm"] = ptm_id
print(ptm_id)

### View hyperparameters that are enabled for AutoML by default

In [None]:
# View default automl specs enabled
! tao-client {model_name} model-automl-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/automl_defaults.json

### Set AutoML related configurations <a class="anchor" id="head-8"></a>
Refer to the hyper-link to see the parameters supported by LPRNet and add more parameters if necessary in addition to the default automl enabled parameters: [LPRNet](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_action_specs.html#id21)


In [None]:
# Choose automl algorithm between "Bayesian" and "HyperBand".
automl_algorithm="Bayesian" # FIXME6 example: Bayesian/HyperBand

metric = "kpi" #Don't change this, in future multiple metrics will be supported
additional_automl_parameters = [] #Refer to parameter list mentioned in the above links and add any extra parameter in addition to the default enabled ones
remove_default_automl_parameters = [] #Remove any hyperparameters that are enabled by default for AutoML

metadata["automl_algorithm"] = automl_algorithm
metadata["automl_enabled"] = True
metadata["metric"] = metric
metadata["automl_add_hyperparameters"] = str(additional_automl_parameters)
metadata["automl_remove_hyperparameters"] = str(remove_default_automl_parameters)

with open(metadata_path, "w") as metadata_file:
    json.dump(metadata, metadata_file, indent=2)

print(json.dumps(metadata, indent=2))

### Provide train specs <a class="anchor" id="head-9"></a>

In [None]:
# Default train model specs
! tao-client {model_name} model-train-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/train.json

In [None]:
# Customize train model specs
specs_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, 'specs', 'train.json')

with open(specs_path , "r") as specs_file:
    specs = json.load(specs_file)

# Apply changes for any of the parameters listed in the previous cell as required
# Example for lprnet (for each network the parameter key might be different)
specs["training_config"]["num_epochs"] = 24

with open(specs_path, "w") as specs_file:
    json.dump(specs, specs_file, indent=2)

print(json.dumps(specs, indent=2))

### Run AutoML train <a class="anchor" id="head-10"></a>

In [None]:
train_job_id = subprocess.getoutput(f"tao-client {model_name} model-train --id " + model_id)
print(train_job_id)

In [None]:
#utility function to print log file for the upcoming cell
def my_tail(logs_dir, log_file):
    %env LOG_FILE={logs_dir}/{log_file}
    ! mkdir -p {logs_dir}
    ! [ ! -f "$LOG_FILE" ] && touch $LOG_FILE && chmod 666 $LOG_FILE
    ! tail -f -n +1 $LOG_FILE | while read LINE; do echo "$LINE"; [[ "$LINE" == "EOF" ]] && pkill -P $$ tail; done

In [None]:
# Set poll_automl_stats to True if just want to see what's the time left, how many epochs are remaining etc.
# Set poll_automl_stats to False if you want to skip stats and see the training logs instead. Training logs viewing are supported only for Bayesian

# Training times for different models benchmarked on 1 GPU V100 machine can be found here: https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#results-of-automl-experiments

poll_automl_stats = True
if poll_automl_stats:
    import time
    from IPython.display import clear_output
    stats_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, train_job_id, "automl_metadata.json")
    controller_json_path = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id, train_job_id, "controller.json")
    while True:
        time.sleep(15)
        clear_output(wait=True)
        if os.path.exists(stats_path):
            try:
                with open(stats_path , "r") as stats_file:
                    stats_dict = json.load(stats_file)
                print(json.dumps(stats_dict, indent=2))
                if float(stats_dict["Number of epochs yet to start"]) == 0.0:
                    break
            except (json.JSONDecodeError):
                print("Stats computed are being written to file. Stats will be visible on screen in a few seconds")
else:
    # Print the log file - supported only for bayesian (the file won't exist until the backend Toolkit container is running -- can take several minutes)
    if automl_algorithm == "Bayesian":
        logs_dir = os.path.join(home, 'shared', 'users', os.environ['USER'], 'models', model_id)
        max_recommendations = metadata.get("automl_max_recommendations",20)
        for experiment_num in range(max_recommendations):
            log_file = f"{train_job_id}/experiment_{experiment_num}/log.txt"
            while True:
                if os.path.exists(os.path.join(logs_dir, log_file)):
                    break
            print(f"\n\nViewing experiment {experiment_num}\n\n")
            my_tail(logs_dir, log_file)

### Get the best model from AutoML <a class="anchor" id="head-11"></a>

In [None]:
# The config and the weights of the best configuration are present at best_model folder
# Takes a few seconds to copy the original automl experiment to best_model folder
!python3 -m pip install pandas
import pandas as pd

automl_job_dir = f"{home}/shared/users/{os.environ['USER']}/models/{model_id}/{train_job_id}"
best_model_path =  f"{automl_job_dir}/best_model"

while True:
    if os.path.exists(best_model_path) and len(os.listdir(best_model_path)) > 0 and os.path.exists(f"{best_model_path}/controller.json"):
        #List the binary model file
        print("\nCheckpoints for the best performing experiment")
        if os.path.exists(best_model_path+"/weights") and len(os.listdir(best_model_path+"/weights")) > 0:
            print(f"Folder: {best_model_path}/weights")
            print("Files:", os.listdir(best_model_path+"/weights"))
        else:
            print(f"Folder: {best_model_path}")
            print("Files:", os.listdir(best_model_path))

        experiment_artifacts = json.load(open(f"{best_model_path}/controller.json","r"))
        data_frame = pd.DataFrame(experiment_artifacts)
        # Print experiment id/number and the corresponding result
        print("\nResults of all experiments")
        with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', None):
            print(data_frame[["id","result"]])

        print("\nConfig/Spec file for the best performing experiment (recommendation_id.kitti with the maximum result value in the dataframe)")
        # List the recommendation config file of the best performing checkpoint(recommendation_id.kitti with the maximum result value in the dataframe)
        !ls {best_model_path}/*.kitti 
            
        break

### Delete experiment <a class="anchor" id="head-12"></a>

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/models/{model_id}
! echo DONE

### Delete datasets <a class="anchor" id="head-13"></a>

In [None]:
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}
! rm -rf ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}
! echo DONE

### Unmount shared volume <a class="anchor" id="head-14"></a>

In [None]:
command = "umount ~/shared"
! echo {password} | sudo -S -k {command} && echo DONE

### Uninstall TAO Remote Client <a class="anchor" id="head-32"></a>

In [None]:
! pip3 uninstall -y nvidia-tao-client