### Notebook to demonstrate AutoML workflow for TAO Segmentation models

Transfer learning is the process of transferring learned features from one application to another. It is a commonly used training technique where you use a model trained on one task and re-train to use it on a different task. Train Adapt Optimize (TAO) Toolkit  is a simple and easy-to-use Python based AI toolkit for taking purpose-built AI models and customizing them with users' own data.

![image](https://developer.nvidia.com/sites/default/files/akamai/TAO/tlt-tao-toolkit-bring-your-own-model-diagram.png)

### Learning Objective

This AutoML notebook applies to identifying the optimal hyperparameters (e.g., learning rate, batch size, weight regularizer, number of layers, etc.) in order to obtain better accuracy results or converge faster on AI models for segmentation application.
- Take a pretrained model and choose automl algorithm/parameters to start AutoML train.
- At the end of an AutoML run, you will receive a config file that specifies the best performing model, along with the binary model file to deploy it to your application.


### AutoML Workflow

User starts with selecting model topology, create and upload dataset, configuring parameters, training with AutoML to comparing the model.

![image](https://raw.githubusercontent.com/vpraveen-nv/model_card_images/main/api/automl_workflow.png)


### Table of contents

1. [Create and upload datasets](#head-1)
1. [List the created datasets](#head-2)
1. [Dataset convert Action](#head-3)
1. [Create model](#head-4)
1. [List models](#head-5)
1. [Assign train, eval datasets](#head-6)
1. [Assign PTM](#head-7)
1. [Set AutoML related configurations](#head-8)
1. [Actions](#head-9)
1. [AutoML Train](#head-10)

### Requirements
Please find the server requirements [here](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_setup.html#)

### FIXME

1. Assign a model_name in FIXME 1
2. Assign a workdir in FIXME 2
3. Assign the ip_address and port_number in FIXME 3 ([info](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_rest_api.html))
4. Assign the ngc_api_key variable in FIXME 4
5. Choose between default and custom dataset in FIXME 5
6. Assign path of data_dir in FIXME 6
7. Choose between Bayesian and Hyperband automl_algorithm in FIXME 7

In [None]:
import json
import os
import requests
import uuid
import time
from IPython.display import clear_output

In [None]:
# Define model_name workspaces and other variables
# Available models (#FIXME 1):
# 1. mask_rcnn - https://docs.nvidia.com/tao/tao-toolkit/text/instance_segmentation/mask_rcnn.html
# 2. unet - https://docs.nvidia.com/tao/tao-toolkit/text/semantic_segmentation/unet.html

model_name = "mask_rcnn" # FIXME1 (Add the model name from the above mentioned list)
workdir = "workdir_segmentation" # FIXME2
host_url = "http://<ip_address>:<port_number>" # FIXME3 example: https://10.137.149.22:32334
# In host machine, node ip_address and port number can be obtained as follows,
# ip_address: hostname -i
# port_number: kubectl get service ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}'
ngc_api_key = "<ngc_api_key>" # FIXME4 example: zZYtczM5amdtdDcwNjk0cnA2bGU2bXQ3bnQ6NmQ4NjNhMDItMTdmZS00Y2QxLWI2ZjktNmE5M2YxZTc0OGyM
dataset_to_be_used = "default" # FIXME5 #default/custom; default for the dataset used in this tutorial notebook; custom for a different dataset

In [None]:
# Exchange NGC_API_KEY for JWT
response = requests.get(f"{host_url}/api/v1/login/{ngc_api_key}")
user_id = response.json()["user_id"]
print("User ID",user_id)
token = response.json()["token"]
print("JWT",token)

# Set base URL
base_url = f"{host_url}/api/v1/user/{user_id}"
print("API Calls will be forwarded to",base_url)

headers = {"Authorization": f"Bearer {token}"}

In [None]:
# Creating workdir
if not os.path.isdir(workdir):
    os.makedirs(workdir)

### Create datasets <a class="anchor" id="head-1"></a>

**Instance Segmentation:**
We will be using the `COCO dataset` for Instance segmentation - MaskRCNN. `download_coco.sh` script from dataset prepare will be used to download and unzip the coco2017 dataset from [here](https://cocodataset.org/#download)


**If using custom dataset; it should follow this dataset structure**
```
DATA_DIR
├── annotations.json
├── images
    ├── image_name_1.jpg
    ├── image_name_2.jpg
    ├── ...

```

**Semantic Segmentation:**
We will be using the `ISBI Challenge: Segmentation of neuronal structures in EM stacks dataset` for the binary segmentation tutorial. Please access the open source repo [here](https://github.com/alexklibisz/isbi-2012/tree/master/data) to download the data. The data is in .tif format. Copy the train-labels.tif, train-volume.tif, test-volume.tif files to `DATA_DIR`.

**If using custom dataset; it should follow this dataset structure**
```
DATA_DIR
├── images
│   ├── test
│   │   ├── image_0.png
│   │   ├── image_1.png
|   |   ├── ...
│   ├── train
│   │   ├── image_2.png
│   │   ├── image_3.png
|   |   ├── ...
│   └── val
│       ├── image_4.png
│       ├── image_5.png
|       ├── ...
├── masks
    ├── train
    │   ├── image_2.png
    │   ├── image_3.png
    |   ├── ...
    └── val
        ├── image_4.png
        ├── image_5.png
        ├── ...

```
The filename should match for images and masks

In [None]:
DATA_DIR = model_name # FIXME6
os.environ['DATA_DIR']= DATA_DIR
!mkdir -p $DATA_DIR

### Download dataset

In [None]:
if model_name == "mask_rcnn" and dataset_to_be_used == "default":
    !bash ../dataset_prepare/coco/download_coco.sh $DATA_DIR
    # Remove existing data
    !rm -rf $DATA_DIR/train2017/images
    !rm -rf $DATA_DIR/val2017/images
    # Rearrange data in the required format
    !mkdir -p $DATA_DIR/train2017/
    !mkdir -p $DATA_DIR/val2017/
    !mv $DATA_DIR/raw-data/train2017 $DATA_DIR/train2017/images
    !mv $DATA_DIR/raw-data/annotations/instances_train2017.json $DATA_DIR/train2017/annotations.json
    !mv $DATA_DIR/raw-data/val2017 $DATA_DIR/val2017/images
    !mv $DATA_DIR/raw-data/annotations/instances_val2017.json $DATA_DIR/val2017/annotations.json
    !cp ../dataset_prepare/coco/label_map.txt $DATA_DIR/train2017/
    !cp ../dataset_prepare/coco/label_map.txt $DATA_DIR/val2017/
    
# For unet you have to manually download from the github link https://github.com/alexklibisz/isbi-2012/tree/master/data and place it in $DATA_DIR

### Verify the downloaded dataset

In [None]:
if model_name == "mask_rcnn":
    !if [ ! -d $DATA_DIR/train2017/images ]; then echo 'Images folder not found'; else echo 'Found images folder';fi
    !if [ ! -f $DATA_DIR/train2017/annotations.json ]; then echo 'annotations file not found'; else echo 'Found annotations file';fi
    !if [ ! -d $DATA_DIR/val2017/images ]; then echo 'Images folder not found'; else echo 'Found images folder';fi
    !if [ ! -f $DATA_DIR/val2017/annotations.json ]; then echo 'annotations file not found'; else echo 'Found annotations file';fi
if model_name == "unet" and dataset_to_be_used == "default":
    !if [ ! -f $DATA_DIR/train-volume.tif ]; then echo 'train-volume.tif file not found, please download.'; else echo 'Found test-volume.tif file.';fi
    !if [ ! -f $DATA_DIR/train-labels.tif ]; then echo 'train-labels file not found, please download.'; else echo 'Found train-labels.tif file.';fi
    !if [ ! -f $DATA_DIR/test-volume.tif ]; then echo 'train-volume.tif file not found, please download.'; else echo 'Found train-volume.tif file.';fi

In [None]:
if model_name == "unet":
    if dataset_to_be_used == "default":
        !python3 -m pip install Pillow
        !bash ../dataset_prepare/unet/prepare_data.sh $DATA_DIR # creates images and masks from the tif files
    !tar -czf isbi_data.tar.gz -C $DATA_DIR .
elif model_name == "mask_rcnn":
    !tar -C $DATA_DIR/train2017 -czf coco_train.tar.gz images annotations.json
    !tar -C $DATA_DIR/val2017 -czf coco_val.tar.gz images annotations.json

In [None]:
if model_name == "unet":
    train_dataset_path = "isbi_data.tar.gz"
    eval_dataset_path = "isbi_data.tar.gz"
elif model_name == "mask_rcnn":
    train_dataset_path = "coco_train.tar.gz"
    eval_dataset_path= "coco_val.tar.gz"

In [None]:
# Create train dataset
if model_name == "unet":
    ds_type = "semantic_segmentation"
    ds_format = "unet"
elif model_name == "mask_rcnn":
    ds_type = "instance_segmentation"
    ds_format = "coco"
data = json.dumps({"type":ds_type,"format":ds_format})

endpoint = f"{base_url}/dataset"

response = requests.post(endpoint,data=data,headers=headers)

print(response)
print(response.json())

dataset_id = response.json()["id"]

In [None]:
# Update
dataset_information = {"name":"Train dataset",
                       "description":"My train dataset"}
data = json.dumps(dataset_information)

endpoint = f"{base_url}/dataset/{dataset_id}"

response = requests.patch(endpoint, data=data, headers=headers)

print(response)
print(response.json())

In [None]:
# Upload
files = [("file",open(train_dataset_path,"rb"))]

endpoint = f"{base_url}/dataset/{dataset_id}/upload"

response = requests.post(endpoint, files=files, headers=headers)

print(response)
print(response.json())

In [None]:
# Create eval dataset
data = json.dumps({"type":ds_type,"format":ds_format})

endpoint = f"{base_url}/dataset"

response = requests.post(endpoint,data=data,headers=headers)

print(response)
print(response.json())

eval_dataset_id = response.json()["id"]

In [None]:
# Update
dataset_information = {"name":"Evaluation dataset",
                       "description":"My eval dataset"}
data = json.dumps(dataset_information)

endpoint = f"{base_url}/dataset/{eval_dataset_id}"

response = requests.patch(endpoint, data=data, headers=headers)

print(response)
print(response.json())

In [None]:
# Upload
files = [("file",open(eval_dataset_path,"rb"))]

endpoint = f"{base_url}/dataset/{eval_dataset_id}/upload"

response = requests.post(endpoint, files=files, headers=headers)

print(response)
print(response.json())

### List the created datasets <a class="anchor" id="head-2"></a>

In [None]:
endpoint = f"{base_url}/dataset"

response = requests.get(endpoint, headers=headers)

print(response)
# print(response.json()) ## Uncomment for verbose list output
print("id\t\t\t\t\t type\t\t\t format\t\t name")
for rsp in response.json():
    print(rsp["id"],"\t",rsp["type"],"\t",rsp["format"],"\t\t",rsp["name"])

### Dataset convert Action <a class="anchor" id="head-3"></a>
#### Run dataset convert only for coco data format, skip to Create model for unet data format

In [None]:
# Get default spec schema
endpoint = f"{base_url}/dataset/{dataset_id}/specs/convert/schema"

response = requests.get(endpoint, headers=headers)

#print(response)
#print(response.json()) ## Uncomment for verbose schema

specs = response.json()["default"]

print(specs)

In [None]:
# Apply changes
specs["coco_config"]["num_shards"] = 256
specs["coco_config"]["tag"] = "train"
print(specs)

In [None]:
# Post spec
data = json.dumps(specs)

endpoint = f"{base_url}/dataset/{dataset_id}/specs/convert"

response = requests.post(endpoint,data=data,headers=headers)

print(response)
print(response.json())

In [None]:
# Run action
parent = None
actions = ["convert"]
data = json.dumps({"job":parent,"actions":actions})

endpoint = f"{base_url}/dataset/{dataset_id}/job"

response = requests.post(endpoint, data=data, headers=headers)

print(response)
print(response.json())

ds_convert_id = response.json()[0]

In [None]:
# Monitor job status by repeatedly running this cell
job_id = ds_convert_id
endpoint = f"{base_url}/dataset/{dataset_id}/job/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(response.json())
    if response.json().get("status") in ["Done","Error"] or response.status_code not in (200,201):
        break
    time.sleep(15)

In [None]:
# Now, repeat the same for the eval dataset
# Get default spec schema
endpoint = f"{base_url}/dataset/{eval_dataset_id}/specs/convert/schema"

response = requests.get(endpoint, headers=headers)

print(response)
#print(response.json()) ## Uncomment for verbose schema
specs = response.json()["default"]

#print(specs)

In [None]:
## Apply changes
specs["coco_config"]["num_shards"] = 32
specs["coco_config"]["tag"] = "val"
print(specs)

In [None]:
# Post spec
data = json.dumps(specs)

endpoint = f"{base_url}/dataset/{eval_dataset_id}/specs/convert"

response = requests.post(endpoint,data=data,headers=headers)

print(response)
print(response.json())

In [None]:
# Run action
parent = None
actions = ["convert"]
data = json.dumps({"job":parent,"actions":actions})

endpoint = f"{base_url}/dataset/{eval_dataset_id}/job"

response = requests.post(endpoint, data=data, headers=headers)

print(response)
print(response.json())

eval_ds_convert_id = response.json()[0]

In [None]:
# Monitor job status by repeatedly running this cell
job_id = eval_ds_convert_id
endpoint = f"{base_url}/dataset/{eval_dataset_id}/job/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(response.json())
    if response.json().get("status") in ["Done","Error"] or response.status_code not in (200,201):
        break
    time.sleep(15)

### Create model <a class="anchor" id="head-4"></a>

In [None]:
network_arch = model_name
encode_key = "tlt_encode"
data = json.dumps({"network_arch":network_arch,"encryption_key":encode_key})

endpoint = f"{base_url}/model"

response = requests.post(endpoint,data=data,headers=headers)

print(response)
print(response.json())

model_id = response.json()["id"]

### List models <a class="anchor" id="head-5"></a>

In [None]:
endpoint = f"{base_url}/model"

response = requests.get(endpoint, headers=headers)

print(response)
# print(response.json()) ## Uncomment for verbose list output

print("model id\t\t\t     network architecture")
for rsp in response.json():
    print(rsp["id"],rsp["network_arch"])

### Assign train, eval datasets <a class="anchor" id="head-6"></a>

- Note: make sure the order for train_datasets is [source ID, target ID]
- eval_dataset is kept same as target for demo purposes
- inference_dataset is kept as target for chaining with hifigan finetune

In [None]:
dataset_information = {"train_datasets":[dataset_id],
                       "eval_dataset":eval_dataset_id}
data = json.dumps(dataset_information)

endpoint = f"{base_url}/model/{model_id}"

response = requests.patch(endpoint, data=data, headers=headers)

print(response)
print(response.json())

### Assign PTM <a class="anchor" id="head-7"></a>

Search for pretrained models (mask_rcnn/unet) on NGC and assign it to the model

In [None]:
# Assigning pretrained models to different networks
# print base_url+"/model" to get the details of all pretrained models and make the appropriate changes to this map for experiments like for example 
# you are changing the number of layers to 34, then you have to make the appropriate change in the pretrained model name
# print(base_url+"/model")
pretrained_map = {"mask_rcnn" : "pretrained_instance_segmentation:resnet50",
                  "unet" : "pretrained_semantic_segmentation:resnet18"}

In [None]:
# Get pretrained model for mask_rcnn
model_list = f"{base_url}/model"
response = requests.get(model_list, headers=headers)

response_json = response.json()

# Search for ptm with given ngc path
ptm_id = None
for rsp in response_json:
    if network_arch == rsp["network_arch"] and pretrained_map[network_arch] in rsp["ngc_path"]:
        ptm_id = rsp["id"]
        print("Metadata for model with requested NGC Path")
        print(rsp)
        break
ptm = ptm_id

In [None]:
ptm_information = {"ptm":ptm}
data = json.dumps(ptm_information)

endpoint = f"{base_url}/model/{model_id}"

response = requests.patch(endpoint, data=data, headers=headers)

print(response)
#print(response.json())

### View hyperparameters that are enabled for AutoML by default

In [None]:
# Get default spec schema
endpoint = f"{base_url}/model/{model_id}/specs/train/schema"

response = requests.get(endpoint, headers=headers)

specs = response.json()["automl_default_parameters"]

import json
print(json.dumps(specs, sort_keys=True, indent=4))

### Set AutoML related configurations <a class="anchor" id="head-8"></a>
Refer to these hyper-links to see the parameters supported by each network and add more parameters if necessary in addition to the default automl enabled parameters: [Mask RCNN](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_action_specs.html#id25), 
[Unet](https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_action_specs.html#id41)

In [None]:
# Choose automl algorithm between "Bayesian" and "HyperBand".
automl_algorithm="Bayesian" # FIXME7 example: Bayesian/HyperBand

metric="kpi" #don't change, more metrics will be supported in the future

additional_automl_parameters = [] #Refer to parameter list mentioned in the above links and add any extra parameter in addition to the default enabled ones
remove_default_automl_parameters = [] #Remove any hyperparameters that are enabled by default for AutoML

automl_information = {"automl_enabled":True,
                      "automl_algorithm":automl_algorithm,
                      "metric":metric,
                      "automl_add_hyperparameters":str(additional_automl_parameters),
                      "automl_remove_hyperparameters":str(remove_default_automl_parameters)
                     }
data = json.dumps(automl_information)

endpoint = f"{base_url}/model/{model_id}"

response = requests.patch(endpoint, data=data, headers=headers)

print(response)
import json
print(json.dumps(response.json(), sort_keys=True, indent=4))

### Actions <a class="anchor" id="head-9"></a>

For all actions:
1. Get default spec schema and derive the default values
2. Modify defaults if needed
3. Post spec dictionary to the service
4. Run model action
5. Monitor job using retrieve
6. Download results using job download endpoint (if needed)

In [None]:
job_map = {}

### AutoML Train <a class="anchor" id="head-10"></a>

In [None]:
# Get default spec schema
endpoint = f"{base_url}/model/{model_id}/specs/train/schema"

response = requests.get(endpoint, headers=headers)
print(response)

#print(response.json()) ## Uncomment for verbose schema
specs = response.json()["default"]

import json
print(json.dumps(specs, sort_keys=True, indent=4))

In [None]:
# Override any of the parameters listed in the previous cell as required
# For each network the parameter key might be different for example, in mask_rcnn training duration is determined by num_epochs or total_steps
specs["num_epochs"] = 5
specs["num_examples_per_epoch"] = 5000 # Set it as the number of images in your dataset for mask-rcnn / num of GPU's
# Example for unet
# specs["training_config"]["epochs"] = 50

In [None]:
# Post spec
data = json.dumps(specs)

endpoint = f"{base_url}/model/{model_id}/specs/train"

response = requests.post(endpoint,data=data,headers=headers)

print(response)
import json
print(json.dumps(response.json(), sort_keys=True, indent=4))

In [None]:
# Run action
parent = None
actions = ["train"]
data = json.dumps({"job":parent,"actions":actions})

endpoint = f"{base_url}/model/{model_id}/job"

response = requests.post(endpoint, data=data, headers=headers)

print(response)
print(response.json())

job_map["train"] = response.json()[0]
print(job_map)

In [None]:
# Monitor automl job status by repeatedly running this cell
# Training times for different models benchmarked on 1 GPU V100 machine can be found here: https://docs.nvidia.com/tao/tao-toolkit/text/automl/automl.html#results-of-automl-experiments

job_id = job_map['train']
endpoint = f"{base_url}/model/{model_id}/job/{job_id}"

while True:    
    clear_output(wait=True)
    response = requests.get(endpoint, headers=headers)
    print(response)
    print(json.dumps(response.json(), sort_keys=True, indent=4))
    if response.json().get("status") in ["Done","Error"] or response.status_code not in (200,201):
        break
    time.sleep(15)

In [None]:
## To Stop an AutoML JOB
#    1. Stop the 'Monitor automl job status by repeatedly running this cell' cell (the cell right before this cell) manually
#    2. Uncomment the snippet in the next cell and run the cell

In [None]:
# job_id = job_map['train']
# endpoint = f"{base_url}/model/{model_id}/job/{job_id}/cancel"

# response = requests.post(endpoint, headers=headers)

# print(response)
# print(response.json())

In [None]:
## Resume AutoML

In [None]:
# Uncomment the below snippet if you want to resume an already stopped AutoML job and then run the 'Monitor automl job status by repeatedly running this cell' cell above (4th cell above from this cell)
# job_id = job_map['train']
# endpoint = f"{base_url}/model/{model_id}/job/{job_id}/resume"

# response = requests.post(endpoint, headers=headers)

# print(response)
# print(response.json())

In [None]:
# Download automl job contents once the above job shows "Done" status
# Download output of automl (detectnet_v2) train (Note: will take time)
job_id = job_map["train"]
endpoint = f'{base_url}/model/{model_id}/job/{job_id}/download'

# Save
temptar = f'{job_id}.tar.gz'
with requests.get(endpoint, headers=headers, stream=True) as r:
    r.raise_for_status()
    with open(temptar, 'wb') as f:
        for chunk in r.iter_content(chunk_size=8192):
            f.write(chunk)

print("Untarring")

# Untar to destination
tar_command = f'tar -xvf {temptar} -C {workdir}/'
os.system(tar_command)
os.remove(temptar)
print(f"Results at {workdir}/{job_id}")
model_downloaded_path = f"{workdir}/{job_id}"

In [None]:
# View best performing model's config, model file; Also view the results of all automl experiments
!python3 -m pip install pandas
import pandas as pd

best_model_path = f"{model_downloaded_path}/best_model"

if os.path.exists(best_model_path):        
    #List the binary model file
    print("\nCheckpoints for the best performing experiment")
    if os.path.exists(best_model_path+"/weights") and len(os.listdir(best_model_path+"/weights")) > 0:
        print(f"Folder: {best_model_path}/weights")
        print("Files:", os.listdir(best_model_path+"/weights"))
    else:
        print(f"Folder: {best_model_path}")
        print("Files:", os.listdir(best_model_path))

    experiment_artifacts = json.load(open(f"{best_model_path}/controller.json","r"))
    data_frame = pd.DataFrame(experiment_artifacts)
    # Print experiment id/number and the corresponding result
    print("\nResults of all experiments")
    with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.max_colwidth', None):
        print(data_frame[["id","result"]])

    print("\nConfig/Spec file for the best performing experiment (recommendation_id.kitti with the maximum result value in the dataframe)")
    # List the recommendation config file of the best performing checkpoint(recommendation_id.kitti with the maximum result value in the dataframe)
    !ls {best_model_path}/*.kitti 