# End-to-End Example training object detection model using NVIDIA Pytorch Container from NGC
 ----

Note this Object Detection demo is based on https://github.com/pytorch/vision/tree/v0.11.3

This notebook walks you each step to train a model using containers from the NGC Catalog. We chose the GPU optimized Pytorch container as an example. The basics of working with docker containers apply to all NGC containers.

We will show you how to:

* Install the Docker Engine on your system
* Pull a Pytorch container from the NGC Catalog using Docker
* Run the Pytorch container using Docker
* Part 2 : Preprocess the Satellite imagery object detection dataset using the SAHI library
* Part 3 : Execute training a object detection on satellite imagery using TensorFlow and Jupyter Notebook
* Part 4 : Run inference on a trained object detection model using the SAHI library

Let's get started!

---

### 1. Install the Docker Engine
Go to https://docs.docker.com/engine/install/ to install the Docker Engine on your system.


### 2. Download the TensorFlow container from the NGC Catalog 

Once the Docker Engine is installed on your machine, visit https://ngc.nvidia.com/catalog/containers and search for the TensorFlow container. Click on the TensorFlow card and copy the pull command.
UPDATE IMG
<img src="https://raw.githubusercontent.com/kbojo/images/master/NGC.png">

Open the command line of your machine and past the pull command into your command line. Execute the command to download the container. 

```$ docker pull nvcr.io/nvidia/pytorch:21.11-py3```
    
The container starts downloading to your computer. A container image consists of many layers; all of them need to be pulled. 

### 3. Run the TensorFlow container image

Once the container download is completed, run the following code in your command line to run and start the container:

```$ docker run -it --gpus all  -p 8888:8888 -v $PWD:/projects --network=host nvcr.io/nvidia/pytorch:21.11-py3```
UPDATE IMG
<img src="https://raw.githubusercontent.com/kbojo/images/master/commandline1.png">

### 4. Install Jupyter lab and open a notebook

Within the container, run the following commands:

```pip install torchvision==0.11.3 jupyterlab```

```jupyter lab --ip=0.0.0.0 --port=8888 --allow-root```

Open up your favorite browser and enter: http://localhost:8888/?token=*yourtoken*.
UPDATE IMG
<img src="https://raw.githubusercontent.com/kbojo/images/master/commandline2.png">

You should see the Jupyter Lab application. Click on the plus icon to launch a new Python 3 noteb
ook.

Follow along with the image classification with the TensorFlow example provided below.

In [12]:
!pip install cython pycocotools matplotlib

# TLDR; run training job on 8 GPUS
The below cell will run a multi-gpu training job. This job will train an object detection model (faster-rcnn) on a dataset of satellite imagery images that contain 61 classes of objects
* Change `nproc_per_node` argument to specify the number of GPUs available on your server

In [None]:
!torchrun --nproc_per_node=8 detection/train.py\
    --dataset coco --data-path=/run/determined/workdir/shared_fs/data/xview_dataset/ --model fasterrcnn_resnet50_fpn --epochs 26\
    --lr-steps 16 22 --aspect-ratio-group-factor 3

### 5. Preprocess Satellite Imagery Dataset with Pytorch
* TODO

In [1]:
# TODO 

### 6. Object Detection on Satellite Imagery with Pytorch (Single GPU)
Follow and Run the code to train a Faster RCNN FPN (Resnet50 backbone) that classifies images of clothing. 

In [1]:
import sys
sys.path.insert(0,'detection')

In [2]:
# Import python dependencies
import datetime
import os
import time

import torch
import torch.utils.data
import torchvision
import torchvision.models.detection
import torchvision.models.detection.mask_rcnn

from coco_utils import get_coco, get_coco_kp

from group_by_aspect_ratio import GroupedBatchSampler, create_aspect_ratio_groups
from engine import train_one_epoch, evaluate

import presets
import utils
from train import *
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
output_dir='output'
data_path='/run/determined/workdir/shared_fs/data/xview_dataset/'
dataset_name='coco'
model='fasterrcnn_resnet50_fpn'
device='cuda'
batch_size=8
epochs=26
workers=4
lr=0.02
momentum=0.9
weight_decay=1e-4
lr_scheduler='multisteplr'
lr_step_size=8
lr_steps=[16, 22]
lr_gamma=0.1
print_freq=20
resume=False
start_epoch=0
aspect_ratio_group_factor=3
rpn_score_thresh=None
trainable_backbone_layers=None
data_augmentation='hflip'
pretrained=True
test_only=False
sync_bn=False

In [5]:
# Import the dataset.
# Data loading code
print("Loading data")

dataset, num_classes = get_dataset(dataset_name, "train", get_transform(True, data_augmentation),
                                   data_path)
dataset_test, _ = get_dataset(dataset_name, "val", get_transform(False, data_augmentation), data_path)
print(dataset.num_classes)
print("Creating data loaders")
train_sampler = torch.utils.data.RandomSampler(dataset)
test_sampler = torch.utils.data.SequentialSampler(dataset_test)
group_ids = create_aspect_ratio_groups(dataset, k=aspect_ratio_group_factor)
train_batch_sampler = GroupedBatchSampler(train_sampler, group_ids, batch_size)
train_batch_sampler = torch.utils.data.BatchSampler(
            train_sampler, batch_size, drop_last=True)

data_loader = torch.utils.data.DataLoader(
    dataset, batch_sampler=train_batch_sampler, num_workers=workers,
    collate_fn=utils.collate_fn)

data_loader_test = torch.utils.data.DataLoader(
    dataset_test, batch_size=1,
    sampler=test_sampler, num_workers=workers,
    collate_fn=utils.collate_fn)

Loading data
loading annotations into memory...
Done (t=2.66s)
creating index...
index created!
self.catIdtoCls:  {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 12, 12: 13, 13: 14, 14: 15, 15: 16, 16: 17, 17: 18, 18: 19, 19: 20, 20: 21, 21: 22, 22: 23, 23: 24, 24: 25, 25: 26, 26: 27, 27: 28, 28: 29, 29: 30, 30: 31, 31: 32, 32: 33, 33: 34, 34: 35, 35: 36, 36: 37, 37: 38, 38: 39, 39: 40, 40: 41, 41: 42, 42: 43, 43: 44, 44: 45, 45: 46, 46: 47, 47: 48, 48: 49, 49: 50, 50: 51, 51: 52, 52: 53, 53: 54, 54: 55, 55: 56, 56: 57, 57: 58, 58: 59, 59: 60}
self.num_classes:  61
{1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6, 8: 7, 9: 8, 10: 9, 11: 10, 12: 11, 13: 12, 14: 13, 15: 14, 16: 15, 17: 16, 18: 17, 19: 18, 20: 19, 21: 20, 22: 21, 23: 22, 24: 23, 25: 24, 26: 25, 27: 26, 28: 27, 29: 28, 30: 29, 31: 30, 32: 31, 33: 32, 34: 33, 35: 34, 36: 35, 37: 36, 38: 37, 39: 38, 40: 39, 41: 40, 42: 41, 43: 42, 44: 43, 45: 44, 46: 45, 47: 46, 48: 47, 49: 48, 50: 49, 51: 50, 52: 51, 53: 

In [6]:
# Let's have a look at one of the images. The following code visualizes the images using the matplotlib library.



In [7]:
# Let's look again at the first ten images, but this time with the class names.



In [8]:
def build_frcnn_model(num_classes):
    # load an detection model pre-trained on COCO
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

    # get the number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    model.min_size=800
    model.max_size=1333
    # RPN parameters
    model.rpn_pre_nms_top_n_train=2000
    model.rpn_pre_nms_top_n_test=1000
    model.rpn_post_nms_top_n_train=2000
    model.rpn_post_nms_top_n_test=1000
    model.rpn_nms_thresh=0.7
    model.rpn_fg_iou_thresh=0.7
    model.rpn_bg_iou_thresh=0.3
    model.rpn_batch_size_per_image=256
    model.rpn_positive_fraction=0.5
    model.rpn_score_thresh=0.0
    # Box parameters
    model.box_score_thresh=0.05
    model.box_nms_thresh=0.5
    model.box_detections_per_img=100
    model.box_fg_iou_thresh=0.5
    model.box_bg_iou_thresh=0.5
    model.box_batch_size_per_image=512
    model.box_positive_fraction=0.25
    return model

In [9]:
# Let's build the model:
print("Creating model")
kwargs = {
    "trainable_backbone_layers": trainable_backbone_layers
}
if "rcnn" in model:
    if rpn_score_thresh is not None:
        kwargs["rpn_score_thresh"] = rpn_score_thresh
model = build_frcnn_model(dataset.num_classes)
_=model.to(device)


Creating model


In [10]:
# Compile the model:
# Define loss function, optimizer, and metrics.
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(
    params, lr=lr, momentum=momentum, weight_decay=weight_decay)

lr_scheduler = lr_scheduler.lower()
if lr_scheduler == 'multisteplr':
    lr_scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, 
                                                        milestones=lr_steps, 
                                                        gamma=lr_gamma)
elif lr_scheduler == 'cosineannealinglr':
    lr_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
else:
    raise RuntimeError("Invalid lr scheduler '{}'. Only MultiStepLR and CosineAnnealingLR "
                       "are supported.".format(args.lr_scheduler))


In [11]:
# Train the model:
# Let's train 1 epoch. After every epoch, training time, loss, and accuracy will be displayed.
print("Start training")
start_time = time.time()
for epoch in range(start_epoch, epochs):
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq)
    lr_scheduler.step()
    if output_dir:
        checkpoint = {
            'model': model.state_dict(),
            'optimizer': optimizer.state_dict(),
            'lr_scheduler': lr_scheduler.state_dict(),
            'args': args,
            'epoch': epoch
        }
        utils.save_on_master(
            checkpoint,
            os.path.join(output_dir, 'model_{}.pth'.format(epoch)))
        utils.save_on_master(
            checkpoint,
            os.path.join(output_dir, 'checkpoint.pth'))

    # evaluate after every epoch
    evaluate(model, data_loader_test, device=device)

total_time = time.time() - start_time
total_time_str = str(datetime.timedelta(seconds=int(total_time)))
print('Training time {}'.format(total_time_str))

Start training


  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]


Epoch: [0]  [   0/2575]  eta: 2:47:07  lr: 0.000040  loss: 6.9348 (6.9348)  loss_classifier: 3.9348 (3.9348)  loss_box_reg: 0.5733 (0.5733)  loss_objectness: 2.2002 (2.2002)  loss_rpn_box_reg: 0.2265 (0.2265)  time: 3.8942  data: 0.4338  max mem: 7172
Epoch: [0]  [  20/2575]  eta: 2:30:39  lr: 0.000440  loss: 3.6758 (3.8337)  loss_classifier: 2.6189 (2.5724)  loss_box_reg: 0.4708 (0.4865)  loss_objectness: 0.2969 (0.6739)  loss_rpn_box_reg: 0.0792 (0.1009)  time: 3.5202  data: 0.0288  max mem: 7454


KeyboardInterrupt: 

In [None]:
# Let's see how the model performs on the test data:
