# HW3 Problem 1

by Haotian Zhang

<img src="https://dl.fbaipublicfiles.com/detectron2/Detectron2-Logo-Horz.png" width="500">

In this homework assignment, we will use Detectron2 (Facebook) to help us to do the tasks of detection and segmentation. 

Detectron2 is Facebook AI Research's next generation software system that implements state-of-the-art object detection algorithms. Here, we will go through some basic usage of detectron2, and finish the following tasks:
* Run inference on images, with existing pre-trained detectron2 models
* Train your own models on two custom datasets: traffic sign & balloon 


## Getting Started

In [None]:
# First step, let's install detectron2 first!
# install dependencies: 
!pip install pyyaml==5.1 pycocotools>=2.0.1

# Change "Runtime -> Change Runtime Type" and choose GPU/CPU
import torch, torchvision
print(torch.__version__, torch.cuda.is_available())
!gcc --version
# opencv is pre-installed on colab

In [None]:
# install detectron2: (Colab has CUDA 10.1 + torch 1.6)
# See https://detectron2.readthedocs.io/tutorials/install.html for instructions
assert torch.__version__.startswith("1.6")
!pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.6/index.html

# It may ask you to restart the runtime

In [None]:
# Some basic setup:
# Setup detectro2 logger
import detectron2
from detectron2.utils.logger import setup_logger
setup_logger()

# import some common libraries
import numpy as np 
import os, json, cv2, random
from google.colab.patches import cv2_imshow

# import some common detectron2 utils
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog, DatasetCatalog

## Run a pretrained Detectron2 model

We first download some image from the given URLs:

In [None]:
!wget http://images.cocodataset.org/val2017/000000007574.jpg -q -O input.jpg
im_input = cv2.imread("./input.jpg")
cv2_imshow(im_input)

In [None]:
!wget http://images.cocodataset.org/val2017/000000013923.jpg -q -O test1.jpg
im_test1 = cv2.imread("./test1.jpg")
cv2_imshow(im_test1)

In [None]:
!wget http://images.cocodataset.org/val2017/000000018380.jpg -q -O test2.jpg
im_test2 = cv2.imread("./test2.jpg")
cv2_imshow(im_test2)

We can see there are multiple objects in these images: bottles, tables, chairs, people, etc. Let us see if we can detect them all by using a pre-trained model given by Detectron2.


In [None]:
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST= 0.5  # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(im_input)

Let's take a look at the model output. 

In inference mode, the builtin model outputs a `list[dict]`, one dict for each image. For the object detection task, the dict contain the following fields:

*   "instances": Instances object with the following fields:
    * "pred_boxes": Storing N boxes, one for each detected instance.
    * "scores": a vector of N scores.
    * "pred_classes": a vector of N labels in range [0, num_categories].

For more details, please see https://detectron2.readthedocs.io/tutorials/models.html#model-output-format for specification



In [None]:
# print(outputs)
print(outputs["instances"].pred_classes)
print(outputs["instances"].pred_boxes)

In [None]:
# We can use "Visualizer" to draw the predictions on the image
v = Visualizer(im_input[:, :, ::-1], MetadataCatalog.get(cfg.DATASETS.TRAIN[0]), scale=1.2)
out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
cv2_imshow(out.get_image()[:, :, ::-1])

AWESOME!!! Great progress so far! We are able to detect sink, microwave, bottle and even refrigerator! At this point, we have used the pre-trained model to do the inference on the given image. There are in total 17 objects are being detected. The image is adopted from the [MS-COCO](https://cocodataset.org/#home) dataset and there are 81 classes including person, bicycle, car, etc. You may find the id-category mapping [here](https://gist.github.com/AruniRC/7b3dadd004da04c80198557db5da4bda).

The model we just used is `COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml`. Actually, the Detectron2 provides us more than that, you may find great amouts of models for different tasks in the given [MODEL_ZOO](https://github.com/facebookresearch/detectron2/tree/master/configs). What about we try a different model to see what its output will look like? 


* Q1 (5%): Object Detection. Use the same configuration `COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml`, with IoU threshold of 0.5 (`SCORE_THRESH_TEST=0.5`), to also run inference on the rest two images (test1.jpg & test2.jpg) and view the outputs with bounding boxes. 

* Q2: Object Detection. Use the `COCO-Detection/faster_rcnn_R_101_FPN_3X.yaml`, which has a ResNet-101 as the backbone, with IoU threshold of 0.5 and view the outputs of all three images with bounding boxes. By looking at the outputs, can you find the difference with the one `COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml` we used in Q1? (e.g., numbers of objects, confidence scores, ...)

* Q3: Object Detection. Use the `COCO-Detection/faster_rcnn_R_101_FPN_3X.yaml` with an IoU threshold of 0.9 and view the outputs of all three images with bounding boxes.

* Q4 (5%): Instance Segmentation. The models we have tried in Q1-Q3 are the Faster R-CNN models for object detection. Here, let’s try a Mask R-CNN model `COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml`, with IoU threshold of 0.5, to perform the instance segmentation and view the outputs of all three images with segmentation masks. Compare the difference of outputs between an object detection model with an instance segmentation model. 






In [None]:
# Your code here (You may use multiple code and text segments to display your solutions.)
# Q1
# ...
# Q2
# ...
# Q3
# ...
# Q4

## Train Faster R-CNN on a traffic sign dataset

We have already used the pre-trained model on MS COCO datasets. Why not we try to train our own model ourselves? Here, we will train an existing detectron2 model on a custom dataset in a new format. 

You have already used the pre-trained model on MS COCO datasets. Why not try to train your own model? Here, let’s train an existing Faster R-CNN model on a custom dataset in a new format. 

We use the [traffic sign dataset](https://www.dropbox.com/s/d8y6uc06027fpqo/traffic_sign_data.zip?dl=1). We’ll train a traffic sign detection model from an existing model pre-trained on COC dataset, available in detectron2’s model zoo. Note that the MS COCO dataset does not have the "traffic sign" category, but we'll be able to recognize this new class in a few minutes.


### Prepare the dataset

In [None]:
# download, decompress the data
!wget https://www.dropbox.com/s/d8y6uc06027fpqo/traffic_sign_data.zip?dl=1 -O traffic_sign_data.zip
!unzip -q traffic_sign_data.zip > /dev/null

Here, the traffic sign dataset is in its custom dataset, therefore we write a function to parse it and prepare it into detectron2's standard format. See `get_traffic_sign_dicts` function for more details. To verify the data loading is correct, let's visualize the annotations of randomly selected samples in the training set:

In [None]:
from detectron2.structures import BoxMode

def get_traffic_sign_dicts(data_root, txt_file):
    dataset_dicts = []
    filenames = []
    csv_path = os.path.join(data_root, txt_file)
    with open(csv_path, "r") as f:
        for line in f:
            filenames.append(line.rstrip())
    
    for idx, filename in enumerate(filenames):
        record = {}

        image_path = os.path.join(data_root, filename)

        height, width = cv2.imread(image_path).shape[:2]

        record['file_name'] = image_path
        record['image_id'] = idx
        record['height'] = height
        record['width'] = width

        image_filename = os.path.basename(filename)
        image_name = os.path.splitext(image_filename)[0]
        annotation_path = os.path.join(data_root, 'labels', '{}.txt'.format(image_name))
        annotation_rows = []

        with open(annotation_path, "r") as f:
            for line in f:
                temp = line.rstrip().split(" ")
                annotation_rows.append(temp)

        objs = []
        for row in annotation_rows:
            xcentre = int(float(row[1])*width)
            ycentre = int(float(row[2])*height)
            bwidth = int(float(row[3])*width)
            bheight = int(float(row[4])*height)

            xmin = int(xcentre - bwidth/2)
            ymin = int(ycentre - bheight/2)
            xmax = xmin  + bwidth
            ymax = ymin + bheight

            obj= {
                'bbox': [xmin, ymin, xmax, ymax],
                'bbox_mode': BoxMode.XYXY_ABS,
                # alternatively, we can use bbox_mode = BoxMode.XYWH_ABS
                # 'bbox': [xmin, ymin, bwidth, bheight],
                # 'bbox_mode': BoxMode.XYWH_ABS,
                'category_id': int(row[0]),
                'iscrowd': 0
            }

            objs.append(obj)
        record['annotations'] = objs
        dataset_dicts.append(record)
    return dataset_dicts

In [None]:
# Metadata configurations
data_root = "traffic_sign_data"
train_txt = "traffic_sign_train.txt"
test_txt = "traffic_sign_test.txt"

train_data_name = "traffic_sign_train"
test_data_name = "traffic_sign_test"

thing_classes = ["traffic-sign"]

output_dir = "./outputs"

def count_lines(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

train_img_count = count_lines(os.path.join(data_root, train_txt))
print("There are {} samples in training data".format(train_img_count))

In [None]:
# Register the traffic_sign_train datasets
DatasetCatalog.register(name=train_data_name, 
                        func=lambda: get_traffic_sign_dicts(data_root, train_txt))
train_metadata = MetadataCatalog.get(train_data_name).set(thing_classes=thing_classes)

# Register the traffic_sign_test datasets
DatasetCatalog.register(name=test_data_name, 
                        func=lambda: get_traffic_sign_dicts(data_root, test_txt))
test_metadata = MetadataCatalog.get(test_data_name).set(thing_classes=thing_classes)

To verify the data loading is correct, let's visualize the annotations of randomly selected samples in the training set:

In [None]:
train_data_dict = get_traffic_sign_dicts(data_root, train_txt)

for d in random.sample(train_data_dict, 3):
    img = cv2.imread(d["file_name"])
    visualizer = Visualizer(img[:, :, ::-1], metadata=train_metadata, scale=0.5)
    out = visualizer.draw_dataset_dict(d)
    cv2_imshow(out.get_image()[:, :, ::-1])

### Train!

Now, let's fine-tune a COCO-pretrained R50-FPN Faster R-CNN model `COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml` on the traffic sign dataset. 

In [None]:
from detectron2.engine import DefaultTrainer

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml"))
cfg.DATASETS.TRAIN = (train_data_name,)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml") # let's trainining initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.0001  # pick a good LR
cfg.SOLVER.MAX_ITER = 500    # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = len(thing_classes)  # only has one class (traffic-sign)
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   # faster, and good enough for this toy dataset (default: 512)
cfg.OUTPUT_DIR = output_dir

In [None]:
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg) 
trainer.resume_or_load(resume=False)
trainer.train()

In [None]:
# Look at training curves in tensorboard:
%load_ext tensorboard
%tensorboard --logdir outputs/

### Inference & evaluation using the trained model


Now let's run inference contains everything we've set previously. First, let's create a predictor using the model we just trained:

In [None]:
# cfg alrady contains everything we've set previously. Now we changed it a little bit for inference:
cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")  # path to the model we just trained
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
predictor = DefaultPredictor(cfg)

Then, we randomly select several samples to visualize the prediction results. 

In [None]:
from detectron2.utils.visualizer import ColorMode

test_data_dict = get_traffic_sign_dicts(data_root, test_txt)

for d in random.sample(test_data_dict, 3):
    im = cv2.imread(d["file_name"])
    outputs = predictor(im) 
    # print(outputs)
    v = Visualizer(im[:, :, ::-1],
                   metadata=test_metadata,
                   scale=0.5,
                   )
    out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
    cv2_imshow(out.get_image()[:, :, ::-1])

We can also evaluate its performance using AP metric implemented in COCO API. For more details about AP, please refer to [Blog](https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173).

In [None]:
from detectron2.evaluation import COCOEvaluator, inference_on_dataset
from detectron2.data import build_detection_test_loader

# create evaluator instance with coco evaluator 
evaluator = COCOEvaluator(test_data_name, cfg, False, output_dir="./outputs/")

# create validation data loader
val_loader = build_detection_test_loader(cfg, test_data_name)

# start validation
print(inference_on_dataset(trainer.model, val_loader, evaluator))

The AP is ~30%. You may also see the detailed metrics for small, medium and large objects as well. Not bad! Here are something that I want you to try by yourself:

* Q5: Change the initial learning rate (`BASE_LR`) from `0.001` to `0.00025` and show the 4 training curves from the TensorBoard. By viewing the results (You may keep the rest of configurations fixed), does it improve the AP or not? Explain why.

* Q6: Change the number of iterations (`MAX_ITERS`) from `300` to `500` and show the 4 training curves from the Tensorboard. By viewing the results (You may keep the rest of configurations fixed), does it improve the AP or not? What about `1000`? Explain why.

* Q7: Apply the data augmentation techniques mentioned in HW2 to the training set and show the 4 training curves from the Tensorboard and view the AP performance. Does it improve the AP or not? Explain why.


In [None]:
# Your code here (You may use multiple code and text segments to display your solutions.)
# Q5
# ...
# Q6
# ...
# Q7


## Train Mask R-CNN on a balloon dataset

The above examples use Faster R-CNN to train on the traffic sign datasets to perform object detection. With few line modifications, we can train an instance segmentation model as well. Notice that the traffic sign dataset only contains the bounding box labeling information, with no segmentation mask labeling, which is not enough to train a Mask R-CNN model. Due to this reason, we switch to another dataset: [balloon segmentation dataset](https://github.com/matterport/Mask_RCNN/tree/master/samples/balloon), which only has one class: balloon. 


In [None]:
# download the ballon dataset, decompress the data
!wget https://github.com/matterport/Mask_RCNN/releases/download/v2.1/balloon_dataset.zip
!unzip balloon_dataset.zip > /dev/null

Write codes to load and visualize the balloon dataset in the similar manner. You need to take a careful look at the label files and construct your `get_balloon_dicts` functions to load extra poly mask information. If you load the dataset correctly, you will see training samples like the following. 

In [None]:
from detectron2.structures import BoxMode

def get_balloon_dicts(img_dir):
    """
    Write your codes to Load and visualize the balloon datasets
    """
    pass    

* Q8: Fine-tune the pre-trained model `COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml` on the balloon dataset with the following configurations and show the TensorBoard Visualization. 
    * IMS_BATCH_SIZE = 2
    * BASE_LR = 0.00025
    * MAX_ITER = 300
    * ROI_HEADS.BATCH_SIZE_PER_IMG = 128
    * ROI_HEADS.NUM_CLASSES = 1

* Q9 (5%): Use your own trained model to do the inference on testing datasets, at least plot 3 prediction results. Then, use the COCO API to report your testing Average Precision (AP). If your model is trained correctly, you will see the prediction results like the following figures. 

In [None]:
# Your code here (You may use multiple code and text segments to display your solutions.)
# Q8
# ...
# Q9
# ...