# Overview

In the previous notebook we saw how you can classify images as specific objects.  This was good, but we saw how image classification can get confused when there are many objects in a camera view.

In this notebook we will work through multiple examples of how to use DIGITS and Caffe to detect objects in imagery.  The data set we will be using is Common Objects in Context (CoCo).  This data set is provided by Microsoft and is a common benchmarking and academic data set.  It also provides a good baseline for robotics applications as there are many objects in individual images.  We will train a detection model with this dataset and deploy it to the TX-1 platform.

Fig 1 shows an example image containing a horse crossing sign:

![Horse Crossing](COCO_test2015_000000387637.jpg)
<h4 align="center">Figure 1: Horse Crossing Sign</h4> 


We are going to tackle a very interesting problem in this tutorial.  Rather than trying to identify the image as a single object, we are going to train a convolutional neural network (CNN) to localize various objects within the image.

## Object detection with DetectNet

There is a final class of object approaches that train a CNN to simultaneously classify the most likely object present at each location within an image and predict the corresponding bounding box for that object through regression.  For example:

![yolo](yolo.png)

This approach has major benefits:

* Simple one-shot detection, classification and bounding box regression pipeline
* Very low latency
* Very low false alarm rates due to strong, voluminous background training data

In order to train this type of network specialized training data is required where all objects of interest are labelled with accurate bounding boxes.  This type of training data is much rarer and costly to produce; however, if this type of data is available for your object detection problem this is almost certainly the best approach to take. Fig 5 shows an example of a labelled training sample for a vehicle detection scenario.

![kespry example](kespry_example.png)
<h4 align="center">Figure 6: Labelled data for a three class object detection scenario</h4> 

The recent release of DIGITS 4 added the capability to train this class of model and provided a new "standard network" called DetectNet as an example.  We are going to use DetectNet to train a Right Whale detector in full-size aerial images of the ocean.  

The main challenge in training a single CNN for object detection and bounding box regression is in handling the fact that there can be varying numbers of objects present in different images.  In some cases you may even have an image with no objects at all.  DetectNet handles this problem by converting an image with an number of bounding box annotations to a fixed dimensionality data representation that we directly attempt to predict with a CNN.  Fig 6 shows how data is mapped to this represenation for a single class object detection problem.

![detectnet data rep](detectnet_data.png)
<h4 align="center">Figure 7: DetectNet data representation</h4> 

DetectNet is actually a FCN, as we described above, but configured to produce precisely this data representation as it's output.  The bulk of the layers in DetectNet are identical to the well known GoogLeNet network.  Fig 7 shows the DetectNet architecture for training.

![detectnet training architecture](detectnet_training.png)
<h4 align="center">Figure 8: DetectNet training architecture</h4> 

For the purposes of this lab we have already prepared the coco dataset for this specific use case within the digits ecosystem so we can begin training, however let us review how the label files work.  Digits uses a label file format known as "Kitti".  To begin with, we must first understand the folder structure.  Notice there is a 1 to 1 pairing of image to label file and both have exactly the same name except one is a .png and the other is a .txt file.
![Kitti Folder Structure](images/label_folders.png)
<h4 align="center">Figure 9: Kitti Folder Structure</h4> 

The kitti file itself is structured as below.
![Kitti File Structure](images/label_file.png)
<h4 align="center">Figure 10: Kitti File Structure</h4> 

For the Coco dataset, we have already curated all of the label files.  You can see a sample kitti file structure below.  Notice how many of the fields from the official structure are zero and not used.  This is because for detect net we are interested in just pure simple bounding boxes, however the kitti format can be used for more complex tasks.
![Kitti File Sample](images/kitti_file.png)
<h4 align="center">Figure 11: Kitti File Sample</h4> 

Now we will first look at how to train DetectNet on this dataset. A complete training run of DetectNet on this dataset takes several hours, so we have provided a trained model to experiment with.  Return to the main DIGITS screen and use the Models tab.  Open the "mscoco_bottle" model and clone it.  Make the following changes:

* select your newly created "coco_detectnet" dataset
* change the number of training epochs to 3  
* change the batch size to 10

Feel free to explore the network architecture visually by clicking the "Visualize" button.  

When you're ready to train, give the model a new name such as "mscoco_bottle_2" and click "create".  Training this model for just 3 epochs will still take several minutes, but you should see both the coverage and bounding box training and validation loss values decreasing already.  You will also see the mean Average Precision (mAP) score begin to rise.  mAP is a combined measure of how well the network is able to detect the objects and how accurate it's bounding box estimates were for the validation dataset.

Once the model has finished training return to the pre-trained "mscoco_bottle" model.  You will see that after 100 training epochs this model had not only converged to low training and validation loss values, but also a high mAP score.  Let's test this trained model against a validation image to see if it can find bottles.

Simply set the visualization method to "Bounding boxes" and paste the following image path in:  `/home/drcrook/db/coco/bottle/val/images/000000487333.jpg`.  Be sure to select the "Show visualizations and statistics" checkbox and then click "Test One".  You should see DetectNet successfully detects the whale face and draws a bounding box, like this:

![detectnet success](images/detectnet_success.png)

Feel free to test other images from the `/home/drcrook/db/coco/bottle/val/images/` folder.  You will see that DetectNet is able to accurately detect most bottles with a tightly drawn bounding box and has a very low false alarm rate.  Furthermore, inference is extremely fast with DetectNet.