# RTML Lab 04: YOLO

In this lab, we'll explore a fascinating use of image classification deep neural networks to peform a different
task: object detection.

## Object Detection

If you could go back in time to the 1990s, there were no cameras that could find faces in a photograph, and no
researcher had a way to count dogs in a video in real time. Everyone had to count the dogs manually.
Times were very tough.

The Holy Grail of computer vision research at the time was real time face detection. If we could find faces
in images fast enough, we could build systems that interact more naturally with human beings. But nobody had
a solution.

Things changed when Viola and Jones introduced the first real time face detector, the Haar-like cascade, at
the end of the 1990s.
This technique swept a detection window over the input image at multiple sizes, and subjected each local patch
to a cascade of simple rough classifiers. Each patch that made it to the end of the cascade of classifiers was
treated as a positive detection. After a set of candidate patches were identified, there would be a cleanup
stage when neighboring detections are clustered into isolated detections.

This method and one cousin, the HOG detector, which was slower but a little more accurate, dominated during the 2000s
and on into the 2010s. These methods worked well enough when trained carefully on the specific environment they were
used in, but usually couldn't be transfer to a new environment.

With the introduction of AlexNet and the amazing advances in image classification, we could follow the direction
of R-CNN, to use a region proposal algorithm followed by a deep learning classifier to do object detection VERY slowly
but much more accurately than the old real time methods.

## What is YOLO?

However, it wasn't until YOLO that we had a deep learning model for object detection that could run in real time.
It took some clever insight to realize that everything, from feature extraction to bounding box estimation, could
actually be done in a single model that could be trained end-to-end to detect objects.

YOLO (You Only Look Once) uses only convolutional layers. This makes it a "fully convolutional network" or FCN.

YOLOv3 has 75 convolutional layers, with skip connections and upsampling layers. No pooling is used, but there is a convolutional
layer with stride 2 used for downsampling to prevent loss of low-level features when use pooling.

Normally, the output of a convolutional layer is a feature map, which is then used for detection prediction.
However, the innovation of YOLO was to uses the feature map directly to predict bounding boxes and, for each bounding box, to
predict whether or not an object is at the center of the bounding box. Finally, a classifier is used for each bounding box
to indicate the content of the bounding box.

## YOLO v3 from "scratch"

Here we'll experiment with building up the YOLO v3 model in PyTorch. However, we won't train it ourselves; we'll
grab the weights from the original Darknet model by Joseph Redmon and friends.

This tutorial is based on https://github.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch.

A blog by the author: https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/

### Ground Truth Bounding Boxes

Here is how we present example images and corresponding object bounding boxes to the model.

The input image is divided into grid cells. The number of cells depends on the number of convolutional layers
and the stride of each of those convolutional layers. For example, if we use a 416$\times$416 input image size,
and we apply 5 conv layers with a stride of 2 each (for a total downsampling factor of 32), we end up with a 13$\times$13
feature map, each corresponding to a region in the original image of size 32$\times$32 pixels.

A ground truth box has a center (x and y position), a width, and a height. Normally the ground truth boxes would be
provided by a human annotator.

Each ground truth box's center must lie in some grid cell in the original image. Consider this example from the YOLO paper:

<img src="img/yolo05.png" title="GroundTruthBox" style="width: 400px;" />

The grid is represented by the black lines. The ground truth bounding box for the object is the yellow rectangle. The center
of this bounding box happens to be within the red-outlined grid cell.

The grid cell containing the center of a ground truth bounding box is given the responsibility during training to try to predict
the presence of the object.

In order to indicate the presence of the given object, the model outputs several parameters for a given candidate object:
 - $(t_x, t_y, t_w, t_h)$ indicate the box's location and size. During training, the targets for these outputs are the actual ground truth box parameters.
 - $p_o$ is an "objectness" score that indicates the likelihood that an object exists in the given bounding box. This output uses a sigmoid function.
   During training, the target for $p_o$ is set to 1 for the center grid cell (the red grid cell), and it is set to 0 for the the neighboring grid cells.
 - $(p_1, p_2, \ldots, p_n)$ are class confidence scores. They indicate the probability of the detected object belonging to a particular class. The targets
   obviously, are set to 1 for the ground truth object class and 0 for other classes during training.

### Anchor Boxes

One problem that would occur in YOLO if you tried to directly learn the parameters mentioned above is the problem of unstable gradients during training.
In a way that is sort of analagous to how a residual block begins with an identity map and learns differences from identity, YOLO uses the idea of
anchor boxes introduced by the R-CNN team. Instead of predicting $(t_x, t_y, t_w, t_h)$ directly, we predict how those parameters are *different from
the parameters of a typical bounding box, an anchor box*.
YOLO uses three bounding boxes per cell. At training time, once ground truth bounding box's center is mapped to a grid cell, we find which of the anchors for that cell has the highest
IoU with the ground truth box.

### So What Does YOLO Actually Predict?

First, let's understand that all predictions are relative to the grid cell. YOLO predicts the following:
- Offsets $(t_x, t_y)$ are specified relative to the top left corner of the grid cell, as a ratio between 0 and 1, using a sigmoid to limit the values.
- Height, and width $(t_w, t_h)$ are specified relative to the dimensions of an anchor box.

Thus, YOLO does not predict absolute coordinates -- it predicts values that can then be used to compute the box's position and size in absolute coordinates.
This diagram gives the idea. We see that the absolute $t_x$ is the grid cell's $(c_x, c_y)$ plus $\sigma(t_x)$ times the grid cell width. Similarly for $t_y$.
The absolute width of the predicted bounding box is the width of the anchor box times $e^{tw}$. Similarly for the height.

<img src="img/yolo06.png" title="GroundTruthBox" style="width: 640px;" />

### Multi-scale prediction

Rather than a single grid size and grid cell size,
YOLOv3 detects objects at multiple sizes with downsampling factors of 32, 16, and 8. The largest objects are detected at the
first, coarsest scale, whereas mid-sized objects are detected at the intermediate scale, and small objects are detected at the finest
scale. The example below shows the three grid sizes relative to the image and an object:

<img src="img/yolo_Scales.png" title="GroundTruthBox" style="width: 640px;" />

### Preparation for Building YOLO in PyTorch

First of all, we will need OpenCV:

    pip3 install --upgrade pip
    pip install matplotlib opencv-python

Create a directory where the code for detector will live.

In that directory, download util.py and darknet.py from https://github.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch.

In Jupyter you can download thusly:

In [1]:
!wget https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/darknet.py
!wget https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/util.py

--2021-02-06 00:55:49--  https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/darknet.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11533 (11K) [text/plain]
Saving to: ‘darknet.py’


2021-02-06 00:55:50 (80.7 MB/s) - ‘darknet.py’ saved [11533/11533]

--2021-02-06 00:55:50--  https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/util.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7432 (7.3K) [text/plain]
Saving to: ‘util.py’


2021-02-06 00:55:51 (4.99 MB/s) - ‘util.py’ saved [7432/7432]



### Take a Look at the YOLO Darknet Configuration File

Next, let's download the `yolov3.cfg` configuration file and take a look.

In [2]:
!mkdir -p cfg
!wget https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/cfg/yolov3.cfg
!mv yolov3.cfg cfg/yolov3.cfg

--2021-02-06 00:55:57--  https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/cfg/yolov3.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8346 (8.2K) [text/plain]
Saving to: ‘yolov3.cfg’


2021-02-06 00:55:58 (23.4 MB/s) - ‘yolov3.cfg’ saved [8346/8346]



The configuration file looks like this:

```python
[net]
# Testing
batch=1
subdivisions=1
# Training
# batch=64
# subdivisions=16
width= 416

height = 416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

...

[shortcut]
from=-3
activation=linear

...

[yolo]
mask = 6,7,8
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=80
num=9
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 61

...

```

### Overview of the Configuration Blocks

The configuration blocks fall into a few cateogies:

 - Net: the global configuration at the top of the configuration file. It declares the size of input images, batch size, learning rate, and so on.
 
      ```python
        batch=64
        subdivisions=16
        width=608
        height=608
        channels=3
        momentum=0.9
        decay=0.0005
        angle=0
        saturation = 1.5
        exposure = 1.5
        hue=.1
        
      ```

 - Convolutional: convolutional layer. Note that this specfication is a little more powerful than the PyTorch way of doing things, as options
   for batch normalization and ReLU are built in.
 
      ```python
        [convolutional]
        batch_normalize=1
        filters=32
        size=3
        stride=1
        pad=1
        activation=leaky
        
      ```

 - Shortcut: skip connections that implement residual blocks. -3 means to add the feature maps output by the previous layer to those output by the layer three layers
   back. As far as I can tell, linear actually means identity (no projection).
  
      ```python
        [shortcut]
        from=-3           # Connect the layer three layers back to here.
        activation=linear
        
      ```
      
 - Upsample: Bilinear upsampling of the previous layer using a particular stride

      ```python
        [upsample]
        stride=2

      ```
      
 - Route: The route layer deserves a bit of explanation. It has an attribute `layers`, which can have either one or two values.
 
      ```python
          [route]
          layers = -4

          [route]
          layers = -1, 61
        
      ```

   When the layers attribute has only one value, it outputs the feature maps of the layer indexed by the value. In our example, it is -4, so the layer will output
   the feature maps from the 4th layer backwards from the route layer.

   When layers has two values, it returns the concatenated feature maps of the layers indexed by its values. In our example it is -1, 61, so the layer will output
   feature maps from the previous layer (-1) and the 61st layer, concatenated along the channels (depth) dimension.
   
 - Yolo:
 
     ```python
          [yolo]
          mask = 0,1,2
          anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
          classes=80
          num=9
          jitter=.3
          ignore_thresh = .5
          truth_thresh = 1
          random=1

     ```
   Here we have a few important attributes:
     - anchors: describes the anchor boxes. The model contains 9 anchors, but only those in the `mask` are used.

     - mask: which anchors index will be used in this yolo layer
     
     - classes: number of object classes


### Create a network from the config file

Go to `darknet.py`, and look at the `parse_cfg` function. The function will read configuration file and store the blocks in a dictionary.

<img src="img/configfunc.JPG" title="configfunc" style="width: 640px;" />

Then create building blocks by using `create_modules` function. Take a look it for more understanding.

### Convolutional block

<img src="img/convolutionalblock.JPG" title="covolutionalblock" style="width: 600px;" />

### Shortcut block

<img src="img/shortcutblock.JPG" title="shortcutblock" style="width: 600px;" />

### Upsample block

<img src="img/upsampleblock.JPG" title="upsampleblock" style="width: 600px;" />

### Route block

Why does it use an empty layer? The actual work will be one in the `forward()` function.

<img src="img/routeblock.JPG" title="routeblock" style="width: 600px;" />

### YOLO block

<img src="img/yoloblock.JPG" title="yoloblock" style="width: 600px;" />

### Using the code

OK, let's try it out. Depending on what you already have installed, you may need to run

    # apt install libgl1-mesa-glx

for the next step to run.

In [3]:
import darknet

blocks = darknet.parse_cfg("cfg/yolov3.cfg")
print(darknet.create_modules(blocks))

({'type': 'net', 'batch': '1', 'subdivisions': '1', 'width': '416', 'height': '416', 'channels': '3', 'momentum': '0.9', 'decay': '0.0005', 'angle': '0', 'saturation': '1.5', 'exposure': '1.5', 'hue': '.1', 'learning_rate': '0.001', 'burn_in': '1000', 'max_batches': '500200', 'policy': 'steps', 'steps': '400000,450000', 'scales': '.1,.1'}, ModuleList(
  (0): Sequential(
    (conv_0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (batch_norm_0): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (leaky_0): LeakyReLU(negative_slope=0.1, inplace=True)
  )
  (1): Sequential(
    (conv_1): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
    (batch_norm_1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (leaky_1): LeakyReLU(negative_slope=0.1, inplace=True)
  )
  (2): Sequential(
    (conv_2): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
    

### Darknet class

Let's make our own version of the `Darknet` class in `darknet.py`.

The class has two main functions:

1. `forward()`: forward propagation, following the instructions in the dictionary modules
2. `load_weights()`: load a set of pretrained weights into the network


In [4]:
from util import *

class MyDarknet(nn.Module):
    def __init__(self, cfgfile):
        super(MyDarknet, self).__init__()
        # load the config file and create our model
        self.blocks = darknet.parse_cfg(cfgfile)
        self.net_info, self.module_list = darknet.create_modules(self.blocks)
        
    def forward(self, x, CUDA:bool):
        modules = self.blocks[1:]
        outputs = {}   #We cache the outputs for the route layer
        
        write = 0
        # run forward propagation. Follow the instruction from dictionary modules
        for i, module in enumerate(modules):        
            module_type = (module["type"])
            
            if module_type == "convolutional" or module_type == "upsample":
                # do convolutional network
                x = self.module_list[i](x)
    
            elif module_type == "route":
                # concat layers
                layers = module["layers"]
                layers = [int(a) for a in layers]
    
                if (layers[0]) > 0:
                    layers[0] = layers[0] - i
    
                if len(layers) == 1:
                    x = outputs[i + (layers[0])]
    
                else:
                    if (layers[1]) > 0:
                        layers[1] = layers[1] - i
    
                    map1 = outputs[i + layers[0]]
                    map2 = outputs[i + layers[1]]
                    x = torch.cat((map1, map2), 1)
                
    
            elif  module_type == "shortcut":
                from_ = int(module["from"])
                # residual network
                x = outputs[i-1] + outputs[i+from_]
    
            elif module_type == 'yolo':        
                anchors = self.module_list[i][0].anchors
                #Get the input dimensions
                inp_dim = int (self.net_info["height"])
        
                #Get the number of classes
                num_classes = int (module["classes"])
        
                #Transform 
                x = x.data
                # predict_transform is in util.py
                x = predict_transform(x, inp_dim, anchors, num_classes, CUDA)
                if not write:              #if no collector has been intialised. 
                    detections = x
                    write = 1
        
                else:       
                    detections = torch.cat((detections, x), 1)
        
            outputs[i] = x
        
        return detections


    def load_weights(self, weightfile):
        '''
        Load pretrained weight
        '''
        #Open the weights file
        fp = open(weightfile, "rb")
    
        #The first 5 values are header information 
        # 1. Major version number
        # 2. Minor Version Number
        # 3. Subversion number 
        # 4,5. Images seen by the network (during training)
        header = np.fromfile(fp, dtype = np.int32, count = 5)
        self.header = torch.from_numpy(header)
        self.seen = self.header[3]   
        
        weights = np.fromfile(fp, dtype = np.float32)
        
        ptr = 0
        for i in range(len(self.module_list)):
            module_type = self.blocks[i + 1]["type"]
    
            #If module_type is convolutional load weights
            #Otherwise ignore.
            
            if module_type == "convolutional":
                model = self.module_list[i]
                try:
                    batch_normalize = int(self.blocks[i+1]["batch_normalize"])
                except:
                    batch_normalize = 0
            
                conv = model[0]
                
                
                if (batch_normalize):
                    bn = model[1]
        
                    #Get the number of weights of Batch Norm Layer
                    num_bn_biases = bn.bias.numel()
        
                    #Load the weights
                    bn_biases = torch.from_numpy(weights[ptr:ptr + num_bn_biases])
                    ptr += num_bn_biases
        
                    bn_weights = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr  += num_bn_biases
        
                    bn_running_mean = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr  += num_bn_biases
        
                    bn_running_var = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr  += num_bn_biases
        
                    #Cast the loaded weights into dims of model weights. 
                    bn_biases = bn_biases.view_as(bn.bias.data)
                    bn_weights = bn_weights.view_as(bn.weight.data)
                    bn_running_mean = bn_running_mean.view_as(bn.running_mean)
                    bn_running_var = bn_running_var.view_as(bn.running_var)
        
                    #Copy the data to model
                    bn.bias.data.copy_(bn_biases)
                    bn.weight.data.copy_(bn_weights)
                    bn.running_mean.copy_(bn_running_mean)
                    bn.running_var.copy_(bn_running_var)
                
                else:
                    #Number of biases
                    num_biases = conv.bias.numel()
                
                    #Load the weights
                    conv_biases = torch.from_numpy(weights[ptr: ptr + num_biases])
                    ptr = ptr + num_biases
                
                    #reshape the loaded weights according to the dims of the model weights
                    conv_biases = conv_biases.view_as(conv.bias.data)
                
                    #Finally copy the data
                    conv.bias.data.copy_(conv_biases)
                    
                #Let us load the weights for the Convolutional layers
                num_weights = conv.weight.numel()
                
                #Do the same as above for weights
                conv_weights = torch.from_numpy(weights[ptr:ptr+num_weights])
                ptr = ptr + num_weights
                
                conv_weights = conv_weights.view_as(conv.weight.data)
                conv.weight.data.copy_(conv_weights)


### Test Forward Propagation

Let's propagate a single image through the network and see what we get.

In [5]:
!wget https://github.com/ayooshkathuria/pytorch-yolo-v3/raw/master/dog-cycle-car.png

--2021-02-06 00:58:21--  https://github.com/ayooshkathuria/pytorch-yolo-v3/raw/master/dog-cycle-car.png
Resolving github.com (github.com)... 13.250.177.223
Connecting to github.com (github.com)|13.250.177.223|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/master/dog-cycle-car.png [following]
--2021-02-06 00:58:21--  https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/master/dog-cycle-car.png
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 347445 (339K) [image/png]
Saving to: ‘dog-cycle-car.png’


2021-02-06 00:58:22 (1.84 MB/s) - ‘dog-cycle-car.png’ saved [347445/347445]



Here's code to load the image into memory and push it through the model:

In [6]:
import cv2
import torch

def get_test_input():
    img = cv2.imread("dog-cycle-car.png")
    img = cv2.resize(img, (416,416))          #Resize to the input dimension
    img_ =  img[:,:,::-1].transpose((2,0,1))  # BGR -> RGB | H X W C -> C X H X W 
    img_ = img_[np.newaxis,:,:,:]/255.0       #Add a channel at 0 (for batch) | Normalise
    img_ = torch.from_numpy(img_).float()     #Convert to float
    img_ = Variable(img_)                     # Convert to Variable
    return img_

Go ahead and try it (noting that the model hasn't been trained so we don't expect any correct result):

In [7]:
from util import *

model = MyDarknet("cfg/yolov3.cfg")
inp = get_test_input()
pred = model(inp, False)
print (pred)

tensor([[[1.4784e+01, 1.7460e+01, 1.1424e+02,  ..., 5.5892e-01,
          3.9877e-01, 5.6024e-01],
         [2.0277e+01, 1.4719e+01, 1.1301e+02,  ..., 5.2226e-01,
          5.0744e-01, 4.7786e-01],
         [1.7056e+01, 1.5934e+01, 5.1666e+02,  ..., 5.4858e-01,
          4.3707e-01, 4.7734e-01],
         ...,
         [4.1254e+02, 4.1234e+02, 9.5190e+00,  ..., 5.2298e-01,
          5.2594e-01, 4.1804e-01],
         [4.1249e+02, 4.1238e+02, 2.5798e+01,  ..., 5.1811e-01,
          4.4765e-01, 4.2939e-01],
         [4.1262e+02, 4.1208e+02, 3.8786e+01,  ..., 6.1843e-01,
          5.2593e-01, 4.4060e-01]]])


### Understanding the output result

The result from prediction model will be $B(13\cdot 13 + 26\cdot 26 + 52 \cdot 52)3\cdot85$. Why? We have
- $B$: the number of images in the batch
- $13\cdot 13$: number of elements (grid cells) in the coarsest feature map
- $26\cdot 16$: number of elements (grid cells) in the medium scale feature map
- $52\cdot 52$: number of elements (grid cells) in the finest cale feature map
- $3$: the number of anchor boxes per grid cell
- $85$: number of bounding box attributes (4 for bounding box, 1 for objectness, 80 for the COCO classes)

### Download the a pretrained weight file

Darknet stores weights as in this diagram:

<img src="img/weights.png" title="weight" style="width: 600px;" />

In [8]:
!wget https://pjreddie.com/media/files/yolov3.weights

--2021-02-06 01:05:23--  https://pjreddie.com/media/files/yolov3.weights
Resolving pjreddie.com (pjreddie.com)... 128.208.4.108
Connecting to pjreddie.com (pjreddie.com)|128.208.4.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 248007048 (237M) [application/octet-stream]
Saving to: ‘yolov3.weights’


2021-02-06 01:08:05 (1.47 MB/s) - ‘yolov3.weights’ saved [248007048/248007048]



In [9]:
model.load_weights("yolov3.weights")

### Test with the sample image again

In [10]:
inp = get_test_input()
pred = model(inp, False)
print (pred)

tensor([[[8.5426e+00, 1.9015e+01, 1.1130e+02,  ..., 1.7306e-03,
          1.3874e-03, 9.2985e-04],
         [1.4105e+01, 1.8867e+01, 9.4014e+01,  ..., 5.9501e-04,
          9.2471e-04, 1.3085e-03],
         [2.1125e+01, 1.5269e+01, 3.5793e+02,  ..., 8.3609e-03,
          5.1067e-03, 5.8561e-03],
         ...,
         [4.1268e+02, 4.1069e+02, 3.7157e+00,  ..., 1.7185e-06,
          4.0955e-06, 6.5897e-07],
         [4.1132e+02, 4.1023e+02, 8.0353e+00,  ..., 1.3927e-05,
          3.2252e-05, 1.2076e-05],
         [4.1076e+02, 4.1318e+02, 4.9635e+01,  ..., 4.2174e-06,
          1.0794e-05, 1.8104e-05]]])


### From YOLO output tensor to *true* detections

In the prediction result, there are many results. We need to threshold them using the objectness score
output for each bounding box prediction. The `write_results` function in `util.py` does just that.

    def write_results(prediction, confidence, num_classes, nms_conf = 0.4)

- prediction: prediction result tensor returned from the YOLO model
- confidence: objectness score threshold to apply to the set of detections
- num_classes: number of classes to expect
- nms_conf: NMS IoU threshold

NMS stands for "non-maxima suppression." The basic idea is that if you have two predicted bounding
boxes that overlap each other significantly, you should throw away the box with the lower confidence
score. Overlap is measured by IoU (Intersection over Union), wich is just the ratio of the the area
of intersection of the two regions with the area of the union of the two regions:

$$ IoU(R_1,R_2) = \frac{|R_1 \cap R_2|}{|R_1 \cup R_2|}. $$

The default of 0.4 means if the intersection is 40% or more of the union, the two bounding boxes
are overlapping enough that only one of the detections should survive.

In [11]:
write_results(pred, 0.5, 80, nms_conf = 0.4)

tensor([[  0.0000,  61.5403, 100.8597, 307.2717, 303.1132,   0.9469,   0.9985,
           1.0000],
        [  0.0000, 253.8483,  66.1096, 378.0396, 118.0089,   0.9992,   0.8164,
           7.0000],
        [  0.0000,  71.0337, 163.2243, 175.7471, 382.2702,   0.9999,   0.9936,
          16.0000]])

### Show the resulting detections on top of an image

The model was trained on the COCO dataset, so download the class label file `coco.names`:

In [12]:
!wget https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/data/coco.names
!mkdir data
!mv coco.names data/coco.names

--2021-02-06 01:49:43--  https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/data/coco.names
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 625 [text/plain]
Saving to: ‘coco.names’


2021-02-06 01:49:44 (44.2 MB/s) - ‘coco.names’ saved [625/625]



In [13]:
def load_classes(namesfile):
    fp = open(namesfile, "r")
    names = fp.read().split("\n")[:-1]
    return names

In [14]:
num_classes = 80
classes = load_classes("data/coco.names")
print(classes)

['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'sofa', 'pottedplant', 'bed', 'diningtable', 'toilet', 'tvmonitor', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']


So we see that the three surviving bounding boxes above, outputting object types 1, 7, and 16, indicate a bicycle, a truck, and a dog.
Let's draw the detections on top of the input image for better visualization.

We'll use some code based on Kathuria's `detect.py`. You can download the original as

In [15]:
!wget https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/detect.py

--2021-02-06 01:53:13--  https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/detect.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7273 (7.1K) [text/plain]
Saving to: ‘detect.py’


2021-02-06 01:53:13 (33.2 MB/s) - ‘detect.py’ saved [7273/7273]



Here's our version. It will process the images in subdirectory `cocoimages` so let's make it and put our sample there:

In [19]:
!mkdir -p cocoimages
!cp dog-cycle-car.png cocoimages/

In [20]:
from __future__ import division
import time
import torch 
import torch.nn as nn
from torch.autograd import Variable
import numpy as np
import cv2 
from util import *
import argparse
import os 
import os.path as osp
from darknet import Darknet
import pickle as pkl
import pandas as pd
import random

images = "cocoimages"
batch_size = 4
confidence = 0.5
nms_thesh = 0.4
start = 0
CUDA = torch.cuda.is_available()

num_classes = 80
classes = load_classes("data/coco.names")

#Set up the neural network

print("Loading network.....")
model = MyDarknet("cfg/yolov3.cfg")
model.load_weights("yolov3.weights")
print("Network successfully loaded")

model.net_info["height"] = 416
inp_dim = int(model.net_info["height"])
assert inp_dim % 32 == 0 
assert inp_dim > 32

#If there's a GPU availible, put the model on GPU

if CUDA:
    model.cuda()

# Set the model in evaluation mode

model.eval()

read_dir = time.time()

# Detection phase

try:
    imlist = [osp.join(osp.realpath('.'), images, img) for img in os.listdir(images)]
except NotADirectoryError:
    imlist = []
    imlist.append(osp.join(osp.realpath('.'), images))
except FileNotFoundError:
    print ("No file or directory with the name {}".format(images))
    exit()
    
if not os.path.exists("des"):
    os.makedirs("des")

load_batch = time.time()
loaded_ims = [cv2.imread(x) for x in imlist]

im_batches = list(map(prep_image, loaded_ims, [inp_dim for x in range(len(imlist))]))
im_dim_list = [(x.shape[1], x.shape[0]) for x in loaded_ims]
im_dim_list = torch.FloatTensor(im_dim_list).repeat(1,2)


leftover = 0
if (len(im_dim_list) % batch_size):
    leftover = 1

if batch_size != 1:
    num_batches = len(imlist) // batch_size + leftover            
    im_batches = [torch.cat((im_batches[i*batch_size : min((i +  1)*batch_size,
                        len(im_batches))]))  for i in range(num_batches)]  

write = 0

if CUDA:
    im_dim_list = im_dim_list.cuda()
    
start_det_loop = time.time()
for i, batch in enumerate(im_batches):
    # Load the image 
    start = time.time()
    if CUDA:
        batch = batch.cuda()
    with torch.no_grad():
        prediction = model(Variable(batch), CUDA)

    prediction = write_results(prediction, confidence, num_classes, nms_conf = nms_thesh)

    end = time.time()

    if type(prediction) == int:

        for im_num, image in enumerate(imlist[i*batch_size: min((i +  1)*batch_size, len(imlist))]):
            im_id = i*batch_size + im_num
            print("{0:20s} predicted in {1:6.3f} seconds".format(image.split("/")[-1], (end - start)/batch_size))
            print("{0:20s} {1:s}".format("Objects Detected:", ""))
            print("----------------------------------------------------------")
        continue

    prediction[:,0] += i*batch_size    #transform the atribute from index in batch to index in imlist 

    if not write:                      #If we have't initialised output
        output = prediction  
        write = 1
    else:
        output = torch.cat((output,prediction))

    for im_num, image in enumerate(imlist[i*batch_size: min((i +  1)*batch_size, len(imlist))]):
        im_id = i*batch_size + im_num
        objs = [classes[int(x[-1])] for x in output if int(x[0]) == im_id]
        print("{0:20s} predicted in {1:6.3f} seconds".format(image.split("/")[-1], (end - start)/batch_size))
        print("{0:20s} {1:s}".format("Objects Detected:", " ".join(objs)))
        print("----------------------------------------------------------")

    if CUDA:
        torch.cuda.synchronize()       
try:
    output
except NameError:
    print ("No detections were made")
    exit()

im_dim_list = torch.index_select(im_dim_list, 0, output[:,0].long())

scaling_factor = torch.min(416/im_dim_list,1)[0].view(-1,1)

output[:,[1,3]] -= (inp_dim - scaling_factor*im_dim_list[:,0].view(-1,1))/2
output[:,[2,4]] -= (inp_dim - scaling_factor*im_dim_list[:,1].view(-1,1))/2

output[:,1:5] /= scaling_factor

for i in range(output.shape[0]):
    output[i, [1,3]] = torch.clamp(output[i, [1,3]], 0.0, im_dim_list[i,0])
    output[i, [2,4]] = torch.clamp(output[i, [2,4]], 0.0, im_dim_list[i,1])
    
output_recast = time.time()
class_load = time.time()
colors = [[255, 0, 0], [255, 0, 0], [255, 255, 0], [0, 255, 0], [0, 255, 255], [0, 0, 255], [255, 0, 255]]

draw = time.time()

def write(x, results):
    c1 = tuple(x[1:3].int())
    c2 = tuple(x[3:5].int())
    img = results[int(x[0])]
    cls = int(x[-1])
    color = random.choice(colors)
    label = "{0}".format(classes[cls])
    cv2.rectangle(img, c1, c2,color, 1)
    t_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_PLAIN, 1 , 1)[0]
    c2 = c1[0] + t_size[0] + 3, c1[1] + t_size[1] + 4
    cv2.rectangle(img, c1, c2,color, -1)
    cv2.putText(img, label, (c1[0], c1[1] + t_size[1] + 4), cv2.FONT_HERSHEY_PLAIN, 1, [225,255,255], 1);
    return img


list(map(lambda x: write(x, loaded_ims), output))

det_names = pd.Series(imlist).apply(lambda x: "{}/det_{}".format("des",x.split("/")[-1]))

list(map(cv2.imwrite, det_names, loaded_ims))

end = time.time()

print("SUMMARY")
print("----------------------------------------------------------")
print("{:25s}: {}".format("Task", "Time Taken (in seconds)"))
print()
print("{:25s}: {:2.3f}".format("Reading addresses", load_batch - read_dir))
print("{:25s}: {:2.3f}".format("Loading batch", start_det_loop - load_batch))
print("{:25s}: {:2.3f}".format("Detection (" + str(len(imlist)) +  " images)", output_recast - start_det_loop))
print("{:25s}: {:2.3f}".format("Output Processing", class_load - output_recast))
print("{:25s}: {:2.3f}".format("Drawing Boxes", end - draw))
print("{:25s}: {:2.3f}".format("Average time_per_img", (end - load_batch)/len(imlist)))
print("----------------------------------------------------------")


torch.cuda.empty_cache()

Loading network.....
Network successfully loaded
dog-cycle-car.png    predicted in  0.433 seconds
Objects Detected:    bicycle truck dog
----------------------------------------------------------
SUMMARY
----------------------------------------------------------
Task                     : Time Taken (in seconds)

Reading addresses        : 0.000
Loading batch            : 0.016
Detection (1 images)     : 1.737
Output Processing        : 0.000
Drawing Boxes            : 0.013
Average time_per_img     : 1.766
----------------------------------------------------------


Voila! You got the YOLO result

<img src="img/dogresult.png" title="weight" style="width: 600px;" />

## Independent exercise: YOLOv4

### Part I: Inference (due next week)

In the lab, we saw how the Darknet configuration file for YOLOv3 could be read in Python and mapped
to PyTorch modules.

For your independent work do the same thing for YOLOv4. Download the `yolov4.cfg` file
from the [YOLOv4 GitHub repository](https://github.com/AlexeyAB/darknet) and modify your
`MyDarknet` class and utility code (`darknet.py`, `util.py`) as
necessary to map the structures to PyTorch.

The changes you'll have to make:

1. Implement the mish activation function
2. Add an option for a maxpool layer in the `create_modules` function and in your model's `forward()` method.
3. Enable a `[route]` module to concatenate more than two previous layers
4. Load the pre-trained weights [provided by the authors](https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights)
4. Scale inputs to 608$\times$608 and make sure you're passing input channels in RGB order, not OpenCV's BGR order.

### Part II: Training (due in two weeks)

Train the YOLOv4 model on the COCO dataset (or another dataset if you have one available).
Here the purpose is not to get the best possible model (that would require implementing all
of the "bag of freebies" training tricks described in the paper), but just some of them, to
get a feel for their importance.

1. Get a set of ImageNet pretrained weights for CSPDarknet53 [from the Darknet GitHub repository](https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/csdarknet53-omega_final.weights)
2. Add a method to load the pretrained weights into the backbone portion of your PyTorch YOLOv4 model.
3. Implement a basic `train_yolo` function similar to the `train_model` function you developed in previous
   labs for classifiers that preprocesses the input with basic augmentation transformations, converts the
   anchor-relative outputs to bounding box coordinates, computes MSE loss for the bounding box coordinates,
   backpropagates the loss, and takes a step for the optimizer. Use the recommended IoU thresholds to determine
   which predicted bounding boxes to include in the loss. You will find many examples of how to do this
   online.
4. Train your model on COCO. Training on the full dataset to completion would take several days, so you can stop early after verifying
   the model is learning in the first few epochs.
5. Compute mAP for your model on the COCO validation set.
6. Implement the CIoU loss function and observe its effect on mAP.
7. (Optional) Train on COCO to completion and see how close you can get to the mAP reported in the paper.

There is some useful information on working with the COCO dataset as a
Torchvision Dataset in [this blog](https://medium.com/howtoai/pytorch-torchvision-coco-dataset-b7f5e8cad82).
For your work on this lab, the instructor will place the entire COCO training and validation datasets on a shared network drive for you to access
so that we don't use resources for multiple copies of the dataset. Once you have access to the dataset you can use the dataset easily:

In [None]:
import torchvision.datasets as dset

path2data="./train2017"
path2json="./annotations/instances_train2017.json"

coco_train = dset.CocoDetection(root = path2data, annFile = path2json)

print('Number of samples: ', len(coco_train))
