# RTML Lab 04: YOLO

In this lab, we'll explore a fascinating use of image classification deep neural networks to peform a different
task: object detection.

Credits: parts of this lab are based on other authors' code and blog posts:

- YOLO v3 in PyTorch: [tutorial code](https://github.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch),
  [blog](https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/)

- YOLO v4 in PyTorch: Chaichan Poonperm (AIT Ph.D. student)


## Object Detection

If you could go back in time to the 1990s, there were no cameras that could find faces in a photograph, and no
researcher had a way to count dogs in a video in real time. Everyone had to count the dogs manually.
Times were very tough.

The Holy Grail of computer vision research at the time was real time face detection. If we could find faces
in images fast enough, we could build systems that interact more naturally with human beings. But nobody had
a solution.

Things changed when Viola and Jones introduced the first real time face detector, the Haar-like cascade, at
the end of the 1990s.
This technique swept a detection window over the input image at multiple sizes, and subjected each local patch
to a cascade of simple rough classifiers. Each patch that made it to the end of the cascade of classifiers was
treated as a positive detection. After a set of candidate patches were identified, there would be a cleanup
stage when neighboring detections are clustered into isolated detections.

This method and one cousin, the HOG detector, which was slower but a little more accurate, dominated during the 2000s
and on into the 2010s. These methods worked well enough when trained carefully on the specific environment they were
used in, but usually couldn't be transfer to a new environment.

With the introduction of AlexNet and the amazing advances in image classification, we could follow the direction
of R-CNN, to use a region proposal algorithm followed by a deep learning classifier to do object detection VERY slowly
but much more accurately than the old real time methods.

## What is YOLO?

However, it wasn't until YOLO that we had a deep learning model for object detection that could run in real time.
It took some clever insight to realize that everything, from feature extraction to bounding box estimation, could
actually be done with a single neural network that could be trained end-to-end to detect objects.

YOLO (You Only Look Once) uses only convolutional layers. This makes it a "fully convolutional network" or FCN.

YOLOv3 has 75 convolutional layers, with skip connections and upsampling layers. No pooling is used, but there is a convolutional
layer with stride 2 used for downsampling. Strided convoution rather than pooling was used to prevent loss of fine-grained detailed
information about the precise location of low-level features that would otherwise occur with pooling.

Normally, the output of a convolutional layer is a feature map. Applying convolutional layers to a possible detection window or
region of interest (ROI) in the image then classifying the ROI's feature map is a reasonable method for detection prediction that
is used in Fast R-CNN and Faster R-CNN.
However, the innovation of YOLO was to use the feature map directly to predict bounding boxes and, for each bounding box, to
predict whether or not an object is at the center of the bounding box. Finally, a classifier is used for each bounding box
to indicate the content of the bounding box.

## YOLO v3 from "scratch"

Early versions of YOLO were very fast but not nearly as accurate as their slower cousins. YOLO v3 included many of the
tricks and techniques used by other models, such as multiscale analysis, and it achieved both high accuracy and fast inference.

Here we'll experiment with building up the YOLO v3 model in PyTorch. However, we won't train it ourselves, as that would
require days of training; instead, we'll
grab the weights for our PyTorch YOLO v3 from the original Darknet model by Joseph Redmon and friends.

### Ground Truth Bounding Boxes

Here is how we present example images and corresponding object bounding boxes to the model.

The input image is divided into grid cells. The number of cells depends on the number of convolutional layers
and the stride of each of those convolutional layers. For example, if we use a 416$\times$416 input image size,
and we apply 5 conv layers with a stride of 2 each (for a total downsampling factor of 32), we end up with a 13$\times$13
feature map, each corresponding to a region in the original image of size 32$\times$32 pixels.

A ground truth box has a center (x and y position), a width, and a height. Normally the ground truth boxes would be
provided by a human annotator.

Each ground truth box's center must lie in some grid cell in the original image. Consider this example from the YOLO paper:

<img src="img/yolo05.png" title="GroundTruthBox" style="width: 400px;" />

The grid is represented by the black lines. The ground truth bounding box for the object is the yellow rectangle. The center
of this bounding box happens to be within the red-outlined grid cell.

The grid cell containing the center of a ground truth bounding box is given the responsibility during training to try to predict
the presence of the object.

In order to indicate the presence of the given object, the model outputs several parameters for a given candidate object:
 - $(t_x, t_y, t_w, t_h)$ indicate the box's location and size. During training, the targets for these outputs are the actual ground truth box parameters.
 - $p_o$ is an "objectness" score that indicates the likelihood that an object exists in the given bounding box. This output uses a sigmoid function.
   During training, the target for $p_o$ is set to 1 for the center grid cell (the red grid cell), and it is set to 0 for the the neighboring grid cells.
 - $(p_1, p_2, \ldots, p_n)$ are class confidence scores. They indicate the probability of the detected object belonging to a particular class. The targets,
   obviously, are set to 1 for the ground truth object class and 0 for other classes during training.

### Anchor Boxes

One problem that would occur in YOLO if you tried to directly learn the parameters mentioned above is the problem of unstable gradients during training.
In a way that is sort of analagous to how a residual block begins with an identity map and learns differences from identity, YOLO v3 uses the idea of
anchor boxes originally introduced by the R-CNN team. Instead of predicting $(t_x, t_y, t_w, t_h)$ directly, we predict how those parameters are *different from
the parameters of a typical bounding box, an anchor box*.
YOLO v3 uses three bounding boxes per cell. At training time, once ground truth bounding box's center is mapped to a grid cell, we find which of the anchors for
that cell has the highest IoU with the ground truth box.

### So What Does YOLO Actually Predict?

First, let's understand that all predictions are relative to the grid cell. YOLO predicts the following:
- Offsets $(t_x, t_y)$ are specified relative to the top left corner of the grid cell, as a ratio between 0 and 1, using a sigmoid to limit the values.
- Height, and width $(t_w, t_h)$ are specified relative to the dimensions of an anchor box.

Thus, YOLO does not predict absolute coordinates -- it predicts values that can then be used to compute the box's position and size in absolute coordinates.
This diagram gives the idea. We see that the absolute $t_x$ is the grid cell's $(c_x, c_y)$ plus $\sigma(t_x)$ times the grid cell width. Similarly for $t_y$.
The absolute width of the predicted bounding box is the width of the anchor box times $e^{tw}$. Similarly for the height.

<img src="img/yolo06.png" title="GroundTruthBox" style="width: 640px;" />

Hopefully you can see the difference between the YOLO v3 bounding box predictions and the Faster R-CNN bounding box predictions. The offset of the center is
encoded relative to the grid cell containing the anchor box rather than the anchor box itself. The dimensions of the bounding box, however, similar to Faster R-CNN,
are predicted relative to the anchor box size.

### Multi-scale prediction

Rather than a single grid size and grid cell size,
YOLO v3 detects objects at multiple sizes with downsampling factors of 32, 16, and 8. The largest objects are detected at the
first, coarsest scale, whereas mid-sized objects are detected at the intermediate scale, and small objects are detected at the finest
scale. The example below shows the three grid sizes relative to the image and an object:

<img src="img/yolo_Scales.png" title="GroundTruthBox" style="width: 640px;" />

### YOLO dataset format

A YOLO dataset contains two sets of files, each pair with the same name: image files (in any supported format)
and label files (in TXT, JSON, or XML format). A label file contains data in the format

$$i, C_x, C_y, L_x, L_y$$

- $i$: Label index
- $C_x$: Center position in the horizontal ($x$-axis) direction, encoded in the range 0-1, where 0 means the left edge of the image and 1 means the right edge.
- $C_y$: Center position in the vertical ($y$-axis) direction, encoded in the range 0-1, where 0 means the top edge of the image and 1 means the bottom edge.
- $L_x$: Object width, encoded in the range 0-1, where 1 means the width of the image.
- $L_y$: Object height, encoded in the range 0-1, where 1 means the height of the image.

To calculate these values for an object, suppose $(W,H)$ is the actual size of a particular image in pixels,
$(O_x, O_y)$ is the actual position of an object in that image, in pixels, and $(l_x,l_y)$ is the actual size of the object, again in pixels.
The object label elements would be calculated as

$$C_x = \frac{O_x}{W}, \; C_y = \frac{O_y}{H} \\
L_x = \frac{l_x}{W}, \; L_y = \frac{l_y}{H}$$

### YOLOv3 Architecture

<img src="img/YOLOv3-Arch.png" title="YOLOv3" style="width: 960px;" />

### Preparation for Building YOLO in PyTorch

First of all, we will need OpenCV:

    pip3 install --upgrade pip
    pip install matplotlib opencv-python

Create a directory where the code for your detector will live.

In that directory, download util.py and darknet.py from https://github.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch.

In Jupyter you would download thusly:

In [1]:
!wget https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/darknet.py
!wget https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/util.py

--2021-02-06 00:55:49--  https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/darknet.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11533 (11K) [text/plain]
Saving to: ‘darknet.py’


2021-02-06 00:55:50 (80.7 MB/s) - ‘darknet.py’ saved [11533/11533]

--2021-02-06 00:55:50--  https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/util.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7432 (7.3K) [text/plain]
Saving to: ‘util.py’


2021-02-06 00:55:51 (4.99 MB/s) - ‘util.py’ saved [7432/7432]



If you're in a Docker container, just run the `wget` commands at the command line. Make sure your proxy environment variables are set correctly.

### Take a Look at the YOLO Darknet Configuration File

Next, let's download the `yolov3.cfg` configuration file and take a look. You could grab it from the canonical Darknet github repository
or any other place it's stored.

In [2]:
!mkdir -p cfg
!wget https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/cfg/yolov3.cfg
!mv yolov3.cfg cfg/yolov3.cfg

--2021-02-06 00:55:57--  https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/cfg/yolov3.cfg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8346 (8.2K) [text/plain]
Saving to: ‘yolov3.cfg’


2021-02-06 00:55:58 (23.4 MB/s) - ‘yolov3.cfg’ saved [8346/8346]



The configuration file looks like this:

    [net]
    # Testing
    batch=1
    subdivisions=1
    # Training
    # batch=64
    # subdivisions=16
    width= 416

    height = 416
    channels=3
    momentum=0.9
    decay=0.0005
    angle=0
    saturation = 1.5
    exposure = 1.5
    hue=.1

    learning_rate=0.001
    burn_in=1000
    max_batches = 500200
    policy=steps
    steps=400000,450000
    scales=.1,.1

    [convolutional]
    batch_normalize=1
    filters=32
    size=3
    stride=1
    pad=1
    activation=leaky

    ...

    [shortcut]
    from=-3
    activation=linear

    ...

    [yolo]
    mask = 6,7,8
    anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
    classes=80
    num=9
    jitter=.3
    ignore_thresh = .7
    truth_thresh = 1
    random=1

    [route]
    layers = -4

    [convolutional]
    batch_normalize=1
    filters=256
    size=1
    stride=1
    pad=1
    activation=leaky

    [upsample]
    stride=2

    [route]
    layers = -1, 61

    ...



### Overview of the Configuration Blocks

The configuration blocks fall into a few cateogies:

- Net: the global configuration at the top of the configuration file. It declares the size of input images, batch size, learning rate, and so on.

      batch=64
      subdivisions=16
      width=608
      height=608
      channels=3
      momentum=0.9
      decay=0.0005
      angle=0
      saturation = 1.5
      exposure = 1.5
      hue=.1


- Convolutional: convolutional layer. Note that this specfication is a little more powerful than the PyTorch way of doing things, as options
  for batch normalization and the activation function (leaky ReLU in this case) are built in.
 
      [convolutional]
      batch_normalize=1
      filters=32
      size=3
      stride=1
      pad=1
      activation=leaky
        

- Shortcut: skip connections that implement residual blocks. -3 means to add the feature maps output by the previous layer to those output by the layer three layers
  back. Linear activation means identity (no nonlinear activation of the result).
  
      [shortcut]
      from=-3           # Connect the layer three layers back to here.
      activation=linear


- Upsample: Bilinear upsampling of the previous layer using a particular stride

      [upsample]
      stride=2


- Route: The route layer deserves a bit of explanation. It has an attribute `layers`, which can have either one or two values.
  
      [route]
      layers = -4

      [route]
      layers = -1, 61    
  
  When the layers attribute has only one value, it outputs the feature maps of the layer indexed by the value. In our example, it is -4, so the layer will output
  the feature maps from the 4th layer backwards from the route layer.

  When layers has two values, it returns the concatenated feature maps of the layers indexed by its values. In our example it is -1, 61, so the layer will output
  feature maps from the previous layer (-1) and the 61st layer, concatenated along the channels (depth) dimension.
   
- YOLO:
 
      [yolo]
      mask = 0,1,2
      anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
      classes=80
      num=9
      jitter=.3
      ignore_thresh = .5
      truth_thresh = 1
      random=1
  
  Here we have a few important attributes:
  
  - anchors: describes the anchor boxes. The model contains 9 anchors, but only those in the `mask` are used.

  - mask: which anchor indices will be used in this YOLO layer
     
  - classes: number of object classes


### Create a network from the config file

We are going to follow the general approach of some of the GitHub contributors who have developed PyTorch tools
to deal with Darknet models. In the file `darknet.py`, there's a `parse_cfg` function. The function will read
the Darknet configuration file and store the blocks in a dictionary.

<img src="img/configfunc.JPG" title="configfunc" style="width: 640px;" />

We'll then create PyTorch NN Modules for each of the blocks in the Darknet configuration as implemented in the
`create_modules` function. Take a look at this function for more understanding.

### Convolutional block

<img src="img/convolutionalblock.JPG" title="covolutionalblock" style="width: 600px;" />

### Shortcut block

<img src="img/shortcutblock.JPG" title="shortcutblock" style="width: 600px;" />

### Upsample block

<img src="img/upsampleblock.JPG" title="upsampleblock" style="width: 600px;" />

### Route block

Why does it use an empty layer? The actual work will be done in the `forward()` function.

<img src="img/routeblock.JPG" title="routeblock" style="width: 600px;" />

### YOLO block

<img src="img/yoloblock.JPG" title="yoloblock" style="width: 600px;" />

### Using the code

OK, let's try it out. Depending on what you already have installed, you may need to run

    # apt install libgl1-mesa-glx

for the next step to run.

In [None]:
import darknet

blocks = darknet.parse_cfg("cfg/yolov3.cfg")
print(darknet.create_modules(blocks))

### Darknet class

Let's make our own version of the `Darknet` class in `darknet.py`.

The class has two main functions:

1. `forward()`: forward propagation, following the instructions in the dictionary modules
2. `load_weights()`: load a set of pretrained weights into the network


In [4]:
from util import *

class MyDarknet(nn.Module):
    def __init__(self, cfgfile):
        super(MyDarknet, self).__init__()
        # load the config file and create our model
        self.blocks = darknet.parse_cfg(cfgfile)
        self.net_info, self.module_list = darknet.create_modules(self.blocks)
        
    def forward(self, x, CUDA:bool):
        modules = self.blocks[1:]
        outputs = {}   #We cache the outputs for the route layer
        
        write = 0
        # run forward propagation. Follow the instruction from dictionary modules
        for i, module in enumerate(modules):        
            module_type = (module["type"])
            
            if module_type == "convolutional" or module_type == "upsample":
                # do convolutional network
                x = self.module_list[i](x)
    
            elif module_type == "route":
                # concat layers
                layers = module["layers"]
                layers = [int(a) for a in layers]
    
                if (layers[0]) > 0:
                    layers[0] = layers[0] - i
    
                if len(layers) == 1:
                    x = outputs[i + (layers[0])]
    
                else:
                    if (layers[1]) > 0:
                        layers[1] = layers[1] - i
    
                    map1 = outputs[i + layers[0]]
                    map2 = outputs[i + layers[1]]
                    x = torch.cat((map1, map2), 1)
                
    
            elif  module_type == "shortcut":
                from_ = int(module["from"])
                # residual network
                x = outputs[i-1] + outputs[i+from_]
    
            elif module_type == 'yolo':        
                anchors = self.module_list[i][0].anchors
                #Get the input dimensions
                inp_dim = int (self.net_info["height"])
        
                #Get the number of classes
                num_classes = int (module["classes"])
        
                #Transform 
                # predict_transform is in util.py
                x = predict_transform(x, inp_dim, anchors, num_classes, CUDA)
                if not write:              #if no collector has been intialised. 
                    detections = x
                    write = 1
        
                else:       
                    detections = torch.cat((detections, x), 1)
        
            outputs[i] = x
        
        return detections


    def load_weights(self, weightfile):
        '''
        Load pretrained weight
        '''
        #Open the weights file
        fp = open(weightfile, "rb")
    
        #The first 5 values are header information 
        # 1. Major version number
        # 2. Minor Version Number
        # 3. Subversion number 
        # 4,5. Images seen by the network (during training)
        header = np.fromfile(fp, dtype = np.int32, count = 5)
        self.header = torch.from_numpy(header)
        self.seen = self.header[3]   
        
        weights = np.fromfile(fp, dtype = np.float32)
        
        ptr = 0
        for i in range(len(self.module_list)):
            module_type = self.blocks[i + 1]["type"]
    
            #If module_type is convolutional load weights
            #Otherwise ignore.
            
            if module_type == "convolutional":
                model = self.module_list[i]
                try:
                    batch_normalize = int(self.blocks[i+1]["batch_normalize"])
                except:
                    batch_normalize = 0
            
                conv = model[0]
                
                
                if (batch_normalize):
                    bn = model[1]
        
                    #Get the number of weights of Batch Norm Layer
                    num_bn_biases = bn.bias.numel()
        
                    #Load the weights
                    bn_biases = torch.from_numpy(weights[ptr:ptr + num_bn_biases])
                    ptr += num_bn_biases
        
                    bn_weights = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr  += num_bn_biases
        
                    bn_running_mean = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr  += num_bn_biases
        
                    bn_running_var = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr  += num_bn_biases
        
                    #Cast the loaded weights into dims of model weights. 
                    bn_biases = bn_biases.view_as(bn.bias.data)
                    bn_weights = bn_weights.view_as(bn.weight.data)
                    bn_running_mean = bn_running_mean.view_as(bn.running_mean)
                    bn_running_var = bn_running_var.view_as(bn.running_var)
        
                    #Copy the data to model
                    bn.bias.data.copy_(bn_biases)
                    bn.weight.data.copy_(bn_weights)
                    bn.running_mean.copy_(bn_running_mean)
                    bn.running_var.copy_(bn_running_var)
                
                else:
                    #Number of biases
                    num_biases = conv.bias.numel()
                
                    #Load the weights
                    conv_biases = torch.from_numpy(weights[ptr: ptr + num_biases])
                    ptr = ptr + num_biases
                
                    #reshape the loaded weights according to the dims of the model weights
                    conv_biases = conv_biases.view_as(conv.bias.data)
                
                    #Finally copy the data
                    conv.bias.data.copy_(conv_biases)
                    
                #Let us load the weights for the Convolutional layers
                num_weights = conv.weight.numel()
                
                #Do the same as above for weights
                conv_weights = torch.from_numpy(weights[ptr:ptr+num_weights])
                ptr = ptr + num_weights
                
                conv_weights = conv_weights.view_as(conv.weight.data)
                conv.weight.data.copy_(conv_weights)


### Test Forward Propagation

Let's propagate a single image through the network and see what we get.

In [5]:
!wget https://github.com/ayooshkathuria/pytorch-yolo-v3/raw/master/dog-cycle-car.png

--2021-02-06 00:58:21--  https://github.com/ayooshkathuria/pytorch-yolo-v3/raw/master/dog-cycle-car.png
Resolving github.com (github.com)... 13.250.177.223
Connecting to github.com (github.com)|13.250.177.223|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/master/dog-cycle-car.png [following]
--2021-02-06 00:58:21--  https://raw.githubusercontent.com/ayooshkathuria/pytorch-yolo-v3/master/dog-cycle-car.png
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 347445 (339K) [image/png]
Saving to: ‘dog-cycle-car.png’


2021-02-06 00:58:22 (1.84 MB/s) - ‘dog-cycle-car.png’ saved [347445/347445]



Here's code to load the image into memory and push it through the model:

In [6]:
import cv2
import torch

def get_test_input():
    img = cv2.imread("dog-cycle-car.png")
    img = cv2.resize(img, (416,416))          #Resize to the input dimension
    img_ =  img[:,:,::-1].transpose((2,0,1))  # BGR -> RGB | H X W C -> C X H X W 
    img_ = img_[np.newaxis,:,:,:]/255.0       #Add a channel at 0 (for batch) | Normalise
    img_ = torch.from_numpy(img_).float()     #Convert to float
    img_ = Variable(img_)                     # Convert to Variable
    return img_

Go ahead and try it (noting that the model hasn't been trained so we don't expect any correct result):

In [7]:
from util import *

model = MyDarknet("cfg/yolov3.cfg")
inp = get_test_input()
pred = model(inp, False)
print (pred)

tensor([[[1.4784e+01, 1.7460e+01, 1.1424e+02,  ..., 5.5892e-01,
          3.9877e-01, 5.6024e-01],
         [2.0277e+01, 1.4719e+01, 1.1301e+02,  ..., 5.2226e-01,
          5.0744e-01, 4.7786e-01],
         [1.7056e+01, 1.5934e+01, 5.1666e+02,  ..., 5.4858e-01,
          4.3707e-01, 4.7734e-01],
         ...,
         [4.1254e+02, 4.1234e+02, 9.5190e+00,  ..., 5.2298e-01,
          5.2594e-01, 4.1804e-01],
         [4.1249e+02, 4.1238e+02, 2.5798e+01,  ..., 5.1811e-01,
          4.4765e-01, 4.2939e-01],
         [4.1262e+02, 4.1208e+02, 3.8786e+01,  ..., 6.1843e-01,
          5.2593e-01, 4.4060e-01]]])


### Understanding the output result

The result from prediction model will be $B(13\cdot 13 + 26\cdot 26 + 52 \cdot 52)3\cdot85$. Why? We have
- $B$: the number of images in the batch
- $13\cdot 13$: number of elements (grid cells) in the coarsest feature map
- $26\cdot 16$: number of elements (grid cells) in the medium scale feature map
- $52\cdot 52$: number of elements (grid cells) in the finest cale feature map
- $3$: the number of anchor boxes per grid cell
- $85$: number of bounding box attributes (4 for bounding box, 1 for objectness, 80 for the COCO classes)

### Download a pretrained weight file

Darknet stores weights as in this diagram:

<img src="img/weights.png" title="weight" style="width: 600px;" />

In [8]:
!wget https://pjreddie.com/media/files/yolov3.weights

--2021-02-06 01:05:23--  https://pjreddie.com/media/files/yolov3.weights
Resolving pjreddie.com (pjreddie.com)... 128.208.4.108
Connecting to pjreddie.com (pjreddie.com)|128.208.4.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 248007048 (237M) [application/octet-stream]
Saving to: ‘yolov3.weights’


2021-02-06 01:08:05 (1.47 MB/s) - ‘yolov3.weights’ saved [248007048/248007048]



In [9]:
model.load_weights("yolov3.weights")

### Test with the sample image again

In [10]:
inp = get_test_input()
pred = model(inp, False)
print (pred)

tensor([[[8.5426e+00, 1.9015e+01, 1.1130e+02,  ..., 1.7306e-03,
          1.3874e-03, 9.2985e-04],
         [1.4105e+01, 1.8867e+01, 9.4014e+01,  ..., 5.9501e-04,
          9.2471e-04, 1.3085e-03],
         [2.1125e+01, 1.5269e+01, 3.5793e+02,  ..., 8.3609e-03,
          5.1067e-03, 5.8561e-03],
         ...,
         [4.1268e+02, 4.1069e+02, 3.7157e+00,  ..., 1.7185e-06,
          4.0955e-06, 6.5897e-07],
         [4.1132e+02, 4.1023e+02, 8.0353e+00,  ..., 1.3927e-05,
          3.2252e-05, 1.2076e-05],
         [4.1076e+02, 4.1318e+02, 4.9635e+01,  ..., 4.2174e-06,
          1.0794e-05, 1.8104e-05]]])


### From YOLO output tensor to *true* detections

In the prediction result, there are many results. We need to threshold them using the objectness score
output for each bounding box prediction. The `write_results` function in `util.py` does just that.

    def write_results(prediction, confidence, num_classes, nms_conf = 0.4)

- prediction: prediction result tensor returned from the YOLO model
- confidence: objectness score threshold to apply to the set of detections
- num_classes: number of classes to expect
- nms_conf: NMS IoU threshold

NMS stands for "non-maxima suppression." The basic idea is that if you have two predicted bounding
boxes that overlap each other significantly, you should throw away the box with the lower confidence
score. Overlap is measured by IoU (Intersection over Union), wich is just the ratio of the the area
of intersection of the two regions with the area of the union of the two regions:

$$ IoU(R_1,R_2) = \frac{|R_1 \cap R_2|}{|R_1 \cup R_2|}. $$

The default of 0.4 means if the intersection is 40% or more of the union, the two bounding boxes
are overlapping enough that only one of the detections should survive.

In [11]:
write_results(pred, 0.5, 80, nms_conf = 0.4)

tensor([[  0.0000,  61.5403, 100.8597, 307.2717, 303.1132,   0.9469,   0.9985,
           1.0000],
        [  0.0000, 253.8483,  66.1096, 378.0396, 118.0089,   0.9992,   0.8164,
           7.0000],
        [  0.0000,  71.0337, 163.2243, 175.7471, 382.2702,   0.9999,   0.9936,
          16.0000]])

### Show the resulting detections on top of an image

The model was trained on the COCO dataset, so download the class label file `coco.names`:

In [12]:
!wget https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/data/coco.names
!mkdir data
!mv coco.names data/coco.names

--2021-02-06 01:49:43--  https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/data/coco.names
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 625 [text/plain]
Saving to: ‘coco.names’


2021-02-06 01:49:44 (44.2 MB/s) - ‘coco.names’ saved [625/625]



In [13]:
def load_classes(namesfile):
    fp = open(namesfile, "r")
    names = fp.read().split("\n")[:-1]
    return names

In [14]:
num_classes = 80
classes = load_classes("data/coco.names")
print(classes)

['person', 'bicycle', 'car', 'motorbike', 'aeroplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'sofa', 'pottedplant', 'bed', 'diningtable', 'toilet', 'tvmonitor', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']


So we see that the three surviving bounding boxes above, outputting object types 1, 7, and 16, indicate a bicycle, a truck, and a dog.
Let's draw the detections on top of the input image for better visualization.

We'll use some code based on Kathuria's `detect.py`. You can download the original as

In [15]:
!wget https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/detect.py

--2021-02-06 01:53:13--  https://raw.githubusercontent.com/ayooshkathuria/YOLO_v3_tutorial_from_scratch/master/detect.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.8.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.8.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7273 (7.1K) [text/plain]
Saving to: ‘detect.py’


2021-02-06 01:53:13 (33.2 MB/s) - ‘detect.py’ saved [7273/7273]



Here's our version. It will process the images in subdirectory `cocoimages` so let's make it and put our sample there:

In [19]:
!mkdir -p cocoimages
!cp dog-cycle-car.png cocoimages/

In [20]:
from __future__ import division
import time
import torch 
import torch.nn as nn
from torch.autograd import Variable
import numpy as np
import cv2 
from util import *
import argparse
import os 
import os.path as osp
from darknet import Darknet
import pickle as pkl
import pandas as pd
import random

images = "cocoimages"
batch_size = 4
confidence = 0.5
nms_thesh = 0.4
start = 0
CUDA = torch.cuda.is_available()

num_classes = 80
classes = load_classes("data/coco.names")

#Set up the neural network

print("Loading network.....")
model = MyDarknet("cfg/yolov3.cfg")
model.load_weights("yolov3.weights")
print("Network successfully loaded")

model.net_info["height"] = 416
inp_dim = int(model.net_info["height"])
assert inp_dim % 32 == 0 
assert inp_dim > 32

#If there's a GPU availible, put the model on GPU

if CUDA:
    model.cuda()

# Set the model in evaluation mode

model.eval()

read_dir = time.time()

# Detection phase

try:
    imlist = [osp.join(osp.realpath('.'), images, img) for img in os.listdir(images)]
except NotADirectoryError:
    imlist = []
    imlist.append(osp.join(osp.realpath('.'), images))
except FileNotFoundError:
    print ("No file or directory with the name {}".format(images))
    exit()
    
if not os.path.exists("des"):
    os.makedirs("des")

load_batch = time.time()
loaded_ims = [cv2.imread(x) for x in imlist]

im_batches = list(map(prep_image, loaded_ims, [inp_dim for x in range(len(imlist))]))
im_dim_list = [(x.shape[1], x.shape[0]) for x in loaded_ims]
im_dim_list = torch.FloatTensor(im_dim_list).repeat(1,2)


leftover = 0
if (len(im_dim_list) % batch_size):
    leftover = 1

if batch_size != 1:
    num_batches = len(imlist) // batch_size + leftover            
    im_batches = [torch.cat((im_batches[i*batch_size : min((i +  1)*batch_size,
                        len(im_batches))]))  for i in range(num_batches)]  

write = 0

if CUDA:
    im_dim_list = im_dim_list.cuda()
    
start_det_loop = time.time()
for i, batch in enumerate(im_batches):
    # Load the image 
    start = time.time()
    if CUDA:
        batch = batch.cuda()
    with torch.no_grad():
        prediction = model(Variable(batch), CUDA)

    prediction = write_results(prediction, confidence, num_classes, nms_conf = nms_thesh)

    end = time.time()

    if type(prediction) == int:

        for im_num, image in enumerate(imlist[i*batch_size: min((i +  1)*batch_size, len(imlist))]):
            im_id = i*batch_size + im_num
            print("{0:20s} predicted in {1:6.3f} seconds".format(image.split("/")[-1], (end - start)/batch_size))
            print("{0:20s} {1:s}".format("Objects Detected:", ""))
            print("----------------------------------------------------------")
        continue

    prediction[:,0] += i*batch_size    #transform the atribute from index in batch to index in imlist 

    if not write:                      #If we have't initialised output
        output = prediction  
        write = 1
    else:
        output = torch.cat((output,prediction))

    for im_num, image in enumerate(imlist[i*batch_size: min((i +  1)*batch_size, len(imlist))]):
        im_id = i*batch_size + im_num
        objs = [classes[int(x[-1])] for x in output if int(x[0]) == im_id]
        print("{0:20s} predicted in {1:6.3f} seconds".format(image.split("/")[-1], (end - start)/batch_size))
        print("{0:20s} {1:s}".format("Objects Detected:", " ".join(objs)))
        print("----------------------------------------------------------")

    if CUDA:
        torch.cuda.synchronize()       
try:
    output
except NameError:
    print ("No detections were made")
    exit()

im_dim_list = torch.index_select(im_dim_list, 0, output[:,0].long())

scaling_factor = torch.min(416/im_dim_list,1)[0].view(-1,1)

output[:,[1,3]] -= (inp_dim - scaling_factor*im_dim_list[:,0].view(-1,1))/2
output[:,[2,4]] -= (inp_dim - scaling_factor*im_dim_list[:,1].view(-1,1))/2

output[:,1:5] /= scaling_factor

for i in range(output.shape[0]):
    output[i, [1,3]] = torch.clamp(output[i, [1,3]], 0.0, im_dim_list[i,0])
    output[i, [2,4]] = torch.clamp(output[i, [2,4]], 0.0, im_dim_list[i,1])
    
output_recast = time.time()
class_load = time.time()
colors = [[255, 0, 0], [255, 0, 0], [255, 255, 0], [0, 255, 0], [0, 255, 255], [0, 0, 255], [255, 0, 255]]

draw = time.time()

def write(x, results):
    c1 = tuple(x[1:3].int())
    c2 = tuple(x[3:5].int())
    img = results[int(x[0])]
    cls = int(x[-1])
    color = random.choice(colors)
    label = "{0}".format(classes[cls])
    cv2.rectangle(img, c1, c2,color, 1)
    t_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_PLAIN, 1 , 1)[0]
    c2 = c1[0] + t_size[0] + 3, c1[1] + t_size[1] + 4
    cv2.rectangle(img, c1, c2,color, -1)
    cv2.putText(img, label, (c1[0], c1[1] + t_size[1] + 4), cv2.FONT_HERSHEY_PLAIN, 1, [225,255,255], 1);
    return img


list(map(lambda x: write(x, loaded_ims), output))

det_names = pd.Series(imlist).apply(lambda x: "{}/det_{}".format("des",x.split("/")[-1]))

list(map(cv2.imwrite, det_names, loaded_ims))

end = time.time()

print("SUMMARY")
print("----------------------------------------------------------")
print("{:25s}: {}".format("Task", "Time Taken (in seconds)"))
print()
print("{:25s}: {:2.3f}".format("Reading addresses", load_batch - read_dir))
print("{:25s}: {:2.3f}".format("Loading batch", start_det_loop - load_batch))
print("{:25s}: {:2.3f}".format("Detection (" + str(len(imlist)) +  " images)", output_recast - start_det_loop))
print("{:25s}: {:2.3f}".format("Output Processing", class_load - output_recast))
print("{:25s}: {:2.3f}".format("Drawing Boxes", end - draw))
print("{:25s}: {:2.3f}".format("Average time_per_img", (end - load_batch)/len(imlist)))
print("----------------------------------------------------------")


torch.cuda.empty_cache()

Loading network.....
Network successfully loaded
dog-cycle-car.png    predicted in  0.433 seconds
Objects Detected:    bicycle truck dog
----------------------------------------------------------
SUMMARY
----------------------------------------------------------
Task                     : Time Taken (in seconds)

Reading addresses        : 0.000
Loading batch            : 0.016
Detection (1 images)     : 1.737
Output Processing        : 0.000
Drawing Boxes            : 0.013
Average time_per_img     : 1.766
----------------------------------------------------------


Voila! You got the YOLO result

<img src="img/dogresult.png" title="weight" style="width: 600px;" />

### YOLOv4 implementation

YOLO v4 was developed based on YOLO v3 by a new group of authors, Alexey Bochkovskiy and colleagues, who took
over the development of Darknet and YOLO after [Joseph Redmon quit computer vision research](https://twitter.com/pjreddie/status/1230524770350817280?lang=en).

Take a look at the [YOLO v4 paper](https://arxiv.org/abs/2004.10934). The authors make many small and some large
improvements to YOLOv3 to achieve a higher frame rate and higher accuracy. Source code is available at the
[Darknet GitHub repository](https://github.com/AlexeyAB/darknet).

### YOLOv4 overall architecture

The YOLOv4 architecture looks like this:

<img src="img/Block-diagram-of-YOLOv4-object-detection-The-small-modules-included-are-CBM.png" title="YOLOV4" style="width: 960px;" />

*Source: https://www.researchgate.net/publication/351652673_A_real-time_deep_learning_forest_fire_monitoring_algorithm_based_on_an_improved_Pruned_KD_model/figures?lo=1*

As you can see, there are several modules in YOLOv4. For the backbone:

- **CBM**: Convolution $\rightarrow$ BatchNorm $\rightarrow$ Mish Activation
- **CBL**: Convolution $\rightarrow$ BatchNorm $\rightarrow$ Leaky ReLU Activation
- **CSPResNet**: Cross-stage partial residual network block. A CSPResNet block separates the input feature map block into two parts. The first part bypasses the computations in the block and becomes part of the input to the next block. The second part is passed throuth a residual block implemented with 2 CBMs.

For the neck:

- **SPP**: Spatial pyramid pooling layer
- **PAN**: Path aggregation network

And finally, for the head:

- **YOLO**: Same as in YOLO v3.

### Cross stage partial network backbone

Let's look at CSP in the backbone first. This figure shows the idea of a CSPDenseNet block compared to an ordinary dense block:

<img src="img/CSPdensenet.png" title="CSP" style="width: 800px;" />

A dense block is just a convolution, batch norm, and activation function.

CSPResNet further replaces the dense block with a residual block (two CBMs as stated above).

### Spatial pyramid pooling in the neck

Next is the spatial pyramid pooling (SPP) layer. Take a look at the [SPP paper](https://arxiv.org/abs/1406.4729).
This layer is used for multiscale analysis of a set of feature maps produced by convolutional layers. A SPP produces
a fixed-size representation of the input regardless of the input feature map size.

Here is an example from the SPP paper:

<img src="img/SPP.jpeg" title="SPP" style="width: 800px;" />

*Source: https://arxiv.org/pdf/1406.4729.pdf, Figure 3*

In the figure, the 256-channel output of a convolutional layer is being pooled at three different resolutions: 4x4, 2x2, and 1x1. The resulting
feature maps are concatenated into a 1D vector that could then be analyzed with fully connected layers.

In YOLO v4, however, the SPP module is implemented slightly differently,
using three different maxpool layers applied to the same input. For example, the first
maxpool of the SPP in `yolov4.cfg` looks like this:

    [maxpool]
    stride=1
    size=13

In Darknet, this means a 13x13 overlapping maximum operations with padding to obtain the same output size as the input size.
The other two maxpool operations are 9x9 and 5x5, each with stride 1, creating three sets of feature maps that are the same size
as the input, which are concatenated with the input and passed to the next stage of processing.

### Path aggregation network in the neck

The last new module in YOLO v4 is the path aggregation network or PANet in the neck. PANet is a typle of feature pyramid
network that extracts features from various points in the backbone.
Take a look at the [PANet paper](https://arxiv.org/abs/1803.01534). It was originally designed for instance segmentation.
From the YOLO v4 diagram above, we see that the PANet has two flows: upward, from the lowest resolution output of the SPP, and downward,
from the high resolution block of feature maps (76x76x256) from the middle of the CSPDarknet53 backbone. The low-to-high resolution
path on the left upscales the low resolution feature maps and combines them with the higher resolution feature maps, while the high-to-low
resolution path on the right uses strided convolution to downscale the high resolution information.

### Mish activation function in the CBM modules

Next, let's take a look at the newish activation function used in YOLOv4: Mish.
Mish is a SoftPlus activation function that is non-monotonic and designed for
neural networks that regularize themselves. It was inspired by the *swish* activation function.
It has a range from -0.31 to $\infty$, due to the SoftPlus function:

$$\mathrm{SoftPlus}(x)=\ln(1+e^x) \\
f(x)=x \tanh(\mathrm{SoftPlus}(x))=x \tanh(\ln(1+e^x)) $$.

Here's a graph:

<img src="img/mish_activation_function_graph.png" title="weight" style="width: 480px;" />

Compared to other activation functions, you can see that Mish is closest to Swish:

<img src="img/mish_activation_function_compare_with_other_activation_functions.png" title="weight" style="width: 480px;" />

### Mish activation function code

Create a file `mish.py` and add your `Mish` class as follows.

In [None]:
import torch 
from torch import tanh
import torch.nn as nn
import torch.nn.functional as F 

class Mish(nn.Module):
     def __init__(self):
         super().__init__()

     def forward(self, x):
         return x * tanh(F.softplus(x))


### PyTorch DarkNet code modifications

Because YOLOv4 extends YOLOv3 in several ways, we need to add or modify some modules in `darknet.py`,
in both the `create_module()` and `forward()` methods:

- Add the overlapping *max pooling* operation used in the SPP block
- Add multipath feature map concatenation to the *route layer*
- Add the *Mish* activation function

#### Max pooling

In [None]:
# create_module()
# max pooling layer
elif x["type"] == "maxpool":
    stride = int(x["stride"])
    size = int(x["size"])
    max_pool = nn.MaxPool2d(size, stride, padding=size//2)
    module.add_module("maxpool_{}".format(index), max_pool)

In [None]:
# forward()
if module_type in ["convolutional", "upsample", "maxpool"]:
    x = self.modul_list[i](x)

#### Route layer

We need to create it in create module too because the number of output filter has been changed at the route module.

In [None]:
# create_module()
# route layer
elif x["type"] == "route":
    x["layers"] = x["layers"].splite(',')
    filters = 0
    
    for i in range(len(x["layers"])):
        pointer = int(x["layers"][i])
        if pointer > 0:
            filters += output_filters[pointer]
        else:
            filters += output_filters[index + pointer]
            
    route = EmptyLayer()
    module.add_module("route_{0}".format(index), route)

In [None]:
# forward()
elif module_type == "route":
    layers = [int(a) for a in layers]
    maps[]
    for l in range(0, len(layers)):
        if layers[l] > 0:
            layers[l] = layers[l] - i
        maps.append(outputs[i + layers[l]])
    x = torch.cat((maps), 1)

#### Mish activation layer

as from above, we introduced mish activation, now add the module in the <code>create_module()</code>

In [None]:
elif activation == "mish":
    activn = Mish()
    module.add_module("mish_{0}".format(index), activn)

### Full create_module() function

And now, here is full code for the `create_module()` function that should
work for the YOLOv4 configuration file. We've noted the modifications with triple single quotes.

In [None]:
# Do not for get to add Mish activation at the top of darknet.py
from mish import Mish

# .
# .
# .
# And now create_module() function
def create_modules(blocks):
    net_info = blocks[0]     #Captures the information about the input and pre-processing    
    module_list = nn.ModuleList()
    prev_filters = 3
    output_filters = []
    
    for index, x in enumerate(blocks[1:]):
        module = nn.Sequential()
    
        #check the type of block
        #create a new module for the block
        #append to module_list
        
        #If it's a convolutional layer
        if (x["type"] == "convolutional"):
            #Get the info about the layer
            activation = x["activation"]
            try:
                batch_normalize = int(x["batch_normalize"])
                bias = False
            except:
                batch_normalize = 0
                bias = True
        
            filters= int(x["filters"])
            padding = int(x["pad"])
            kernel_size = int(x["size"])
            stride = int(x["stride"])
        
            if padding:
                pad = (kernel_size - 1) // 2
            else:
                pad = 0
        
            #Add the convolutional layer
            conv = nn.Conv2d(prev_filters, filters, kernel_size, stride, pad, bias = bias)
            module.add_module("conv_{0}".format(index), conv)
        
            #Add the Batch Norm Layer
            if batch_normalize:
                bn = nn.BatchNorm2d(filters)
                module.add_module("batch_norm_{0}".format(index), bn)
        
            #Check the activation. 
            #It is either Linear or a Leaky ReLU for YOLO
            if activation == "leaky":
                activn = nn.LeakyReLU(0.1, inplace = True)
                module.add_module("leaky_{0}".format(index), activn)
                
            ''' Mish activation modification here '''
            elif activation == "mish":
                activn = Mish()
                module.add_module("mish_{0}".format(index), activn)

        
        #If it's an upsampling layer
        #We use Bilinear2dUpsampling
        elif (x["type"] == "upsample"):
            stride = int(x["stride"])
            upsample = nn.Upsample(scale_factor = 2, mode = "nearest")
            module.add_module("upsample_{}".format(index), upsample)
            
        ''' route layermodification here '''
        elif (x["type"] == "route"):
            x["layers"] = x["layers"].split(',')
            filters = 0

            for i in range(len(x["layers"])):
                pointer = int(x["layers"][i])
                if  pointer > 0:
                    filters += output_filters[pointer]
                else:
                    filters += output_filters[index + pointer]

            route = EmptyLayer()
            module.add_module("route_{0}".format(index), route)
    
        #shortcut corresponds to skip connection
        elif x["type"] == "shortcut":
            shortcut = EmptyLayer()
            module.add_module("shortcut_{}".format(index), shortcut)
            
        #Yolo is the detection layer
        elif x["type"] == "yolo":
            mask = x["mask"].split(",")
            mask = [int(x) for x in mask]
    
            anchors = x["anchors"].split(",")
            anchors = [int(a) for a in anchors]
            anchors = [(anchors[i], anchors[i+1]) for i in range(0, len(anchors),2)]
            anchors = [anchors[i] for i in mask]
    
            detection = DetectionLayer(anchors)
            module.add_module("Detection_{}".format(index), detection)
        
        ''' Max pooling layer modification here '''
        elif x["type"] == "maxpool":
            stride = int(x["stride"])
            size = int(x["size"])
            max_pool = nn.MaxPool2d(size, stride, padding=size // 2)
            module.add_module("maxpool_{}".format(index), max_pool)
                              
        module_list.append(module)
        prev_filters = filters
        output_filters.append(filters)
        
    return (net_info, module_list)

### Full darknet class function

Here is the correspondingly modified `MyDarkNet` class:

In [None]:
class MyDarknet(nn.Module):
    def __init__(self, cfgfile):
        super(Darknet, self).__init__()
        self.blocks = parse_cfg(cfgfile)
        self.net_info, self.module_list = create_modules(self.blocks)
        
    def forward(self, x, CUDA):
        modules = self.blocks[1:]
        outputs = {}   #We cache the outputs for the route layer
        
        write = 0
        for i, module in enumerate(modules):        
            module_type = (module["type"])
            
            ''' max pooling '''
            if module_type in ["convolutional", "upsample", "maxpool"]:
                x = self.module_list[i](x)
            
            ''' route layer '''
            elif module_type == "route":
                layers = module["layers"]
                layers = [int(a) for a in layers]
                maps = []
                for l in range(0, len(layers)):
                    if layers[l] > 0:
                        layers[l] = layers[l] - i
                    maps.append(outputs[i + layers[l]])
                x = torch.cat((maps), 1)
                
    
            elif  module_type == "shortcut":
                from_ = int(module["from"])
                x = outputs[i-1] + outputs[i+from_]
    
            elif module_type == 'yolo':        
                anchors = self.module_list[i][0].anchors
                #Get the input dimensions
                inp_dim = int (self.net_info["height"])
        
                #Get the number of classes
                num_classes = int (module["classes"])
        
                #Transform 
                x = predict_transform(x, inp_dim, anchors, num_classes, CUDA)
                if not write:              #if no collector has been intialised. 
                    detections = x
                    write = 1
        
                else:       
                    detections = torch.cat((detections, x), 1)
        
            outputs[i] = x
        
        return detections

### Modifications to inference code

Here is code for the new inference test. Put it in `run_yolov4.py`. We have the following modifications:
1. Resize the input image to $608\times 608$ using the function `letterbox_image`.
2. Change the image format from BGR to RGB
3. Change the normalization factor for bounding boxes from 416 to 608
4. Transform the output back to BGR in order to save to a file using OpenCV

Here is sample code for `run_yolov4.py`:

In [None]:
from __future__ import division
import time
import torch 
import torch.nn as nn
from torch.autograd import Variable
import numpy as np
import cv2 
from util import *
import argparse
import os 
import os.path as osp
from darknet import Darknet
import pickle as pkl
import pandas as pd
import random

images = "cocoimages"
batch_size = 1
confidence = 0.5
nms_thesh = 0.4
start = 0
CUDA = torch.cuda.is_available()

num_classes = 80
classes = load_classes("data/coco.names")

#Set up the neural network

print("Loading network.....")
model = Darknet("cfg/yolov4.cfg")
model.load_weights("yolov4.weights")
print("Network successfully loaded")

model.net_info["height"] = 608
model.net_info["width"] = 608
inp_dim = int(model.net_info["height"])
assert inp_dim % 32 == 0 
assert inp_dim > 32

#If there's a GPU availible, put the model on GPU

if CUDA:
    model.cuda()

# Set the model in evaluation mode

model.eval()

read_dir = time.time()

# Detection phase

try:
    imlist = [osp.join(osp.realpath('.'), images, img) for img in os.listdir(images)]
except NotADirectoryError:
    imlist = []
    imlist.append(osp.join(osp.realpath('.'), images))
except FileNotFoundError:
    print ("No file or directory with the name {}".format(images))
    exit()
    
if not os.path.exists("des"):
    os.makedirs("des")

load_batch = time.time()
# loaded_ims = [letterbox_image(cv2.imread(x), (inp_dim, inp_dim)) for x in imlist]

img = cv2.imread(imlist[0])

print(type(img), img.shape)
img = letterbox_image(img, (inp_dim, inp_dim))
cv2.imwrite('test.jpg', img)
img = cv2.imread('test.jpg')
print(img.shape)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
# img = torch.from_numpy(img).float().div(255.0).unsqueeze(0)
# print(img.shape)
# img = torch.from_numpy(img.transpose(2, 0, 1)).float().div(255.0).unsqueeze(0)
print(img.shape)

loaded_ims = [img]


im_batches = list(map(prep_image, loaded_ims, [inp_dim for x in range(len(imlist))]))


im_dim_list = [(x.shape[1], x.shape[0]) for x in loaded_ims]
im_dim_list = torch.FloatTensor(im_dim_list).repeat(1,2)
print(im_dim_list)

leftover = 0
if (len(im_dim_list) % batch_size):
    leftover = 1

if batch_size != 1:
    num_batches = len(imlist) // batch_size + leftover            
    im_batches = [torch.cat((im_batches[i*batch_size : min((i +  1)*batch_size,
                        len(im_batches))]))  for i in range(num_batches)]  

write = 0

if CUDA:
    im_dim_list = im_dim_list.cuda()
    
start_det_loop = time.time()
for i, batch in enumerate(im_batches):
    # Load the image 
    start = time.time()
    if CUDA:
        batch = batch.cuda()
    with torch.no_grad():
        prediction = model(Variable(batch), CUDA)

    prediction = write_results(prediction, confidence, num_classes, nms_conf = nms_thesh)

    end = time.time()

    if type(prediction) == int:

        for im_num, image in enumerate(imlist[i*batch_size: min((i +  1)*batch_size, len(imlist))]):
            im_id = i*batch_size + im_num
            print("{0:20s} predicted in {1:6.3f} seconds".format(image.split("/")[-1], (end - start)/batch_size))
            print("{0:20s} {1:s}".format("Objects Detected:", ""))
            print("----------------------------------------------------------")
        continue

    prediction[:,0] += i*batch_size    #transform the atribute from index in batch to index in imlist 

    if not write:                      #If we have't initialised output
        output = prediction  
        write = 1
    else:
        output = torch.cat((output,prediction))

    for im_num, image in enumerate(imlist[i*batch_size: min((i +  1)*batch_size, len(imlist))]):
        im_id = i*batch_size + im_num
        objs = [classes[int(x[-1])] for x in output if int(x[0]) == im_id]
        print("{0:20s} predicted in {1:6.3f} seconds".format(image.split("/")[-1], (end - start)/batch_size))
        print("{0:20s} {1:s}".format("Objects Detected:", " ".join(objs)))
        print("----------------------------------------------------------")

    if CUDA:
        torch.cuda.synchronize()       
try:
    output
except NameError:
    print ("No detections were made")
    exit()

im_dim_list = torch.index_select(im_dim_list, 0, output[:,0].long())

scaling_factor = torch.min(model.net_info["height"]/im_dim_list,1)[0].view(-1,1)

output[:,[1,3]] -= (inp_dim - scaling_factor*im_dim_list[:,0].view(-1,1))/2
output[:,[2,4]] -= (inp_dim - scaling_factor*im_dim_list[:,1].view(-1,1))/2

output[:,1:5] /= scaling_factor

for i in range(output.shape[0]):
    output[i, [1,3]] = torch.clamp(output[i, [1,3]], 0.0, im_dim_list[i,0])
    output[i, [2,4]] = torch.clamp(output[i, [2,4]], 0.0, im_dim_list[i,1])
    
output_recast = time.time()
class_load = time.time()
colors = [[255, 0, 0], [255, 0, 0], [255, 255, 0], [0, 255, 0], [0, 255, 255], [0, 0, 255], [255, 0, 255]]

draw = time.time()

def write(x, results):
    c1 = tuple(x[1:3].int())
    c2 = tuple(x[3:5].int())
    img = results[int(x[0])]
    print(img.shape)
    print(type(img))
    # img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    # cv2.imshow('555',img)
    # cv2.waitKey(0)
    # cv2.destroyAllWindows()
    cls = int(x[-1])
    color = random.choice(colors)
    label = "{0}".format(classes[cls])
    cv2.rectangle(img, c1, c2,color, 1)
    t_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_PLAIN, 1 , 1)[0]
    c2 = c1[0] + t_size[0] + 3, c1[1] + t_size[1] + 4
    cv2.rectangle(img, c1, c2,color, -1)
    
    cv2.putText(img, label, (c1[0], c1[1] + t_size[1] + 4), cv2.FONT_HERSHEY_PLAIN, 1, [225,255,255], 1);
    return img


list(map(lambda x: write(x, loaded_ims), output))

det_names = pd.Series(imlist).apply(lambda x: "{}/det_{}".format("des",x.split("/")[-1]))

list(map(cv2.imwrite, det_names, [cv2.cvtColor(loaded_ims[0], cv2.COLOR_BGR2RGB)]))
end = time.time()

print("SUMMARY")
print("----------------------------------------------------------")
print("{:25s}: {}".format("Task", "Time Taken (in seconds)"))
print()
print("{:25s}: {:2.3f}".format("Reading addresses", load_batch - read_dir))
print("{:25s}: {:2.3f}".format("Loading batch", start_det_loop - load_batch))
print("{:25s}: {:2.3f}".format("Detection (" + str(len(imlist)) +  " images)", output_recast - start_det_loop))
print("{:25s}: {:2.3f}".format("Output Processing", class_load - output_recast))
print("{:25s}: {:2.3f}".format("Drawing Boxes", end - draw))
print("{:25s}: {:2.3f}".format("Average time_per_img", (end - load_batch)/len(imlist)))
print("----------------------------------------------------------")


torch.cuda.empty_cache()

You can download the weights for the model `yolov4.weight` from [this link](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwiwuaTq0NP1AhVNILcAHT0JAHAQFnoECAgQAQ&url=https%3A%2F%2Fgithub.com%2FAlexeyAB%2Fdarknet%2Freleases%2Fdownload%2Fdarknet_yolo_v3_optimal%2Fyolov4.weights&usg=AOvVaw30if4joxtTaS8DAh12vYQ4).

You'll also find the yolov4.cfg file in the [YOLOv4 GitHub repository](https://github.com/AlexeyAB/darknet) (in folder cfg).

### YOLOv4 training

After adding the missing components to our Darknet configuration transformer for YOLOv3 to support YOLOv4,
let's work on training. However, inference and training part are quite different from each other. We need
to add augmentation of images and labels, and also compute the new loss functions.

The process of training requires these steps:

1. Load the pretrained weights for ImageNet
2. Implement training data augmentation
3. Convert ground truth annotations to targets for the three YOLO heads
4. Implement the appropriate loss functions (CIOU loss for bounding boxes, weighted binary cross entropy for confidence scores, and binary cross entropy for class scores)

We'll help you get started. Use the albumentation package for augmentation and the pycocotools package for
dealing with the COCO dataset.
Create a file named `train_yolov4.py` and the code below.

#### Load pretrained weights

In [None]:
print("Loading network.....")
model = Darknet("cfg/yolov4.cfg")
model.load_weights("csdarknet53-omega_final.weights", backbone_only=True)
print("Network successfully loaded")

You will need to modify your `load_weights` method to load only the backbone
layer weights from the CSP Darknet 53 weight file.

#### Image augmentation

Use the "Albumentation" package for augmentations. Augmentataion is easy when we're doing
classification, because the target doesn't change. However, object detection augmentation,
requires transforming both the input image and the output labels according to the type of
augmentation (for example, a horizontal flip requires a translation of all bounding boxes).
Albumentation is a library that does this for you easily. Take a look at the
[Albumentation documentation](https://albumentations.ai/docs/) for more detail.

In [None]:
import albumentations as A

...

# Training dataset transformation
train_transform = A.Compose([
    A.SmallestMaxSize(256),
    A.RandomCrop(width=224, height=224),
    # A.HorizontalFlip(p=0.5),
    # A.RandomBrightnessContrast(p=0.2),
], bbox_params=A.BboxParams(format='coco', label_fields=['category_ids']),
)

# Validation dataset transformation
eval_transform = A.Compose([
    A.SmallestMaxSize(256),
    A.CenterCrop(width=224, height=224),
], bbox_params=A.BboxParams(format='coco', label_fields=['category_ids']),
)

#### Create CustomCoco dataset class

Create a new class ``CustomCoco`` to get transformed bounding boxes with their labels.
The bounding box transformation is in the ``__getitem__()`` method.

In [None]:
if self.transform is not None:
    bboxes = list(obj['bbox'] for obj in target)
    category_ids = list(obj['category_id'] for obj in target)
    transformed = self.transform(image=img, bboxes=bboxes, category_ids=category_ids)
    img = transformed['image'],
    bboxes = torch.Tensor(transformed['bboxes'])
    cat_ids = torch.Tensor(transformed['category_ids'])
    labels, bboxes = self.__create_label(bboxes, cat_ids.type(torch.IntTensor))

#### Create Ground truth formatting

Creating ground truth target tensors is a bit tricky.
Refer to [argusswift's yolov4 code](https://github.com/argusswift/YOLOv4-pytorch).


In [None]:
bboxes = np.array(bboxes)
class_inds = np.array(class_inds)
anchors = ANCHORS # all the anchors
strides = np.array(STRIDES) # list of strides
train_output_size = IP_SIZE / strides # image with different scales
anchors_per_scale = NUM_ANCHORS # anchor per scale

# print(train_output_size)

label = [
    np.zeros(
        (
            int(train_output_size[i]),
            int(train_output_size[i]),
            anchors_per_scale,
            5 + NUM_CLASSES,
        )
    )
    for i in range(3)
]

And also align the label and bounding box ground truth for each scale

In [None]:
flatten_size_s = int(train_output_size[2]) * int(train_output_size[2]) * anchors_per_scale
flatten_size_m = int(train_output_size[1]) * int(train_output_size[1]) * anchors_per_scale
flatten_size_l = int(train_output_size[0]) * int(train_output_size[0]) * anchors_per_scale

label_s = torch.Tensor(label[2]).view(1, flatten_size_s, 5 + NUM_CLASSES).squeeze(0)
label_m = torch.Tensor(label[1]).view(1, flatten_size_m, 5 + NUM_CLASSES).squeeze(0)
label_l = torch.Tensor(label[0]).view(1, flatten_size_l, 5 + NUM_CLASSES).squeeze(0)

bboxes_s = torch.Tensor(bboxes_xywh[2])
        bboxes_m = torch.Tensor(bboxes_xywh[1])
bboxes_l = torch.Tensor(bboxes_xywh[0])

Concatenate the three sets of bounding box tensors:

In [None]:
sbboxes, mbboxes, lbboxes = bboxes_xywh

labels = torch.cat([label_l, label_m, label_s], 0)
bboxes = torch.cat([bboxes_l, bboxes_m, bboxes_s], 0)

Here is full code you can use for your ``CustomCoco`` class.

In [None]:
import os
import sys
from PIL import Image
sys.path.append("..")
sys.path.append("../utils")
import torch
import math
import cv2
import numpy as np
import torch
import random
from torchvision.datasets import CocoDetection
# from . import data_augment as dataAug
# from . import tools

from typing import Any, Callable, Optional, Tuple
import json

ANCHORS = [
            [[12, 16], [19, 36], [40, 28]],
            [[36, 75], [76, 55], [72, 146]],
            [[142, 110], [192, 243], [459, 401]]
]

STRIDES = [8, 16, 32]

IP_SIZE = 224
NUM_ANCHORS = 3
NUM_CLASSES = 80

with open('coco_cats.json') as js:
    data = json.load(js)["categories"]

cats_dict = {}
for i in range(0, 80):
    cats_dict[str(data[i]['id'])] = i



class CustomCoco(CocoDetection):
    def __init__(
            self,
            root: str,
            annFile: str,
            transform: Optional[Callable] = None,
            target_transform: Optional[Callable] = None,
            transforms: Optional[Callable] = None,
    ) -> None:

        super(CocoDetection, self).__init__(root, transforms, transform, target_transform)
        from pycocotools.coco import COCO
        self.coco = COCO(annFile)
        self.ids = list(sorted(self.coco.imgs.keys()))


    def __getitem__(self, index: int) -> Tuple[Any, Any]:
        """
        Args:
            index (int): Index

        Returns:
            tuple: Tuple (image, target). target is the object returned by ``coco.loadAnns``.
        """
        coco = self.coco
        img_id = self.ids[index]
        ann_ids = coco.getAnnIds(imgIds=img_id)
        target = coco.loadAnns(ann_ids)
        # self.target = target

        path = coco.loadImgs(img_id)[0]['file_name']

        img = Image.open(os.path.join(self.root, path)).convert('RGB')
        img = np.array(img)

        category_ids = list(obj['category_id'] for obj in target)
        bboxes = list(obj['bbox'] for obj in target)
  
        if self.transform is not None:
            bboxes = list(obj['bbox'] for obj in target)
            category_ids = list(obj['category_id'] for obj in target)
            transformed = self.transform(image=img, bboxes=bboxes, category_ids=category_ids)
            img = transformed['image'],
            bboxes = torch.Tensor(transformed['bboxes'])
            cat_ids = torch.Tensor(transformed['category_ids'])
            labels, bboxes = self.__create_label(bboxes, cat_ids.type(torch.IntTensor))

        return img, labels, bboxes

    def __len__(self) -> int:
        return len(self.ids)

    def __create_label(self, bboxes, class_inds):
        """
        Label assignment. For a single picture all GT box bboxes are assigned anchor.
        1、Select a bbox in order, convert its coordinates("xyxy") to "xywh"; and scale bbox'
           xywh by the strides.
        2、Calculate the iou between the each detection layer'anchors and the bbox in turn, and select the largest
            anchor to predict the bbox.If the ious of all detection layers are smaller than 0.3, select the largest
            of all detection layers' anchors to predict the bbox.
        Note :
        1、The same GT may be assigned to multiple anchors. And the anchors may be on the same or different layer.
        2、The total number of bboxes may be more than it is, because the same GT may be assigned to multiple layers
        of detection.
        """
        # print("Class indices: ", class_inds)
        bboxes = np.array(bboxes)
        class_inds = np.array(class_inds)
        anchors = ANCHORS # all the anchors
        strides = np.array(STRIDES) # list of strides
        train_output_size = IP_SIZE / strides # image with different scales
        anchors_per_scale = NUM_ANCHORS # anchor per scale

        # print(train_output_size)

        label = [
            np.zeros(
                (
                    int(train_output_size[i]),
                    int(train_output_size[i]),
                    anchors_per_scale,
                    5 + NUM_CLASSES,
                )
            )
            for i in range(3)
        ]
        # for i in range(3):
            # label[i][..., 5] = 1.0

        # 150 bounding box ground truths per scale
        bboxes_xywh = [
            np.zeros((150, 4)) for _ in range(3)
        ]  # Darknet the max_num is 30
        bbox_count = np.zeros((3,))

        for i in range(len(bboxes)):
            bbox_coor = bboxes[i][:4]
            bbox_class_ind = cats_dict[str(class_inds[i])]
            # bbox_mix = bboxes[i][5]

            # onehot
            one_hot = np.zeros(NUM_CLASSES, dtype=np.float32)
            one_hot[bbox_class_ind] = 1.0
            # one_hot_smooth = dataAug.LabelSmooth()(one_hot, self.num_classes)

            # convert "xyxy" to "xywh"
            bbox_xywh = np.concatenate(
                [
                    (0.5 * bbox_coor[2:] + bbox_coor[:2]) ,
                    bbox_coor[2:],
                ],
                axis=-1,
            )
            # print("bbox_xywh: ", bbox_xywh)
            
            bbox_xywh_scaled = (
                1.0 * bbox_xywh[np.newaxis, :] / strides[:, np.newaxis]
            )

            # print("bbox_xywhscaled: ", bbox_xywh_scaled)

            iou = []
            exist_positive = False
            for i in range(3):
                anchors_xywh = np.zeros((anchors_per_scale, 4))
                anchors_xywh[:, 0:2] = (
                    np.floor(bbox_xywh_scaled[i, 0:2]).astype(np.int32) + 0.5
                )  # 0.5 for compensation

                # assign all anchors 
                anchors_xywh[:, 2:4] = anchors[i]

                iou_scale = iou_xywh_numpy(
                    bbox_xywh_scaled[i][np.newaxis, :], anchors_xywh
                )
                iou.append(iou_scale)
                iou_mask = iou_scale > 0.3

                if np.any(iou_mask):
                    xind, yind = np.floor(bbox_xywh_scaled[i, 0:2]).astype(
                        np.int32
                    )

                    # Bug : 当多个bbox对应同一个anchor时，默认将该anchor分配给最后一个bbox
                    label[i][yind, xind, iou_mask, 0:4] = bbox_xywh * strides[i]
                    label[i][yind, xind, iou_mask, 4:5] = 1.0
                    label[i][yind, xind, iou_mask, 5:] = one_hot

                    bbox_ind = int(bbox_count[i] % 150)  # BUG : 150为一个先验值,内存消耗大
                    bboxes_xywh[i][bbox_ind, :4] = bbox_xywh * strides[i]
                    bbox_count[i] += 1

                    exist_positive = True

            if not exist_positive:
                # check if a ground truth bb have the best anchor with any scale
                best_anchor_ind = np.argmax(np.array(iou).reshape(-1), axis=-1)
                best_detect = int(best_anchor_ind / anchors_per_scale)
                best_anchor = int(best_anchor_ind % anchors_per_scale)

                xind, yind = np.floor(
                    bbox_xywh_scaled[best_detect, 0:2]
                ).astype(np.int32)

                label[best_detect][yind, xind, best_anchor, 0:4] = bbox_xywh * strides[best_detect]
                label[best_detect][yind, xind, best_anchor, 4:5] = 1.0
                # label[best_detect][yind, xind, best_anchor, 5:6] = bbox_mix
                label[best_detect][yind, xind, best_anchor, 5:] = one_hot 

                bbox_ind = int(bbox_count[best_detect] % 150)
                bboxes_xywh[best_detect][bbox_ind, :4] = bbox_xywh * strides[best_detect]
                bbox_count[best_detect] += 1

        flatten_size_s = int(train_output_size[2]) * int(train_output_size[2]) * anchors_per_scale
        flatten_size_m = int(train_output_size[1]) * int(train_output_size[1]) * anchors_per_scale
        flatten_size_l = int(train_output_size[0]) * int(train_output_size[0]) * anchors_per_scale

        label_s = torch.Tensor(label[2]).view(1, flatten_size_s, 5 + NUM_CLASSES).squeeze(0)
        label_m = torch.Tensor(label[1]).view(1, flatten_size_m, 5 + NUM_CLASSES).squeeze(0)
        label_l = torch.Tensor(label[0]).view(1, flatten_size_l, 5 + NUM_CLASSES).squeeze(0)

        bboxes_s = torch.Tensor(bboxes_xywh[2])
        bboxes_m = torch.Tensor(bboxes_xywh[1])
        bboxes_l = torch.Tensor(bboxes_xywh[0])
        # label_sbbox, label_mbbox, label_lbbox = label
        sbboxes, mbboxes, lbboxes = bboxes_xywh
        # print("label")
        labels = torch.cat([label_l, label_m, label_s], 0)
        bboxes = torch.cat([bboxes_l, bboxes_m, bboxes_s], 0)
        return labels, bboxes

    def __create_label_old(self, bboxes, class_inds):
        # anno = annotation.strip().split(" ")
        label = np.zeros(
                (
                    IP_SIZE,
                    IP_SIZE,
                    NUM_ANCHORS,
                    5 + NUM_CLASSES,
                )
                )

        # x y w h obj cls
        # label[..., 5] = 1.0 # objness = 1

        bboxes_xywh = np.zeros((450, 4))
        bbox_count = 0


        for i in range(len(bboxes)):
            bbox_coor = bboxes[i][:4]
            # print(bbox_coor)

            # onehot
            one_hot = np.zeros(NUM_CLASSES, dtype=np.float32)
            one_hot[class_inds[i]] = 1.0
            bbox_xywh  = np.concatenate(
                [
                    (0.5 * np.array(bbox_coor[2:]) + np.array(bbox_coor[:2])),
                    np.array(bbox_coor[2:]),
                ],
                axis=-1,
            )

            # print(bbox_xywh)

            iou = []
            anchors_xywh = np.zeros((NUM_ANCHORS, 4))
            anchors_xywh[:, 0:2] = (
                np.floor(bbox_xywh[ 0:2]).astype(np.int32) + 0.5
            )  # 0.5 for compensation
            anchors_xywh[:, 2:4] = ANCHORS

            iou_scale = iou_xywh_numpy(
                bbox_xywh, anchors_xywh
            )
            iou.append(iou_scale)
            iou_mask = iou_scale > 0.3

            exist_positive = False

            if np.any(iou_mask):
                xind, yind = np.floor(bbox_xywh[0:2]).astype(
                    np.int32
                )
                label[yind, xind, iou_mask, 0:4] = bbox_xywh
                label[yind, xind, iou_mask, 4:5] = 1.0
                # label[yind, xind, iou_mask, 5:6] = bbox_mix
                label[yind, xind, iou_mask, 5:] = one_hot

                # bbox_ind = int(bbox_count[i] % 150)  # BUG : 150为一个先验值,内存消耗大
                bboxes_xywh[bbox_count, :4] = bbox_xywh
                # bbox_count[i] += 1
                bbox_count += 1

                exist_positive = True


            if not exist_positive:
                best_anchor_ind = np.argmax(np.array(iou).reshape(-1), axis=-1)
                # best_detect = int(best_anchor_ind / anchors_per_scale)
                # best_anchor = int(best_anchor_ind % anchors_per_scale)

                xind, yind = np.floor(
                    bbox_xywh[best_anchor_ind, 0:2]
                ).astype(np.int32)

                label[yind, xind, best_anchor_ind, 0:4] = bbox_xywh
                label[yind, xind, best_anchor_ind, 4:5] = 1.0
                # label[yind, xind, best_anchor, 5:6] = bbox_mix
                label[yind, xind, best_anchor_ind, 5:] = one_hot

                bbox_ind = bbox_count
                bboxes_xywh[bbox_ind, :4] = bbox_xywh
                bbox_count += 1

        # print('output')
        # print(label.shape ,bboxes_xywh.shape)
        return label, bboxes_xywh
    
def iou_xywh_numpy(boxes1, boxes2):
    """
    :param boxes1: boxes1和boxes2的shape可以不相同，但是需要满足广播机制
    :param boxes2: 且需要保证最后一维为坐标维，以及坐标的存储结构为(x,y,w,h)，其中(x,y)是bbox的中心坐标
    :return: 返回boxes1和boxes2的IOU，IOU的shape为boxes1和boxes2广播后的shape[:-1]
    """
    boxes1 = np.array(boxes1)
    boxes2 = np.array(boxes2)
    # print(boxes1, boxes2)

    boxes1_area = boxes1[..., 2] * boxes1[..., 3]
    boxes2_area = boxes2[..., 2] * boxes2[..., 3]

    # 分别计算出boxes1和boxes2的左上角坐标、右下角坐标
    # 存储结构为(xmin, ymin, xmax, ymax)，其中(xmin,ymin)是bbox的左上角坐标，(xmax,ymax)是bbox的右下角坐标
    boxes1 = np.concatenate([boxes1[..., :2] - boxes1[..., 2:] * 0.5,
                                boxes1[..., :2] + boxes1[..., 2:] * 0.5], axis=-1)
    boxes2 = np.concatenate([boxes2[..., :2] - boxes2[..., 2:] * 0.5,
                                boxes2[..., :2] + boxes2[..., 2:] * 0.5], axis=-1)

    # 计算出boxes1与boxes1相交部分的左上角坐标、右下角坐标
    left_up = np.maximum(boxes1[..., :2], boxes2[..., :2])
    right_down = np.minimum(boxes1[..., 2:], boxes2[..., 2:])

    # 因为两个boxes没有交集时，(right_down - left_up) < 0，所以maximum可以保证当两个boxes没有交集时，它们之间的iou为0
    inter_section = np.maximum(right_down - left_up, 0.0)
    inter_area = inter_section[..., 0] * inter_section[..., 1]
    union_area = boxes1_area + boxes2_area - inter_area
    IOU = 1.0 * inter_area / union_area
    return IOU


def CIOU_xywh_torch(boxes1,boxes2):
    '''
    cal CIOU of two boxes or batch boxes
    :param boxes1:[xmin,ymin,xmax,ymax] or
                [[xmin,ymin,xmax,ymax],[xmin,ymin,xmax,ymax],...]
    :param boxes2:[xmin,ymin,xmax,ymax]
    :return:
    '''
    # cx cy w h->xyxy
    boxes1 = torch.cat([boxes1[..., :2] - boxes1[..., 2:] * 0.5,
                        boxes1[..., :2] + boxes1[..., 2:] * 0.5], dim=-1)
    boxes2 = torch.cat([boxes2[..., :2] - boxes2[..., 2:] * 0.5,
                        boxes2[..., :2] + boxes2[..., 2:] * 0.5], dim=-1)

    boxes1 = torch.cat([torch.min(boxes1[..., :2], boxes1[..., 2:]),
                        torch.max(boxes1[..., :2], boxes1[..., 2:])], dim=-1)
    boxes2 = torch.cat([torch.min(boxes2[..., :2], boxes2[..., 2:]),
                        torch.max(boxes2[..., :2], boxes2[..., 2:])], dim=-1)

    # (x2 minus x1 = width)  * (y2 - y1 = height)
    boxes1_area = (boxes1[..., 2] - boxes1[..., 0]) * (boxes1[..., 3] - boxes1[..., 1])
    boxes2_area = (boxes2[..., 2] - boxes2[..., 0]) * (boxes2[..., 3] - boxes2[..., 1])

    # upper left of the intersection region (x,y)
    inter_left_up = torch.max(boxes1[..., :2], boxes2[..., :2])

    # bottom right of the intersection region (x,y)
    inter_right_down = torch.min(boxes1[..., 2:], boxes2[..., 2:])

    # if there is overlapping we will get (w,h) else set to (0,0) because it could be negative if no overlapping
    inter_section = torch.max(inter_right_down - inter_left_up, torch.zeros_like(inter_right_down))
    inter_area = inter_section[..., 0] * inter_section[..., 1]
    union_area = boxes1_area + boxes2_area - inter_area
    ious = 1.0 * inter_area / union_area

    # cal outer boxes
    outer_left_up = torch.min(boxes1[..., :2], boxes2[..., :2])
    outer_right_down = torch.max(boxes1[..., 2:], boxes2[..., 2:])
    outer = torch.max(outer_right_down - outer_left_up, torch.zeros_like(inter_right_down))
    outer_diagonal_line = torch.pow(outer[..., 0], 2) + torch.pow(outer[..., 1], 2)

    # cal center distance
    # center x center y
    boxes1_center = (boxes1[..., :2] +  boxes1[...,2:]) * 0.5
    boxes2_center = (boxes2[..., :2] +  boxes2[...,2:]) * 0.5

    # euclidean distance
    # x1-x2 square 
    center_dis = torch.pow(boxes1_center[...,0]-boxes2_center[...,0], 2) +\
                 torch.pow(boxes1_center[...,1]-boxes2_center[...,1], 2)

    # cal penalty term
    # cal width,height
    boxes1_size = torch.max(boxes1[..., 2:] - boxes1[..., :2], torch.zeros_like(inter_right_down))
    boxes2_size = torch.max(boxes2[..., 2:] - boxes2[..., :2], torch.zeros_like(inter_right_down))
    v = (4 / (math.pi ** 2)) * torch.pow(
            torch.atan((boxes1_size[...,0]/torch.clamp(boxes1_size[...,1],min = 1e-6))) -
            torch.atan((boxes2_size[..., 0] / torch.clamp(boxes2_size[..., 1],min = 1e-6))), 2)

    alpha = v / (1-ious+v)

    #cal ciou
    cious = ious - (center_dis / outer_diagonal_line + alpha*v)

    return cious

def calculate_APs(iou_threshold, batches, targets):
    from pycocotools.coco import COCO
    coco = COCO('/root/COCO/annotations/instances_val2017.json')
    ids = list(sorted(coco.imgs.keys()))

    # img_id = ids[index]
    # ann_ids = coco.getAnnIds(imgIds=img_id)
    # ids = list(range(0,91))
    target = coco.anns
    # idx = list(target.keys())
    print(len(target))
    print(type(target))
    number_of_classes = 80
    # target = target
    for index in ids:
        print(target[index])
    # print(len(idx))
    # print(ids)
    # self.target = target
    # for i in range(0, 500):
    #     img_id = idx[i]
    #     tar = target[img_id]
    #     print(tar)
    # path = coco.loadImgs(img_id)[0]['file_name']
    
    APs = {}
    recalls = {}
    precisions = {}
    # 80 classes
    # print(target)
    # for i in range(0, 80):

            

### Add MSE Loss

Let's come back to the `train_yolov4.py` file. Add MSE loss in the code to begin with.
In your homework, you'll want to replace MSE with CIOU loss.

In [None]:
optimizer.zero_grad()
with torch.set_grad_enabled(True):
    outputs = model(inputs, True)

    pred_xywh = outputs[..., 0:4] / 224
    pred_conf = outputs[..., 4:5]
    pred_cls = outputs[..., 5:]

    label_xywh = labels[..., :4] / 224

    # label_xywh = torch.cat([label_xy, label_wh], dim=-1)
    label_obj_mask = labels[..., 4:5]
    label_noobj_mask = (1.0 - label_obj_mask)
    
    lambda_coord = 0.001
    lambda_noobj = 0.05
    label_cls = labels[..., 5:]
    loss = nn.MSELoss()

Here is the full code to start with for your ``train_yolov4.py`` file:

In [None]:
from __future__ import division
import time
from torch.utils.data import Subset
import torch 
import torch.nn as nn
from torch.autograd import Variable
import numpy as np
import cv2 
from util import *
import argparse
import os 
import os.path as osp
from darknet import Darknet
import pickle as pkl
import pandas as pd
import random
import albumentations as A
from custom_coco import CIOU_xywh_torch
from torch.nn.utils.rnn import pad_sequence
import torch
import torchvision
from torchvision import datasets, models, transforms
import torch.nn as nn
import torch.optim as optim
import time
import os
from copy import copy
from copy import deepcopy
import torch.nn.functional as F
from train import train_model, evaluate_model
from custom_coco import CustomCoco, calculate_APs
import matplotlib.pyplot as plt

# Set device to GPU or CPU
gpu = "1"
device = torch.device("cuda:{}".format(gpu) if torch.cuda.is_available() else "cpu")

train_preprocess = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

path2data_train="/root/COCO/train2017"
path2json_train="/root/COCO/annotations/instances_train2017.json"

path2data_val="/root/COCO/val2017"
path2json_val="/root/COCO/annotations/instances_val2017.json"



train_transform = A.Compose([
    A.SmallestMaxSize(256),
    A.RandomCrop(width=224, height=224),
    # A.HorizontalFlip(p=0.5),
    # A.RandomBrightnessContrast(p=0.2),
], bbox_params=A.BboxParams(format='coco', label_fields=['category_ids']),
)

eval_transform = A.Compose([
    A.SmallestMaxSize(256),
    A.CenterCrop(width=224, height=224),
], bbox_params=A.BboxParams(format='coco', label_fields=['category_ids']),
)

# raw_train_dataset = torchvision.datasets.CocoDetection(root = path2data_train,
                                # annFile = path2json_train, transform=none_train_transform)

# train_dataset = torchvision.datasets.CocoDetection(root = path2data_train,
                                # annFile = path2json_train, transform=train_transform)
BATCH_SIZE = 10
val_dataset = Subset(CustomCoco(root = path2data_val,
                                annFile = path2json_val, transform=eval_transform), list(range(0,20)))

def collate_fn(batch):
    return tuple(zip(*batch))

val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=BATCH_SIZE,
                                            shuffle=False, num_workers=1, collate_fn=collate_fn)

#If there's a GPU availible, put the model on GPU
# CUDA = torch.cuda.is_available()


print("Loading network.....")
model = Darknet("cfg/yolov4.cfg")
model.load_weights("csdarknet53-omega_final.weights", backbone_only=True)
print("Network successfully loaded")


model.to(device)

criterion = nn.CrossEntropyLoss()
params_to_update = model.parameters()
optimizer = optim.Adam(params_to_update, lr=0.001)
for e in range(0, 40):
    running_loss = 0.0
    for inputs, labels, bboxes in val_dataloader:
        inputs = torch.from_numpy(np.array(inputs)).squeeze(1).permute(0,3,1,2).float()
        inputs = inputs.to(device)
        labels = torch.stack(labels).to(device)
        
        running_corrects = 0

        # zero the parameter gradients
        # it uses for update training weights
        optimizer.zero_grad()
        with torch.set_grad_enabled(True):
            outputs = model(inputs, True)
            # pred_xy = outputs[..., :2] / 224
            # pred_wh = torch.sqrt(outputs[..., 2:4] / 224)

            pred_xywh = outputs[..., 0:4] / 224
            # pred_xywh = torch.cat([pred_xy, pred_wh], dim=-1)
            pred_conf = outputs[..., 4:5]
            pred_cls = outputs[..., 5:]


            # label_xy = labels[..., :2] / 224
            # label_wh = torch.sqrt(labels[..., 2:4] / 224)

            label_xywh = labels[..., :4] / 224

            # label_xywh = torch.cat([label_xy, label_wh], dim=-1)
            label_obj_mask = labels[..., 4:5]
            label_noobj_mask = (1.0 - label_obj_mask)  # * (
                # iou_max < self.__iou_threshold_loss
            # ).float()
            lambda_coord = 0.001
            lambda_noobj = 0.05
            label_cls = labels[..., 5:]
            loss = nn.MSELoss()
            loss_bce = nn.BCELoss()

            # ciou = CIOU_xywh_torch(p_d_xywh, label_xywh).unsqueeze(-1)

            loss_coord = lambda_coord * label_obj_mask * loss(input=pred_xywh, target=label_xywh)
            loss_conf = (label_obj_mask * loss_bce(input=pred_conf, target=label_obj_mask)) + \
                        (lambda_noobj * label_noobj_mask * loss_bce(input=pred_conf, target=label_obj_mask))
            loss_cls = label_obj_mask * loss_bce(input=pred_cls, target=label_cls)

            loss_coord = torch.sum(loss_coord)
            loss_conf = torch.sum(loss_conf)
            loss_cls = torch.sum(loss_cls)

            # print(pred_xywh.shape, label_xywh.shape)

            ciou = CIOU_xywh_torch(pred_xywh, label_xywh)
            # print(ciou.shape)
            ciou = ciou.unsqueeze(-1)
            # print(ciou.shape)
            # print(label_obj_mask.shape)
            loss_ciou = torch.sum(label_obj_mask * (1.0 - ciou))
            # print(loss_coord)
            loss =  loss_ciou +  loss_conf + loss_cls
            loss.backward()
            optimizer.step()
            # statistics
            running_loss += loss.item() * inputs.size(0)
            # print('Running loss')
            # print(loss_coord, loss_conf, loss_cls)
    epoch_loss = running_loss / 750
    print(epoch_loss)
    print('Epoch')

    print(calculate_APs(0.5, None, None))

In case we missed anything, full code containing these modifications from YOLO v3 can be loaded from [here](yolov4.zip).

## Exercises

1. Follow the YOLO v3 and YOLO v4 implementations in PyTorch and get inference working for both models
   using the pretrained Darknet weights. This will give you a good understanding of YOLO in general
   and many of the specific improvements made in YOLO v4. For YOLO v4, pay attention to the main points:

   1. Implementation of the mish activation function
   2. Option for the maxpool layer in the `create_modules` function and in your model's `forward()` method.
   3. Enabling a `[route]` module to concatenate more than two previous layers
   4. Loading the pre-trained weights [provided by the authors](https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights)
   4. Scale inputs to 608$\times$608 and make sure you're passing input channels in RGB order, not OpenCV's BGR order.

2. Train your implementation of YOLO v4. The COCO dataset is available on the AIT DS&AI GPU servers at `/home/fidji/mdailey/Datasets/COCO`.
   Here the purpose is not to get the best possible model (that would require implementing all
   of the "bag of freebies" training tricks described in the paper), but just the ones described here, to
   get a feel for their importance.
   This involves the following:
   
   1. Get a set of ImageNet pretrained weights for CSPDarknet53 [from the Darknet GitHub repository](https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v4_pre/csdarknet53-omega_final.weights)
   2. Add a method to load the pretrained weights into the backbone portion of your PyTorch YOLOv4 model.
   3. Implement a basic `train_yolo` function similar to the `train_model` function you developed in previous
      labs for classifiers that preprocesses the input with basic augmentation transformations, converts the
      anchor-relative outputs to bounding box coordinates, computes MSE loss for the bounding box coordinates,
      backpropagates the loss, and takes a step for the optimizer. Use the recommended IoU thresholds to determine
      which predicted bounding boxes to include in the loss. You will find many examples of how to do this
      online.
   4. Train your model on COCO. Training on the full dataset to completion would take several days, so you can stop early after verifying
      the model is learning in the first few epochs.
   5. Compute mAP for your model on the COCO validation set.
   6. Implement the CIoU loss function and observe its effect on mAP.
   7. (Optional) Train on COCO to completion and see how close you can get to the mAP reported in the paper.

   There is some useful information on working with the COCO dataset as a
   Torchvision Dataset in [this blog](https://medium.com/howtoai/pytorch-torchvision-coco-dataset-b7f5e8cad82).
   Use the instructor's copy of the COCO training and validation datasets from the shared network drive
   so that we don't use resources for multiple copies of the dataset. Once you have access to the dataset you can use the dataset easily.

2. The state of the art in fast single-stage object detection in 2022 appears to be YOLOR.
   Take a look at the [paper](https://arxiv.org/abs/2105.04206) and the [PyTorch source code](https://github.com/WongKinYiu/yolor).
   Execute the same exercise as above to modify your PyTorch Darknet tools to read the YOLOR configuration file and get the model working
   first for inference then for training. We will execute this task collaboratively and share experience on the class discussion board as
   we go along.
   
   Note that YOLOR is built on top of [Scaled YOLOv4](https://arxiv.org/abs/2011.08036), in which several different versions of the
   CSP Darknet backbone are created for different speed/accuracy tradeoffs. We probably want to start with YOLOR P6, which is based on
   YOLOv4-P6, which obtains 54.3% COCO mAP at 30 fps with resolution 1280$\times$1280.

   We want you to modify the YOLOv4 code you already developed to work with the YOLOR model. Don't just replace your code with the YOLOR
   repository code. Instead, try to trace the execution of the image preparation, `create_modules()` function, `forward()` method, and
   so on.
   
   You'll need to make at least the following modifications to make your code run the same as YOLOR:
   1. Allow flexible input image size for the letterboxing, with input size 1280 in the longest dimension and a multiple of 64 pixels in the
      other dimension.
   2. Since the YOLOR weight files are PyTorch pickle files, you'll need to name the convolutional and implicit latent parameter layers
      the same as in the YOLOR weight file so that you can load the state dictionary as with other PyTorch models.
   3. Utilize the YOLOR non maxima suppression function, as it is much faster than the one we gave you in this manual for the YOLOR output
      tensors.
   4. Utilize the YOLOR conversion of model output tensors to bounding box coordinates, confidence scores, and class scores. The organization
      and post processing is different from our YOLOv4 configuration, so your `predict_transform()` function won't work as is.
      
   Good luck!
