In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Model

The 1-channel model built based on `yolov5s.yaml` has the following architecture:



```
Model(
  (model): Sequential(
  # BEGIN: backbone -----------------------------------------------------------------------------------------
    (0): Focus()  # Focus wh information into c-space
    
    (1): Conv()  # All these "Conv" modules are actually SiLU(Conv2d(BatchNorm2d(x))))
    
    (2): C3()  # This is more complicated, it's explained below; three conv layers, followed by a Bottleneck module
    
    (3): Conv()
    
    (4): C3()
    
    (5): Conv()
    
    (6): C3()
    
    (7): Conv()
    
    (8): SPP()  # Spatial Pyramid Pooling (explained below)
    
    (9): C3()
  # END: backbone -------------------------------------------------------------------------------------------
  
    
  # BEGIN: head ---------------------------------------------------------------------------------------------
    (10): Conv()
    
    (11): Upsample(scale_factor=2.0, mode=nearest)  # This module is from PyTorch, but I'm also explaining it below
    
    (12): Concat()
    
    (13): C3()
    
    (14): Conv()
    
    (15): Upsample(scale_factor=2.0, mode=nearest)
    
    (16): Concat()
    
    (17): C3()
    
    (18): Conv()
    
    (19): Concat()
    
    (20): C3()
    
    (21): Conv()
    
    (22): Concat()
    
    (23): C3()
    
    (24): Detect(  # The detection is done on three scales of feature maps; also
                   #  this is where all the anchors live
      (m): ModuleList(
        (0): Conv2d(128, 255, kernel_size=(1, 1), stride=(1, 1))
        (1): Conv2d(256, 255, kernel_size=(1, 1), stride=(1, 1))
        (2): Conv2d(512, 255, kernel_size=(1, 1), stride=(1, 1))
      )
    )
  # END: head -----------------------------------------------------------------------------------------------
  )
)
```

## Explanation of the: `Focus`, `Conv`, `C3` and `SPP` modules

### `Focus` wh information into c-space

It's easiest to just paste in the implementation:

```python
class Focus(nn.Module):
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):
        # ch_in, ch_out, kernel, stride, padding, groups
        super(Focus, self).__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        return self.conv(
            torch.cat([
                x[..., ::2, ::2],
                x[..., 1::2, ::2],
                x[..., ::2, 1::2],
                x[..., 1::2, 1::2]
            ], 1))
```

### `Conv`

This is actually:

```
    Conv(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
        (act): SiLU()
    )
```


### CSP Bottleneck with 3 convolutions (`C3`)

This is actually:

```
    C3(
      (cv1): Conv() # The one detailed above
      
      (cv2): Conv()
      
      (cv3): Conv()
      
      (m): Sequential(
        (0): Bottleneck(
          (cv1): Conv()
          
          (cv2): Conv()
        )
      )
    )
```


### Spatial Pyramid Pooling (`SPP`)

This is actually:

```
    SPP(
      (cv1): Conv(
        (conv): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn): BatchNorm2d(256, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
        (act): SiLU()
      )
      (cv2): Conv(
        (conv): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (bn): BatchNorm2d(512, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
        (act): SiLU()
      )
      (m): ModuleList(
        (0): MaxPool2d(kernel_size=5, stride=1, padding=2, dilation=1, ceil_mode=False)
        (1): MaxPool2d(kernel_size=9, stride=1, padding=4, dilation=1, ceil_mode=False)
        (2): MaxPool2d(kernel_size=13, stride=1, padding=6, dilation=1, ceil_mode=False)
      )
    )
```


### `Upsample`

Straight from the docstring:

    Upsamples a given multi-channel 1D (temporal), 2D (spatial) or 3D (volumetric) data.

    The input data is assumed to be of the form
    `minibatch x channels x [optional depth] x [optional height] x width`.
    (...)

    The algorithms available for upsampling are nearest neighbor and linear,
    bilinear, bicubic and trilinear for 3D, 4D and 5D input Tensor,
    respectively.

    One can either give a :attr:`scale_factor` or the target output :attr:`size` to
    calculate the output size. (You cannot give both, as it is ambiguous)
    (...)