# EfficientNet V1 From Scratch

In this notebook I'll explore building an efficient net from scratch, following by an attempt to recreate the paper steps into achieving its final results.
I think that I should go into this project through these steps to make it easy:

- [ ]  Implement an MBConv layer
- [ ]  Implement an efficientnet baseline
- [ ]  Test efficientnet on fastai imagenette using paper configuration and compare with resent
- [ ]  Implement neural architecture search 


EfficientNets are similar to ResNets in that they consist of bottleneck layers. However, EfficientNets use <strong>MBConv (Mobile Inverted Bottleneck Convolutional Blocks)</strong> followed by <strong>SEBlocks (Squeeze and Excitation Blocks)</strong>

## MBConv block
### What is an MBConv block?

An MBConv block is the building block of MobileNetV2, and here is an excerpt from the MobilNetV2 papers describing it.

> Our network pushes the state
of the art for mobile tailored computer vision models,
by significantly decreasing the number of operations and
memory needed while retaining the same accuracy.

An MBConv layer module is an <strong><em>inverted residual with linear bottleneck</em></strong>.

This definition needs a little bit of dissection. First let's take a look at a residual block.

![residual block](https://miro.medium.com/max/1140/1*D0F3UitQ2l5Q0Ak-tjEdJg.png)

A residual block (bottleneck layer) is a module which basically adds it's input to the final output of the block. We can see that it has ReLU non-linearity, and in the ResNet architecture, the number of channels generally tends to expand, and inside the block the bottleneck occures in decreasing the channels, then increasing it back again.

![resnet](https://www.researchgate.net/publication/336642248/figure/fig1/AS:839151377203201@1577080687133/Original-ResNet-18-Architecture.png)

#### So how does an MBConv differ from Residual Block?

1. Channels depth increases within the block then they decrease upon output 
2. ReLU isn't used in the output layer and only used within the expansion zone

<img src="https://production-media.paperswithcode.com/methods/Screen_Shot_2020-06-06_at_10.08.25_PM.png" width="400" align="center" />

### Why use an MBConv Layer?

1. More efficient
2. Reduce computational time

### Now let's implement it using fastai

In [1]:
from fastai.vision.all import *
from torch.nn.modules.activation import ReLU6

First let's make only the bottleneck convolutional layer. According to the paper it should be like that:

<img src="https://miro.medium.com/max/1400/1*mFKsFp9fi8LDaflCu8Tyhg@2x.png" width="400px" align="center"/>

This translate to:
1. A 1x1 convolutional layer to expand the depth of the input image without decreaseing the width and height with a ReLU6 non-linearity
2. A 3x3 convolutional layer to decrease the dimensions of the previous layer output while keeping the expanded depth with a ReLU6 non-linearity
3. A 1x1 convolutional layer to shrink the depth last layer output without applying a non-linear function 

In [2]:
class MBConvBlock(Module):
    def __init__(self, ni, nf, ks=3, stride=1, t=6):
        self.convs = nn.Sequential(
            ConvLayer(ni, ni*6, ks=1, act_cls=ReLU6),
            ConvLayer(ni*6, ni*6, ks=ks, stride=stride, act_cls=ReLU6),
            ConvLayer(ni*6, nf, ks=1, act_cls=None)
        )
        
    def forward(self, x):
        return self.convs(x)

To test this block, I'll make an input tensor to check the outputs.

In [3]:
w = torch.rand((1, 3))
w[..., None, None].shape

torch.Size([1, 3, 1, 1])

In [4]:
input_tensor = torch.rand((1, 3, 28, 28))
print(input_tensor.shape)

torch.Size([1, 3, 28, 28])


In [5]:
mbconv = MBConvBlock(3, 6, stride=2)

In [6]:
output_tensor = mbconv(input_tensor)
print(output_tensor.shape)

torch.Size([1, 6, 14, 14])


Now we have to add two things to this class:
1. An average pooling layer if the stride 2 is used
2. An 1x1 convolutional layer to convert the input tensor depth to the output tensor depth for addition
3. A squeeze and excitation module

In [7]:
class SEBlock(nn.Sequential):
    def __init__(self, ni, r=8):
        super().__init__(*[
            nn.AdaptiveAvgPool2d(1),
            Flatten(),
            nn.Linear(ni, ni//r),
            nn.ReLU(),
            nn.Linear(ni//r, ni),
            nn.Sigmoid()
        ])

class MBConvBlock(Module):
    def __init__(self, ni, nf, ks=3, stride=1, t=6, r=16):
        self.convs = nn.Sequential(
            ConvLayer(ni, ni*6, ks=1, act_cls=ReLU6),
            ConvLayer(ni*6, ni*6, ks=ks, stride=stride, act_cls=ReLU6),
            ConvLayer(ni*6, nf, ks=1, act_cls=None)
        )
        self.pool = noop if stride == 1 else nn.AvgPool2d(2, ceil_mode=True)
        self.idconv = noop if ni == nf else ConvLayer(ni, nf, ks=1, act_cls=None)
        self.seblock = SEBlock(nf)
        
    def forward(self, x):
        conv_out = self.convs(x)
        w = self.seblock(conv_out)
        return conv_out*w[..., None, None] + self.idconv(self.pool(x))     

In [8]:
mbconv = MBConvBlock(3, 6, stride=2)
mbconv(input_tensor).shape



torch.Size([1, 6, 14, 14])

Now that we have a working MBConv layer, let's proceed to build the baseline efficientnet b0.

I'll build this at first without any compound scaling capability.

<img src="https://miro.medium.com/max/1400/0*6ezHy0HX_lCrJGRS" width="500"/>

In [9]:
class EfficientNetB0(nn.Sequential):
    def __init__(self, n_out, layers):
        stem = ConvLayer(ni=3, nf=32, ks=3, stride=1)
        self.block_szs = [32, 16, 24, 40, 80, 112, 192, 320, 1280]
        self.block_ks = [3, 3, 5, 3, 5, 5, 3]
        self.block_ts = [1] + [6]*6
        self.strides = [2, 1, 2, 2, 2, 1, 2]
        
        blocks = [self._make_layer(*o) for o in enumerate(layers)]
        super().__init__(*stem, *blocks, 
                         ConvLayer(self.block_szs[-2], self.block_szs[-1], ks=1),
                         nn.AdaptiveAvgPool2d(1), Flatten(),
                         nn.Linear(self.block_szs[-1], n_out))
        
    def _make_layer(self, i, n_layers):
        ni, nf = self.block_szs[i], self.block_szs[i+1]
        ks, t = self.block_ks[i], self.block_ts[i]
        stride = self.strides[i]
        
        return nn.Sequential(*[
            MBConvBlock(ni if i == 0 else nf, nf, ks, stride, t)
            for i in range(n_layers)
        ])

Now let's make a model and see how it looks.

In [10]:
model = EfficientNetB0(10, layers=[1, 2, 2, 3, 3, 4, 1])

Now let's test this model on fastai's Imagenette, and then compare it to ResNet50.

In [11]:
path = untar_data(URLs.IMAGENETTE_160)

In [12]:
dls = ImageDataLoaders.from_folder(path, valid='val', 
    item_tfms=Resize(160), batch_tfms=[*aug_transforms(size=128, min_scale=0.5), Normalize.from_stats(*imagenet_stats)], bs=64)

In [13]:
eff_learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)

In [14]:
eff_learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,1.828946,3.050071,0.235924,01:11
1,1.722414,1.688316,0.409936,01:06
2,1.391298,1.334921,0.572484,01:06
3,1.153258,1.112591,0.64535,01:06
4,0.989492,1.030246,0.660892,01:06


In [18]:
res_learn = cnn_learner(dls, resnet50, pretrained=False, metrics=accuracy)

In [19]:
res_learn.fit_one_cycle(5, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,2.543627,2.300174,0.340382,00:27
1,2.048752,1.691249,0.417325,00:27
2,1.732282,1.517511,0.515669,00:27
3,1.446873,1.332534,0.589554,00:27
4,1.22636,1.112935,0.661401,00:27


You can see that my implementation of EfficientNet-B0 is similar to ResNet-50 performance wise, but it's still slower than ResNet-50, which means that there is something not right in my implementation, since EfficientNet-b0 is supposed to be faster than ResNet-50.