# Efficient Net from scratch [See Paper PDF](https://arxiv.org/pdf/1905.11946.pdf)

![](https://d3i71xaburhd42.cloudfront.net/e085a62d97b12eb5efc1a65fbdb87a5acbb75868/4-Table2-1.png)

# Author Observations:

### Observation 1:

Scaling up any dimension of network width, depth, or resolution imporves accuracy, but the accuracy gain diminishes for bigger models.

### Observation 2:

In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth and resoultion during ConvNet scaling.

![](https://1.bp.blogspot.com/-Cdtb97FtgdA/XO3BHsB7oEI/AAAAAAAAEKE/bmtkonwgs8cmWyI5esVo8wJPnhPLQ5bGQCLcBGAs/s1600/image4.png)

# EfficinetNet-B0 Architecture

### Table showing details of B0 Architecture


![EfficientNet Architecture Img](https://miro.medium.com/max/1400/0*6ezHy0HX_lCrJGRS "EfficinetNet-B0 Architecture")


### See how sub-layer(s) look like
![](https://miro.medium.com/max/2000/1*rnhgFRXetwD8PvxhZIpwIA.png)


### Sub-Layers are also continious layers! (Dont be scared of the term "Sub-Layers") :)
We just say them sub-layers because they all have same structures.

![](https://www.researchgate.net/profile/Tashin-Ahmed/publication/344410350/figure/fig4/AS:1022373302128641@1620764198841/Architecture-of-EfficientNet-B0-with-MBConv-as-Basic-building-blocks.png "EfficinetNet-B0 Architecture")


# What authors use to get amazing results??

**Ans** 
- SiLU
- Auto Augment
- Stochastic Depth
- Squeeze-and-excitation optimization
- Mobile inverted bottleneck MBConv

# Pre- requisites

## Compound scaling method

**Compound Coefficient [φ]**

- It uniformaly scales network width, depth and resolution in a principled way.

<div style="background-color: wheat; padding: 10px">
Depth: d = α^φ
width: w = β^φ
resolution: r = γ^φ

**Constraint**
s.t. α · β^2 · γ^2 ≈ 2

**Where**
α ≥ 1, β ≥ 1, γ ≥ 1

Here α, β, γ are the constants that can be determined by a small grid search.
</div>

- φ is a user specified coefficient that controls how many more resources are available for model scaling. When we have fixed computational budget eg. When we are working with mobile or small computational devices.


- **FLOPS** of a regular conv operation is proportional to d, w^2, r^2 i.e. doubling network depth will double FLOPS, but doubling network width or resolution will increase FLOPS by 4 times.


- As we know conv op usually dominate the computation cost in convnets, scaling a convnet with the above eq will approx. increase total FLOPS by (α · β^2 · γ^2)^φ or ~ (2)^φ


**Q How they find above values  for α, β and γ?**

**Ans :** Using Grid search they find that ALPHA = 1.2, BETA = 1.1 and GAMMA = 1.15 are the best values for EfficientNet-B0. After that they fixed α, β, γ as constants and scale up baseline network with different φ, to obtain EfficientNet-B0 to B7 as given above in the table.

Efficient Net main building block is **mobile inverted bottleneck MBConv**, to which they also add **squeeze-and-excitation optimization**.

## Mobile inverted bottleneck MBConv

## Squeeze-and-excitation optimization

## FLOPS (Optional)

In [17]:
'''
CALCULATE INCREASE IN TOTAL FLOPS
'''
import numpy as np
PHI = 1
ALPHA = 1.2
BETA = 1.1
GAMMA = 1.15

def calc_FLOPS():
    return np.power(np.power(ALPHA, 1) *np.power(BETA, 2) * np.power(GAMMA, 2), PHI)


print(f"INCREASE IN TOTAL FLOPS: {calc_FLOPS():.4f}")

INCREASE IN TOTAL FLOPS: 1.9203


## SiLU

## Activation Map 
<a href="https://towardsdatascience.com/activation-maps-for-deep-learning-models-in-a-few-lines-of-code-ed9ced1e8d21">link</a>

# Implementation

## imports

In [96]:
import torch
import torch.nn as nn
import math

## Simple Conv Block

### Conv Block implementation

In [97]:
class CNNBlock(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, groups=1):
        '''
        groups = in_channels is for depth wise conv
        eg 
        FOR 10x256x256 input data
        WHEN groups= 1 THEN  conv kernel will be 10x3x3
        WHEN groups= in_channels THEN conv kernel will be 1x3x3
        '''
        super(CNNBlock, self).__init__()
        self.cnn = nn.Conv2d(
            in_channels,
            out_channels,
            kernel_size,
            stride,
            padding,
            groups= groups,
            bias= False
        )
        self.bn = nn.BatchNorm2d(out_channels)
        self.silu = nn.SiLU() # SiLU <-> Swish
        
        
    def forward(self, x):
        return self.silu(self.bn(self.cnn(x)))

### Sanity check ✅

In [98]:
batch = 15
channels = 10
out_channels = 50
data = torch.rand(batch, channels, 256, 256)
result = CNNBlock(channels, out_channels, kernel_size=3, stride=1,
                  padding=0)(data)

# Delete variables so that they dont mess up in the later part of code
del batch, channels, out_channels, data

print("Correct ✅")

Correct ✅


## Squeeze and Excitation

### Figure shows Squeeze and Excitation

![Squeeze Excitation](https://www.researchgate.net/profile/Anabia-Sohail-2/publication/330511306/figure/fig8/AS:717351204966400@1548041263212/Squeeze-and-Excitation-block.ppm "How Squeeze excitation works")

### Squeeze and Excitation implementing

In [99]:
class SqueezeExcitation(nn.Module):
    def __init__(self, in_channels, red_channels):
        super(SqueezeExcitation, self).__init__()
        self.se = nn.Sequential(
            nn.AdaptiveAvgPool2d(1), # C x H x W -> C x 1 x 1
            nn.Conv2d(in_channels, red_channels, 1), # Squeeze part
            nn.SiLU(), # Activation
            nn.Conv2d(red_channels, in_channels, 1), # Excitation
            nn.Sigmoid() # Activation
        )
        
        
    def forward(self, x):
         return x * self.se(x)

### Sanity check ✅

In [100]:
in_channels = 10
red_channels = 50
batch = 1
data = torch.rand(batch, in_channels, 256, 256)
result = SqueezeExcitation(in_channels, red_channels)(data)

assert data.shape == result.shape, "Error: data and result shape does not match"

del in_channels, red_channels, batch, data, result

print("Correct ✅")

Correct ✅


## Inverted Residual Block

### Figure shows Inverted Residual Block

![dd](https://miro.medium.com/max/612/1*BaxdP8RS5x_EVMNJSd1Urg.png)

### Inverted Residual Block implementation

In [134]:
class InvertedResidualBlock(nn.Module):
    def __init__(
        self,
        in_channels,
        out_channels,
        kernel_size,
        stride,
        padding,
        expand_ratio= 1,
        reduction = 4, # squeeze excitation
        survival_prob= 0.8, # for stochastic depth
    ):
        '''
        @param reduction: 
            How much we want to squeeze (see red_channels in SqueezeExcitation)
            red_channels = in_channels / reduction
        @param expand_ratio:
            It is how much we want to increase the input_channels in the InvertedResidualBlock at the starting
            We will not expand if it is equal to 1
            new_input = input_channels * expand_ratio
        @survival_prob:
            It is used for the probability that if the layer should be removed or not
            Sort of like Dropout for layers
            How much percentage of layers we dont want to drop
        '''
        super(InvertedResidualBlock, self).__init__()
        
        self.survival_prob = survival_prob
        
        # If in_channels and out_channels are not same then we can sum them up for residual connection
        # and stride should not be gt 1 becasue we want SAME CONV
        self.use_residual = in_channels == out_channels and stride == 1
#         print(in_channels, out_channels, self.use_residual, in_channels != out_channels)
        
        hidden_dim = in_channels * expand_ratio
        
        # Check if we can pull up residual connection or not
        self.expand = in_channels != hidden_dim
        
        reduced_dim = int(in_channels / reduction)
    
        ##---- Expansion part ----##
        # If expand_ratio > 1; then we can increase the channels to get new bigger channel input
        if self.expand:
            self.expand_conv = CNNBlock(
                in_channels,
                hidden_dim,
                kernel_size=3,
                stride= 1,
                padding= 1
            )

        ##---- Squeeze Excitation part ----##
        self.conv = nn.Sequential(
            # Depth wise CNN
            CNNBlock(hidden_dim, hidden_dim, kernel_size, stride, padding, groups= hidden_dim),
            # Squeeze Excitation to update values of each channel by their AdaptiveAvgPool value
            SqueezeExcitation(hidden_dim, reduced_dim),
            # Do another conv
            nn.Conv2d(hidden_dim, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels),
        )
            
            
    def stochastic_depth(self, x):
        '''
        This function will remove random layers/block with some probability
        Block here is refered to Inverted Residual Block
        '''
        if not self.training:
            return x
        
        
        binary_tensor = torch.rand(x.shape[0], 1, 1, 1, device=x.device) < self.survival_prob
        return torch.div(x, self.survival_prob) * binary_tensor
        
        
    def forward(self, inputs):
        # Expand input if out expand_ration > 1
        x = self.expand_conv(inputs) if self.expand else inputs
        
        if self.use_residual:
            # remember stochastic_depth can return 0 valued Tensor so adding inputs can
            # make them  non zero
#             print(f"{x.shape=}")
#             print(f"{self.conv(x).shape=}")
#             print(f"{self.stochastic_depth(self.conv(x)).shape=}")
#             print(f"{(self.stochastic_depth(self.conv(x)) + inputs).shape=}")
            
            return self.stochastic_depth(self.conv(x)) + inputs
        else:
            return self.conv(x)

In [102]:
## Stochastic Depth explanation (How it drops layers/block)
x = torch.rand(4, 1, 2, 2)
binary_tensor = torch.rand(4, 1, 1, 1) < 0.9
print("-"*10, "Binary tensor")
print(binary_tensor)
print("-"*10, "Layers")
print(torch.div(x, 0.9) * binary_tensor)

---------- Binary tensor
tensor([[[[False]]],


        [[[ True]]],


        [[[ True]]],


        [[[ True]]]])
---------- Layers
tensor([[[[0.0000, 0.0000],
          [0.0000, 0.0000]]],


        [[[0.2576, 1.0633],
          [1.0067, 0.1463]]],


        [[[0.5148, 0.0567],
          [0.7111, 0.9424]]],


        [[[0.2869, 0.9715],
          [0.1948, 0.1353]]]])


## EfficientNet

In [1]:
#  Data for EfficientNet-B0 baseline network
base_model = [
    # expand_ratio, channels, repeats, stride, kernel_size
    [1, 16, 1, 1, 3],
    [6, 24, 2, 2, 3],
    [6, 40, 2, 2, 5],
    [6, 80, 3, 2, 3],
    [6, 112, 3, 1, 5],
    [6, 192, 4, 2, 5],
    [6, 320, 1, 1, 3],
]

b = [0, 0, 0, 0, 0]

phi_values = {
    # tuple of: (phi_value, resolution, drop_rate)
    "b0": (0, 224, 0.2),  # alpha, beta, gamma, depth = alpha ** phi
    "b1": (0.5, 240, 0.2),
    "b2": (1, 260, 0.3),
    "b3": (2, 300, 0.3),
    "b4": (3, 380, 0.4),
    "b5": (4, 456, 0.4),
    "b6": (5, 528, 0.5),
    "b7": (6, 600, 0.5),
}

### EfficientNet Implementation

In [120]:
class EfficientNet(nn.Module):
    def __init__(self, version, num_classes):
        super(EfficientNet, self).__init__()
        
        '''
        Gettings factors to modify out layer depth, width, resoultion size and dropout rate
        These factors are like parameters which then be multiplied by the no of channels,
        no of layers and resolution
        '''
        width_factor, depth_factor, dropout_rate = self.calculate_factors(version)
        
        # See how width factor is multiplied by no of channels(1280) for the last layer channels
        last_channels = math.ceil(1280 * width_factor)
        
        # Adaptive avg pool to get avg single no of each channel to sort of prioritize each channel
        self.pool = nn.AdaptiveAvgPool2d(1)
        
        # Creating all the features for our network
        self.features =self.create_features(width_factor, depth_factor, last_channels)
        
        # Classifier for last Fully connected layer
        # Here dropout will be same for both in FullyConv layer and "stochastic_depth"
        self.classifier = nn.Sequential(
            nn.Dropout(dropout_rate),
            nn.Linear(last_channels, num_classes),
        )
        
    def calculate_factors(self, version:str, alpha:float= 1.2, beta:float= 1.1)-> tuple:
        # specific values corresponding for the given version
        phi, resolution, drop_rate = phi_values.get(version)
        depth_factor = alpha**phi
        width_factor = beta**phi
        return width_factor, depth_factor, drop_rate
    
    
    def create_features(self, width_factor, depth_factor, last_channels):
        channels = int(32 * width_factor)
        image_channels = 3 # (R, G, B)
        
        # Adding Step of the model to the features list
        features = [CNNBlock(image_channels, channels, kernel_size=3, stride=2, padding= 1)]
        
        # in_channels will become the out_channels of the step layer
        in_channels = channels
        
        for expand_ratio, channels, repeats, stride, kernel_size in base_model:
            # Just to make out_channels multiple of 4
            out_channels = 4 * math.ceil(int(channels * width_factor) / 4)
            
            # Number of sub-layers we want in each layer
            layers_repeat = ceil(repeats * depth_factor)
            
            # Creating sub-layer(s) for current layer
            for layer in range(layers_repeat):
                features.append(
                    InvertedResidualBlock(
                        in_channels,
                        out_channels,
                        expand_ratio= expand_ratio,
                        # -Stride will be 1 from the 2nd layer as we dont want to downsample it
                        # -It only downsample for the first sub-layer in each layer
                        # - Some layer dosen't downsample because their stride is already 1
                        stride= stride if layer==0 else 1,
                        
                        kernel_size= kernel_size,
                        # if k=1:pad=0, k=3:pad=1, k=5:pad=2 (FOR SAME CONV)
                        # QUE: BUT! Someone may ask if we are doing SAME CONV for each layer, how will
                        # our inputs are becoming smaller in resoultions???
                        # ANS: Stride!, they are doing the real magic of resolution change.
                        padding= kernel_size//2,
                    )
                )
                
                # Dont forget to change in_channels for the next sub-layer!
                in_channels = out_channels
                
        '''
        This is the last layer of the model
        Remember we create last_channels = math.ceil(1280 * width_factor)
        ______________________________________________________________________
        Stage | Operator               | Resolution | #Channels | #Layers
        9     | Conv1x1 & Pooling & FC | 7 × 7      | 1280      | 1
        ----------------------------------------------------------------------
        '''
        features.append(
            CNNBlock(in_channels, last_channels, kernel_size= 1, stride=1, padding=0)
        )
        return nn.Sequential(*features)
    
    
    def forward(self, x):
        '''
        Passing data from the conv layers and the doing average pool for each channel to make them single no
        AdaptiveAvgPool(1)(batch_size, last_channels, H, W) = (batch_size, last_channels, 1, 1)
        '''
        x = self.pool(self.features(x))
        
        '''
        Reshape the data and pass it to the fully connected layer(s)
        '''
        return self.classifier(x.view(x.shape[0], -1))

In [137]:
def test():
    device = "cuda" if torch.cuda.is_available() else "cpu"
    version = "b0"
    phi, res, drop_rate = phi_values[version]
    num_examples, num_classes = 4, 10
    x= torch.randn((num_examples, 3, res, res)).to(device)
    model = EfficientNet(
        version= version,
        num_classes= num_classes
    ).to(device)
    
    print(model(x).shape)
    
test()

torch.Size([4, 10])
