## **Search for new activation functions**
---
We generate a random population of activation functions, each represented as a mathematical formula.
* **Unary functions**: x, −x, |x|, x2, x3,√x, βx, x + β, log(|x| + ε), exp(x) sin(x), cos(x), sinh(x), cosh(x), tanh(x), sinh−1(x), tan−1(x), <br> sinc(x), max(x, 0), min(x, 0), σ(x),log(1 + exp(x)), exp(−x2), erf(x)


* **Binary functions**: x1 + x2, x1 · x2, x1 − x2,x1/(x2+ε), max(x1, x2), min(x1, x2), σ(x1) · x2, exp(−β(x1 − x2)^2), exp(−β|x1 − x2|), βx1 + (1 − β)x2
<figure align="center">
    <img  width="800" height="250" src="https://miro.medium.com/v2/resize:fit:1400/1*ccoqEVXPa0lf01ibFPy9tA.png">
    <figcaption>The top novel activation functions found by the searches. Separated into two diagrams for visual clarity. Best viewed in color.</figcaption>
</figure>
To test the robustness of the top performing novel activation functions to different architectures, we run additional experiments using the preactivation ResNet-164 (RN), Wide ResNet 28-10 (WRN), and DenseNet 100-12 (DN) models.
<figure align="center">
    <img  width="800" height="250" src="https://miro.medium.com/max/1400/1*NkOt2sKckxOsXwWS3-KTqw.png">
</figure>
The results are shown in Tables 1 and 2. Two of the discovered activation functions, x·σ(βx)
and max(x, σ(x)), consistently match or outperform ReLU on all three models.<br>
We focus on empirically evaluating the activation
function 

$ f(x)=x*σ(βx) $

and we call **Swish**.

---

## **The swish activation function**
---
The swish activation function was introduced in 2017 as an alternative to the commonly used ReLU activation function. <br>
Like the ReLU function, the swish function is non-linear.

The mathematical definition of the swish function:
$$
 swish(x) = x * sigmoid(βx) 
$$
where β is either a constant or a trainable parameter and sigmoid(x) is the standard sigmoid function, defined as:

$$ sigmoid(z) = \frac{1}{1 + exp(-z)} $$

<br>

<figure align="center">
  <img width="460" height="300" src="https://media.geeksforgeeks.org/wp-content/uploads/20200516225910/swish.jpeg">
  <img width="460" height="300" src="https://media.arxiv-vanity.com/render-output/6513421/x4.png">
  <figcaption>If β = 1, Swish is equivalent to the Sigmoid-weighted Linear Unit (SiL) of Elfwinget al.<br>
If β = 0, Swish becomes the scaled linear function f(x) = x2. <br>
As β → ∞, the sigmoid component approaches a 0-1 function, so Swish becomes like the ReLU function.<br></figcaption>
</figure>
<br>

Benefit of the swish function: 
*   The most striking difference between Swish and ReLU is the non-monotonic “bump” of Swish when x < 0, the shape of this bump can be controlled by changing the β parameter.

Limitation of the swish function:
 *   Is not computationally efficient as the ReLU function, because it requires a sigmoid computation for each input value. This can make it slower to train and evaluate neural network models that use the swish function.

---

## **Experiment Compare Relu with Swhis** 
---
### Simple classification problem
In this problem, we classifying images of handwritten digits (0-9) using the MNIST dataset, the goal of this problem is to develop a model that can accurately classify images of handwritten digits (0-9) using the MNIST dataset.

#### **Relu** activation function:



In [2]:
import sys
from datetime import datetime
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms


class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(28*28, 256)
        self.fc2 = nn.Linear(256, 64)
        self.fc3 = nn.Linear(64, 10)
        
    def forward(self, x):
        x = x.view(-1, 28*28)
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Define a loss function and optimizer
net = SimpleNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters())

# Load the MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
mnist_train = datasets.MNIST('./data', train=True, download=True, transform=transform)
mnist_test = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader = DataLoader(mnist_train, batch_size=64, shuffle=True)
test_loader = DataLoader(mnist_test, batch_size=64, shuffle=True)
print("Train Network (ReLU)")

# Train the network
tstart = datetime.now()
for epoch in range(1):
    for i, data in enumerate(train_loader):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        j = (i + 1) / len(train_loader)
        sys.stdout.write('\r')
        sys.stdout.write("epoch_%d ->[%-20s] %d%%" % (epoch,'='*int(20*j), 100*j))
        sys.stdout.flush()
tend = datetime.now()

# Evaluate the network on the test dataset
correct = 0
total = 0
with torch.no_grad():
    for data in test_loader:
        inputs, labels = data
        outputs = net(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
print('\nAccuracy of the network: {} %'.format(100 * correct / total))
print("TimeCost for train: ",tend-tstart)


Train Network (ReLU)
Accuracy of the network: 92.68 %
TimeCost for train:  0:00:59.245795


#### **Swish** (silu) activation function:

In [3]:
import sys
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms


class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(28*28, 256)
        self.fc2 = nn.Linear(256, 64)
        self.fc3 = nn.Linear(64, 10)
        
    def forward(self, x):
        x = x.view(-1, 28*28)
        x = torch.nn.functional.silu(self.fc1(x))
        x = torch.nn.functional.silu(self.fc2(x))
        x = self.fc3(x)
        return x

# Define a loss function and optimizer
net = SimpleNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters())

# Load the MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
mnist_train = datasets.MNIST('./data', train=True, download=True, transform=transform)
mnist_test = datasets.MNIST('./data', train=False, download=True, transform=transform)
train_loader = DataLoader(mnist_train, batch_size=64, shuffle=True)
test_loader = DataLoader(mnist_test, batch_size=64, shuffle=True)
print("Train Network (Swish)")
# Train the network
tstart = datetime.now()
for epoch in range(1):
    for i, data in enumerate(train_loader):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        j = (i + 1) / len(train_loader)
        sys.stdout.write('\r')
        sys.stdout.write("epoch_%d ->[%-20s] %d%%" % (epoch,'='*int(20*j), 100*j))
        sys.stdout.flush()
tend = datetime.now()
       

# Evaluate the network on the test dataset
correct = 0
total = 0
with torch.no_grad():
    for data in test_loader:
        inputs, labels = data
        outputs = net(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
print('\nAccuracy of the network: {} %'.format(100 * correct / total))
print("TimeCost for train: ",tend-tstart)

Train Network (Swish)
Accuracy of the network: 94.77 %
TimeCost for train:  0:01:12.321490
