# Activation Functions

In D.L., our **objective** is, almost always, to find a **set of weights** that **minimizes error.** All of these sets of weights are **linear operations** and hence, if performed alone, we would attain just a simple **multiple linear regression model.** 

##### What’s the Problem with Linear Models?

If inputs are left untouched, they are not **flexible** as they can only model linear relationships while most data out there has a **non-linear patterns.** Hence, we need to find a way to force our model to be able to **learn non-linear patterns.** 

##### How do we do this?

After a set of linear operations, we apply to the new **input** created by the linear operations ($Ax = \hat{y}$) a **non-linear activation function**.

Suppose we have a simple linear model $\hat{y}=ax+b$. These $\hat{y}$ form a linear operation such as below:



Well, given our orange line ($\hat{y}$), we then apply a **non-linear activation function** so as to **transform** our linear model into a **fixed non-linear model** such as below:

<img src="https://www.researchgate.net/profile/Hoon_Chung2/publication/309775740/figure/fig1/AS:538049215381504@1505292337270/The-most-common-nonlinear-activation-functions.png" alt="The most common nonlinear activation functions. | Download ..." style="zoom: 50%;" />

**Why is this a fixed non-linear operation?**

Because whatever formula we use for our non-linear operation, **we do NOT** have a set of weights on it that try to learn an optimal non-linear representation. It will always follow a **fixed, single transformation.**

##### Well, isn’t our purpose to find an optimal non-linear operation?

Yes and no. We find an optimal non-linear operation by letting our set of **linear weights** learn a **representation of the data** that, **once fed to the non-linear operation**, will **correctly identify the new pattern.** Hence, the objective of our linear weights now becomes to **find a representation of the data that, once fed to the non-linear activation, will correctly learn the non-linear patterns.**

##### How do we backpropagate these non-linear activations?

Given that these non-linear activations are in fact non-linear, we are unable to just take the input as the gradient as we can do with linear operations (3x => 3). Hence, we will need **two things**:

1. The **input ($\hat{y}$) that was fed to the non-linear activation** and
2. The **derivative equation of the non-linear function.**

Given that we want to apply the non-linear operation to every input, we can classify these operations as element-wise. 

This has important implications on how we can calculate our gradient. 

**First**, as we learned on the "Linear Layer" tutorial, the dimension of the incoming gradient from our subsequent operation will equal the dimension of the output from our non-linear operation. 

Now, since the output of the non-linear operations equals the dimension of the input, we are able to calculate the corresponding chain-gradient with a simple Hadamard product (element-wise multiplication) between our incoming gradient and our current non-linear operation. In other words,

```input.shape == output.shape == incoming_grad.shape```

**Second**, given that there are no weight parameters to these operations holds two implications:

i) from a backward perspective, these operations are only intermediate variables and 

ii) we can just apply the derivative of the equation to each input such as shown below

$$z = \sigma(y)=\sigma(x_ow_0+x_1w_1+x_2w_2+x_3w_3) = \sigma(x_ow_0)+\sigma(x_1w_1)+\sigma(x_2w_2)+\sigma(x_3w_3)$$

Hence

$$\frac{\partial z}{\partial y} = \frac{\partial z}{\partial y}(x_ow_0+x_1w_1+x_2w_2+x_3w_3) = \frac{\partial z}{\partial y}(x_ow_0)+\frac{\partial z}{\partial y}(x_1w_1)+\frac{\partial z}{\partial y}(x_2w_2)+\frac{\partial z}{\partial y}(x_3w_3)$$

Now that we generally understand how to implement non-linear operations, it begs to ask, **what are some common non-linear operations?**

##### What are some Common non-linear operations?

Well, lets break this down bit-by-bit by first showing a summary:

* ReLU
* Sigmoid
* Tanh



# ReLU
ReLU is a piece-wise linear, vector valued function that adds non-linearity to our model. The effects that this simple piece-wise function has had on the DL sphere have been astonishing. 

The ReLU's forward and backward pass can both be seen as "gates" that either inhibit or advance the flow of either operations. 

During the forward pass, ReLU either retains the original content of the input if its greater than zero or else, turns it to zero.

```python
[x if x > 0 else 0 for x in input]
```

For the inputs that were "cut" to zero, its gradients are turned to zero while the rest of the values become 1. Hence, and given that ReLU is an intermediate operation, ReLU either restricts values of the incoming gradients or lets them "flow". 

Such simple conditions make ReLU a "lightweight" operation as it does not take much to compute its forward and backward method

Such properties, and its surprising effectiveness to model non-linearity, have made ReLU a very popular choice of option for most DL architectures.

Let us model this process in PyTorch

In [1]:
import torch
import torch.nn as nn
torch.randn((2,2)).cuda()

tensor([[ 0.1198,  0.1402],
        [ 0.6488, -0.8397]], device='cuda:0')

In [3]:
# custom ReLU function 
# Remember that:
# input.shape == out.shape == incoming_gradient.shape

class ReLU_layer(torch.autograd.Function):
    
    @staticmethod
    def forward(self, input):
        # save input for backward() pass 
        self.save_for_backward(input) # wraps in a tuple structure
        activated_input = torch.clamp(input, min = 0)
        return activated_input

    @staticmethod
    def backward(self, output_grad_wrt_loss):
        """
        In the backward pass we receive a Tensor containing the 
        gradient of the loss with respect to our f(x) output, 
        and we now need to compute the gradient of the loss
        wrt the input.
        """
        # keep in mind that the gradient of ReLU is binary = {0,1}
        # hence, we will either keep the element of the output_grad_wrt_loss
        # or turn it to zero
        input, = self.saved_tensors
        input_grad_wrt_loss = output_grad_wrt_loss.clone()
        input_grad_wrt_loss[input < 0] = 0
        return input_grad_wrt_loss 

In [4]:
# Wrap ReLU_layer function in nn.module
class ReLU(nn.Module):
    def __init__(self):
        super().__init__()

        
    def forward(self, input):
        output = ReLU_layer.apply(input)
        return output
    

In [5]:
# test function with linear + relu layer
dummy_input= torch.ones((1,2)) # input 

# forward pass
linear = nn.Linear(2,3)
relu = ReLU()
linear2 = nn.Linear(3,1)

output1 = linear(dummy_input)
output2 = relu(output1)
output3 = linear2(output2)
output3

tensor([[0.2316]], grad_fn=<AddmmBackward>)

In [6]:
# backward pass
output3.backward()

In [7]:
# check computed gradients of 1st linear layaer
list(linear.parameters())[0].grad

tensor([[0.1558, 0.1558],
        [0.0000, 0.0000],
        [0.0000, 0.0000]])

Now that we have validated our operation, let's us apply it to a simple neural network model that learns to map inputs from the MNIST dataset

In [8]:
class NeuralNet(nn.Module):
    def __init__(self, num_units = 128, activation = ReLU()):
        super().__init__()
        
        # fully-connected layers
        self.fc1 = nn.Linear(784,num_units)
        self.fc2 = nn.Linear(num_units , num_units//2)
        self.fc3 = nn.Linear(num_units // 2, 10)
        
        # init activation
        self.activation = activation
        
    def forward(self,x):
        
        # 1st layer
        output = self.activation(self.fc1(x))
        
        # 2nd layer
        output = self.activation(self.fc2(output))
        
        # 3rd layer
        output = self.fc3(output)
        
        return output
        

In [9]:
# import data and feed to DataLoader
import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader

root = r'C:\Users\erick\PycharmProjects\untitled\3D_2D_GAN\MNIST_experimentation'
mnist = torchvision.datasets.MNIST(root = root, 
                                      train = True, 
                                      download = False, 
                                       transform = transforms.Normalize(.5,.5)
                                  )
mnist

Dataset MNIST
    Number of datapoints: 60000
    Root location: C:\Users\erick\PycharmProjects\untitled\3D_2D_GAN\MNIST_experimentation
    Split: Train

In [10]:
# normalize data to mean = 0, std = 1 and extract labels
data = mnist.data.view(60000,-1).float()
X = (data - data.mean()) / data.std()
y = mnist.targets 
y

tensor([5, 0, 4,  ..., 5, 6, 8])

In [15]:
from skorch import NeuralNetClassifier
from torch import optim

net = NeuralNetClassifier(
    NeuralNet,
    max_epochs = 25,
    batch_size = 128,
    lr = .01,
    criterion = nn.CrossEntropyLoss,
    optimizer = optim.SGD,
    optimizer__momentum = .95,
    device = 'cuda',
    iterator_train__pin_memory = True)

In [16]:
net.fit(X,y)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4752       0.9411        0.1882  7.4694
      2        0.1566       0.9623        0.1273  7.3764
      3        0.1014       0.9664        0.1134  7.2953
      4        0.0727       0.9697        0.1035  7.2914
      5        0.0563       0.9741        0.0931  7.3200
      6        0.0430       0.9713        0.1002  7.1467
      7        0.0346       0.9708        0.1064  7.6154
      8        0.0283       0.9693        0.1277  7.5981
      9        0.0237       0.9698        0.1248  7.4488
     10        0.0203       0.9738        0.1124  7.5092
     11        0.0176       0.9714        0.1223  7.5436
     12        0.0132       0.9747        0.1126  7.5255
     13        0.0106       0.9745        0.1163  7.4018
     14        0.0112       0.9744        0.1171  7.5277
     15        0.0095       0.9752        0.1204  7.4607
     16        0.0076       0.9

<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=NeuralNet(
    (fc1): Linear(in_features=784, out_features=128, bias=True)
    (fc2): Linear(in_features=128, out_features=64, bias=True)
    (fc3): Linear(in_features=64, out_features=10, bias=True)
    (activation): ReLU()
  ),
)

In [20]:
# extract history
import pandas as pd
history = pd.DataFrame(net.history)
history

Unnamed: 0,batches,dur,epoch,train_batch_count,train_loss,train_loss_best,valid_acc,valid_acc_best,valid_batch_count,valid_loss,valid_loss_best
0,"[{'train_loss': 2.341770887374878, 'train_batc...",7.469449,1,375,0.475174,True,0.941108,True,94,0.18823,True
1,"[{'train_loss': 0.11716476827859879, 'train_ba...",7.37642,2,375,0.156564,True,0.962349,True,94,0.127276,True
2,"[{'train_loss': 0.07833997160196304, 'train_ba...",7.295328,3,375,0.101438,True,0.966431,True,94,0.113426,True
3,"[{'train_loss': 0.05629928037524223, 'train_ba...",7.291368,4,375,0.072656,True,0.969679,True,94,0.103544,True
4,"[{'train_loss': 0.039215993136167526, 'train_b...",7.319993,5,375,0.056315,True,0.974094,True,94,0.093118,True
5,"[{'train_loss': 0.01480671763420105, 'train_ba...",7.146651,6,375,0.04303,True,0.971345,False,94,0.100192,False
6,"[{'train_loss': 0.014157209545373917, 'train_b...",7.615449,7,375,0.034617,True,0.970762,False,94,0.106429,False
7,"[{'train_loss': 0.028001882135868073, 'train_b...",7.598089,8,375,0.028333,True,0.969263,False,94,0.127717,False
8,"[{'train_loss': 0.017053615301847458, 'train_b...",7.448816,9,375,0.023698,True,0.969763,False,94,0.12484,False
9,"[{'train_loss': 0.006474234163761139, 'train_b...",7.509183,10,375,0.020254,True,0.973844,False,94,0.112428,False


In [23]:
# extract only epoch, train_loss, train_loss_best, valid_acc, valid_acc_best, valid_loss, and valid_loss_best
df = history.iloc[:, [2,4,5,6,7,9,10]]
df

Unnamed: 0,epoch,train_loss,train_loss_best,valid_acc,valid_acc_best,valid_loss,valid_loss_best
0,1,0.475174,True,0.941108,True,0.18823,True
1,2,0.156564,True,0.962349,True,0.127276,True
2,3,0.101438,True,0.966431,True,0.113426,True
3,4,0.072656,True,0.969679,True,0.103544,True
4,5,0.056315,True,0.974094,True,0.093118,True
5,6,0.04303,True,0.971345,False,0.100192,False
6,7,0.034617,True,0.970762,False,0.106429,False
7,8,0.028333,True,0.969263,False,0.127717,False
8,9,0.023698,True,0.969763,False,0.12484,False
9,10,0.020254,True,0.973844,False,0.112428,False


In [24]:
# turn above data into a list of dictionary values so that we can feed
# to HiPlot
results = []
for row in df.iterrows():
    results.append(row[1].to_dict())
results

[{'epoch': 1,
  'train_loss': 0.47517417672773565,
  'train_loss_best': True,
  'valid_acc': 0.9411078717201167,
  'valid_acc_best': True,
  'valid_loss': 0.18822958134875006,
  'valid_loss_best': True},
 {'epoch': 2,
  'train_loss': 0.15656359225421515,
  'train_loss_best': True,
  'valid_acc': 0.9623490212411495,
  'valid_acc_best': True,
  'valid_loss': 0.12727606778018727,
  'valid_loss_best': True},
 {'epoch': 3,
  'train_loss': 0.10143763750204755,
  'train_loss_best': True,
  'valid_acc': 0.9664306538942108,
  'valid_acc_best': True,
  'valid_loss': 0.11342565075016081,
  'valid_loss_best': True},
 {'epoch': 4,
  'train_loss': 0.07265592684428113,
  'train_loss_best': True,
  'valid_acc': 0.9696793002915451,
  'valid_acc_best': True,
  'valid_loss': 0.10354374069092126,
  'valid_loss_best': True},
 {'epoch': 5,
  'train_loss': 0.056315004594910456,
  'train_loss_best': True,
  'valid_acc': 0.9740941274468972,
  'valid_acc_best': True,
  'valid_loss': 0.09311790618299197,
  'vali

In [35]:
# torch.save(results, 'relu_history.pt')
import hiplot as hip
# hip.Experiment.from_iterable(results).display() for a smaller visualization
hip.Experiment.from_iterable(results).display(force_full_width = True)

<hiplot.ipython.IPythonExperimentDisplayed at 0x2c19b5b1278>

There are numerous insights that we can gather from the above graph:

* As number of epochs increase, train loss begins to decrease lower than its previous epoch (with the exception of epoch 14)
* As train loss begins to decrease, validation accuracy begins to climb
* The higher the validation accuracy, the more of a chance it has to output the best validation accuracy
* For validations with the best accuracy, the more of a chance there is to also attain the  best validation loss

Such insights lets us know that the data may be overfitting once it reaches epochs greater than 15 as the model begins to slow down its learning and best validation accuracies begin to decrease.

In the ML world, it is important to be able to present your findings in a meaningful way that enables us to compare our findings with other state-of-the-art models. 

For such reason, we will again implement our above model, however, this time we will be comparing the performances of different activation methods. 

We will be comparing the following:

* ReLU
* tanh
* leakyReLU

For concreteness, we will implement tanh and leakyReLU with PyTorch's inherent methods and its manual implementation may be left as an exercise



In [39]:
# grid search
from sklearn.model_selection import GridSearchCV
params = {
    'module__activation': [ReLU(), nn.Tanh(), nn.LeakyReLU()]
}

In [40]:
# we will run a GridSearch of the above parameters on 3-folds
gs = GridSearchCV(net, params, refit = False,cv = 3,scoring = 'accuracy')

In [41]:
gs.fit(X.numpy(),y.numpy())

Re-initializing module because the following parameters were re-set: activation.
Re-initializing optimizer.
Re-initializing module because the following parameters were re-set: activation.
Re-initializing optimizer because the following parameters were re-set: momentum.
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.6218       0.9190        0.2577  3.8483
      2        0.1946       0.9483        0.1678  3.7291
      3        0.1281       0.9563        0.1405  3.8284
      4        0.0918       0.9608        0.1324  4.0050
      5        0.0682       0.9628        0.1276  3.8801
      6        0.0523       0.9626        0.1330  3.8397
      7        0.0399       0.9655        0.1249  4.2604
      8        0.0319       0.9654        0.1265  4.0021
      9        0.0267       0.9659        0.1271  3.6260
     10        0.0242       0.9639        0.1444  3.8109
     11        0.0199       0.9625        0.1

     10        0.0192       0.9685        0.1072  2.1350
     11        0.0146       0.9714        0.0970  2.2160
     12        0.0113       0.9731        0.0931  2.1110
     13        0.0087       0.9730        0.0971  2.1090
     14        0.0069       0.9734        0.0973  2.1020
     15        0.0056       0.9731        0.0965  2.1530
     16        0.0045       0.9735        0.0956  2.1540
     17        0.0037       0.9729        0.0961  2.1340
     18        0.0031       0.9735        0.0970  2.1550
     19        0.0028       0.9734        0.0977  2.1180
     20        0.0025       0.9733        0.0978  2.1750
     21        0.0022       0.9733        0.0974  2.1480
     22        0.0020       0.9740        0.0975  2.2220
     23        0.0018       0.9736        0.0979  2.6770
     24        0.0017       0.9736        0.0982  2.5870
     25        0.0015       0.9738        0.0985  2.3720
Re-initializing module because the following parameters were re-set: activation.
Re-init

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=<class 'skorch.classifier.NeuralNetClassifier'>[initialized](
  module_=NeuralNet(
    (fc1): Linear(in_features=784, out_features=128, bias=True)
    (fc2): Linear(in_features=128, out_features=64, bias=True)
    (fc3): Linear(in_features=64, out_features=10, bias=True)
    (activation): ReLU()
  ),
),
             iid='warn', n_jobs=None,
             param_grid={'module__activation': [ReLU(), Tanh(),
                                                LeakyReLU(negative_slope=0.01)]},
             pre_dispatch='2*n_jobs', refit=False, return_train_score=False,
             scoring='accuracy', verbose=0)

In [70]:
# format the results such as to feed to HiPlot
# torch.save(gs.cv_results_,'activation_CV.pt')
import pandas as pd
results = pd.DataFrame(gs.cv_results_).iloc[:, [4,6,7,8,9,10,11]]
results.param_module__activation = results.param_module__activation.astype(str)
results.head()

Unnamed: 0,param_module__activation,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,ReLU(),0.976905,0.972199,0.974496,0.974533,0.001921,2
1,Tanh(),0.973205,0.970299,0.972646,0.97205,0.001259,3
2,LeakyReLU(negative_slope=0.01),0.975705,0.973249,0.975096,0.974683,0.001044,1


In [71]:
data = []
for row in results.iterrows():
    data.append(row[1].to_dict())
data

[{'param_module__activation': 'ReLU()',
  'split0_test_score': 0.9769046190761848,
  'split1_test_score': 0.9721986099304966,
  'split2_test_score': 0.9744961744261639,
  'mean_test_score': 0.9745333333333334,
  'std_test_score': 0.001921471837924856,
  'rank_test_score': 2},
 {'param_module__activation': 'Tanh()',
  'split0_test_score': 0.9732053589282144,
  'split1_test_score': 0.9702985149257463,
  'split2_test_score': 0.9726458968845327,
  'mean_test_score': 0.97205,
  'std_test_score': 0.0012593262268930255,
  'rank_test_score': 3},
 {'param_module__activation': 'LeakyReLU(negative_slope=0.01)',
  'split0_test_score': 0.9757048590281944,
  'split1_test_score': 0.9732486624331217,
  'split2_test_score': 0.9750962644396659,
  'mean_test_score': 0.9746833333333333,
  'std_test_score': 0.0010444117399852609,
  'rank_test_score': 1}]

In [73]:
hip.Experiment.from_iterable(data).display(force_full_width = True)

<hiplot.ipython.IPythonExperimentDisplayed at 0x2c1a6aa2828>