In [1]:
import numpy as np

import torch
from torch import nn
from torch import optim

# Activation & Loss Functions

Between layers of a Neural Network, we pass values derived from the previous layer. We call this set of values the Activations of the layer. Before we pass the values, we would apply some form of Activation Function to introduce non-linearity to the Network.

I will be consolidating all types of Activation Functions I know of in this notebook.
List of Activation Functions:
1. Sigmoid
    - 1 / (1+e^-x)
    - 0 to 1
2. Tanh - -1 to 1 
    - (e^z - e^-z) / (e^z + e ^-z)
    - -1 to 1
3. ReLU
4. Leaky ReLU
5. Softmax

List of Loss Functions:
1. Cross-Entropy
    - −(ylog(p)+(1−y)log(1−p))
    - Used for Binary


## Generating Fake data

In [2]:
from sklearn.datasets import make_classification, make_regression
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
%matplotlib inline

## Sigmoid + Cross Entropy

In [3]:
x, y = make_classification(n_samples=1000, 
                           n_features=2, 
                           n_classes=2,
                           n_informative=2,
                           n_redundant=0,
                           flip_y=0.1,
                           class_sep=0.9
                          )

x = torch.from_numpy(x).float()
y = torch.from_numpy(y).long()

In [4]:
# Network Architecture
sigmoid_model = nn.Sequential(
    nn.Linear(2,2),
    nn.Sigmoid()
)

# loss function - CrossEntropy, optimiser - SGD
criterion = nn.CrossEntropyLoss()
optimiser = optim.SGD(sigmoid_model.parameters(), lr=0.1)

In [5]:
# Training the model
epochs = 1000
loss = []

for e in range(epochs):
    optimiser.zero_grad() # zero gradients
    
    output = sigmoid_model.forward(x) # forward prop
    running_loss = criterion(output, y) # calculate loss
    loss.append(running_loss) # store loss
    running_loss.backward() # back prop
    optimiser.step() #  update weights
    
    if e % 10 == 0:
        print('{}/{} --- Loss: {}'.format(e+1, epochs, running_loss))

1/1000 --- Loss: 0.6427337527275085
11/1000 --- Loss: 0.6298125386238098
21/1000 --- Loss: 0.6181340217590332
31/1000 --- Loss: 0.6075857877731323
41/1000 --- Loss: 0.5980523228645325
51/1000 --- Loss: 0.5894248485565186
61/1000 --- Loss: 0.5816006064414978
71/1000 --- Loss: 0.5744912028312683
81/1000 --- Loss: 0.5680191516876221
91/1000 --- Loss: 0.5621114373207092
101/1000 --- Loss: 0.556709349155426
111/1000 --- Loss: 0.5517580509185791
121/1000 --- Loss: 0.5472099781036377
131/1000 --- Loss: 0.5430224537849426
141/1000 --- Loss: 0.5391596555709839
151/1000 --- Loss: 0.5355876088142395
161/1000 --- Loss: 0.5322769284248352
171/1000 --- Loss: 0.5292025804519653
181/1000 --- Loss: 0.5263410806655884
191/1000 --- Loss: 0.5236734747886658
201/1000 --- Loss: 0.5211797952651978
211/1000 --- Loss: 0.51884526014328
221/1000 --- Loss: 0.5166550874710083
231/1000 --- Loss: 0.5145966410636902
241/1000 --- Loss: 0.5126585960388184
251/1000 --- Loss: 0.5108311176300049
261/1000 --- Loss: 0.50910

In [6]:
print(output.max())
print(output.min())

tensor(0.9999, grad_fn=<MaxBackward1>)
tensor(8.3125e-05, grad_fn=<MinBackward1>)


After running our values through the sigmoid function, we can see that they are restricted between 0 to 1.

In [7]:
output[:5]

tensor([[0.3863, 0.5450],
        [0.0499, 0.9405],
        [0.9589, 0.0344],
        [0.4205, 0.5515],
        [0.5080, 0.4148]], grad_fn=<SliceBackward>)

They can be interpretted as the probability of the class. i.e. column 1 --> probability of class 0 and column 2 --> probability of class 1

In [8]:
# Take larger probability as the predicted class
_, predictions = torch.max(output, axis=1)

In [10]:
# View classical classification scoring methods
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.91      0.83      0.87       503
           1       0.84      0.92      0.88       497

    accuracy                           0.87      1000
   macro avg       0.88      0.87      0.87      1000
weighted avg       0.88      0.87      0.87      1000



## Tanh

In [11]:
# Network Architecture
tanh_model = nn.Sequential(
    nn.Linear(2,2),
    nn.Tanh()
)

# loss function - CrossEntropy, optimiser - SGD
criterion = nn.CrossEntropyLoss()
optimiser = optim.SGD(tanh_model.parameters(), lr=0.1)

In [12]:
# Training the model
epochs = 1000
loss = []

for e in range(epochs):
    optimiser.zero_grad() # zero gradients
    
    output = tanh_model.forward(x) # forward prop
    running_loss = criterion(output, y) # calculate loss
    loss.append(running_loss) # store loss
    running_loss.backward() # back prop
    optimiser.step() #  update weights
    
    if e % 10 == 0:
        print('{}/{} --- Loss: {}'.format(e+1, epochs, running_loss))

1/1000 --- Loss: 0.9230818748474121
11/1000 --- Loss: 0.7356969118118286
21/1000 --- Loss: 0.6006231307983398
31/1000 --- Loss: 0.5193090438842773
41/1000 --- Loss: 0.4716643989086151
51/1000 --- Loss: 0.44304099678993225
61/1000 --- Loss: 0.4247363209724426
71/1000 --- Loss: 0.41223159432411194
81/1000 --- Loss: 0.4032110571861267
91/1000 --- Loss: 0.3964187800884247
101/1000 --- Loss: 0.39113208651542664
111/1000 --- Loss: 0.3869064748287201
121/1000 --- Loss: 0.38345783948898315
131/1000 --- Loss: 0.3805941343307495
141/1000 --- Loss: 0.37818092107772827
151/1000 --- Loss: 0.37612324953079224
161/1000 --- Loss: 0.374348908662796
171/1000 --- Loss: 0.3728054165840149
181/1000 --- Loss: 0.3714505136013031
191/1000 --- Loss: 0.3702529966831207
201/1000 --- Loss: 0.36918750405311584
211/1000 --- Loss: 0.3682333827018738
221/1000 --- Loss: 0.36737409234046936
231/1000 --- Loss: 0.3665965795516968
241/1000 --- Loss: 0.36589014530181885
251/1000 --- Loss: 0.36524486541748047
261/1000 --- L

Just off the loss values from training, we can see that using the Tanh function, our loss converges to a much smaller value.

In [13]:
print(output.max())
print(output.min())

tensor(1.0000, grad_fn=<MaxBackward1>)
tensor(-1.0000, grad_fn=<MinBackward1>)


Unlike the Sigmoid function, the Tanh is restricted between -1 to 1 instead. As such, it cannot be interpreted as a probability of the class. The improved learning is often attributed to the Tanh function being centered around 0 unlike the sigmoid function.

In [14]:
# Take larger probability as the predicted class
_, predictions = torch.max(output, axis=1)

In [15]:
# View classical classification scoring methods
print(classification_report(y, predictions))

              precision    recall  f1-score   support

           0       0.91      0.84      0.87       503
           1       0.85      0.91      0.88       497

    accuracy                           0.88      1000
   macro avg       0.88      0.88      0.88      1000
weighted avg       0.88      0.88      0.88      1000

