# Softmax activation

This note uses MNIST dataset to illustrate this activation function.

The input is $(N, M)$, where $N$ is the number of samples, $M$ is the number of features.
The output is one of 0 to 9. This requires the weight to be of shape $(M, 10)$. Since this falls into the question of "Which one of the output is best"? [Activation functions](activation_functions.ipynb) discussed why softmax is better than sigmoid in this case.


General definition of softmax:

$$ \sigma(z_j) = \frac{e^{z_j}}{\sum_{i=1}^{K}e^{z_i}}$$

Here, there are a total of $K$ outputs.


 

In [7]:
import numpy as np 
def softmax(x):
    return np.exp(x)/np.sum(np.exp(x))

np.random.seed(0)
output = np.random.randint(0,10, size=10)
sm_output = softmax(output)
print(sm_output)
print(np.sum(sm_output))


[1.54279046e-02 1.03952403e-04 2.08793983e-03 2.08793983e-03
 1.13997652e-01 8.42335048e-01 2.08793983e-03 1.54279046e-02
 7.68110139e-04 5.67560891e-03]
1.0



## Logits and Cross Entropy

The output from linear regression is known as logits, they are the inputs to the softmax. Cross entropy as defined here, can compute the differences between two distributions, $\hat{y}$ and $y$.

The sum of these differences forms the Loss function.mro

![loss function](../figs/cross-entropy.png)



In [16]:
import numpy as np 
def cross_entropy(Y, Y_pred):
    """ Note: there is no softmax applied to Y_pred here """
    return np.sum(-Y * np.log(Y_pred))


# one-hot encoding
Y = np.array([0, 1, 0])

# This will NOT work ...
# Y_pred1 = np.array([.2, .7, 0])
Y_pred1 = np.array([.1, .7, .1])
Y_pred2 = np.array([.7, .2, .1])

print(f"loss1 = {cross_entropy(Y, Y_pred1)}") # good prediction
print(f"loss2 = {cross_entropy(Y, Y_pred2)}") # bad prediction

loss1 = 0.35667494393873245
loss2 = 1.6094379124341003


## Cross entropy in PyTorch


In [27]:
import torch.nn as nn 
import torch 
from torch.autograd import Variable

# Softmax + CrossEntropy
loss = nn.CrossEntropyLoss()

# input is class, not one-hot
# for example: if there are three target value
# it should be either 0, 1, or 2.
# in this case, we define it as 1
Y = Variable(torch.LongTensor([1]), requires_grad=False)

# Input are logits, not softmax 
Y_pred1 = Variable(torch.Tensor([[1.0, 3.0, 2.0]]))
Y_pred2 = Variable(torch.Tensor([[7.0, 1.0, 2.0]]))

l1 = loss(Y_pred1, Y)
l2 = loss(Y_pred2, Y)
print(f"PyTorch loss1={l1.data}")
print(f"PyTorch loss2={l2.data}")


PyTorch loss1=0.4076060354709625
PyTorch loss2=6.009174346923828


## Batch Loss 

PyTorch supports prediction and target in batch: that is, multple prediction correspond to multiple targets


In [28]:
import torch.nn as nn 
import torch 
from torch.autograd import Variable

# Softmax + CrossEntropy
loss = nn.CrossEntropyLoss()

# input is class, not one-hot
# for example: if there are three target value
# it should be either 0, 1, or 2.
# in this case, we define it as 1
Y = Variable(torch.LongTensor([2, 0, 1]), requires_grad=False)

# Input are logits, not softmax 
Y_pred1 = Variable(torch.Tensor([
    [0.1, 0.2, 0.9],
    [1.1, 0.1, 0.2],
    [0.2, 2.1, 0.1]
    ])) # good predition

l1 = loss(Y_pred1, Y)
print(f"PyTorch loss1={l1.data}")


PyTorch loss1=0.4966353178024292


## Apply to MNIST


In [30]:
# https://github.com/pytorch/examples/blob/master/mnist/main.py
from __future__ import print_function
from torch import nn, optim, cuda
from torch.utils import data
from torchvision import datasets, transforms
import torch.nn.functional as F
import time

# Training settings
batch_size = 64
device = 'cuda' if cuda.is_available() else 'cpu'
print(f'Training MNIST Model on {device}\n{"=" * 44}')

# MNIST Dataset
train_dataset = datasets.MNIST(root='./mnist_data/',
                               train=True,
                               transform=transforms.ToTensor(),
                               download=True)

test_dataset = datasets.MNIST(root='./mnist_data/',
                              train=False,
                              transform=transforms.ToTensor())

# Data Loader (Input Pipeline)
train_loader = data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        self.l1 = nn.Linear(784, 520)
        self.l2 = nn.Linear(520, 320)
        self.l3 = nn.Linear(320, 240)
        self.l4 = nn.Linear(240, 120)
        self.l5 = nn.Linear(120, 10)

    def forward(self, x):
        x = x.view(-1, 784)  # Flatten the data (n, 1, 28, 28)-> (n, 784)
        x = F.relu(self.l1(x))
        x = F.relu(self.l2(x))
        x = F.relu(self.l3(x))
        x = F.relu(self.l4(x))
        return self.l5(x)


model = Net()
model.to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.5)


def train(epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % 10 == 0:
            print('Train Epoch: {} | Batch Status: {}/{} ({:.0f}%) | Loss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))


def test():
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        output = model(data)
        # sum up batch loss
        test_loss += criterion(output, target).item()
        # get the index of the max
        pred = output.data.max(1, keepdim=True)[1]
        correct += pred.eq(target.data.view_as(pred)).cpu().sum()

    test_loss /= len(test_loader.dataset)
    print(f'===========================\nTest set: Average loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_loader.dataset)} '
          f'({100. * correct / len(test_loader.dataset):.0f}%)')


if __name__ == '__main__':
    since = time.time()
    for epoch in range(1, 10):
        epoch_start = time.time()
        train(epoch)
        m, s = divmod(time.time() - epoch_start, 60)
        print(f'Training time: {m:.0f}m {s:.0f}s')
        test()
        m, s = divmod(time.time() - epoch_start, 60)
        print(f'Testing time: {m:.0f}m {s:.0f}s')

    m, s = divmod(time.time() - since, 60)
    print(f'Total Time: {m:.0f}m {s:.0f}s\nModel was trained on {device}!')

Training MNIST Model on cpu
Train Epoch: 1 | Batch Status: 0/60000 (0%) | Loss: 2.296754
Train Epoch: 1 | Batch Status: 640/60000 (1%) | Loss: 2.296579
Train Epoch: 1 | Batch Status: 1280/60000 (2%) | Loss: 2.303263
Train Epoch: 1 | Batch Status: 1920/60000 (3%) | Loss: 2.297008
Train Epoch: 1 | Batch Status: 2560/60000 (4%) | Loss: 2.301572
Train Epoch: 1 | Batch Status: 3200/60000 (5%) | Loss: 2.293637
Train Epoch: 1 | Batch Status: 3840/60000 (6%) | Loss: 2.295901
Train Epoch: 1 | Batch Status: 4480/60000 (7%) | Loss: 2.286935
Train Epoch: 1 | Batch Status: 5120/60000 (9%) | Loss: 2.302372
Train Epoch: 1 | Batch Status: 5760/60000 (10%) | Loss: 2.300096
Train Epoch: 1 | Batch Status: 6400/60000 (11%) | Loss: 2.299518
Train Epoch: 1 | Batch Status: 7040/60000 (12%) | Loss: 2.293327
Train Epoch: 1 | Batch Status: 7680/60000 (13%) | Loss: 2.298049
Train Epoch: 1 | Batch Status: 8320/60000 (14%) | Loss: 2.291863
Train Epoch: 1 | Batch Status: 8960/60000 (15%) | Loss: 2.301373
Train Epoc