Cross Entropy
====

There's an excellent explanation of Cross-Entropy and related functions on https://machinelearningmastery.com/cross-entropy-for-machine-learning/ (Brownlee, 2019)

Brownlee has some good explanation with code on cross-entropy from scratch, lets first look at how it's implemented in PyTorch and how to use it.

# Cross-Entropy Loss (with torch)

In [255]:
import torch
from torch import nn

from torch import optim

In [185]:
criterion = nn.CrossEntropyLoss()

# Assuming, batch first, we have 
# 5 data points, with 2 real number output each.
# Each output representing perceptron's prediction for one label. 
last_layer = torch.randn(5, 2)
predictions = torch.sigmoid(last_layer)

# Correspondingly, we have 5 data points with 1 label each.
# Each label has its corresponding integer to represent.
truth = torch.LongTensor(5, 1).random_(0,2).squeeze(1)

In [186]:
last_layer

tensor([[-0.7833, -0.6955],
        [-0.3951, -1.2426],
        [ 0.5134, -1.4385],
        [-0.5044,  0.4975],
        [-1.4518, -1.5407]])

In [187]:
# For each data point, we output the 
# sigmoidal output per label. 
predictions

tensor([[0.3136, 0.3328],
        [0.4025, 0.2240],
        [0.6256, 0.1918],
        [0.3765, 0.6219],
        [0.1897, 0.1764]])

In [188]:
# Each data point has a label and our label space
# is made up of labels 0s and 1s.
truth

tensor([0, 0, 0, 1, 0])

In [189]:
loss = criterion(predictions, truth)

In [190]:
loss

tensor(0.6149)

# Binary Cross-Entropy Loss (with torch)

In [89]:
criterion = nn.BCELoss()

# Assuming, batch first, we have 
# 5 data points, with 3 real number output each.
last_layer = torch.randn(5, 3)
predictions = torch.sigmoid(last_layer)

# Correspondingly, we have 5 data points, 
# with 3 boolean labels each.
truth = torch.LongTensor(5, 3).random_(0,2).float() 

In [90]:
last_layer # Before activation function.

tensor([[-2.0245,  0.2092,  0.5374],
        [-0.6163,  1.1205,  0.4234],
        [ 0.7822,  0.8506,  0.2875],
        [ 1.0529,  1.7423,  0.5177],
        [ 0.5219, -0.5256, -1.3008]])

In [91]:
predictions # After activation function.

tensor([[0.1167, 0.5521, 0.6312],
        [0.3506, 0.7541, 0.6043],
        [0.6861, 0.7007, 0.5714],
        [0.7413, 0.8510, 0.6266],
        [0.6276, 0.3715, 0.2140]])

In [95]:
# This is kind of special such that for each
# data point we have 3 labels. And within
# torch.autograd, it's design to compute any arbitrary label spaces. 
# Here, we're "cheating" the outputs by saying the space is 0s or 1s.
truth

tensor([[0., 0., 0.],
        [1., 0., 0.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 0., 0.]])

In [96]:
loss = criterion(predictions, truth)

In [97]:
loss

tensor(0.5796)

# What happens when the space isn't just 0s or 1s?

In [98]:
criterion = nn.BCELoss()

# Assuming, batch first, we have 
# 5 data points, with 3 real number output each.
last_layer = torch.randn(5, 3)
predictions = torch.sigmoid(last_layer)

# Correspondingly, we have 5 data points, 
# with 3 boolean labels each.
truth = torch.LongTensor(5, 3).random_(0,5).float() 

In [99]:
predictions

tensor([[0.3793, 0.7710, 0.5964],
        [0.9048, 0.8805, 0.4903],
        [0.2629, 0.6860, 0.6095],
        [0.5221, 0.2022, 0.6634],
        [0.1605, 0.3513, 0.6699]])

In [100]:
# This is kind of special such that for each
# data point we have 3 labels. And within
# torch.autograd, it's design to compute any arbitrary label spaces. 
# Here, we're "cheating" the outputs by saying the space is 0s or 1s.
truth

tensor([[0., 4., 0.],
        [2., 2., 1.],
        [2., 1., 2.],
        [4., 4., 2.],
        [2., 2., 1.]])

In [101]:
loss = criterion(predictions, truth)

In [102]:
loss

tensor(0.5911)

# But how does that single scalar do backpropagation?

We don't do backpropagation on that sum loss =)

When we log the sum loss over all the data points, we get a scalar but because we have the loss for all labels in the label space, we actually get a vector back for every data point.

In [191]:
predictions

tensor([[0.3136, 0.3328],
        [0.4025, 0.2240],
        [0.6256, 0.1918],
        [0.3765, 0.6219],
        [0.1897, 0.1764]])

In [192]:
truth

tensor([0, 0, 0, 1, 0])

In [193]:
torch.nn.functional.one_hot(truth)

tensor([[1, 0],
        [1, 0],
        [1, 0],
        [0, 1],
        [1, 0]])

In [205]:
# If we iterate through each data point. 
for row_pred, row_truth in zip(predictions, torch.nn.functional.one_hot(truth)):
    row_entropy = [-1 * float(t * math.log2(p)) for p, t in zip(row_pred, row_truth)]
    print(row_pred, '\t', row_truth)
    print(row_entropy)
    print()

tensor([0.3136, 0.3328]) 	 tensor([1, 0])
[1.6729376316070557, 0.0]

tensor([0.4025, 0.2240]) 	 tensor([1, 0])
[1.3129408359527588, 0.0]

tensor([0.6256, 0.1918]) 	 tensor([1, 0])
[0.676690399646759, 0.0]

tensor([0.3765, 0.6219]) 	 tensor([0, 1])
[0.0, 0.6853156685829163]

tensor([0.1897, 0.1764]) 	 tensor([1, 0])
[2.3980143070220947, 0.0]



In [237]:
X = xor_input = np.array([[0,0], [0,1], [1,0], [1,1]])
Y = xor_output = np.array([0,1,1,0]).T

X_pt = torch.tensor(X).float()
Y_pt = torch.tensor(Y, requires_grad=False).squeeze(0)



In [238]:
X_pt

tensor([[0., 0.],
        [0., 1.],
        [1., 0.],
        [1., 1.]])

In [239]:
Y_pt

tensor([0, 1, 1, 0])

In [280]:
hidden_dim = 5
num_data, input_dim = 4, 2
num_data, output_dim = 4, 2

model = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                      nn.Sigmoid(), 
                      nn.Linear(hidden_dim, output_dim),
                      nn.Sigmoid())

optimizer = optim.SGD(model.parameters(), lr=0.03)

In [281]:
predictions = model(X_pt)
predictions

tensor([[0.4089, 0.5076],
        [0.4006, 0.5109],
        [0.3846, 0.5483],
        [0.3770, 0.5506]], grad_fn=<SigmoidBackward>)

In [282]:
truth = Y_pt
truth

tensor([0, 1, 1, 0])

In [283]:
criterion = nn.CrossEntropyLoss()
loss = criterion(predictions, truth)

In [289]:
loss

tensor(0.6954, grad_fn=<NllLossBackward>)

In [292]:
list(model.parameters())

[Parameter containing:
 tensor([[ 0.6037, -0.0889],
         [ 0.6657,  0.2893],
         [-0.4809, -0.2565],
         [-0.3140,  0.6397],
         [ 0.4835,  0.2135]], requires_grad=True), Parameter containing:
 tensor([-0.2218,  0.2330,  0.6987, -0.1540,  0.1504], requires_grad=True), Parameter containing:
 tensor([[-0.2135, -0.2679,  0.2896, -0.0270,  0.0190],
         [ 0.0608,  0.2798, -0.4467, -0.3066,  0.3231]], requires_grad=True), Parameter containing:
 tensor([-0.3147,  0.1126], requires_grad=True)]

In [291]:
list(model.parameters())[0]

Parameter containing:
tensor([[ 0.6037, -0.0889],
        [ 0.6657,  0.2893],
        [-0.4809, -0.2565],
        [-0.3140,  0.6397],
        [ 0.4835,  0.2135]], requires_grad=True)

In [285]:
list(model.parameters())[0].grad == None

True

In [286]:
loss.backward()

In [287]:
loss.grad

In [288]:
list(model.parameters())[0].grad

tensor([[ 3.5187e-04,  2.5118e-04],
        [ 8.1553e-05, -4.9206e-04],
        [-9.7865e-04, -9.6398e-04],
        [-4.5266e-04, -3.8973e-04],
        [ 1.8965e-04, -2.3537e-05]])