# Loss functions

NOTE: The usage of cross entropy loss in PyTorch for multiclass and binary classification is the main topic of this document.

**In PyTorch, we usually pass the raw output of our network into the loss, not the output of the softmax function. This is because `nn.CrossEntropyLoss` internally computes Softmax and cross entropy loss simultaneusly.**

The raw, unnormalized output of a network is usually called the *logits* or *scores*. 

Using logits is better than probabilities (or direct outputs of a Softmax function) because probabilities are often very close to zero or one, and  floating-point numbers can't accurately represent values near zero or one ([read more here](https://docs.python.org/3/tutorial/floatingpoint.html)). Therefore, it is better to avoid doing calculations with probabilities. One approach is to use **log-probabilities**. In PyTorch, `nn.CrossEntropyLoss` expects logits. **It internally computes log probabilities (or the logarithm of Softmax's outputs using `nn.LogSoftmax`), and use those values for computing the loss function (using `nn.NLLLoss()`)**.

**Note:** In Tensorflow/Keras, we have `tf.keras.losses.CategoricalCrossentropy(from_logits=False)`. 
- When softmax is used in your model, use `from_logits=False`
- When softmax is not applied in your model (i.e., your model outputs logits), use `from_logits=True`.

The second option (`from_logits=True`), which combines the computation of softmax and loss, is more effective in terms of numerical stability. 

## `nn.CrossEntropyLoss`

This function expects two inputs: 1. logits (Nxd), and 2. targets (N)

In [1]:
import torch
from torch import nn

# Assume that we have 3 examples. 
# And, our network outputs 5 numbers (logits) for each example.
output = torch.randn(3, 5, requires_grad=True)

# Targets or labels are expected to be numeric
# Each target value must be in {0, 1, 2, 3, 4} because the network outputs 5 logits.
target = torch.tensor([1, 0, 4])

print(nn.CrossEntropyLoss()(output, target)) # logits and labels

tensor(1.7920, grad_fn=<NllLossBackward0>)


Note that `grad_fn` is set to `<NllLossBackward0>`, meaning that `nn.NLLLoss()` was internally used for loss computation.

In [2]:
print(output.dtype)
print(target.dtype)

print(output.shape)
print(target.shape)

torch.float32
torch.int64
torch.Size([3, 5])
torch.Size([3])


## `nn.CrossEntropyLoss` Usage Example

In [3]:
import torch
from torch import nn
from torch import optim

import torch.nn.functional as F
from torchvision import datasets, transforms

# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,)),
                              ])
# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

In [4]:
print(len(trainloader))
print(len(trainloader.dataset))

938
60000


### Define the model

In [5]:
model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64, 10)) # No need to apply Softmax

criterion = nn.CrossEntropyLoss() # By default, reduction = 'mean'. 
# Hence, loss.item() contains the loss of entire mini-batch, but divided by the batch size.

optimizer = optim.SGD(model.parameters(), lr=0.003)

epochs = 5
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        # Flatten MNIST images into a 784 long vector
        images = images.view(images.shape[0], -1)
    
        optimizer.zero_grad()
        
        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()*images.shape[0]
    else:
        print(f"Training loss: {running_loss/len(trainloader.dataset)}")

Training loss: 1.9108810361862183
Training loss: 0.8520315013249715
Training loss: 0.5252435129801433
Training loss: 0.4329351637363434
Training loss: 0.39127474762598674


----

## `nn.LogSoftmax` 

It performs `log_softmax(x) = ln( softmax(x) )`, and it often used in multi-class classification tasks to transform raw scores (logits) into **normalized log probabilities**.



In [6]:
last_layer_outputs = torch.randn(3, 5, requires_grad=True)
outputs = nn.LogSoftmax(dim=1)(last_layer_outputs) # m = 3 samples, each has n = 5 features
# This is in the form of one-hot encoded outputs, but with log softmax
print(outputs)
# We can convert those scores into a proper one-hot encoded vectors
print(torch.exp(outputs))

tensor([[-1.5991, -1.2887, -3.1615, -0.8946, -2.6426],
        [-1.6974, -1.2098, -3.0073, -2.1276, -1.0497],
        [-1.6188, -3.0033, -2.4904, -1.9149, -0.6501]],
       grad_fn=<LogSoftmaxBackward0>)
tensor([[0.2021, 0.2756, 0.0424, 0.4088, 0.0712],
        [0.1832, 0.2983, 0.0494, 0.1191, 0.3500],
        [0.1981, 0.0496, 0.0829, 0.1474, 0.5220]], grad_fn=<ExpBackward0>)


In [7]:
target = torch.tensor([1, 0, 4]) # target values for each sample

print(nn.NLLLoss()(outputs, target)) # logSoftmax and labels

# You can get the probabilities using the exponential function:
print("Probabilities:", torch.exp(outputs))

tensor(1.2121, grad_fn=<NllLossBackward0>)
Probabilities: tensor([[0.2021, 0.2756, 0.0424, 0.4088, 0.0712],
        [0.1832, 0.2983, 0.0494, 0.1191, 0.3500],
        [0.1981, 0.0496, 0.0829, 0.1474, 0.5220]], grad_fn=<ExpBackward0>)


Note: 

In fact, `nn.CrossEntropyLoss()` combines `nn.LogSoftmax()` (log(softmax(x))) and `nn.NLLLoss()` in one single class. Therefore, the output from the network that is passed into `nn.CrossEntropyLoss` needs to be the raw output of the network (called logits), not the output of the softmax function.

## Use `nn.LogSoftmax` & `nn.NLLLoss()` to define a model and train it

An alternative to `nn.CrossEntropyLoss()` is to directly use these two functions. Here is an example:

In [8]:
import torch
from torch import nn
from torch import optim

import torch.nn.functional as F
from torchvision import datasets, transforms

### Prepare dataset

In [9]:
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,)),
                              ])
# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

In [10]:
print(len(trainloader))
print(len(trainloader.dataset))

938
60000


### Define the model

In [11]:
model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64, 10),
                      nn.LogSoftmax(dim=1)) #### you can also use "F.log_softmax"

criterion = nn.NLLLoss() # By default, reduction = 'mean'. 
# Hence, loss.item() contains the loss of entire mini-batch, but divided by the batch size.

optimizer = optim.SGD(model.parameters(), lr=0.003)

epochs = 5
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        # Flatten MNIST images into a 784 long vector
        images = images.view(images.shape[0], -1)
    
        optimizer.zero_grad()
        
        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()*images.shape[0]
    else:
        print(f"Training loss: {running_loss/len(trainloader.dataset)}")

Training loss: 1.9763235883076986
Training loss: 0.9304784668286642
Training loss: 0.5560508148034413
Training loss: 0.4427611797332764
Training loss: 0.39288225973447166


# Binary Classification

## Using `nn.CrossEntropyLoss`

In [12]:
import torch
from torch import nn
from torch import optim

import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import Dataset

transform = transforms.Compose([transforms.ToTensor(),
                                 transforms.Normalize((0.5,), (0.5,))])
# trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)


class BinaryMNISTDataset(Dataset):
    def __init__(self, root, train=True, transform=None, target_transform=None, download=False):
        self.mnist_dataset = datasets.MNIST(root, train=train, 
                                            transform=transform, 
                                            target_transform=target_transform, 
                                            download=download)
    def __len__(self):
        return len(self.mnist_dataset)
    def __getitem__(self, index):
        image, label = self.mnist_dataset[index]
        if label == 0:
            new_label = 1
        else:
            new_label = 0

        return image, new_label


transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
trainset = BinaryMNISTDataset(root="~/.pytorch/MNIST_data/", train=True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

In [13]:
dataiter = iter(trainloader)
images, labels = next(dataiter)
print(images.shape)

torch.Size([64, 1, 28, 28])


In [14]:
model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64, 2)) 

# dataiter = iter(trainloader)
# images, labels = next(dataiter)
# print(images.shape)                        # torch.Size([64, 1, 28, 28])
# images = images.view(images.shape[0], -1)  # torch.Size([64, 784])
# print(images.dtype)                        # torch.float32

Note that the network must output 2 logits because `nn.CrossEntropyLoss` expects target values to be in {0, 1}. Thus, if your the network outputs only one value, there is no corresponding logit for a target class of 1, and you will face out of bound indices error.

In [15]:
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=0.003)

epochs = 5
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        images = images.view(images.shape[0], -1)
        optimizer.zero_grad()
        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()*images.shape[0]
    else:
        print(f"Training loss: {running_loss/len(trainloader.dataset)}")

Training loss: 0.19014976968566577
Training loss: 0.06397702364822229
Training loss: 0.045741876451671125
Training loss: 0.03884067624881864
Training loss: 0.03537146703427037


## Using `nn.BCELoss()`

If you want your network to have only one output (logit), use `torch.sigmoid()` and `nn.BCELoss()`

In [16]:
model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64, 1),
                      nn.Sigmoid()) 
criterion = nn.BCELoss()

optimizer = optim.SGD(model.parameters(), lr=0.003)

epochs = 5
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        
        images = images.view(images.shape[0], -1)
        labels = labels.view(labels.shape[0], -1)
        
        optimizer.zero_grad()
        
        output = model(images)
        
        loss = criterion(output.double(), labels.double()) # to avoid getting errors for data types
        
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()*images.shape[0]
    else:
        print(f"Training loss: {running_loss/len(trainloader.dataset)}")

Training loss: 0.22231331993959397
Training loss: 0.08463542868081923
Training loss: 0.05626678491201148
Training loss: 0.0457951580032223
Training loss: 0.040296036641153186
