# Loss functions

In PyTorch, we pass the raw output of our network into the loss, not the output of the softmax function. This raw output is usually called the *logits* or *scores*. 

We use the logits because softmax gives you probabilities which will often be very close to zero or one but floating-point numbers can't accurately represent values near zero or one ([read more here](https://docs.python.org/3/tutorial/floatingpoint.html). It's usually best to avoid doing calculations with probabilities, typically we use log-probabilities.

**Note:** In Tensorflow/Keras, we have `tf.keras.losses.CategoricalCrossentropy(from_logits=False)`. 
- When softmax is used in your model, use `from_logits=False`
- When softmax is not applied in your model (i.e., your model outputs logits), use `from_logits=True`.

The second option (`from_logits=True`), which combines the computation of softmax and loss, is more effective in terms of numerical stability. 

In PyTorch, `nn.CrossEntropyLoss` expects logits.

## Example of `nn.CrossEntropyLoss`

`nn.CrossEntropyLoss`: The input is expected to contain raw, unnormalized scores for each class

In [5]:
import torch
from torch import nn

# Assume that we have three examples. And, our network outputs 5 numbers for each example.
output = torch.randn(3, 5, requires_grad=True) # logits
# Assume that the labels of these examples are numeric (they are class indices)
target = torch.tensor([1, 0, 4])

print(nn.CrossEntropyLoss()(output, target)) # logits and labels

tensor(0.8991, grad_fn=<NllLossBackward>)


In [6]:
print(output.dtype)
print(target.dtype)

torch.float32
torch.int64


## Example of `nn.LogSoftmax`

In [7]:
last_layer_outputs = torch.randn(3, 5, requires_grad=True)
outputs = nn.LogSoftmax(dim=1)(last_layer_outputs) # m = 3 samples, each has n = 5 features
# This is in the form of one-hot encoded outputs, but with log softmax
print(outputs)
# We can convert those scores into a proper one-hot encoded vectors
print(torch.exp(outputs))

tensor([[-2.6232, -0.8465, -3.1938, -0.8882, -3.0773],
        [-0.7772, -1.8745, -2.8865, -1.3734, -2.5527],
        [-3.0232, -3.1845, -1.2186, -2.4182, -0.6439]],
       grad_fn=<LogSoftmaxBackward>)
tensor([[0.0726, 0.4289, 0.0410, 0.4114, 0.0461],
        [0.4597, 0.1534, 0.0558, 0.2532, 0.0779],
        [0.0486, 0.0414, 0.2957, 0.0891, 0.5252]], grad_fn=<ExpBackward>)


In [8]:
target = torch.tensor([1, 0, 4]) # target values for each sample

print(nn.NLLLoss()(outputs, target)) # logSoftmax and labels

# You can get the probabilities using the exponential function:
print("Probabilities:", torch.exp(outputs))

tensor(0.7559, grad_fn=<NllLossBackward>)
Probabilities: tensor([[0.0726, 0.4289, 0.0410, 0.4114, 0.0461],
        [0.4597, 0.1534, 0.0558, 0.2532, 0.0779],
        [0.0486, 0.0414, 0.2957, 0.0891, 0.5252]], grad_fn=<ExpBackward>)


Note: 

In fact, `nn.CrossEntropyLoss()` combines `nn.LogSoftmax()` (log(softmax(x))) and `nn.NLLLoss()` in one single class. Therefore, the output from the network that is passed into `nn.CrossEntropyLoss` needs to be the raw output of the network (called logits), not the output of the softmax function.

## Use `nn.LogSoftmax` to define a model and train it

In [1]:
import torch
from torch import nn
from torch import optim

import torch.nn.functional as F
from torchvision import datasets, transforms

### Prepare dataset

In [3]:
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,)),
                              ])
# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

In [4]:
print(len(trainloader))
print(len(trainloader.dataset))

938
60000


### Define the model

In [6]:
model = nn.Sequential(nn.Linear(784, 128),
                      nn.ReLU(),
                      nn.Linear(128, 64),
                      nn.ReLU(),
                      nn.Linear(64, 10),
                      nn.LogSoftmax(dim=1)) #### you can also use "F.log_softmax"

criterion = nn.NLLLoss() # By default, reduction = 'mean'. 
# Hence, loss.item() contains the loss of entire mini-batch, but divided by the batch size.

optimizer = optim.SGD(model.parameters(), lr=0.003)

epochs = 5
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        # Flatten MNIST images into a 784 long vector
        images = images.view(images.shape[0], -1)
    
        optimizer.zero_grad()
        
        output = model(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()*images.shape[0]
    else:
        print(f"Training loss: {running_loss/len(trainloader.dataset)}")

Training loss: 1.8840012865066529
Training loss: 0.8314543878237406
Training loss: 0.515946289173762
Training loss: 0.4211969321568807
Training loss: 0.3788894084135691
