In [None]:
# ! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [None]:
#hide
from fastbook import *
from fastai.vision.all import *

In [None]:
# Downloading the Pet Image Dataset
path = untar_data(URLs.PETS)

# Setting the base path to the pet images directory we have just downloaded
Path.BASE_PATH = path

# Using the data block API
pets = DataBlock(blocks = (ImageBlock, CategoryBlock),
                 get_items = get_image_files,
                 splitter = RandomSplitter(seed=42),
                 get_y = using_attr(RegexLabeller(r'(.+)_\d+.jpg$'), 'name'),
                 item_tfms = Resize(460),
                 batch_tfms = aug_transforms(size=224, min_scale=0.75))

# Creating a DataLoaders object
dls = pets.dataloaders(path/'images')

# Creating a simple model
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(2)

# Dogs and Cats to Pet Breeds

## Cross Entropy Loss
Remember that *loss* is whatever function we've decided to use to optimise the parameters of our model. But we haven't actually told fastai what loss function we want to us. 

fastai will generally try to select an appropriate loss function based on what kind of data and model you are using. In this case, we have image data and a categorical outcome, so fastai will default to using cross-entropy loss.

*Cross-entropy loss* is a loss function that is similar to the one we used in the previous chapter, but has two benefits:
- It works even when our dependent variable has more than two categories
- It results in faster and more reliable training

To understand how cross-entropy loss works for dependent variables with more than two categories, we have to understand what the actual data and activations that are seen by the loss function look like.

### Viewing Activations and Labels
Let's have a look at the activations of our model. To get a batch of real data from our `DataLoaders`, we can use the `one_batch` method.

In [None]:
x,y = dls.one_batch()

This returns the dependent and independent variables as a mini-batch. Let's see what is actually contained in our dependent variable.

In [None]:
y

Our batch size is 64, so we have 64 rows in this tensor. Each row is a single integer between 0 and 36, represeneting our 37 possible pet breeds.

We can view the predictions (activations of the final layer of our neural network) using `Learner.get_preds`. This function either takes a dataset index (0 for train and 1 for valid) or an iterator of batches. Thus, we can pass in a simple list with our batch to get our predictions.

By default, it returns predictions and targets, but since we already have the targets, we can effectively ignore them by assigning to the special variable `_`.

In [None]:
preds,_ = learn.get_preds(dl=[(x,y)])
preds[0]

The actual predictions are 37 probabilities between 0 and 1, which add up to 1 in total.

In [None]:
len(preds[0]), preds[0].sum()

To transform the activations of our model into predictions like this, we used something called the *softmax* activation function. 

### Softmax
In our classification model, we use the softmax activation function in the final layer to ensure that the activations are all between 0 and 1, and that they sum to 1. It is similar to the sigmoid function we saw earlier.

To think about what happens if we want to have more categories in our target (such as our 37 pet breeds), we'll need more activations than just a single column: we need an activation *per category*. We can therefore create, for instance, a neural net that predicts 3s and 7s that returns two activations, one for each class. This will be a good step towards creating the the more general approach.

Let's start off using some random numbers with a standard deviation of 2 for our example. Assuming we have 6 images and 2 possible categories (where the first column represents 3s and the second is 7s).

In [None]:
torch.random.manual_seed(42);
acts = torch.randn((6,2))*2
acts

We can't just take the sigmoid of this directly, since we don't get rows that add to 1. (i.e., We want the probability of being a 3 plus the probability of being a 7 add up to 1)

In [None]:
acts.sigmoid()

Previously, our neural net created a single activation per image, which we passed through the `sigmoid` function. That single activation represented the model's confidence that the input was a 3. Binary problems are a special case of classification problems, because the target can get treated as a single boolean value, as we did in `mnist_loss`. But binary problems can also be thought of in the context of the more general group of classifiers with any number of categories: in this case, we happen to have two categories. As we saw in the bear classifier, our neural net will return one activation per category.

In the binary case, what do those activations really indicate? A single pair of activations simply indicates the *relative confidence* of the input being a 3 verus being a 7. The overall values, whether they are both high, or both low, don't matter - all that matters is which is higher and by how much.

We would expect that since this is just another way of representing the same problem, that we would be able to use `sigmoid` directly on the two-activation version of our neural net. Indeed we can!
We just take the *difference* between the two neural net activations, because that reflects how much more sure we are of the input being a 3 than a 7, and then take the sigmoid of that.


In [None]:
(acts[:,0] - acts[:,1]).sigmoid()

The second column (the probability of it being a 7) will then just be that value subtracted from 1.

Now, we need a way to do all this that also works for more than two columns. The softmax function does exactly that!

In [None]:
def softmax(x): return exp(x) / exp(x).sum(dim=1, keepdim=True)

Let's check that `softmax` returns the same values as `sigmoid` for the first column, and those values subtracted from 1 for the second column.

In [None]:
sm_acts = torch.softmax(acts, dim=1)
sm_acts

`softmax` is the multi-category equivalent of `sigmoid` - we have to use it any time we have more than two categories and the probabilities of the categories must add to 1. We often use it even for just two categories, just to make things more consistent. We'll also see shortly that the softmax function works hand-in-hand with the loss function we will look at in the next section.

### Log Likelihood
Softmax is the first part of cross-entropy loss - the second part is log likelihood.

When we calculated the loss for our MNIST example in the last chapter, we used:

In [None]:
def mnist_loss(inputs, targets):
    inputs = inputs.sigmoid()
    return torch.where(targets==1, 1-inputs, inputs).mean()

Just as we moved from sigmoid to softmax, we need to extend our loss function to work with more than just binary classification. It needs to be able to classify any number of categories (in this case, we have 37 categories).

Our activations, after softmax, are between 0 and 1, and sum to 1 for each row in the batch of predictions. Our targets are integers between 0 and 36. Furthermore, cross-entropy loss generalises our binary classification loss and allows for more than one correct label per example (which is called multi-label classification).

In the binary case, we used `torch.where` to select between `inputs` and `1-inputs`. When we treat a binary classification problem as a general classification problem with two categories, it actually becomes even easier, because we now have two columns containing the equivalent of `inputs` and `1-inputs`. As there is only one correct label per example, all we need to do is select the appropriate column.

Let's try to implement this in PyTorch. For our synthetic 3s and 7s example, let's say the following are our labels and our softmax activations.

In [None]:
targ = tensor([0,1,0,1,1,0])

sm_acts

For each item of `targ`, we can use that to select the appropriate column of `sm_acts` using tensor indexing.

In [None]:
idx = range(6)
sm_acts[idx, targ]

To see what's exactly happening here, let's put all the columns together in a table. Here, the first two columns are our activations, then we have the targets and the row index. We explain the last column, `result` below.

In [None]:
from IPython.display import HTML
df = pd.DataFrame(sm_acts, columns=["3","7"])
df['targ'] = targ
df['idx'] = idx
df['result'] = sm_acts[range(6), targ]
t = df.style.hide_index()
#To have html code compatible with our script
html = t._repr_html_().split('</style>')[1]
html = re.sub(r'<table id="([^"]+)"\s*>', r'<table >', html)
display(HTML(html))

Looking at the table, we can see that the `result` column is what `sm_acts[idx, targ]` was doing. The really interesting thing is that it works just as well with more than two columns. To see this, consider that would happen if we added an activation column for every digit (0 through 9), and then `targ` contained a number from 0 to 9.

PyTorch provides a function that does exactly the same as `sm_acts[range(n), targ]` (exept it takes the negative, because when applying the log afterward, we will have negative numbers), called `nll_loss` (*NLL* stands for *negative log likelihood*).

In [None]:
-sm_acts[idx, targ]

In [None]:
F.nll_loss(sm_acts, targ, reduction='none')

Despite its name, this PyTorch function does not take the log. It assumes we have *already* taken the log.

PyTorch has a function called `log_softmax` that combines `log` and `softmax` in a fast and accurate way. `nll_loss` is designed to be used after `log_softmax`.

#### Why do we take the Log?
Cross entropy loss may involve the multiplication of many numbers. Multiplying lots of negative numbers together can call problems like *numerical underflow* in computers. We therefore want to transform these probabilities to larger values so we can perform mathematical operations on them.

The *logarithm* function does exactly this (available as `torch.log`).

Additionally, we want to ensure our model is able to detect differences between small numbers. For example, consider the probabilities of .01 and .001. Although these numbers are very close together, 0.01 is 10 times more confident than 0.001. By taking the log of our probabilities, we prevent these important differences from being ignored.

The key thing to know about logarithms is the following relationship.
```
    log(a*b) = log(a) + log(b)
```

It means that logarithms increase linearly when the underlying signal increases exponentially or multiplicatively. We love using logarithms, because it means that multiplication, which can create really large and really small numbers, can be replaced by addition, which is much less likely to result in scales that are difficult for our computers to handle.

Observe that the log of a number approaches negative infinity as the number approaches zero. In our case, since the result reflects the predicted probability of the correct label, we want our loss function to return a small value when the prediction is "good" (closer to 1) and a large value when the prediction is "bad" (closer to 0).

In [None]:
plot_function(lambda x: -1*torch.log(x), min=0,max=1, tx='x', ty='- log(x)', title = 'Log Loss when true label = 1')

We will go ahead and update our previous results table with an additional column, `loss` to reflect this loss function.

In [None]:
from IPython.display import HTML
df['loss'] = -torch.log(tensor(df['result']))
t = df.style.hide_index()
#To have html code compatible with our script
html = t._repr_html_().split('</style>')[1]
html = re.sub(r'<table id="([^"]+)"\s*>', r'<table >', html)
display(HTML(html))

Notice how the loss is very large in the 3rd and 4th rows, where the predictions are confident and wrong. A benefit of using the log to calculate the loss is that our loss function penalises predictions that are both confident and wrong. This kind of penalty works well in practice to aid in more effective model training.

We are calculating the loss from the column containing the correct label. Because there is only one "right" answer per example, we don't need to consider the other columns, because by the definition of softmax, they add up to 1 minus the activation corresponding to the correct label. As long as the activation columns sum to 1, then we'll have a loss function that shows how well we're predicting each digit. Therefore, making the activation for the correct label as high as possible must mean we're also decreasing the activations of the remaining columns.

### Negative Log Likelihood

Taking the mean of the negative log of our probabilities (i.e., taking the mean of the `loss` column of our table) gives use the *negative log likelihood loss*, which is just another name for cross-entropy loss. Recall that PyTorch's `nll_loss` assumes that you already took the log of the softmax, so it doesn't actually do the logarithm for you.

In PyTorch, this is available as `nn.CrossEntropyLoss` (which, in practice, actually does `log_softmax` and then `nll_loss`).

In [None]:
loss_func = nn.CrossEntropyLoss()

This is a class. Instantiating it gives you an object which behaves like a function.


In [None]:
loss_func(acts, targ)

All PyTorch loss functions are provided in two forms, the class shown above or the functional form.

In [None]:
F.cross_entropy(acts, targ)

By default, PyTorch loss functions take the mean of the loss of all items. You can use `reduction='none'` to disable that.
The values should make the `loss` column in our table exactly.

In [None]:
nn.CrossEntropyLoss(reduction='none')(acts, targ)

We have now seen all the pieces hidden behind our loss function.