## Logits, Softmax, Label Smoothing with Fastai



MixUp, Label Smoothing ...

- [Lesson 12 (2019) - Advanced training techniques; ULMFiT from scratch](https://youtu.be/vnOpEwmtFJ8)

- [class MixUp](https://docs.fast.ai/callback.mixup.html#MixUp)

- [Revisiting ResNets: Improved Training and Scaling Strategies](https://wandb.ai/wandb_fc/pytorch-image-models/reports/Revisiting-ResNets-Improved-Training-and-Scaling-Strategies--Vmlldzo2NDE3NTM?s=09)

![image|476x351](upload://f3TJwptFmeQEoe6MjUUfhEEhcny.png)
  


## Acknowledgements

**fastai course:**
- [Practical Deep Learning for Coders (a UQ collaboration with fast.ai)](https://itee.uq.edu.au/event/2022/practical-deep-learning-coders-uq-fastai)  

**Jeremy's Notebook Series:**
- [First Steps: Road to the Top, Part 1](https://www.kaggle.com/code/jhoward/first-steps-road-to-the-top-part-1)
- [Small models: Road to the Top, Part 2](https://www.kaggle.com/code/jhoward/small-models-road-to-the-top-part-2)
- [Scaling Up: Road to the Top, Part 3](https://www.kaggle.com/code/jhoward/scaling-up-road-to-the-top-part-3)
- [Multi-target: Road to the Top, Part 4](https://www.kaggle.com/code/jhoward/multi-target-road-to-the-top-part-4)

## Installing the libraries


In [329]:
# fastkaggle allows you to work locally and then submit the results and notebook to Kaggle

try: import fastkaggle

except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *

In [330]:
competition = 'paddy-disease-classification'
path = setup_comp(competition, install='fastai "timm>=0.6.2.dev0"')

from fastai.vision.all import *
from scipy.special import softmax, log_softmax
from functools import partial
set_seed(42)

## Setting data paths

In [331]:
# train images
train_path = path / 'train_images'
train_files = get_image_files(train_path)

# test images
test_path = path/'test_images'
test_files = get_image_files(test_path).sorted()

# sample submission
sample_submission = pd.read_csv(path/'sample_submission.csv')

# train labels
train_df = pd.read_csv(path / 'train.csv')
train_df.head()

Unnamed: 0,image_id,label,variety,age
0,100330.jpg,bacterial_leaf_blight,ADT45,45
1,100365.jpg,bacterial_leaf_blight,ADT45,45
2,100382.jpg,bacterial_leaf_blight,ADT45,45
3,100632.jpg,bacterial_leaf_blight,ADT45,45
4,101918.jpg,bacterial_leaf_blight,ADT45,45


## Dataloaders for fastai training
You can create the dataloader in any of these two ways:
1. From a `DataBlock`

In [332]:
dblock = DataBlock(
    blocks=(ImageBlock, CategoryBlock),
    get_items=get_image_files,
    get_y=parent_label,
    splitter=RandomSplitter(0.2, seed=42),
    item_tfms=Resize(480, method='squish'),
    batch_tfms=aug_transforms(size=224, min_scale=0.75)
)

dls = dblock.dataloaders(train_path)

## Create a learner and train

### Custom Loss and Activation function

In [333]:
def custom_loss(inp, disease): 
    return F.cross_entropy(inp, disease, label_smoothing=label_smoothing)

def custom_softmax(x, base=math.e):
    """If base=2, then the softmax becomes "SoftmaxBase" (?)
    computed as [2**xi / SUM(2**xi)] instead of [e**xi / SUM(e**xi)]
    """
    # Since `base` can be rewritten as `e**(ln(base))`
    # So `base**xi = (e**(ln(base)))**xi = e**(xi*ln(base))`
    return F.softmax(x * math.log(base), dim=-1)

def no_activation(x):
    return x

In [334]:
learn = vision_learner(dls, resnet34, metrics=error_rate).to_fp16()

And that's it, 16 epochs to get the best baseline for the price 

In [8]:
learn.fine_tune(1, 0.005)

epoch,train_loss,valid_loss,error_rate,time
0,1.736589,1.059424,0.336377,01:18


epoch,train_loss,valid_loss,error_rate,time
0,0.668114,0.402552,0.121096,01:21


### Predictions and Test Time Augmentation

Lets compare the error rate -on the validation set- that are obtained with the normal prediction function and with the predictions we can get applying a technique called Test Time Augmentation (TTA). As you'll see, TTA is easy with fastai.

In [9]:
# Get predictions on validation set
probs, target = learn.get_preds(dl=dls.valid)
error_rate(probs, target)

TensorBase(0.1211)

In [10]:
# Get TTA predictions on validation set
probs, target = learn.tta(dl=dls.valid)
error_rate(probs, target)

TensorBase(0.1124)

So you can see a boost with TTA.

## Softmax

In [209]:
def custom_loss(inp, disease): 
    return F.cross_entropy(inp, disease, label_smoothing=label_smoothing)

def custom_softmix(x, exp=math.e):
    return F.softmax(x * math.log(exp), dim=-1)

def no_activation(x):
    return x

In [16]:
probs.shape

torch.Size([2081, 10])

- As we can see, `probs` contains 2,081 results (number of validation photos), each with 10 probabilities for each of the ten possible diseases.
- The sum of those 10 values adds to 100% in each photo

In [17]:
[sum(row)*100 for row in probs[0:4]]

[TensorBase(100.), TensorBase(100.), TensorBase(100.), TensorBase(100.0000)]

- But the raw output of neural networks (called logits) is not a probability. Is a serie of 10 numbers that are normalized to make probabilities out of them

#### Para obtener logits
- `learn.loss_func = custom_loss` 
Antes de lam- Llamar a `get_preds` o `tta` asignar `custom_loss` a `loss_func`

- `learn.get_preds()`
Pasarle el parámetro `act=no_activation` para que no use Softmax como "normalizador. Pero no puede pasarse a `tta`


### Softmax Excel
- Jeremy video
- Excel Repository
- Google Sheet file

In [287]:
# logits
x = torch.tensor(
    [[-4.88522478044709, 2.59747282063147, 0.59166497570264, -2.06894452227226, -4.56867917386799],
     [-2.88522478044709, 1.59747282063147, -0.59166497570264, 2.06894452227226, -4.56867917386799]]
)
x

tensor([[-4.8852,  2.5975,  0.5917, -2.0689, -4.5687],
        [-2.8852,  1.5975, -0.5917,  2.0689, -4.5687]])

In [288]:
target = torch.tensor([1,3])
target

tensor([1, 3])

In [289]:
# e**xi
expxi = x.exp()
expxi

tensor([[7.5574e-03, 1.3430e+01, 1.8070e+00, 1.2632e-01, 1.0372e-02],
        [5.5842e-02, 4.9405e+00, 5.5341e-01, 7.9165e+00, 1.0372e-02]])

In [290]:
# Sum by row [sum([e**xi])]
sum_expxi = expxi.sum(axis=1)
sum_expxi

tensor([15.3810, 13.4766])

In [291]:
softmaxxi = expxi.transpose(0,1)*(1/sum_expxi)
softmaxxi = softmaxxi.transpose(0,1)
softmaxxi

tensor([[4.9135e-04, 8.7314e-01, 1.1748e-01, 8.2127e-03, 6.7432e-04],
        [4.1436e-03, 3.6660e-01, 4.1064e-02, 5.8742e-01, 7.6960e-04]])

In [292]:
logsoftmaxxi = softmaxxi.log()
logsoftmaxxi

tensor([[-7.6184, -0.1357, -2.1415, -4.8021, -7.3018],
        [-5.4862, -1.0035, -3.1926, -0.5320, -7.1696]])

In [293]:
y_ohe = np.zeros((len(target), len(x[0])))
y_ohe[np.arange(len(target)), target] = 1
y_ohe

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

In [294]:
lossi = (logsoftmaxxi * y_ohe).sum(dim=-1)
lossi

tensor([-0.1357, -0.5320], dtype=torch.float64)

In [295]:
mean_loss = (logsoftmaxxi * y_ohe).sum(dim=-1).mean()
mean_loss

tensor(-0.3338, dtype=torch.float64)

#### scipy

In [273]:
softmax(x, axis=1)

tensor([[4.9135e-04, 8.7314e-01, 1.1748e-01, 8.2127e-03, 6.7432e-04],
        [4.1436e-03, 3.6660e-01, 4.1064e-02, 5.8742e-01, 7.6960e-04]])

In [274]:
log_softmax(x, axis=1)

array([[-7.6183577 , -0.13565996, -2.1414678 , -4.8020773 , -7.301812  ],
       [-5.4861803 , -1.0034829 , -3.1926208 , -0.5320113 , -7.169635  ]],
      dtype=float32)

In [275]:
softmax(x, axis=1).log()

tensor([[-7.6184, -0.1357, -2.1415, -4.8021, -7.3018],
        [-5.4862, -1.0035, -3.1926, -0.5320, -7.1696]])

#### pythorch

In [296]:
F.softmax(x, dim=-1)

tensor([[4.9135e-04, 8.7314e-01, 1.1748e-01, 8.2127e-03, 6.7432e-04],
        [4.1436e-03, 3.6660e-01, 4.1064e-02, 5.8742e-01, 7.6960e-04]])

In [297]:
F.log_softmax(x, dim=-1)

tensor([[-7.6184, -0.1357, -2.1415, -4.8021, -7.3018],
        [-5.4862, -1.0035, -3.1926, -0.5320, -7.1696]])

In [298]:
F.nll_loss(F.log_softmax(x, dim=-1), target)

tensor(0.3338)

In [299]:
F.cross_entropy(x, target)

tensor(0.3338)

In [301]:
F.one_hot(target, len(x[0]))

tensor([[0, 1, 0, 0, 0],
        [0, 0, 0, 1, 0]])

### Cross Entropy Loss

In [302]:
x = torch.tensor([
  [3.326171875,0.457275391,-2.796875000,0.537597656,0.363037109,0.821777344,1.685546875,-3.531250000,-1.796875000,-1.674804688],
  [-1.417968750,1.099609375,-2.019531250,3.640625000,0.524902344,1.203125000,1.619140625,-0.686035156,-2.474609375,-1.802734375]
])
x

tensor([[ 3.3262,  0.4573, -2.7969,  0.5376,  0.3630,  0.8218,  1.6855, -3.5312,
         -1.7969, -1.6748],
        [-1.4180,  1.0996, -2.0195,  3.6406,  0.5249,  1.2031,  1.6191, -0.6860,
         -2.4746, -1.8027]])

In [303]:
target = torch.tensor([0,3])
target

tensor([0, 3])

In [309]:
F.cross_entropy(x, target, reduction='none')

tensor([0.3794, 0.3167])

In [306]:
mean_loss = F.cross_entropy(x, target)
mean_loss

tensor(0.3480)

In [312]:
## Label Smoothing
F.cross_entropy(x, target, reduction='none', label_smoothing=0.1)

tensor([0.7381, 0.6839])

In [313]:
F.cross_entropy(x, target, label_smoothing=0.1)

tensor(0.7110)

In [317]:
F.softmax(x, dim=-1)

tensor([[6.8425e-01, 3.8839e-02, 1.4997e-03, 4.2088e-02, 3.5346e-02, 5.5921e-02,
         1.3265e-01, 7.1958e-04, 4.0767e-03, 4.6060e-03],
        [4.6297e-03, 5.7401e-02, 2.5369e-03, 7.2857e-01, 3.2309e-02, 6.3662e-02,
         9.6505e-02, 9.6256e-03, 1.6094e-03, 3.1510e-03]])

In [315]:
custom_softmix(x, exp=math.e)

tensor([[6.8425e-01, 3.8839e-02, 1.4997e-03, 4.2088e-02, 3.5346e-02, 5.5921e-02,
         1.3265e-01, 7.1958e-04, 4.0767e-03, 4.6060e-03],
        [4.6297e-03, 5.7401e-02, 2.5369e-03, 7.2857e-01, 3.2309e-02, 6.3662e-02,
         9.6505e-02, 9.6256e-03, 1.6094e-03, 3.1510e-03]])

In [323]:
part_sofmax = partial(custom_softmix, exp=math.e)

In [324]:
part_sofmax(x)

tensor([[6.8425e-01, 3.8839e-02, 1.4997e-03, 4.2088e-02, 3.5346e-02, 5.5921e-02,
         1.3265e-01, 7.1958e-04, 4.0767e-03, 4.6060e-03],
        [4.6297e-03, 5.7401e-02, 2.5369e-03, 7.2857e-01, 3.2309e-02, 6.3662e-02,
         9.6505e-02, 9.6256e-03, 1.6094e-03, 3.1510e-03]])

In [325]:
part_sofmax2 = partial(custom_softmix, exp=2)

In [326]:
part_sofmax2(x) # 2**xi / sum(2**xi)

tensor([[0.5026, 0.0688, 0.0072, 0.0727, 0.0644, 0.0886, 0.1612, 0.0043, 0.0144,
         0.0157],
        [0.0162, 0.0926, 0.0107, 0.5390, 0.0622, 0.0995, 0.1328, 0.0269, 0.0078,
         0.0124]])

In [328]:
part_sofmax2(x).sum(axis=-1)

tensor([1.0000, 1.0000])

In [322]:
F.softmax(x * math.log(2), dim=-1)

tensor([[0.5026, 0.0688, 0.0072, 0.0727, 0.0644, 0.0886, 0.1612, 0.0043, 0.0144,
         0.0157],
        [0.0162, 0.0926, 0.0107, 0.5390, 0.0622, 0.0995, 0.1328, 0.0269, 0.0078,
         0.0124]])

### Predictions on test set

In [11]:
# TTA predictions from test images
probs, _ = learn.tta(dl=dls.test_dl(test_files))

In [12]:
# get the index with the greater probability
preds = probs.argmax(dim=1)

In [13]:
dls.vocab[preds]

(#3469) ['hispa','normal','blast','blast','blast','brown_spot','dead_heart','brown_spot','hispa','normal'...]

### Submission

In [14]:
sample_submission.label = dls.vocab[preds]
sample_submission.to_csv('submission.csv', index=False)

### Conclusions

* I found this model being a good baseline, with a good accuracy for its speed and cost.
* You can try different epochs, learning rates, or even a different seed and see what happens when submitting the results.
* Then you can apply some of the techniques that Jeremy applied in his series.
* And keep trying.



In [15]:
# Pushing the notebook from my home PC to Kaggle

if not iskaggle:
    push_notebook(
        'fmussari', 
        'fast-resnet34-with-fastai',
        title='Fast Resnet34 with Fastai',
        file='2022-07. Fast and Agile Resnet34 with Fastai.ipynb',
        competition=competition, 
        private=True, 
        gpu=True
    )

Kernel version 1 successfully pushed.  Please check progress at https://www.kaggle.com/code/fmussari/fast-resnet34-with-fastai
