# Side notes to comeback and refreshhhh 🧠🛠️

In [85]:
import torch

## __Probability vs. Likelihood__


- __Probability__ corresponds to finding the chance of observing a particular outcome given a specific probability distribution of the data. In other words, it is the likelihood of observing the data under a given model.

- __Likelihood__ corresponds to finding how likely a particular set of parameters of a statistical model is, given the observed data. In simple terms, it involves adjusting the parameters of the model to maximize the probability of the observed data.

__Probability__ is used to determine the chance of occurrence of a particular event given a model, whereas __likelihood__ is used to estimate the parameters of the model that make the observed data most probable.

When training a model, we optimize the likelihood, meaning we find the parameters that maximize the likelihood function. Once the model is trained, it can be used to compute the probability of future events.

## __Negative Log Likelihood vs. Cross-Entropy__

__Always prefer__ ```torch.nn.functional.cross_entropy``` __implementation since it is much more efficient and much more numerically stable.__ See demonstration below:

In [84]:
B = 12
num_classes = 37
Y = torch.randint(low=0, high=num_classes, size=(B,))
dummy_nnet = lambda x: torch.randn((B, num_classes))
x = None 
logits = dummy_nnet(x)
Y.shape, logits.shape

(torch.Size([12]), torch.Size([12, 37]))

### Negative Log Likelihood

In [80]:
def negative_log_likelihood(logits, Y):
    B, _ = logits.shape
    counts = logits.exp()
    probs = counts / counts.sum(1, keepdims=True)
    assert probs.sum(dim=1).allclose(torch.ones((logits.shape[0]))), 'Probability must sum to 1'
    # Negative log likelihood
    log_probs = probs[torch.arange(B), Y].log()
    negative_log_probs = -log_probs
    loss_reduced = negative_log_probs.mean() # loss reduction
    #loss_reduced = -probs[torch.arange(B), Y].log().mean() # One liner implementation
    torch_nll_loss = torch.nn.functional.nll_loss(probs.log(), Y, reduction='mean')
    assert loss_reduced.allclose(torch_nll_loss), 'Torch and own implementation must be equal'
    return loss_reduced

In [81]:
loss = negative_log_likelihood(logits, Y)
loss

tensor(4.3732)

### Cross Entropy 

In [82]:
loss = torch.nn.functional.cross_entropy(logits, Y, reduction='mean')
loss

tensor(4.3732)

## __Dead Neurons__

Dead neurons never get a gradient because for all examples they never activate (e.g., flat region in ReLU, or flat regions in Tanh and Sigmoid). See [Video](https://youtu.be/P6sfmUTpUmc?t=779) for more details. May be caused by chance at initialization or by using a high learning rate that knocks out the neuron out of data distribution.

## __Batch Norm__

As mentioned in the original paper, it accelerates (very) deep NN training by reducing the internal covariate shift of activations. It just achieves to have layer outputs that are gaussian distributions with 0 mean and 1 std. It should be placed after the layer (e.g., nn.Linear, nn.Conv2D,...) and before the activation (nn.ReLU, nn.TanH). However, some experiments have demonstrated good results applying after it... why this works? still a mistery.

When using BN, the preceding layer does not need a bias term.

- __momentum__. BN mean and std is updated using a EMA at each mini-batch through entire dataset. If the batch-size is small (e.g., 32) a momentum of 0.1 might be too agressive and harm training. Consider using a smaller momentum *0.001 in that case.

See [Video](https://youtu.be/P6sfmUTpUmc?t=2440)