In [1]:
import torch

## Recap with Datasets 
$D = \{(x_i, y_i)\}_{i=1}^{N} \quad \text{where} \quad (x_i, y_i) \sim P(X, Y)$

#### Loss function:
$l(\theta | x_i, y_i)$

#### Expected loss:
$L(\theta | D) = \mathbb{E}_{(x,y) \sim D} [l(\theta | x, y)]$


We optimize the expected loss

Here we have a simple linear model with 10 inputs and 1 output

In [4]:
model = torch.nn.Linear(10, 1)

Generate 20 samples, each with 10 features. And also 20 labels.

In [5]:
x = torch.randn(20, 10)
y = torch.randn(20, 1)
print(f'{x=} {y=}')

x=tensor([[-0.9083,  0.6928,  1.9945,  0.1235, -0.9830, -0.1790, -0.6362, -0.6634,
         -0.1359, -0.6323],
        [ 0.9359, -1.2579,  0.8495, -0.1982, -2.0149,  0.4632,  1.3451, -0.0267,
          1.2630,  0.7624],
        [ 2.6083, -0.6014,  0.2768, -0.3980, -1.3419,  1.0293, -1.5615, -0.4441,
         -0.6056,  0.9286],
        [-0.3448,  1.4212, -0.2720, -2.1540, -0.0844, -0.0027, -0.9357,  1.3070,
         -0.9902,  0.0247],
        [ 0.7207,  0.3491, -1.6588, -0.0283,  0.1197,  0.0608,  1.3079,  0.1081,
         -0.2526,  0.6874],
        [-0.3460, -2.6866, -0.4130, -0.3973, -1.3535, -1.7491, -0.1439, -1.7610,
          0.7561, -0.1493],
        [-0.8451,  1.3845, -0.3687, -0.3800, -0.1305, -0.4377,  0.4213, -0.2474,
         -0.9070,  0.8161],
        [-0.0595, -0.1350,  0.5597, -0.7980,  0.2108,  0.8200,  2.0374,  0.6362,
          0.9273,  0.4229],
        [ 0.2330,  0.5349,  0.7114,  0.3931,  0.4020,  0.6067, -0.2303, -0.2359,
          0.2746,  0.5293],
        [-1.6134,

Pass through model with x, we get predictions for y.

In [6]:
pred_y = model(x)
print(pred_y)

tensor([[ 0.4510],
        [-0.4237],
        [-1.1375],
        [-0.7424],
        [-0.2288],
        [-1.0232],
        [-0.1169],
        [ 0.4734],
        [ 0.3439],
        [-0.3617],
        [-0.9265],
        [-0.1807],
        [-0.0681],
        [-0.0300],
        [ 1.0179],
        [ 0.6545],
        [-0.7629],
        [-0.5349],
        [-0.2827],
        [ 0.2715]], grad_fn=<AddmmBackward0>)


**Mean Squared Error (MSE) Formula**

The Mean Squared Error (MSE) measures the average squared difference between actual and predicted values:


$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

A lower MSE indicates a better fit of the model.


In [7]:
loss = torch.nn.functional.mse_loss(pred_y, y)
print(loss)

tensor(0.8921, grad_fn=<MseLossBackward0>)


MSE from scratch

In [8]:
def mse_loss(pred_y, y):
    return torch.mean((pred_y - y) ** 2)
mse_loss(pred_y, y)

tensor(0.8921, grad_fn=<MeanBackward0>)

Let's create 20 binary labels 

In [7]:
y = (torch.randn(20, 1) > 0).float()
print(y)

tensor([[0.],
        [0.],
        [1.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.]])


### Binary Cross Entropy with Logits (BCE with Logits)

Binary Cross Entropy with Logits is used for binary classification tasks. Instead of working with probabilities, it takes raw model outputs (logits) and applies the **sigmoid function** internally for stability.

#### **Formula:**
$
L = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\sigma(\hat{y}_i)) + (1 - y_i) \log(1 - \sigma(\hat{y}_i)) \right]
$


Using logits instead of probabilities improves numerical stability and gradient computations.


In [8]:
loss = torch.nn.functional.binary_cross_entropy_with_logits(pred_y, y)
print(loss)

tensor(0.6922, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)


Let's create a new model that takes in 10 inputs and outputs 3 different classes

Let's also create some labels of classes 0 - 2

In [9]:
num_classes = 3
model = torch.nn.Linear(10, num_classes)
y = (torch.randn(20) > 0).long() + (torch.randn(20) > 0).long()
print(y)

tensor([1, 1, 1, 1, 0, 2, 1, 2, 1, 1, 0, 1, 1, 1, 1, 2, 0, 0, 1, 1])


### Cross Entropy Loss

Cross Entropy is used for classification tasks to compare the predicted probability distribution with the true class labels.

**Formula for a Single Sample:**

$L = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$

**Formula for Multiple Samples:**

$L = -\frac{1}{n} \sum_{j=1}^{n} \sum_{i=1}^{C} y_{j,i} \log(\hat{y}_{j,i})$


Cross entropy penalizes incorrect predictions more severely when the confidence in a wrong class is high.


In [None]:
pred_y = model(x)
loss = torch.nn.functional.cross_entropy(pred_y, y)
print(loss)

tensor(1.0760, grad_fn=<NllLossBackward0>)
