## 1. Mean Squared Error (MSE)

Mean Squared Error is typically used for **regression** tasks. It measures the average of the squares of the errors—that is, the average squared difference between the estimated values ($\hat{y}$) and the actual value ($y$).

The formula for MSE is:
$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

### Implementation Task:
Implement a function `mean_squared_error` that:
1. Takes two NumPy arrays: `y_true` and `y_pred`.
2. Returns a single scalar representing the MSE.

In [2]:
import numpy as np

def mean_squared_error(y_true, y_pred):
    """
    Calculates the Mean Squared Error (MSE) between true labels and predictions.

    Args:
        y_true: Ground truth values.
        y_pred: Predicted values from the model.
    """
    # Return the mean of squared errors
    return np.mean((y_true - y_pred) ** 2)

## 2. Binary Cross-Entropy (BCE)

Binary Cross-Entropy is the standard loss function for **binary classification**. It penalizes the model based on how far the predicted probability is from the actual label (0 or 1). 

If $y = 1$, the loss is $-\log(\hat{y})$. If $y = 0$, the loss is $-\log(1 - \hat{y})$.

While the loss for a single point is $L$, we minimize the average loss over $N$ samples (the Cost Function $J$):

$$J = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]$$

This ensures that our loss value is scale-invariant (it doesn't double just because you doubled your batch size).

### Implementation Task:
Implement a function `binary_cross_entropy` that:
1. Takes `y_true` (actual labels) and `y_pred` (predicted probabilities).
2. Uses a small epsilon ($\epsilon = 1e-15$) to clip `y_pred` to avoid `log(0)` errors.
3. Returns the average loss over all samples.

In [1]:
def binary_cross_entropy(y_true, y_pred):
    """
    Calculates the Binary Cross-Entropy (BCE) between true labels and predictions.

    Args:
        y_true: Ground truth values.
        y_pred: Predicted values from the model.
    """
    epsilon = 1e-15
    # np.clip:Clip array elements to the specified min and max range
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon) 
    loss = -(y_true * np.log(y_pred) + (1-y_true) * np.log(1 - y_pred))
    # return the average loss over the batch
    return np.mean(loss)

## Why Use Cross-Entropy instead of MSE for Classification?

While Mean Squared Error (MSE) is the go-to for regression, it is rarely used for classification tasks involving probabilities. Here are the two primary reasons:

### 1. The Vanishing Gradient Problem (Learning Speed)
When using a Sigmoid activation function with MSE, the gradient becomes very small when the prediction is "very wrong" (e.g., predicting 0.99 for a label of 0). This leads to extremely slow convergence. Cross-Entropy's derivative cancels out the Sigmoid derivative's "flatness," ensuring a strong gradient when the error is high.

### 2. Non-Convexity
For logistic regression, MSE results in a **non-convex** loss surface with many local minima. Cross-Entropy, however, is **convex**, guaranteeing that gradient descent can find the global minimum.

In [6]:
# Test Case: A "very wrong" prediction
y_true = np.array([1])
y_pred = np.array([0.01]) # The model is 99% sure it's class 0, but it's actually 1

mse_val = mean_squared_error(y_true, y_pred)
bce_val = binary_cross_entropy(y_true, y_pred)

print(f"MSE Loss: {mse_val:.4f}") 
# MSE is bounded; even a total failure results in a max loss of 1.0

print(f"BCE Loss: {bce_val:.4f}") 
# BCE is logarithmic; as the error approaches 1, the loss approaches infinity

MSE Loss: 0.9801
BCE Loss: 4.6052


### Mathematical Insight: Gradient Saturation

In classification, we usually use the **Sigmoid** activation function:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

The derivative of MSE with respect to the weights involves $\sigma'(z)$. As $\sigma(z)$ approaches 0 or 1 (the "flat" regions of the curve), $\sigma'(z)$ becomes nearly zero. This causes the weights to stop updating, a phenomenon known as **gradient saturation**.

![pic](../assets/Sigmoid_function_and_its_derivative.png)

Cross-Entropy solves this. When we take the derivative of the BCE loss with respect to the weights, the denominator from the Sigmoid derivative is canceled out. This leaves a gradient that is proportional to the linear error $(y - \hat{y})$, meaning the model learns faster when it is further from the truth.

## Why BCE is Superior: Gradient Analysis

To understand why BCE is preferred over MSE for classification, we must look at the gradients with respect to the weights $w$ when using a **Sigmoid** activation function: $\hat{y} = \sigma(z)$, where $z = wx + b$.

The derivative of the Sigmoid function is:
$$\sigma'(z) = \sigma(z)(1 - \sigma(z)) = \hat{y}(1 - \hat{y})$$



[Image of Sigmoid function and its derivative]


### Case 1: Mean Squared Error (MSE)
The loss for one sample is $L = \frac{1}{2}(y - \hat{y})^2$. Using the chain rule:
$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$$
$$\frac{\partial L}{\partial w} = -(y - \hat{y}) \cdot \sigma'(z) \cdot x$$
$$\frac{\partial L}{\partial w} = -(y - \hat{y}) \cdot \hat{y}(1 - \hat{y}) \cdot x$$

**The Problem (Gradient Vanishing):** If the prediction $\hat{y}$ is very close to $0$ or $1$ (even if it's the **wrong** prediction), the term $\hat{y}(1 - \hat{y})$ becomes extremely small (near $0$). This "kills" the gradient, and the model stops learning.

### Case 2: Binary Cross-Entropy (BCE)
The loss is $L = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]$. Using the chain rule:
$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$$
$$\frac{\partial L}{\partial w} = -(\frac{y}{\hat{y}} - \frac{1-y}{1-\hat{y}}) \cdot \hat{y}(1 - \hat{y}) \cdot x$$
Simplifying the term inside the parenthesis:
$$\frac{\partial L}{\partial w} = -(\frac{y(1-\hat{y}) - \hat{y}(1-y)}{\hat{y}(1-\hat{y})}) \cdot \hat{y}(1 - \hat{y}) \cdot x$$
Notice how the denominator $\hat{y}(1-\hat{y})$ cancels out the Sigmoid derivative!

**The Result:**
$$\frac{\partial L}{\partial w} = (\hat{y} - y)x$$

**The Advantage:**
The gradient is purely proportional to the error $(\hat{y} - y)$. There is no "vanishing" term. If the error is large, the gradient is large, and the model learns quickly.

## Loss Functions as Maximum Likelihood Estimation Under Distributional Assumptions

In machine learning, loss functions are not chosen arbitrarily.

Most commonly used loss functions come from a single principle:

> A loss function is the negative log-likelihood of an assumed probability distribution for the target variable.

In other words:

$$
\text{Loss Function} = - \log p(y \mid X; \theta)
$$

When we choose a loss function, we are implicitly choosing a probability distribution for the data.


### Maximum Likelihood Under a Chosen Distribution

Suppose we assume that:

$$
y \sim p(y \mid X; \theta)
$$

We want to find parameters $$\theta$$ that make the observed data most likely.

Maximum Likelihood Estimation (MLE) solves:

$$
\theta^* = \arg\max_{\theta} \prod_{i=1}^{n} p(y_i \mid X_i; \theta)
$$

Taking the log:

$$
\theta^* = \arg\max_{\theta} \sum_{i=1}^{n} \log p(y_i \mid X_i; \theta)
$$

Converting to minimization:

$$
\theta^* = \arg\min_{\theta} - \sum_{i=1}^{n} \log p(y_i \mid X_i; \theta)
$$

This negative log-likelihood becomes the loss function.


### Why Do We Use MSE for Regression?

For regression problems, the target $$y$$ is continuous.

A natural assumption is:

$$
y \mid X \sim \mathcal{N}(\mu, \sigma^2)
$$

where:

$$
\mu = Xw
$$

The Gaussian density is proportional to:

$$
p(y \mid X) \propto 
\exp\left(
-\frac{(y - Xw)^2}{2\sigma^2}
\right)
$$

Taking the negative log-likelihood gives:

$$
(y - Xw)^2
$$

This is exactly Mean Squared Error (up to a constant factor).

Therefore:

MSE = Maximum Likelihood under a Gaussian assumption

It is not arbitrary — it follows directly from assuming Gaussian noise.


### Why Do We Use BCE for Binary Classification?

For binary classification:

$$
y \in \{0,1\}
$$

A natural assumption is:

$$
y \mid X \sim \text{Bernoulli}(p)
$$

where:

$$
p = \sigma(Xw)
$$

The Bernoulli likelihood is:

$$
p(y \mid X) = p^y (1-p)^{1-y}
$$

Taking the negative log-likelihood gives:

$$
- \left[
y \log p + (1-y)\log(1-p)
\right]
$$

This is Binary Cross Entropy (BCE).

Therefore:

BCE = Maximum Likelihood under a Bernoulli assumption

Again, this is derived — not chosen randomly.


### Most Common Loss Functions Come From the Exponential Family

Many standard machine learning loss functions arise from the Exponential Family of distributions.

| Distribution  | Loss Function      |
|--------------|-------------------|
| Gaussian     | MSE               |
| Bernoulli    | BCE               |
| Categorical  | Cross Entropy     |
| Poisson      | Poisson Loss      |

These are not disconnected tricks.

They are unified under a single framework.

### This Framework Is Called Generalized Linear Models (GLM)

Generalized Linear Models (GLM) provide a unified way to:

1. Choose a probability distribution for the target
2. Define a linear predictor $$Xw$$
3. Use a suitable link function
4. Derive the loss via maximum likelihood

Under GLM:

$$
\nabla_w L = X^T(\hat{y} - y)
$$

This elegant structure explains why many seemingly different models share similar gradient forms.

### Key Takeaway

We do not choose loss functions because their gradients are simple.

We choose them because they correspond to a probabilistic assumption about how data is generated.

The simplicity of the gradient is a consequence of mathematical structure — not the goal.