# üìâ Loss Functions: A Theoretical Guide

Loss functions are at the heart of machine learning. They tell us **how wrong a model‚Äôs predictions are** and provide a direction for improving the model during training. This notebook explains the most commonly used **regression and classification loss functions** in a clear, conceptual way.

---

## 1. What Is a Loss Function?

A **loss function** maps a model‚Äôs prediction and the true target value to a **single numerical value** representing error.

- **Lower loss** ‚Üí better predictions  
- **Higher loss** ‚Üí worse predictions  

During training, learning algorithms adjust model parameters to **minimize the loss**.

---

## 2. Regression Loss Functions

Regression problems deal with **continuous numerical outputs**, such as temperature, rainfall, or house prices.

---

## 2.1 Mean Squared Error (MSE)

### Definition
Mean Squared Error computes the **average of the squared differences** between predicted and actual values.

### Mathematical Expression
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

### Interpretation
- Squaring ensures all errors are positive.
- Larger errors are **penalized heavily**.
- Encourages predictions to be close to the true values.

### Advantages
- Smooth and differentiable.
- Works well with gradient-based optimization.

### Disadvantages
- Highly sensitive to **outliers**.
- A single large error can dominate the loss.

### When to Use
- When large errors must be strongly penalized.
- Common in linear regression and neural networks.

---

## 2.2 Mean Absolute Error (MAE)

### Definition
Mean Absolute Error computes the **average absolute difference** between predictions and true values.

### Mathematical Expression
$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

### Interpretation
- Measures error in the **same units** as the target variable.
- Treats all errors equally.

### Advantages
- More robust to outliers than MSE.
- Easy to interpret.

### Disadvantages
- Not differentiable at zero.
- Slower convergence in gradient-based methods.

### When to Use
- When outliers are present.
- When interpretability is important.

---

## 2.3 Huber Loss

### Definition
Huber Loss combines the strengths of **MSE and MAE**.

- Behaves like **MSE** for small errors.
- Behaves like **MAE** for large errors.

### Mathematical Expression
$$
L_\delta(a) =
\begin{cases}
\frac{1}{2}a^2 & \text{if } |a| \le \delta \\
\delta(|a| - \frac{1}{2}\delta) & \text{if } |a| > \delta
\end{cases}
$$

where $ a = y - \hat{y} $.

### Advantages
- Less sensitive to outliers than MSE.
- Smooth and differentiable.

### Disadvantages
- Requires choosing the parameter $ \delta $.

### When to Use
- When the data has some outliers but not extreme ones.

---

## 3. Classification Loss Functions

Classification problems involve **discrete class labels**.

---

## 3.1 0/1 Loss

### Definition
0/1 Loss assigns:
- **0** if prediction is correct
- **1** if prediction is incorrect

### Mathematical Expression
$$
L(y, \hat{y}) =
\begin{cases}
0 & \text{if } y = \hat{y} \\
1 & \text{otherwise}
\end{cases}
$$

### Interpretation
- Measures **accuracy-based error**.
- Ignores confidence of predictions.

### Limitations
- Not differentiable.
- Cannot be used directly for training.

### When to Use
- For **evaluation**, not training.

---

## 3.2 Cross-Entropy (Log Loss)

### Definition
Cross-Entropy measures the difference between the **true probability distribution** and the **predicted probability distribution**.

### Binary Classification Formula
$$
\text{Log Loss} = -\frac{1}{n} \sum \left[y\log(p) + (1-y)\log(1-p)\right]
$$

### Multiclass (Softmax) Version
$$
L = -\sum_{k=1}^{K} y_k \log(\hat{y}_k)
$$

### Interpretation
- Strongly penalizes confident but wrong predictions.
- Encourages probabilistic correctness.

### Advantages
- Differentiable and convex (for logistic regression).
- Works well with probabilistic models.

### Disadvantages
- Sensitive to incorrect high-confidence predictions.

### When to Use
- Logistic regression
- Neural networks
- Softmax classifiers

---

## 4. Why Different Loss Functions Matter

| Problem Type | Preferred Loss | Reason |
|--------------|---------------|--------|
| Linear Regression | MSE | Penalizes large errors |
| Robust Regression | MAE / Huber | Handles outliers |
| Binary Classification | Log Loss | Probabilistic learning |
| Multiclass Classification | Softmax Cross-Entropy | Probability distribution |

---

## 5. Key Takeaways

- Loss functions define **what ‚Äúlearning‚Äù means**.
- The choice of loss affects:
  - Model behavior
  - Sensitivity to outliers
  - Training stability
- There is **no universal best loss** ‚Äî it depends on the problem.

---

## 6. Conceptual Link to Optimization

Training algorithms (like Gradient Descent) do not directly maximize accuracy.  
Instead, they **minimize a loss function**, and improved accuracy is a result of that minimization.

---

This theoretical foundation will help you understand:
- Why certain models behave the way they do
- How training objectives shape predictions
- What loss function to choose for real-world problems

-----------
-----------

In [3]:
import numpy as np

# Regression Loss Function (From Scratch) 
# Sample Dataset
y_true = np.array([3,5,2.5,9])
y_pred = np.array([2.5,6,4,7])

# Mean Squared Error(MSE)
def mse(y_true,y_pred):
    return np.mean((y_true-y_pred)**2)
print("MSE:",mse(y_true,y_pred))

# Mean Absolute Error(MAE)
def mae(y_true,y_pred):
    return np.mean(np.abs(y_true-y_pred))

print("MAE:",mae(y_true,y_pred))


MSE: 1.875
MAE: 1.25


### Huber Loss 
 Huber loss behaves like:
- MSE for small errors
- MAE for large errors


In [4]:
def huber_loss(y_true,y_pred,delta=1.0):
    error = y_true-y_pred
    is_small_error = np.abs(error) <=delta
    
    squared_loss = 0.5*error**2
    linear_loss = delta*(np.abs(error)-0.5*delta)
    
    return np.mean(np.where(is_small_error,squared_loss,linear_loss))

print("Hubber Loss:", huber_loss(y_true,y_pred))

Hubber Loss: 0.78125


###  Classification Loss Functions

In [5]:
# Sample Classification Data
y_true_class = np.array([1, 0, 1, 1])
y_pred_class = np.array([1, 1, 1, 0])

In [6]:
# 0/1 Loss (Simple Classification Loss)
def zero_one_loss(y_true,y_pred):
    incorrect = y_true!=y_pred
    return np.mean(incorrect)

print("0/1 Loss:",zero_one_loss(y_true_class,y_pred_class))

0/1 Loss: 0.5


### Cross-Entropy/ Log Loss (Binary Classification)

In [7]:
y_true_prob = np.array([1,0,1,1])
y_pred_prob = np.array([0.9,0.6,0.2,0.8])

In [12]:
def log_loss(y_true,y_pred):
    epsilon = 1e-9  # to avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)

    loss = -(y_true*np.log(y_pred)+(1-y_true)*np.log(1-y_pred))
    return np.mean(loss)
print("Log Loss:",log_loss(y_true_prob,y_pred_prob))

Log Loss: 0.7135581778200728
