# Loss and Cost Functions in Deep Learning

In deep learning, loss and cost functions are used to measure how well a model is performing. They help the model learn by telling it how far its predictions are from the actual correct answers.

## Loss Function

The loss function measures the error for a single training example. It calculates how far the model's prediction is from the actual target value.

Think of it as a way to tell the model, "This is how wrong you were for this one example."

### Example

Let’s say you’re predicting house prices. For one house:

- **Actual price**: $300,000
- **Model’s prediction**: $250,000

The loss function (e.g., Mean Squared Error) would calculate the difference:



\[ \text{Loss} = (300,000 - 250,000)^2 = 50,000^2 = 2,500,000,000 \]



This tells the model, "You were off by $50,000 for this house."

## Cost Function

The cost function is the average loss over the entire dataset. It measures how well the model is performing across all training examples.

Think of it as the model’s overall "report card" for its predictions.

### Example

If you have 3 houses with the following predictions:

- **Actual**: 300,000, **Predicted**: 250,000 → **Loss**: 50,000² = 2,500,000,000
- **Actual**: 400,000, **Predicted**: 420,000 → **Loss**: 20,000² = 400,000,000
- **Actual**: 500,000, **Predicted**: 480,000 → **Loss**: 20,000² = 400,000,000

The cost function (average loss) would be:



\[ \text{Cost} = \frac{2,500,000,000 + 400,000,000 + 400,000,000}{3} = \frac{3,300,000,000}{3} = 1,100,000,000 \]



This tells the model, "On average, your predictions are off by a lot."

## Why Are They Important?

The goal of training a model is to minimize the cost function. This means making the model’s predictions as close as possible to the actual values.

During training, the model adjusts its parameters (weights and biases) to reduce the cost function, improving its predictions.

## Common Loss/Cost Functions

- **Mean Squared Error (MSE)**: Used for regression tasks (e.g., predicting house prices).
- **Cross-Entropy Loss**: Used for classification tasks (e.g., classifying images as cats or dogs).
- **Binary Cross-Entropy Loss**: Used for binary classification (e.g., spam or not spam).

## Summary

- **Loss function**: Measures error for one example.
- **Cost function**: Measures average error for the entire dataset.

Both help the model learn by quantifying how wrong its predictions are.


# What is Mean Absolute error, Mean Squared error, log loss or Binary Cross-Entropy

# Mean Absolute Error (MAE)

**What it is**: MAE measures the average absolute difference between the predicted and actual values.

**Formula**:



$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} | y_i - \hat{y}_i | $$



- $y_i$= actual value
- $ \hat{y}_i $ = predicted value
- $n$ = number of examples

**Example**:
Suppose you’re predicting house prices:

- Actual prices: [300, 400, 500]
- Predicted prices: [250, 420, 480]

Calculate absolute errors:

- |300 - 250| = 50
- |400 - 420| = 20
- |500 - 480| = 20



$$\text{MAE} = \frac{50 + 20 + 20}{3} = 30$$



**Interpretation**: On average, the model’s predictions are off by $30,000.

# Mean Squared Error (MSE)

**What it is**: MSE measures the average squared difference between the predicted and actual values. It penalizes larger errors more heavily.

**Formula**:



$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} ( y_i - \hat{y}_i )^2$$



**Example**:
Using the same house prices:

- Actual prices: [300, 400, 500]
- Predicted prices: [250, 420, 480]

Calculate squared errors:

- (300 - 250)² = 2500
- (400 - 420)² = 400
- (500 - 480)² = 400



$$ \text{MSE} = \frac{2500 + 400 + 400}{3} = 1100 $$



**Interpretation**: On average, the model’s predictions are off by 1100 squared units. This is harder to interpret directly, but it’s useful for optimization.

# Log Loss (Binary Cross-Entropy)

**What it is**: Log Loss measures the performance of a binary classification model (e.g., predicting 0 or 1). It penalizes incorrect predictions and rewards confident, correct predictions.

**Formula**:



$$\text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} [ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) ]$$



- $y_i$ = actual label (0 or 1)
- $ \hat{y}_i$ = predicted probability (between 0 and 1)

**Example**:
Suppose you’re predicting whether an email is spam (1) or not spam (0):

- Actual labels: [1, 0, 1]
- Predicted probabilities: [0.9, 0.2, 0.7]

Calculate log loss:

- For $y_1$ = 1, $\hat{y}_1$ = 0.9 : $ -\log(0.9)$    = 0.105 
- For $y_2$ = 0, $\hat{y}_2$ = 0.2 : $-\log(1 - 0.2)$ = 0.223
- For $y_3$ = 1, $ \hat{y}_3$= 0.7 : $-\log(0.7)$     = 0.357 



$$ \text{Log Loss} = \frac{0.105 + 0.223 + 0.357}{3} = 0.228$$



**Interpretation**: A lower log loss means better predictions. Here, the model is doing reasonably well, but there’s room for improvement.

# Key Differences

| Metric | Use Case | Behavior | Example Interpretation |
|--------|----------|----------|------------------------|
| MAE    | Regression | Measures average absolute error | Predictions are off by $30,000 on average. |
| MSE    | Regression | Penalizes larger errors more | Predictions are off by 1100 squared units. |
| Log Loss | Binary Classification | Rewards confident, correct predictions | Lower log loss = better model performance. |

# When to Use Which?

- Use MAE if you want a simple, interpretable measure of error.
- Use MSE if you want to penalize larger errors more heavily (useful for optimization).
- Use Log Loss for binary classification problems where you need to measure probabilities.


# Why We Use Mean Absolute Error, Mean Squared Error, Log Loss or Binary Cross-Entropy

The choice of Mean Absolute Error (MAE), Mean Squared Error (MSE), or Log Loss (Binary Cross-Entropy) depends on the type of problem you're solving (regression or classification) and the specific behavior you want from your model. Let’s break it down:

## 1. Mean Absolute Error (MAE)

### Why Use It?
- **Interpretability**: MAE is easy to understand because it directly measures the average absolute difference between predictions and actual values.
- **Robustness to Outliers**: MAE is less sensitive to outliers compared to MSE because it doesn’t square the errors. This makes it a good choice when your data has noisy or extreme values.

### When to Use It?
Use MAE for regression problems where you want a simple, interpretable measure of error.

**Example**: Predicting house prices, where outliers (e.g., extremely expensive houses) shouldn’t dominate the error metric.

## 2. Mean Squared Error (MSE)

### Why Use It?
- **Penalizes Larger Errors**: MSE squares the errors, so larger errors contribute disproportionately to the total error. This encourages the model to focus on reducing large errors.
- **Useful for Optimization**: Many optimization algorithms (like gradient descent) work well with MSE because it’s differentiable and has a smooth curve.

### When to Use It?
Use MSE for regression problems where you want to penalize large errors heavily.

**Example**: Predicting stock prices, where being wildly wrong is much worse than being slightly wrong.

## 3. Log Loss (Binary Cross-Entropy)

### Why Use It?
- **Probabilistic Interpretation**: Log Loss measures the performance of a model that outputs probabilities (e.g., the probability of an email being spam). It rewards confident and correct predictions while penalizing incorrect or uncertain ones.
- **Useful for Classification**: It’s specifically designed for binary classification problems, where the goal is to predict one of two classes (e.g., spam or not spam).

### When to Use It?
Use Log Loss for binary classification problems where the model outputs probabilities.

**Example**: Predicting whether a customer will churn (yes or no), where you want to measure how well the model’s predicted probabilities match the actual outcomes.

## Key Differences and Use Cases

| Metric   | Use Case              | Key Behavior                                      | Example Use Case                                |
|----------|-----------------------|---------------------------------------------------|-------------------------------------------------|
| MAE      | Regression            | Measures average absolute error. Robust to outliers. | Predicting house prices with noisy data.        |
| MSE      | Regression            | Penalizes larger errors more heavily.             | Predicting stock prices where large errors are costly. |
| Log Loss | Binary Classification | Measures probabilistic accuracy. Rewards confident, correct predictions. | Predicting spam emails (yes/no).                |

## Why Not Use Just One Metric?
Each metric has its strengths and weaknesses, and the choice depends on:

- **Problem Type**: Regression vs. classification.
- **Data Characteristics**: Presence of outliers, noisy data, etc.
- **Model Behavior**: Whether you want to penalize large errors (MSE) or treat all errors equally (MAE).
- **Output Type**: Probabilities (Log Loss) vs. direct values (MAE/MSE).

## Summary
- Use MAE for simple, interpretable regression tasks with potential outliers.
- Use MSE for regression tasks where large errors are particularly bad.
- Use Log Loss for binary classification tasks where the model outputs probabilities.

By choosing the right metric, you ensure that your model is optimized for the specific problem you’re solving!


In [11]:
import numpy as np

In [12]:
# Predicted values (e.g., from a model)
y_predicted = np.array([1, 1, 0, 0, 1])

# True values (e.g., actual labels)
y_true = np.array([0.30, 0.7, 1, 0, 0.5])

In [13]:
def mae(y_predicted, y_true):
    # Initialize the total error to 0
    total_error = 0
    
    # Loop over the predicted and true values
    for yp, yt in zip(y_predicted, y_true):
        # Calculate the absolute error for each prediction
        total_error += abs(yp - yt)
    
    # Print the total error
    print("Total error is:", total_error)
    
    # Calculate the mean absolute error
    mae = total_error / len(y_predicted)
    
    # Print the mean absolute error
    print("Mean absolute error is:", mae)
    
    # Return the mean absolute error
    return mae

In [14]:
mae(y_predicted, y_true)

Total error is: 2.5
Mean absolute error is: 0.5


0.5

# Implement same thing using numpy in much easier way

In [15]:
np.abs(y_predicted - y_true)

array([0.7, 0.3, 1. , 0. , 0.5])

In [16]:
np.mean(np.abs(y_predicted-y_true))

0.5

In [17]:
def mae_np(y_predicted, y_true):
    """
    Calculate the Mean Absolute Error (MAE) between predicted and true values.

    MAE is a measure of errors between paired observations expressing the same phenomenon.

    Parameters:
    y_predicted (numpy.ndarray): The predicted values.
    y_true (numpy.ndarray): The true values.

    Returns:
    float: The mean absolute error between the predicted and true values.
    """
    # Calculate the mean absolute error
    return np.mean(np.abs(y_predicted-y_true))

In [18]:
mae_np(y_predicted, y_true)

0.5

# Implement Log Loss or Binary Cross Entropy

In [19]:
np.log([0])

  np.log([0])


array([-inf])

In [20]:
epsilon = 1e-15

In [21]:
np.log([1e-15])

array([-34.53877639])

In [22]:
y_predicted

array([1, 1, 0, 0, 1])

In [23]:
y_predicted_new = [max(i,epsilon) for i in y_predicted]
y_predicted_new

[1, 1, 1e-15, 1e-15, 1]

In [24]:
y_predicted_new = [max(i,epsilon) for i in y_predicted]
y_predicted_new

[1, 1, 1e-15, 1e-15, 1]

In [25]:
1-epsilon

0.999999999999999

In [26]:
y_predicted_new = [min(i,1-epsilon) for i in y_predicted_new]
y_predicted_new

[0.999999999999999, 0.999999999999999, 1e-15, 1e-15, 0.999999999999999]

In [28]:
y_predicted_new = np.array(y_predicted_new)
np.log(y_predicted_new)

array([-9.99200722e-16, -9.99200722e-16, -3.45387764e+01, -3.45387764e+01,
       -9.99200722e-16])

$$\text{Log Loss} = -\frac{1}{n} \sum_{i=1}^{n} [ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) ]$$

In [29]:
-np.mean(y_true*np.log(y_predicted_new)+(1-y_true)*np.log(1-y_predicted_new))

17.2696280766844

In [30]:
def log_loss(y_true, y_predicted):
    y_predicted_new = [max(i,epsilon) for i in y_predicted]
    y_predicted_new = [min(i,1-epsilon) for i in y_predicted_new]
    y_predicted_new = np.array(y_predicted_new)
    return -np.mean(y_true*np.log(y_predicted_new)+(1-y_true)*np.log(1-y_predicted_new))

In [31]:
log_loss(y_true, y_predicted)

17.2696280766844

Implementing mean squared error (or MSE) in two ways,

1. Without using numpy (i.e. using plain python)

2. With the use of numpy

Solution 1: Without using numpy

In [2]:
import numpy as np

# Predicted values (e.g., from a model)
y_predicted = np.array([1, 1, 0, 0, 1])

# True values (e.g., actual labels)
y_true = np.array([0.30, 0.7, 1, 0, 0.5])

In [3]:
def mae(y_predicted, y_true):
    # Initialize the total error to 0
    total_error = 0
    
    # Loop over the predicted and true values
    for yp, yt in zip(y_predicted, y_true):
        # Calculate the absolute error for each prediction
        total_error += abs(yp - yt)**2
    
    # Print the total error
    print("Total error is:", total_error)
    
    # Calculate the mean absolute error
    mae = total_error / len(y_predicted)
    
    # Print the mean absolute error
    print("Mean absolute error is:", mae)
    
    # Return the mean absolute error
    return mae

In [6]:
mae(y_predicted, y_true)

Total error is: 1.83
Mean absolute error is: 0.366


0.366

Solution 2: By using numpy

In [7]:
np.mean(np.square(y_true-y_predicted))

0.366