# Loss and Cost
In this notebook we shall discuss the Loss and Cost of the process of training an algorithm. 

Cost and loss are related but slightly different concepts.

1. **Cost Function**:
   - **Definition**: The cost function (also known as the objective function or the error function) quantifies the `"cost"` associated with the errors made by the model in its predictions.
   - **Purpose**: It is used during the training phase to guide the optimization algorithm in adjusting the model's parameters to minimize this cost.
   - **Methods**: Different types of cost functions can be used depending on the specific task and the nature of the data. Some common cost functions include Mean Squared Error (MSE), Cross-Entropy Loss, Hinge Loss, etc.
   - **Example**: In linear regression, the cost function is typically the Mean Squared Error (MSE), which calculates the average squared difference between the actual and predicted values.

2. **Loss Function**:
   - **Definition**: The loss function is a component of the cost function. It measures the difference between the predicted values of the model and the actual target values for a single data point.
   - **Purpose**: It provides feedback to the optimization algorithm during training, indicating how well the model is performing on individual data points.
   - **Methods**: Similar to cost functions, there are various types of loss functions used in different contexts. Some common loss functions include Mean Absolute Error (MAE), Binary Cross-Entropy Loss, Categorical Cross-Entropy Loss, etc.
   - **Example**: In binary classification tasks, the Binary Cross-Entropy Loss is often used as the loss function. It penalizes the model based on the divergence between the predicted probabilities and the actual binary labels.

In summary, cost function is a broader concept that encompasses the overall measure of error or cost associated with the model's predictions across the entire dataset, while loss function evaluates the performance of the model on individual data points and is a component of the cost function. The choice of cost and loss functions depends on the specific problem being addressed and the characteristics of the data.

Although in the frameworks that are provided for the AI algorithm training these cost and loss are already implemented, in this notebook we will try to implement them ourself to get a better idea of them. 

In [1]:
import numpy as np

In [2]:
y_predicted = np.array([1,1,0,0,1])
y_true = np.array([0.30,0.7,1,0,0.5])

<h3 style='color:Red'>Mean Absolute Error</h3>
Mean Absolute Error (MAE) is a common metric used in regression analysis to evaluate the performance of a predictive model. It measures the average absolute difference between the predicted values and the actual values in a dataset.

__Steps:__

1. For each data point in the dataset, calculate the absolute difference between the predicted value $(\hat{y}_i)$ and the actual value $(y_i)$.
   
   $$\text{Absolute Difference} = | \hat{y}_i - y_i |$$

2. Sum up all these absolute differences.

   $$\text{Sum of Absolute Differences} = \sum_{i=1}^{n} | \hat{y}_i - y_i |$$

3. Calculate the mean of these absolute differences by dividing the sum by the total number of data points $(n)$.

   $$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} | \hat{y}_i - y_i |$$

MAE is useful because it provides a straightforward interpretation of the average prediction error in the same units as the target variable. For example, if you're predicting house prices, a MAE of $10,000$ would mean that, on average, your predictions are off by $10,000$.

__Advantages of using MAE__:

1. It's easy to understand and interpret.
2. It's less sensitive to outliers compared to other metrics like Mean Squared Error (MSE), which squares the differences.
3. It's useful when the distribution of the target variable is skewed or has outliers.

However, MAE does not penalize large errors as heavily as MSE does. Consequently, in some scenarios, such as when large errors are particularly undesirable, MSE might be a more appropriate metric.

In [3]:
def mae(y_predicted, y_true):
    total_error = 0
    for yp, yt in zip(y_predicted, y_true):
        # Step 1, 2
        total_error += abs(yp - yt)
    print("Total error is:",total_error)
    
    # Step 3
    mae = total_error/len(y_predicted)
    
    return mae

In [4]:
print("Mean absolute error is:", mae(y_predicted, y_true))

Total error is: 2.5
Mean absolute error is: 0.5


**Implement same thing using numpy in much easier way**

In [5]:
def mae_np(y_predicted, y_true):
    return np.mean(np.abs(y_predicted-y_true))

In [6]:
mae_np(y_predicted, y_true)

0.5

<h3 style='color:red'>Log Loss or Binary Cross Entropy</h3>
Binary Cross-Entropy Loss, also known as Log Loss, is a common loss function used in binary classification tasks. It measures the discrepancy between the predicted probabilities output by the model and the actual binary labels of the data.

__Steps:__

1. For each data point in the dataset, calculate the cross-entropy loss using the predicted probability $( \hat{y}_i ) $ for the positive class and the actual binary label $( y_i )$.
   
   $$\text{Cross-Entropy Loss} = -\left( y_i \log(\hat{y}_i + \epsilon) + (1 - y_i) \log(1 - \hat{y}_i + \epsilon) \right)$$

   Where:
   - $ \hat{y}_i $ is the predicted probability for the positive class (typically obtained from a sigmoid function applied to the model's output).
   - $ y_i $ is the actual binary label (0 for the negative class, 1 for the positive class).
   - The epsilon term is often added to the logarithmic functions to prevent numerical instability when the predicted probabilities are very close to 0 or 1. It is typically a small positive constant, such as $10^{-7}$.

2. Average the cross-entropy losses over all data points in the dataset to obtain the overall Binary Cross-Entropy Loss.

   $$ \text{Binary Cross-Entropy Loss} = \frac{1}{n} \sum_{i=1}^{n} \left( -\left( y_i \log(\hat{y}_i + \epsilon) + (1 - y_i) \log(1 - \hat{y}_i + \epsilon) \right) \right) $$

Binary Cross-Entropy Loss essentially quantifies how well the predicted probabilities match the true labels. If the predicted probability for the positive class $( \hat{y}_i )$ is close to 1 when the actual label $( y_i )$ is 1 and close to 0 when the actual label is 0, the loss will be low. Conversely, if the predicted probability diverges significantly from the true label, the loss will be high.

**Advantages of using Binary Cross-Entropy Loss:**

1. It penalizes confident wrong predictions heavily, which encourages the model to output probabilities close to 1 for positive examples and close to 0 for negative examples.
2. It provides a smooth and differentiable loss function, making it suitable for gradient-based optimization algorithms like stochastic gradient descent (SGD).

Binary Cross-Entropy Loss is commonly used in binary classification tasks, where the goal is to classify instances into one of two classes. It's worth noting that there are variations of cross-entropy loss for multi-class classification tasks, such as Categorical Cross-Entropy Loss.

In [7]:
def log_loss(y_true, y_predicted):
    epsilon = 1e-7
    y_predicted_new = [max(i, epsilon) for i in y_predicted]
    y_predicted_new = [min(i, 1 - epsilon) for i in y_predicted_new]
    y_predicted_new = np.array(y_predicted_new)
    return -np.mean(y_true * np.log(y_predicted_new) + (1 - y_true) * np.log(1 - y_predicted_new))

In [8]:
log_loss(y_true, y_predicted)

8.059047875637068

<h3 style='color:red'>Mean Squared Error</h3>
Mean Squared Error (MSE) is a common metric used in regression tasks to evaluate the performance of a predictive model. It measures the average of the squares of the differences between the predicted values and the actual values.


__Steps:__

1. For each data point in the dataset, calculate the squared difference between the predicted value $ ( \hat{y}_i ) $ and the actual value $( y_i )$.
   
   $$ \text{Squared Difference} = (\hat{y}_i - y_i)^2 $$

2. Sum up all these squared differences.

   $$ \text{Sum of Squared Differences} = \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 $$

3. Calculate the mean of these squared differences by dividing the sum by the total number of data points $( n )$.

   $$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2 $$

MSE provides a measure of the average squared deviation between the predicted values and the actual values. It's commonly used because it has several desirable properties, including being continuous, differentiable, and sensitive to large errors due to the squaring operation.

Advantages of using MSE:

1. It's easy to understand and interpret. The units of MSE are the square of the units of the target variable.
2. It penalizes larger errors more heavily than smaller errors, which can be desirable in many regression tasks.
3. It's useful for optimization algorithms that rely on gradients, such as gradient descent, as it provides smooth gradients.

However, MSE is sensitive to outliers since squaring large errors can disproportionately affect the overall error. In scenarios where outliers are present or where a more robust metric is required, alternative metrics such as Mean Absolute Error (MAE) might be preferred.

In [9]:
def mse(y_true, y_predicted):
    # step 1, 2: Total difference
    sum_diff = 0
    for i in range(len(y_true)):
        sum_diff += (y_predicted[i] - y_true[i]) ** 2
    return sum_diff / len(y_true)

In [10]:
print("Mean Squared Error is:", mse(y_true, y_predicted))

Mean Squared Error is: 0.366


__Same implementation in Numpy is:__


In [11]:
np.mean(np.square(y_predicted - y_true))

0.366

## There are more metrics
In addition to Mean Squared Error (MSE) and Mean Absolute Error (MAE) and Log Loss, there are several other commonly used evaluation metrics in machine learning and regression tasks. Some of these metrics include:

1. **Root Mean Squared Error (RMSE)**:
   - RMSE is the square root of the Mean Squared Error. It's useful because it's in the same unit as the target variable, making it easier to interpret.
   - RMSE = \( \sqrt{\text{MSE}} \)

2. **Mean Absolute Percentage Error (MAPE)**:
   - MAPE measures the average percentage difference between the predicted and actual values.
   - MAPE is calculated as the mean of the absolute percentage errors:
     $$ \text{MAPE} = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100\% $$
   - MAPE is useful when you want to evaluate the performance of a model relative to the scale of the target variable.

3. **Coefficient of Determination (R-squared)**:
   - R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variables.
   - It ranges from 0 to 1, with higher values indicating a better fit of the model to the data.
   - R-squared = $ 1 - \frac{\text{MSE}}{\text{Var}(y)} $, where $Var(y)$ is the variance of the target variable.

4. **Median Absolute Error (MedAE)**:
   - MedAE is the median of the absolute errors between the predicted and actual values.
   - It's less sensitive to outliers compared to mean-based metrics like MAE or MSE.

5. **R-squared Adjusted (Adjusted R-squared)**:
   - Adjusted R-squared is a modified version of R-squared that penalizes model complexity by taking into account the number of predictors in the model.
   - It provides a more conservative estimate of the model's goodness of fit, particularly when dealing with multiple predictors.

6. **Mean Squared Logarithmic Error (MSLE)**:
   - MSLE measures the mean of the squared logarithmic differences between the predicted and actual values.
   - It's useful when the target variable spans several orders of magnitude.

These are just a few examples of evaluation metrics commonly used in regression tasks. The choice of metric depends on various factors, including the specific characteristics of the dataset, the problem domain, and the desired properties of the evaluation.