# 🚦 Loss functions and Gradient Calculations


1. Find the gradient of the following function at the point (2, 1): 
    
    $f(x, y) = x^3 - 2xy + y^2$

2. Find the derivative of the softmax function with respect to x_i: 
   
   $softmax(x_i) = \frac{e^{x_i}}{Σ_j e^{x_j}}$, 
   where $Σ_j$ denotes the sum over all $j$.

3. Find the derivative of the mean squared error loss function with respect to the predicted value ŷ: 
   
   $MSE(y, ŷ) = \frac{Σ_i (y_i - ŷ_i)^2}{n} $, 
   where $n$ is the number of samples.

In [31]:
import numpy as np

### 1. Gradient of a Function

Given the funtion 

$f(x, y) = x^3 - 2xy + y^2$

The gradient of the function is given by the partial derivatives of the function with respect to x and y.

$\nabla f(x, y) = [\frac{∂f}{∂x}, \frac{∂f}{∂y}]$


If we calculate the partial derivatives of the function, we get:

$\frac{∂f}{∂x} = 3x^2 - 2y$

$\frac{∂f}{∂y} = -2x + 2y$

Therefore, the gradient of the function is:

$\nabla f(x, y) = [3x^2 - 2y, -2x + 2y]$

Now, we can find the gradient of the function at the point (2, 1):

$\nabla f(2, 1) = [3(2)^2 - 2(1), -2(2) + 2(1)]$

$\nabla f(2, 1) = [12 - 2, -4 + 2]$

$\nabla f(2, 1) = [10, -2]$

This means that at the point (2, 1), the function is increasing most rapidly in the x-direction and decreasing in the y-direction.

In [4]:
# Analytical gradient
point = np.array([2, 1])
df_dx = (3 * point[0] ** 2) - (2 * point[1])
df_dy = (-2 * point[0]) + (2 * point[1])
gradient = np.array([df_dx, df_dy])
print(gradient)

[10 -2]


In [29]:
# Numerical gradient


def fun(x, y):
    """Function for which to calculate the gradient."""
    return x**3 - 2 * x * y + y**2


def gradient(fun, x, y, h=1e-6):
    """
    Calculate the numerical gradient of a function at a point using
    finite difference approximations in a rigorous way.
    """
    df_dx = (fun(x + h, y) - fun(x, y)) / h
    df_dy = (fun(x, y + h) - fun(x, y)) / h
    return np.array([df_dx, df_dy])


def gradient_symmetric(fun, x, y, h=1e-6):
    """
    Calculate the numerical gradient of a function at a point
    using the symmetric difference quotient.
    """
    df_dx = (fun(x + h, y) - fun(x - h, y)) / (2 * h)
    df_dy = (fun(x, y + h) - fun(x, y - h)) / (2 * h)
    return np.array([df_dx, df_dy])

In [30]:
print(gradient(fun, x=2, y=1, h=1e-6))
print(gradient_symmetric(fun, x=2, y=1, h=1e-6))

[10.000006 -1.999999]
[10. -2.]


### 2. Derivative of the Softmax Function

The softmax function is used in machine learning to convert a vector of real numbers into a probability distribution. The softmax function is defined as:

$softmax(x_i) = \frac{e^{x_i}}{Σ_j e^{x_j}}$

To find the derivative of the softmax function with respect to $x_i$, we can use the quotient rule of differentiation.

Let $f(x) = e^{x_i}$ and $g(x) = Σ_j e^{x_j}$, then the softmax function can be written as:

$softmax(x_i) = \frac{f(x_i)}{g(x)}$

The derivative of the softmax function with respect to $x_i$ is given by:

$\frac{d}{dx_i} softmax(x_i) = \frac{g(x)f'(x_i) - f(x_i)g'(x)}{(g(x))^2}$

Where $f'(x_i)$ is the derivative of $e^{x_i}$ with respect to $x_i$ and $g'(x)$ is the derivative of $Σ_j e^{x_j}$ with respect to $x_i$.

The derivative of $e^{x_i}$ with respect to $x_i$ is simply $e^{x_i}$.

The derivative of $Σ_j e^{x_j}$ with respect to $x_i$ is $e^{x_i}$, as the sum is over all $j$ and $x_i$ is one of the terms in the sum.

Therefore, the derivative of the softmax function with respect to $x_i$ is:

$\frac{d}{dx_i} softmax(x_i) = \frac{Σ_j e^{x_j}e^{x_i} - e^{x_i}e^{x_i}}{(Σ_j e^{x_j})^2}$

$\frac{d}{dx_i} softmax(x_i) = \frac{e^{x_i}Σ_j e^{x_j} - e^{2x_i}}{(Σ_j e^{x_j})^2}$

$\frac{d}{dx_i} softmax(x_i) = \frac{e^{x_i}}{Σ_j e^{x_j}} - \frac{e^{2x_i}}{(Σ_j e^{x_j})^2}$

$\frac{d}{dx_i} softmax(x_i) = softmax(x_i) - softmax(x_i)^2$

### 3. Derivative of the Mean Squared Error Loss Function

The mean squared error (MSE) loss function is commonly used in regression problems to measure the average of the squares of the errors or residuals. The MSE loss function is defined as:

$MSE(y, ŷ) = \frac{Σ_i (y_i - ŷ_i)^2}{n}$

Where $y$ is the true value, $ŷ$ is the predicted value, and $n$ is the number of samples.

To find the derivative of the MSE loss function with respect to the predicted value $ŷ_i$, we can use the chain rule of differentiation.

The chain rule states that for a function $f(x) = g(h(x))$, the derivative is $f'(x) = g'(h(x)) * h'(x)$.

In our case:

$g(x) = (1/n) * x$

$h(x) = Σ_i (y_i - ŷ_i)^2$

The derivative of $g(x)$ with respect to $x$ is simply $1/n$.

Now, we need to find the derivative of $h(x)$ with respect to $ŷ$. 

First, we need to expand $h(x)$:

$h(x) = Σ_i (y_i - ŷ_i)^2 = Σ_i (y_i^2 - 2y_iŷ_i + ŷ_i^2)$

Now we can differentiate with respect to ŷ_i using the power rule:

$\frac{d}{dŷ_i} h(x) = -2y_i + 2ŷ_i = 2(ŷ_i - y_i)$

Apply the chain rule to find the derivative of the sum:

$\frac{d}{dŷ}  Σ_i (y_i - ŷ_i)^2 = Σ_i \frac{d}{dŷ} (y_i - ŷ_i)^2 = Σ_i 2(ŷ_i - y_i)$

Therefore, the derivative of the sum of squared differences with respect to ŷ is:

$\frac{d}{dŷ}  Σ_i (y_i - ŷ_i)^2 = 2 Σ_i (ŷ_i - y_i)$


Finally, given:

$g'(x) = (1/n)$

$h'(x) = 2 Σ_i (ŷ_i - y_i)$

We can apply the chain rule:

$\frac{d}{dŷ} MSE(y, ŷ) = g'(h(x)) * h'(x)$

$\frac{d}{dŷ} MSE(y, ŷ) = (1/n) * 2 Σ_i (ŷ_i - y_i)$

$\frac{d}{dŷ} MSE(y, ŷ) = 2/n * Σ_i (ŷ_i - y_i)$

This result shows that the derivative of the mean squared error is the average of the differences between the predicted values ($ŷ_i$) and the true values ($y_i$), multiplied by $2/n$.


