When training a model, the goal is to find parameters, denoted as $\boldsymbol{\phi}$, that minimize a given loss function:

$$\hat{\phi} = \underset{\boldsymbol{\phi}}{\mathrm{argmin}} \left[ L\left[\boldsymbol{\phi}\right] \right]
$$

There are many types of optimization algorithms available. However, the standard methods for training neural networks are iterative. These algorithms start with parameters initialized using heuristic methods and then continuously adjust them to reduce the loss.

One of the simplest methods in this category is **gradient descent**. This method begins with initial parameters $\boldsymbol{\phi} = [\phi_0, \phi_1, \ldots , \phi_N]^T$ and follows two steps:

**Step 1.** Compute the derivatives of the loss with respect to the parameters:
$$\frac{\partial L}{\partial \boldsymbol{\phi}} = \begin{bmatrix}
\frac{\partial L}{\partial \phi_0} \\
\frac{\partial L}{\partial \phi_1} \\
\vdots \\
\frac{\partial L}{\partial \phi_N}
\end{bmatrix}
$$

**Step 2.** Update the parameters using the rule:
$$\boldsymbol{\phi} \leftarrow \boldsymbol{\phi} - \alpha \cdot \frac{\partial L}{\partial \boldsymbol{\phi}}$$

Here, the positive scalar $\alpha$, known as the **learning rate**, determines the size of the updates.

In [None]:
import tensorflow as tf

x = tf.constant([0.03, 0.19, 0.34, 0.46, 0.78, 0.81, 1.08, 1.18, 1.39, 1.60, 1.65, 1.90], dtype=tf.float64)
y = tf.constant([0.67, 0.85, 1.05, 1.0, 1.40, 1.5, 1.3, 1.54, 1.55, 1.68, 1.73, 1.6 ], dtype=tf.float64)

def compute_loss(x, y, phi0, phi1):
    # Compute the predicted values
    y_pred = phi0 + phi1 * x
    # Compute the squared differences
    squared_diffs = tf.square(y_pred - y)
    # Return the sum of the squared differences
    return tf.reduce_sum(squared_diffs)

compute_loss(x, y, 0.0, 1)

In the single-variable scenario, we can formulate the loss function, often referred to as the least-squares loss, as follows:

$$ \begin{align*}
L\left[\boldsymbol{\phi}\right] &= \sum_{i=1}^{m} (h\left[x_i, \boldsymbol{\phi}\right] - y_i)^2 \\
&= \sum_{i=1}^{m} (\phi_0 + \phi_1 x_i - y_i)^2
\end{align*}
$$

The derivative of the loss function with respect to the parameters can be broken down into the sum of the derivatives of each individual contribution:

$$
\frac{\partial L}{\partial \boldsymbol{\phi}} = \frac{\partial}{\partial \boldsymbol{\phi}} \sum_{i=1}^{m} l_i =  \sum_{i=1}^{m} \frac{\partial l_i}{\partial \boldsymbol{\phi}}
$$

These individual contributions are given by:

$$
\frac{\partial l_i}{\partial \boldsymbol{\phi}} =
\begin{bmatrix}
-2(y_i - (\phi_0 + \phi_1x_i)) \\
-2x_i(y_i - (\phi_0 + \phi_1x_i))
\end{bmatrix}
$$


In [None]:
def compute_gradients(x, y, phi0, phi1):
    # Compute the predicted values
    y_pred = phi0 + phi1 * x
    # Compute the derivatives of the loss with respect to phi0 and phi1
    d_phi0 = -2 * tf.reduce_sum(y - y_pred)
    d_phi1 = -2 * tf.reduce_sum((y - y_pred) * x)
    return d_phi0, d_phi1

def update_parameters(phi0, phi1, d_phi0, d_phi1, learning_rate):
    # Update phi0 and phi1
    phi0 = phi0 - learning_rate * d_phi0
    phi1 = phi1 - learning_rate * d_phi1
    return phi0, phi1

def gradient_descent(x, y, phi0, phi1, learning_rate, num_iterations):
    # Store the values of phi0, phi1, and the loss at each step
    phi0_values = [phi0]
    phi1_values = [phi1]
    loss_values = [compute_loss(x, y, phi0, phi1)]

    # Perform gradient descent
    for i in range(num_iterations):
        # Step 1: Compute the gradients
        d_phi0, d_phi1 = compute_gradients(x, y, phi0, phi1)

        # Step 2: Update phi0 and phi1 using the gradients
        phi0, phi1 = update_parameters(phi0, phi1, d_phi0, d_phi1, learning_rate)

        # Step 3: Store the new values of phi0, phi1, and the loss
        phi0_values.append(phi0)
        phi1_values.append(phi1)
        loss_values.append(compute_loss(x, y, phi0, phi1))

    return phi0, phi1, phi0_values, phi1_values, loss_values

# Define the learning rate and the number of iterations
learning_rate = 0.003
num_iterations = 100

# Initialize phi0 and phi1
phi0 = 0.1
phi1 = 0.1

# Perform gradient descent
phi0, phi1, phi0_values, phi1_values, loss_values = gradient_descent(x, y, phi0, phi1, learning_rate, num_iterations)

# Print the final values of phi0 and phi1
print(f"Final phi0: {phi0}, Final phi1: {phi1}")

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Create a grid of phi0 and phi1 values
phi0_grid = np.linspace(0, 1, 100)
phi1_grid = np.linspace(0, 1, 100)
phi0_grid, phi1_grid = np.meshgrid(phi0_grid, phi1_grid)

# Compute the loss for each pair of phi0 and phi1 values
loss_grid = np.zeros_like(phi0_grid)
for i in range(phi0_grid.shape[0]):
    for j in range(phi0_grid.shape[1]):
        loss_grid[i, j] = compute_loss(x, y, phi0_grid[i, j], phi1_grid[i, j])

# Create a contour plot of the loss
plt.figure(figsize=(10, 8))
plt.contour(phi0_grid, phi1_grid, loss_grid, levels=np.logspace(0, 5, 35), cmap='jet')

# Plot the path of the gradient descent points
plt.plot(phi0_values, phi1_values, 'r', marker='o', markersize=5)

# Show the plot
plt.show()