# Basic Mathematics and Automatic Differentiation

In this section, some fundamental concepts of differential calculus applied to machine
learning are introduced, illustrating how PyTorch allows calculating gradients
automatically through its _autograd_ system. The objective is to connect the traditional
mathematical formulation (symbolic calculus) with practical implementation in code, and
show how these gradients are used in typical tasks such as linear regression, logistic
regression, or multiclass classification.

The central idea is as follows: A differentiable function is defined that depends on one
or more tensors with `requires_grad=True`, a scalar value is calculated from them, and
`backward()` is invoked. From that moment, PyTorch traverses the computational graph it
has built internally and calculates the partial derivatives of the scalar output with
respect to each of the differentiable inputs, storing them in the `.grad` attribute of
the corresponding tensors.

## Gradient Calculation: PyTorch versus SymPy

To illustrate the parallelism between symbolic calculus and automatic differentiation,
consider the scalar function of two variables:

$$
f(x_1, x_2) = x_1^2 + 3 x_1 x_2 + x_2^2.
$$

In PyTorch, a tensor `x` with two components is defined and gradient tracking is
activated:

In [None]:
# 3pps
import sympy as sp
import torch


# Create input tensor with gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)

# Define the differentiable function: f(x1, x2) = x1^2 + 3*x1*x2 + x2^2
y = x[0] ** 2 + 3 * x[0] * x[1] + x[1] ** 2

# Calculate gradients
y.backward()

# Gradients with respect to each input
grad_x1 = x.grad[0]  # ∂f/∂x1
grad_x2 = x.grad[1]  # ∂f/∂x2

print("PyTorch gradients:")
print("Gradient ∂f/∂x1:", grad_x1)
print("Gradient ∂f/∂x2:", grad_x2)

PyTorch automatically constructs the operation graph that leads from `x` to `y` and, when
invoking `y.backward()`, calculates the partial derivatives ∂f/∂x₁ and ∂f/∂x₂ at the
specific point `x = [2, 3]`. These derivatives are stored in `x.grad`.

In parallel, the same function can be represented symbolically with SymPy:

In [None]:
# Define symbolic variables
x1, x2 = sp.symbols("x1 x2")

# Define the same function symbolically
f = x1**2 + 3 * x1 * x2 + x2**2

# Calculate symbolic derivatives
df_dx1 = sp.diff(f, x1)
df_dx2 = sp.diff(f, x2)

print("SymPy derivative formulas:")
print("∂f/∂x1 =", df_dx1)
print("∂f/∂x2 =", df_dx2)

# Evaluate derivatives at point (x1=2, x2=3)
grad_x1_sym = df_dx1.evalf(subs={x1: 2, x2: 3})
grad_x2_sym = df_dx2.evalf(subs={x1: 2, x2: 3})

print("SymPy symbolic gradients evaluated at (x1=2, x2=3):")
print("Gradient x1:", grad_x1_sym)
print("Gradient x2:", grad_x2_sym)

SymPy provides closed-form symbolic expressions for derivatives and allows evaluating
them at specific points. The comparison between SymPy's results and PyTorch's shows how
PyTorch's automatic differentiation matches analytical derivatives, which helps validate
the implementation and understand the relationship between theory and practice.

## Examples

Below are several simple examples that illustrate how PyTorch calculates derivatives in
different contexts: single-variable functions, multi-variable functions, chain rule
application, and simple linear and logistic models. These examples allow intuitively
understanding how the _autograd_ system tracks operations and applies the rules of
differential calculus.

In [None]:
# 3pps
import torch


# Example 1: Quadratic function
# y = x², dy/dx = 2x
x = torch.tensor(3.0, requires_grad=True)
y = x**2
y.backward()
print(f"y = x² | x={x.item()}, dy/dx={x.grad.item()}")

In this first case, the function is one-dimensional and simple. PyTorch automatically
applies the power rule for derivatives and obtains dy/dx = 2x evaluated at x = 3.

In a scenario with multiple variables, PyTorch calculates partial gradients:

In [None]:
# Example 2: Multiple variables
# z = 2a + 3b, dz/da = 2, dz/db = 3
a = torch.tensor(4.0, requires_grad=True)
b = torch.tensor(5.0, requires_grad=True)
z = 2 * a + 3 * b
z.backward()
print(f"z = 2a + 3b | dz/da={a.grad.item()}, dz/db={b.grad.item()}")

Here, `a.grad` contains ∂z/∂a and `b.grad` contains ∂z/∂b, as expected from a linear
function in two variables.

The chain rule is applied implicitly when the function is composed of several
intermediate operations:

In [None]:
# Example 3: Chain rule
# y = (2x + 1)², dy/dx = 4(2x + 1)
x = torch.tensor(3.0, requires_grad=True)
y = (2 * x + 1) ** 2
y.backward()
print(f"y = (2x+1)² | x={x.item()}, dy/dx={x.grad.item()}")

In this case, PyTorch internally decomposes the function into elementary steps
(multiplication, addition, power) and combines their derivatives following the chain
rule, without the user needing to do it explicitly.

## Linear Regression and Logistic Regression

Derivatives acquire a central role when working with linear and logistic models, as they
allow quantifying how the model output changes with small variations in the inputs or
parameters. The following examples show how PyTorch calculates gradients with respect to
inputs in simple configurations.

In a linear model with two features, with weights `w` and bias `b`, the output is:

$$
y = w_1 x_1 + w_2 x_2 + b,
$$

so the derivatives with respect to the inputs are ∂y/∂x₁ = w₁ and ∂y/∂x₂ = w₂:

In [None]:
# Example 4: Linear regression
# y = w·x + b, dy/dx = w
x = torch.tensor([2.0, 3.0], requires_grad=True)
w = torch.tensor([0.5, -1.0])
b = 2.0

y = w[0] * x[0] + w[1] * x[1] + b
y.backward()

print(f"Linear | dy/dx1={x.grad[0].item()}, dy/dx2={x.grad[1].item()}")

PyTorch exactly reproduces these derivatives: the gradient of the output with respect to
each component of `x` matches the corresponding weight. This behavior is what generalizes
when calculating gradients with respect to model parameters during training.

In the case of logistic regression, a sigmoid function is applied over the linear
combination:

$$
z = w_1 x_1 + w_2 x_2 + b,\\
y = \sigma(z) = \frac{1}{1 + e^{-z}}.
$$

The derivative with respect to the inputs is given by the chain rule:
$\frac{\partial y}{\partial x_i} = \sigma'(z)\, w_i$, where
$\sigma'(z) = \sigma(z)\,(1 - \sigma(z))$. PyTorch handles this composition
automatically:

In [None]:
# Example 5: Logistic regression
# y = σ(w·x + b), dy/dx = σ'(z)·w
x = torch.tensor([2.0, 3.0], requires_grad=True)
z = w[0] * x[0] + w[1] * x[1] + b
y = torch.sigmoid(z)

y.backward()
print(f"Logistic | dy/dx1={x.grad[0].item():.4f}, dy/dx2={x.grad[1].item():.4f}")

The values contained in `x.grad` reflect the sensitivity of the predicted probability
with respect to each of the input features, and illustrate how the nonlinear activation
function (the sigmoid) affects the gradient.

## Multiclass Classification

In multiclass classification tasks, it is common to use a linear layer followed by a
softmax function. The linear layer calculates a score or _logit_ for each class, and
softmax transforms these scores into probabilities that sum to 1. Below is a simple
example with three input features and three output classes.

Consider an input vector `x` and a weight matrix `W`, where each column of `W` can be
interpreted as the weight vector associated with a class. From them, the logits are
obtained as a matrix product and softmax is applied:

In [None]:
# 3pps
import torch
import torch.nn.functional as F


# Input features
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# Weight matrix for 3 classes
W = torch.tensor(
    [
        [0.2, -0.5, 0.3],
        [0.4, 0.1, -0.2],
        [0.1, 0.3, 0.2],
    ],
    requires_grad=False,
)
b = torch.tensor([0.0, 0.0, 0.0])

# Linear scores for each class: logits = W^T x + b
logits = torch.matmul(x, W) + b  # shape [3]

# Apply Softmax to obtain probabilities
probs = F.softmax(logits, dim=0)

# Select the probability of the predicted class (the highest)
pred_class_idx = probs.argmax()
top_prob = probs[pred_class_idx]

# Calculate gradients with respect to the input
top_prob.backward()

print("Multiclass Classification | Probabilities:", probs.detach().numpy())
print("Predicted class index:", pred_class_idx.item())
print("Gradients inputs:", x.grad.detach().numpy())

In this example, `logits` is a one-dimensional tensor of size 3 containing the linear
score of each class. The `F.softmax` function transforms these logits into a probability
vector. Next, the probability of the class with the highest value (`top_prob`) is
selected and `backward()` is called to calculate the gradient of that probability with
respect to the input vector `x`.

The resulting values in `x.grad` indicate how the probability of the predicted class
would vary if each component of the input were slightly perturbed. This information can
be used, for example, to analyze the model's sensitivity to input features or as the
basis for explanation techniques and adversarial example generation.