Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions docs/machine-learning/mathematics-for-ml/calculus/chain-rule.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: "Chain Rule - The Engine of Backpropagation"
sidebar_label: Chain Rule
description: "Mastering the Chain Rule, the fundamental calculus tool for differentiating composite functions, and its direct application in the Backpropagation algorithm for training neural networks."
tags:
[
chain-rule,
calculus,
mathematics-for-ml,
backpropagation,
composite-functions,
neural-networks,
gradient,
]
---

The **Chain Rule** is a formula used to compute the derivative of a **composite function**, a function that is composed of one function inside another. If a function is built like a chain, the Chain Rule shows us how to differentiate it link by link.

This is arguably the most important calculus concept for Deep Learning, as the entire structure of a neural network is one massive composite function.

## 1. What is a Composite Function?

A composite function is one where the output of an inner function becomes the input of an outer function.

If $y$ is a function of $u$, and $u$ is a function of $x$, then $y$ is ultimately a function of $x$.

$$
y = f(u) \quad \text{where} \quad u = g(x)
$$

The overall composite function is $y = f(g(x))$.

## 2. The Chain Rule Formula (Single Variable)

The Chain Rule states that the rate of change of $y$ with respect to $x$ is the product of the rate of change of $y$ with respect to $u$, and the rate of change of $u$ with respect to $x$.

$$
\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}
$$

### Example

Let $y = (x^2 + 1)^3$. This can be written as $y = u^3$ where $u = x^2 + 1$.

1. **Find $\frac{dy}{du}$ (Outer derivative):**
$$
\frac{dy}{du} = \frac{d}{du}(u^3) = 3u^2
$$
2. **Find $\frac{du}{dx}$ (Inner derivative):**
$$
\frac{du}{dx} = \frac{d}{dx}(x^2 + 1) = 2x
$$
3. **Apply the Chain Rule:**
$$
\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = (3u^2) \cdot (2x)
$$
4. **Substitute $u$ back:**
$$
\frac{dy}{dx} = 3(x^2 + 1)^2 \cdot 2x = 6x(x^2 + 1)^2
$$

## 3. The Chain Rule with Multiple Variables (Partial Derivatives)

In neural networks, one variable can affect the final output through multiple different paths. This requires a slightly more complex version of the Chain Rule involving partial derivatives and summation.

If $z$ is a function of $x$ and $y$, and both $x$ and $y$ are functions of $t$: $z = f(x, y)$, where $x=g(t)$ and $y=h(t)$.

The total derivative of $z$ with respect to $t$ is:

$$
\frac{dz}{dt} = \frac{\partial z}{\partial x} \frac{dx}{dt} + \frac{\partial z}{\partial y} \frac{dy}{dt}
$$

## 4. The Chain Rule and Backpropagation

Backpropagation (short for "backward propagation of errors") is the algorithm used to train neural networks. It is nothing more than the repeated application of the multivariate Chain Rule.

### The Neural Network Chain

A neural network is a sequence of composite functions:

$$
\text{Loss} \leftarrow \text{Output Layer} \leftarrow \text{Hidden Layer 2} \leftarrow \text{Hidden Layer 1} \leftarrow \text{Input}
$$

The goal is to calculate how a small change in a parameter (weight $w$) in an **early layer** affects the final **Loss** $J$.

$$
\frac{\partial J}{\partial w_{\text{early}}} = \left(\frac{\partial J}{\partial \text{Output}}\right) \cdot \left(\frac{\partial \text{Output}}{\partial \text{Layer } 2}\right) \cdot \left(\frac{\partial \text{Layer } 2}{\partial \text{Layer } 1}\right) \cdot \left(\frac{\partial \text{Layer } 1}{\partial w_{\text{early}}}\right)
$$

:::important Backpropagation Flow
1. **Forward Pass:** Calculate the prediction and the Loss $J$.
2. **Backward Pass (Backprop):** Start at the end of the chain (the Loss $J$) and calculate the partial derivatives (gradients) layer by layer, multiplying them backward toward the input.
3. **Update:** Use the final calculated gradient $\frac{\partial J}{\partial w}$ to update the weight $w$ via Gradient Descent.
:::

## 5. Summary of Calculus for ML

You have now covered the three foundational concepts of Calculus required for Machine Learning:

| Concept | Mathematical Tool | ML Application |
| :--- | :--- | :--- |
| **Derivatives** | $\frac{df}{dx}$ | Measures the slope of the loss function. |
| **Partial Derivatives** | $\nabla J$ (The Gradient) | Identifies the direction of steepest ascent in the high-dimensional loss surface. |
| **Chain Rule** | $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$ | Propagates the gradient backward through all layers of a neural network to calculate parameter updates. |

---

With the mathematical foundations of Linear Algebra and Calculus established, we are now ready to tackle the core optimization algorithm that brings these concepts together: Gradient Descent.
100 changes: 100 additions & 0 deletions docs/machine-learning/mathematics-for-ml/calculus/derivatives.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
---
title: "Derivatives - The Rate of Change"
sidebar_label: Derivatives
description: "An introduction to derivatives, their definition, rules, and their crucial role in calculating the slope of the loss function, essential for optimization algorithms like Gradient Descent."
tags:
[
derivatives,
calculus,
mathematics-for-ml,
rate-of-change,
slope,
gradient-descent,
optimization,
]
---

Calculus is the mathematical foundation for optimization in Machine Learning. Specifically, **Derivatives** are the primary tool used to train almost every ML model, from Linear Regression to complex Neural Networks, via algorithms like Gradient Descent.

## 1. What is a Derivative?

The derivative of a function measures the **instantaneous rate of change** of that function. Geometrically, the derivative at any point on a curve is the **slope of the tangent line** to the curve at that point.

### Formal Definition

The derivative of a function $f(x)$ with respect to $x$ is defined using limits:

$$
f'(x) = \frac{dy}{dx} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}
$$

* $\frac{dy}{dx}$ is the common notation, read as "the derivative of $y$ with respect to $x$."
* The expression $\frac{f(x+h) - f(x)}{h}$ is the slope of the secant line between $x$ and $x+h$.
* Taking the limit as $h$ approaches zero gives the exact slope of the tangent line at $x$.

## 2. Derivatives in Machine Learning: Optimization

In Machine Learning, we define a **Loss Function** (or Cost Function) $J(\theta)$ which measures the error of our model, where $\theta$ represents the model's parameters (weights and biases).

The goal of training is to find the parameter values $\theta$ that **minimize** the loss function.

### A. Finding the Minimum

1. A function's minimum (or maximum) occurs where the slope is zero.
2. The derivative tells us the slope.
3. Therefore, by setting the derivative $\frac{dJ}{d\theta}$ to zero, we can find the optimal parameters $\theta$.

### B. Gradient Descent

For most complex ML models, the loss function is too complex to solve by setting the derivative to zero directly. Instead, we use an iterative process called **Gradient Descent**.

The derivative $\frac{dJ}{d\theta}$ tells us two things:

* **Magnitude:** How steep the slope is (how quickly the loss is changing).
* **Direction (Sign):** Whether moving parameter $\theta$ in a positive direction will increase or decrease the loss.

In Gradient Descent, we update the parameter $\theta$ in the **opposite direction** of the derivative (down the slope) to find the minimum:

$$
\theta_{\text{new}} = \theta_{\text{old}} - \alpha \frac{dJ}{d\theta}
$$

* $\alpha$ (alpha) is the **learning rate** (a small scalar).
* $\frac{dJ}{d\theta}$ is the derivative (the slope/gradient).

## 3. Basic Differentiation Rules

You must be familiar with the following rules to understand how derivatives are calculated for model training.

| Rule Name | Function $f(x)$ | Derivative $\frac{d}{dx}f(x)$ | Example |
| :--- | :--- | :--- | :--- |
| **Constant Rule** | $c$ | $0$ | $\frac{d}{dx}(5) = 0$ |
| **Power Rule** | $x^n$ | $nx^{n-1}$ | $\frac{d}{dx}(x^3) = 3x^2$ |
| **Constant Multiple** | $c \cdot f(x)$ | $c \cdot f'(x)$ | $\frac{d}{dx}(4x^2) = 8x$ |
| **Sum/Difference** | $f(x) \pm g(x)$ | $f'(x) \pm g'(x)$ | $\frac{d}{dx}(x^2 - 3x) = 2x - 3$ |
| **Exponential** | $e^x$ | $e^x$ | |

### Example: Quadratic Loss

Linear Regression often uses Mean Squared Error (MSE), which is a quadratic function of the weights $w$.

Let the simplified loss function be $J(w) = w^2 + 4w + 1$.
We apply the Sum and Power Rules:

$$
\frac{dJ}{dw} = \frac{d}{dw}(w^2) + \frac{d}{dw}(4w) + \frac{d}{dw}(1) = 2w + 4 + 0 = 2w + 4
$$

If the current weight is $w=1$, the slope is $2(1) + 4 = 6$ (steep, positive).

## References and Resources

To solidify your understanding of differentiation, here are some excellent resources:

* **[Khan Academy - Differential Calculus](https://www.khanacademy.org/math/differential-calculus):** Comprehensive video tutorials covering limits, derivatives, and rules. Excellent for visual learners.
* **Calculus: Early Transcendentals** by James Stewart (or any similar major textbook): Provides rigorous definitions and practice problems.
* **The Calculus of Computation** by Lars Kristensen: A good resource that connects calculus directly to computational methods.

---

Most functions in ML depend on more than one parameter (e.g., $w_1, w_2, \text{bias}$). To find the slope in these multi-variable spaces, we must use Partial Derivatives.
93 changes: 93 additions & 0 deletions docs/machine-learning/mathematics-for-ml/calculus/gradients.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
title: "Gradients - The Direction of Steepest Ascent"
sidebar_label: Gradients
description: "Defining the Gradient vector, its mathematical composition from partial derivatives, its geometric meaning as the direction of maximum increase, and its role as the central mechanism for learning in Machine Learning."
tags:
[
gradients,
calculus,
mathematics-for-ml,
gradient-descent,
vector-calculus,
optimization,
loss-function,
]
---

The **Gradient** is the ultimate expression of calculus in Machine Learning. It is the vector that consolidates all the partial derivatives of a multi-variable function (like our Loss Function) and points in the direction the function is increasing most rapidly.

Understanding the gradient is essential because the primary optimization algorithm, **Gradient Descent**, simply involves moving in the direction *opposite* to the gradient.

## 1. Defining the Gradient Vector

The gradient of a scalar-valued function $f$ of several variables ($\theta_1, \theta_2, \dots, \theta_n$) is a **vector** that contains all the function's partial derivatives.

### Notation

The gradient of a function $J(\mathbf{\theta})$ (our Loss Function, $J$) with respect to the parameter vector $\mathbf{\theta}$ is denoted by the $\nabla$ symbol (nabla or del):

$$
\nabla J(\mathbf{\theta}) \quad \text{or} \quad \nabla_{\mathbf{\theta}} J
$$

### Composition

If the loss function $J$ depends on $n$ parameters, $\mathbf{\theta} = (\theta_1, \theta_2, \dots, \theta_n)$, the gradient is the $n$-dimensional vector:

$$
\nabla J(\mathbf{\theta}) = \begin{bmatrix}
\frac{\partial J}{\partial \theta_1} \\
\frac{\partial J}{\partial \theta_2} \\
\vdots \\
\frac{\partial J}{\partial \theta_n}
\end{bmatrix}
$$

## 2. Geometric Meaning

The Gradient $\nabla J$ is the vector that has two crucial geometric properties:

1. **Direction:** It points in the direction of the **steepest increase** (the fastest way uphill) on the function's surface.
2. **Magnitude (Length):** The length of the gradient vector, $||\nabla J||$, indicates the **steepness** of the slope in that direction.

## 3. The Central Role in Gradient Descent

Since the goal of training an ML model is to **minimize** the Loss Function $J(\mathbf{\theta})$, we must adjust the parameters $\mathbf{\theta}$ to move *downhill*.

The most effective path downhill is to move in the exact opposite direction of the gradient.

### A. The Update Rule

The Gradient Descent update rule formalizes this movement:

$$
\mathbf{\theta}_{\text{new}} = \mathbf{\theta}_{\text{old}} - \alpha \nabla J(\mathbf{\theta}_{\text{old}})
$$

| Term | Role in Optimization | Calculation |
| :--- | :--- | :--- |
| $\mathbf{\theta}_{\text{old}}$ | Current position (weights/biases). | Vector of current model parameters. |
| $\alpha$ (Alpha) | **Learning Rate** (a small scalar). | Hyperparameter defining the step size. |
| $\nabla J(\mathbf{\theta})$ | **Gradient** of the Loss. | Vector of all partial derivatives. |
| $-\nabla J(\mathbf{\theta})$ | **Negative Gradient**. | The direction of steepest descent (downhill). |

### B. Convergence

As the parameters approach the minimum (the "valley floor"), the slope of the Loss Function flattens.

* At the minimum point, the Loss is flat, so all partial derivatives are zero.
* Therefore, the gradient $\nabla J$ is the zero vector ($\mathbf{0}$).
* The update step becomes $\mathbf{\theta}_{\text{new}} = \mathbf{\theta}_{\text{old}} - \alpha \cdot \mathbf{0}$. The parameters stop changing, and the model has converged.

## 4. Analogy: Descending a Mountain

Imagine being blindfolded on a vast mountain range (the Loss Surface). Your goal is to reach the valley floor (the minimum loss).

* **You can't see the whole mountain:** You only know your local height and slope (your current loss $J(\mathbf{\theta})$).
* **The Gradient ($\nabla J$):** A guide who tells you, "The fastest way to go **up** from here is to take 3 steps North and 1 step East."
* **Gradient Descent:** You ignore the guide's direction and decide, "I will move the **opposite** of what you say," taking 3 steps South and 1 step West.
* **Learning Rate ($\alpha$):** Determines if your step size is a cautious hop or a giant leap.

---

The Gradient is the core concept uniting all the calculus concepts we've covered. It moves the model from an initial, poor starting position to an optimal, converged solution.
84 changes: 84 additions & 0 deletions docs/machine-learning/mathematics-for-ml/calculus/hessian.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: "The Hessian Matrix"
sidebar_label: Hessian
description: "Understanding the Hessian matrix, second-order derivatives, and how the curvature of the loss surface impacts optimization and model stability."
tags:
[
hessian,
calculus,
mathematics-for-ml,
optimization,
second-order-derivatives,
curvature,
]
---

While the **Gradient** tells us the direction of the steepest slope, it doesn't tell us about the "shape" of the ground. Is the slope getting steeper or flatter? Are we in a narrow canyon or a wide, shallow bowl? To answer these questions, we need second-order derivatives, organized into the **Hessian Matrix**.

## 1. What is the Hessian?

The Hessian is a square matrix of **second-order partial derivatives** of a scalar-valued function. It describes the local **curvature** of the function.

If we have a function $f(x_1, x_2, \dots, x_n)$, the Hessian $\mathbf{H}$ is an $n \times n$ matrix:

$$
\mathbf{H} = \begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \dots \\
\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \dots \\
\vdots & \vdots & \ddots
\end{bmatrix}
$$

:::info Symmetry
If the second derivatives are continuous, the Hessian is a **symmetric matrix** (i.e., $\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$). This makes it easier to work with using Linear Algebra tools like Eigen-decomposition.
:::

## 2. Why does the Hessian matter in ML?

The Hessian helps us understand the "topography" of the Loss Function $J(\theta)$.

### A. Determining Maxima and Minima
The gradient only tells us if the slope is zero ($\nabla J = 0$), but that could be a peak, a valley, or a saddle point. The Hessian tells us which one:
* **Positive Definite Hessian:** The surface curves upward in all directions (a **Local Minimum**).
* **Negative Definite Hessian:** The surface curves downward in all directions (a **Local Maximum**).
* **Indefinite Hessian:** The surface curves up in some directions and down in others (a **Saddle Point**).

### B. Curvature and Learning Rates
The Hessian determines the "width" of the valley:
* **High Curvature:** A narrow, steep valley. If the learning rate is too high, Gradient Descent will bounce back and forth across the valley walls.
* **Low Curvature:** A wide, flat valley. Gradient Descent will move very slowly toward the bottom.

## 3. Second-Order Optimization

Standard Gradient Descent is a **first-order** method; it only uses the gradient. There are **second-order** methods, like **Newton's Method**, that use the Hessian to take much more efficient steps.

Instead of just moving in the negative gradient direction, Newton's method scales the step by the inverse of the Hessian:

$$
\theta_{new} = \theta_{old} - \mathbf{H}^{-1} \nabla J(\theta)
$$

:::caution The Computational Catch
In modern Deep Learning, the Hessian is rarely used directly. If a model has 10 million parameters, the Hessian matrix would have $10^{14}$ elements (100 trillion!), which is impossible to store in memory or invert. We use "quasi-Newton" methods or adaptive optimizers (like Adam) that approximate this curvature information.
:::

## 4. Example Calculation

Let $f(x, y) = x^2 + 4xy + y^2$.

1. **First Partial Derivatives (Gradient):**
* $f_x = 2x + 4y$
* $f_y = 4x + 2y$
2. **Second Partial Derivatives (Hessian):**
* $f_{xx} = \frac{\partial}{\partial x}(2x + 4y) = 2$
* $f_{yy} = \frac{\partial}{\partial y}(4x + 2y) = 2$
* $f_{xy} = \frac{\partial}{\partial y}(2x + 4y) = 4$

The Hessian matrix is:
$$
\mathbf{H} = \begin{bmatrix} 2 & 4 \\ 4 & 2 \end{bmatrix}
$$

---

Now that we have covered the mathematics of change (Calculus), we need to look at the mathematics of uncertainty. This allows us to handle noisy data and make predictions with confidence.
Loading