🏷️sec_single_variable_calculus
In :numref:sec_calculus
, we saw the basic elements of differential calculus. This section takes a deeper dive into the fundamentals of calculus and how we can understand and apply it in the context of machine learning.
Differential calculus is fundamentally the study of how functions behave under small changes. To see why this is so core to deep learning, let us consider an example.
Suppose that we have a deep neural network where the weights are, for convenience, concatenated into a single vector
This function is extraordinarily complex, encoding the performance of all possible models of the given architecture on this dataset, so it is nearly impossible to tell what set of weights
The question then becomes something that on the surface is no easier: how do we find the direction which makes the weights decrease as quickly as possible? To dig into this, let us first examine the case with only a single weight:
Let us take
%matplotlib inline
from d2l import mxnet as d2l
from IPython import display
from mxnet import np, npx
npx.set_np()
# Plot a function in a normal range
x_big = np.arange(0.01, 3.01, 0.01)
ys = np.sin(x_big**x_big)
d2l.plot(x_big, ys, 'x', 'f(x)')
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
from IPython import display
import torch
torch.pi = torch.acos(torch.zeros(1)).item() * 2 # Define pi in torch
# Plot a function in a normal range
x_big = torch.arange(0.01, 3.01, 0.01)
ys = torch.sin(x_big**x_big)
d2l.plot(x_big, ys, 'x', 'f(x)')
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
from IPython import display
import tensorflow as tf
tf.pi = tf.acos(tf.zeros(1)).numpy() * 2 # Define pi in TensorFlow
# Plot a function in a normal range
x_big = tf.range(0.01, 3.01, 0.01)
ys = tf.sin(x_big**x_big)
d2l.plot(x_big, ys, 'x', 'f(x)')
At this large scale, the function's behavior is not simple. However, if we reduce our range to something smaller like
# Plot a the same function in a tiny range
x_med = np.arange(1.75, 2.25, 0.001)
ys = np.sin(x_med**x_med)
d2l.plot(x_med, ys, 'x', 'f(x)')
#@tab pytorch
# Plot a the same function in a tiny range
x_med = torch.arange(1.75, 2.25, 0.001)
ys = torch.sin(x_med**x_med)
d2l.plot(x_med, ys, 'x', 'f(x)')
#@tab tensorflow
# Plot a the same function in a tiny range
x_med = tf.range(1.75, 2.25, 0.001)
ys = tf.sin(x_med**x_med)
d2l.plot(x_med, ys, 'x', 'f(x)')
Taking this to an extreme, if we zoom into a tiny segment, the behavior becomes far simpler: it is just a straight line.
# Plot a the same function in a tiny range
x_small = np.arange(2.0, 2.01, 0.0001)
ys = np.sin(x_small**x_small)
d2l.plot(x_small, ys, 'x', 'f(x)')
#@tab pytorch
# Plot a the same function in a tiny range
x_small = torch.arange(2.0, 2.01, 0.0001)
ys = torch.sin(x_small**x_small)
d2l.plot(x_small, ys, 'x', 'f(x)')
#@tab tensorflow
# Plot a the same function in a tiny range
x_small = tf.range(2.0, 2.01, 0.0001)
ys = tf.sin(x_small**x_small)
d2l.plot(x_small, ys, 'x', 'f(x)')
This is the key observation of single variable calculus: the behavior of familiar functions can be modeled by a line in a small enough range. This means that for most functions, it is reasonable to expect that as we shift the
Thus, we can consider the ratio of the change in the output of a function for a small change in the input of the function. We can write this formally as
This is already enough to start to play around with in code. For instance, suppose that we know that
#@tab all
# Define our function
def L(x):
return x**2 + 1701*(x-4)**3
# Print the difference divided by epsilon for several epsilon
for epsilon in [0.1, 0.001, 0.0001, 0.00001]:
print(f'epsilon = {epsilon:.5f} -> {(L(4+epsilon) - L(4)) / epsilon:.5f}')
Now, if we are observant, we will notice that the output of this number is suspiciously close to
As a bit of a historical digression: in the first few decades of neural network research, scientists used this algorithm (the method of finite differences) to evaluate how a loss function changed under small perturbation: just change the weights and see how the loss changed. This is computationally inefficient, requiring two evaluations of the loss function to see how a single change of one variable influenced the loss. If we tried to do this with even a paltry few thousand parameters, it would require several thousand evaluations of the network over the entire dataset! It was not solved until 1986 that the backpropagation algorithm introduced in :cite:Rumelhart.Hinton.Williams.ea.1988
provided a way to calculate how any change of the weights together would change the loss in the same computation time as a single prediction of the network over the dataset.
Back in our example, this value
eq_der_def
Different texts will use different notations for the derivative. For instance, all of the below notations indicate the same thing:
Most authors will pick a single notation and stick with it, however even that is not guaranteed. It is best to be familiar with all of these. We will use the notation
Oftentimes, it is intuitively useful to unravel the definition of derivative :eqref:eq_der_def
again to see how a function changes when we make a small change of
$$\begin{aligned} \frac{df}{dx}(x) = \lim_{\epsilon \rightarrow 0}\frac{f(x+\epsilon) - f(x)}{\epsilon} & \implies \frac{df}{dx}(x) \approx \frac{f(x+\epsilon) - f(x)}{\epsilon} \ & \implies \epsilon \frac{df}{dx}(x) \approx f(x+\epsilon) - f(x) \ & \implies f(x+\epsilon) \approx f(x) + \epsilon \frac{df}{dx}(x). \end{aligned}$$
:eqlabel:eq_small_change
The last equation is worth explicitly calling out. It tells us that if you take any function and change the input by a small amount, the output would change by that small amount scaled by the derivative.
In this way, we can understand the derivative as the scaling factor that tells us how large of change we get in the output from a change in the input.
🏷️sec_derivative_table
We now turn to the task of understanding how to compute the derivative of an explicit function. A full formal treatment of calculus would derive everything from first principles. We will not indulge in this temptation here, but rather provide an understanding of the common rules encountered.
As was seen in :numref:sec_calculus
, when computing derivatives one can oftentimes use a series of rules to reduce the computation to a few core functions. We repeat them here for ease of reference.
-
Derivative of constants.
$\frac{d}{dx}c = 0$ . -
Derivative of linear functions.
$\frac{d}{dx}(ax) = a$ . -
Power rule.
$\frac{d}{dx}x^n = nx^{n-1}$ . -
Derivative of exponentials.
$\frac{d}{dx}e^x = e^x$ . -
Derivative of the logarithm.
$\frac{d}{dx}\log(x) = \frac{1}{x}$ .
If every derivative needed to be separately computed and stored in a table, differential calculus would be near impossible. It is a gift of mathematics that we can generalize the above derivatives and compute more complex derivatives like finding the derivative of sec_calculus
, the key to doing so is to codify what happens when we take functions and combine them in various ways, most importantly: sums, products, and compositions.
-
Sum rule.
$\frac{d}{dx}\left(g(x) + h(x)\right) = \frac{dg}{dx}(x) + \frac{dh}{dx}(x)$ . -
Product rule.
$\frac{d}{dx}\left(g(x)\cdot h(x)\right) = g(x)\frac{dh}{dx}(x) + \frac{dg}{dx}(x)h(x)$ . -
Chain rule.
$\frac{d}{dx}g(h(x)) = \frac{dg}{dh}(h(x))\cdot \frac{dh}{dx}(x)$ .
Let us see how we may use :eqref:eq_small_change
to understand these rules. For the sum rule, consider following chain of reasoning:
By comparing this result with the fact that
The product is more subtle, and will require a new observation about how to work with these expressions. We will begin as before using :eqref:eq_small_change
:
This resembles the computation done above, and indeed we see our answer ($\frac{df}{dx}(x) = g(x)\frac{dh}{dx}(x) + \frac{dg}{dx}(x)h(x)$) sitting next to
and see that as we send
Finally, with the chain rule, we can again progress as before using :eqref:eq_small_change
and see that
where in the second line we view the function
These rule provide us with a flexible set of tools to compute essentially any expression desired. For instance,
Where each line has used the following rules:
- The chain rule and derivative of logarithm.
- The sum rule.
- The derivative of constants, chain rule, and power rule.
- The sum rule, derivative of linear functions, derivative of constants.
Two things should be clear after doing this example:
- Any function we can write down using sums, products, constants, powers, exponentials, and logarithms can have its derivate computed mechanically by following these rules.
- Having a human follow these rules can be tedious and error prone!
Thankfully, these two facts together hint towards a way forward: this is a perfect candidate for mechanization! Indeed backpropagation, which we will revisit later in this section, is exactly that.
When working with derivatives, it is often useful to geometrically interpret the approximation used above. In particular, note that the equation
approximates the value of
# Compute sin
xs = np.arange(-np.pi, np.pi, 0.01)
plots = [np.sin(xs)]
# Compute some linear approximations. Use d(sin(x)) / dx = cos(x)
for x0 in [-1.5, 0, 2]:
plots.append(np.sin(x0) + (xs - x0) * np.cos(x0))
d2l.plot(xs, plots, 'x', 'f(x)', ylim=[-1.5, 1.5])
#@tab pytorch
# Compute sin
xs = torch.arange(-torch.pi, torch.pi, 0.01)
plots = [torch.sin(xs)]
# Compute some linear approximations. Use d(sin(x))/dx = cos(x)
for x0 in [-1.5, 0.0, 2.0]:
plots.append(torch.sin(torch.tensor(x0)) + (xs - x0) *
torch.cos(torch.tensor(x0)))
d2l.plot(xs, plots, 'x', 'f(x)', ylim=[-1.5, 1.5])
#@tab tensorflow
# Compute sin
xs = tf.range(-tf.pi, tf.pi, 0.01)
plots = [tf.sin(xs)]
# Compute some linear approximations. Use d(sin(x))/dx = cos(x)
for x0 in [-1.5, 0.0, 2.0]:
plots.append(tf.sin(tf.constant(x0)) + (xs - x0) *
tf.cos(tf.constant(x0)))
d2l.plot(xs, plots, 'x', 'f(x)', ylim=[-1.5, 1.5])
Let us now do something that may on the surface seem strange. Take a function
However, the derivative,
Let us try to understand why this is a useful notion. Below, we visualize
First, consider the case that the second derivative fig_positive-second
.
Second, if the second derivative is a negative constant, that means that the first derivative is decreasing. This implies the first derivative may start out positive, becomes zero at a point, and then becomes negative. Hence, the function fig_negative-second
.
Third, if the second derivative is a always zero, then the first derivative will never change---it is constant! This means that fig_zero-second
.
To summarize, the second derivative can be interpreted as describing the way that the function
Let us take this one step further. Consider the function
If we have some original function
# Compute sin
xs = np.arange(-np.pi, np.pi, 0.01)
plots = [np.sin(xs)]
# Compute some quadratic approximations. Use d(sin(x)) / dx = cos(x)
for x0 in [-1.5, 0, 2]:
plots.append(np.sin(x0) + (xs - x0) * np.cos(x0) -
(xs - x0)**2 * np.sin(x0) / 2)
d2l.plot(xs, plots, 'x', 'f(x)', ylim=[-1.5, 1.5])
#@tab pytorch
# Compute sin
xs = torch.arange(-torch.pi, torch.pi, 0.01)
plots = [torch.sin(xs)]
# Compute some quadratic approximations. Use d(sin(x)) / dx = cos(x)
for x0 in [-1.5, 0.0, 2.0]:
plots.append(torch.sin(torch.tensor(x0)) + (xs - x0) *
torch.cos(torch.tensor(x0)) - (xs - x0)**2 *
torch.sin(torch.tensor(x0)) / 2)
d2l.plot(xs, plots, 'x', 'f(x)', ylim=[-1.5, 1.5])
#@tab tensorflow
# Compute sin
xs = tf.range(-tf.pi, tf.pi, 0.01)
plots = [tf.sin(xs)]
# Compute some quadratic approximations. Use d(sin(x)) / dx = cos(x)
for x0 in [-1.5, 0.0, 2.0]:
plots.append(tf.sin(tf.constant(x0)) + (xs - x0) *
tf.cos(tf.constant(x0)) - (xs - x0)**2 *
tf.sin(tf.constant(x0)) / 2)
d2l.plot(xs, plots, 'x', 'f(x)', ylim=[-1.5, 1.5])
We will extend this idea to the idea of a Taylor series in the next section.
The Taylor series provides a method to approximate the function
We saw the case of
As we can see above, the denominator of
If we push the logic further to
where the
Furthermore, we can get a degree
where the notation
Indeed,
While we are not going to dive all the way into the error of the above approximations, it is worth mentioning the infinite limit. In this case, for well behaved functions (known as real analytic functions) like
Take
Let us see how this works in code and observe how increasing the degree of the Taylor approximation brings us closer to the desired function
# Compute the exponential function
xs = np.arange(0, 3, 0.01)
ys = np.exp(xs)
# Compute a few Taylor series approximations
P1 = 1 + xs
P2 = 1 + xs + xs**2 / 2
P5 = 1 + xs + xs**2 / 2 + xs**3 / 6 + xs**4 / 24 + xs**5 / 120
d2l.plot(xs, [ys, P1, P2, P5], 'x', 'f(x)', legend=[
"Exponential", "Degree 1 Taylor Series", "Degree 2 Taylor Series",
"Degree 5 Taylor Series"])
#@tab pytorch
# Compute the exponential function
xs = torch.arange(0, 3, 0.01)
ys = torch.exp(xs)
# Compute a few Taylor series approximations
P1 = 1 + xs
P2 = 1 + xs + xs**2 / 2
P5 = 1 + xs + xs**2 / 2 + xs**3 / 6 + xs**4 / 24 + xs**5 / 120
d2l.plot(xs, [ys, P1, P2, P5], 'x', 'f(x)', legend=[
"Exponential", "Degree 1 Taylor Series", "Degree 2 Taylor Series",
"Degree 5 Taylor Series"])
#@tab tensorflow
# Compute the exponential function
xs = tf.range(0, 3, 0.01)
ys = tf.exp(xs)
# Compute a few Taylor series approximations
P1 = 1 + xs
P2 = 1 + xs + xs**2 / 2
P5 = 1 + xs + xs**2 / 2 + xs**3 / 6 + xs**4 / 24 + xs**5 / 120
d2l.plot(xs, [ys, P1, P2, P5], 'x', 'f(x)', legend=[
"Exponential", "Degree 1 Taylor Series", "Degree 2 Taylor Series",
"Degree 5 Taylor Series"])
Taylor series have two primary applications:
-
Theoretical applications: Often when we try to understand a too complex function, using Taylor series enables us to turn it into a polynomial that we can work with directly.
-
Numerical applications: Some functions like
$e^{x}$ or$\cos(x)$ are difficult for machines to compute. They can store tables of values at a fixed precision (and this is often done), but it still leaves open questions like "What is the 1000-th digit of$\cos(1)$ ?" Taylor series are often helpful to answer such questions.
- Derivatives can be used to express how functions change when we change the input by a small amount.
- Elementary derivatives can be combined using derivative rules to create arbitrarily complex derivatives.
- Derivatives can be iterated to get second or higher order derivatives. Each increase in order provides more fine grained information on the behavior of the function.
- Using information in the derivatives of a single data example, we can approximate well behaved functions by polynomials obtained from the Taylor series.
- What is the derivative of
$x^3-4x+1$ ? - What is the derivative of
$\log(\frac{1}{x})$ ? - True or False: If
$f'(x) = 0$ then$f$ has a maximum or minimum at$x$ ? - Where is the minimum of
$f(x) = x\log(x)$ for$x\ge0$ (where we assume that$f$ takes the limiting value of$0$ at $f(0)$)?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: