# 反向传播算法

In the previous section, we introduced three models. The basic process of the whole process is to define the model, read in the data, give the loss function $f$, and update the parameters by the gradient descent method. PyTorch provides a very simple automatic derivation to help us solve the derivative. For simpler models, we can also manually determine the gradient of the parameters, but for very complex models, such as a 100-layer network, how can we effectively manually Find this gradient? Here we need to introduce a back propagation algorithm. The essence of automatic derivation is a back propagation algorithm.

The backpropagation algorithm is an algorithm for effectively solving the gradient. It is essentially the application of a chained derivation rule. However, this simple and obvious method was invented nearly 30 years after Roseblatt proposed the perceptron algorithm. Popular, Bengio said: "A lot of seemingly obvious ideas become apparent only afterwards."

Let's take a closer look at what is a backpropagation algorithm.


## Chain Law

First, let's briefly introduce the chain rule and consider a simple function, such as
$$f(x, y, z) = (x + y)z$$

We can of course directly find the differential of this function, but here we have to use the chain rule,
$$q=x+y$$

Then

$$f = qz$$

For these two equations, we can find their differentials separately.

$$\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z}=q$$

At the same time $q$ is the sum of $x$ and $y$, so we can get

$$\frac{\partial q}{x} = 1, \frac{\partial q}{y} = 1$$

The problem we care about is

$$\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}$$

The chain rule tells us how to calculate their value

$$
\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q}\frac{\partial q}{\partial x}
$$
$$
\frac{\partial f}{\partial y} = \frac{\partial f}{\partial q}\frac{\partial q}{\partial y}
$$
$$
\frac{\partial f}{\partial z} = q
$$

Through chain-based rules, we know that if we need to derive the elements, we can multiply them one by one and multiply the results. This is the core of the chain rule and the core of the back propagation algorithm. More about The algorithm of the chain rule can access this [document] (https:


## Backpropagation Algorithm

Understand the chain rule, we can start to introduce the back propagation algorithm. In essence, the back propagation algorithm is only an application of the chain rule. We still use the same example $q=x y, f=qz$, which can be expressed by calculating the graph.

![](https://ws1.sinaimg.cn/large/006tNc79ly1fmiozcinyzj30c806vglk.jpg)

The green number above indicates its value, and the red number below indicates the gradient obtained. We can look at the implementation of the backpropagation algorithm step by step. First from the end, the gradient is of course 1, then calculated

$$\frac{\partial f}{\partial q} = z = -4,\ \frac{\partial f}{\partial z} = q = 3$$

Then we calculate
$$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} = -4 \times 1 = -4,\ \frac{\partial f}{\partial y} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial y} = -4 \times 1 = -4$$

This step by step we find $\nabla f(x, y, z)$.

Intuitively, the backpropagation algorithm is an elegant local process. Each derivation is only a derivative of the current operation. Solving the parameters of each layer of the network is based on the chain rule to find the previous result and iterate to this layer. , so this is a communication process

### Sigmoid function example

Below we use the Sigmoid function to demonstrate how the backpropagation process works on a complex function.

$$
f(w, x) = \frac{1}{1+e^{-(w_0 x_0 + w_1 x_1 + w_2)}}
$$

We need to solve
$$\frac{\partial f}{\partial w_0}, \frac{\partial f}{\partial w_1}, \frac{\partial f}{\partial w_2}$$

First we abstract this function into a computational graph, ie
$$
   f(x) = \frac{1}{x} \\
   f_c(x) = 1 + x \\
   f_e(x) = e^x \\
   f_w(x) = -(w_0 x_0 + w_1 x_1 + w_2)
$$

So we can draw the following calculation diagram

![](https://ws1.sinaimg.cn/large/006tNc79ly1fmip1va5qjj30lb08e0t0.jpg)

Similarly, the green number above represents the value, the red number below indicates the gradient, and we calculate the gradient of each parameter from the back to the front. First, the final gradient is 1, and then pass the $\frac{1}{x}$ function. The gradient of this function is $-\frac{1}{x^2}$, and the gradient before the previous propagation is $1 \times -\frac{1}{1.37^2} = -0.53$, then $1$ this operation, the gradient is unchanged, then the $e^x$ operation, its gradient is $-0.53 \times E^{-1} = -0.2$, so that the gradient of each parameter can be obtained by continuously propagating backwards.
