## Beyond the math: all the fancy terms

Though knowing math is enough for us to implement and use neural network models, understanding the jargon helps communicate with people from different backgrounds. In this section, we would like to fill in the gap between math and the terminology commonly used in the community of machine learning and deep learning.

Let's use the math from a fully connected neural network as an example:

$$
\begin{aligned}
\mathbf{z}^1 &= \sigma_0\left(\left(W^0\right)^\mathsf{T}\mathbf{x} + \mathbf{b}^0\right) \\
\mathbf{z}^2 &= \sigma_1\left(\left(W^1\right)^\mathsf{T}\mathbf{z}^1 + \mathbf{b}^1\right) \\
&\vdots \\
\hat{\mathbf{y}} &= \sigma_L\left(\left(W^L\right)^\mathsf{T}\mathbf{z}^L + \mathbf{b}^L\right)
\end{aligned}
$$

Often you see people trying to use this kind of graph to explain a fully connected neural network:

<img src="../images/neural-networks.png" style="width: 400px;"/> 

This graphical illustration may come from the fact that neural network models are inspired by the true neural networks in human bodies. However, this figure does not really tell us how exactly a model should be implemented.

Nevertheless, the graphical illustration makes it's easier to understand why the deep learning community names the math components with the following terms.

##### Neurons

Each value in the intermediate and final results is called a neuron. Though we use vectors $\mathbf{z}^1$, $\dots$, $\hat{\mathbf{y}}$ to denote the calculation results, they are composed of many elements. In other words, elements in vectors are called neurons. For example, if $\mathbf{z}^1$ has $12$ elements, i.e., $n_1=12$, then we say $\mathbf{z}^1$ has 12 neurons.

A neuron in a neural network means it is a signal processing node. It receives values/signals from other neurons, does some calculations, and then provides the calculation result (or says processed signal) to other neurons. For example, in the above fully connected neural network model, if we expand the vector-matrix form to explicit equations, the element $z_1^2$ from the intermediate vector $\mathbf{z}^2$ is obtained through:

$$
z_1^2 = \sigma_0\left(W_{1,1}^{1}z_1^{1}+W_{2,1}^{1}z_2^1+\cdots+W_{n_1,1}^1 z_{n_1}^1 + b_1^1\right)
$$

So we say the neuron $z_1^2$ receives signals from neurons $z_1^1$, $z_2^1$, $\cdots$, $z_{n_1}^1$. And because the value of neuron $z_1^2$ is needed to calculate the intermediate vector $\mathbf{z}^3$, we also say the neuron $z_1^2$ provides the signal to neurons $z_1^3$, $z_2^3$, $\cdots$, $z_{n_3}^3$ for further signal processing.

##### Layers

Neurons that are independent of each other are put together and becomes a layer. For example, to calculate $z_1^2$ from the input $\mathbf{x}$, we don't need the value of $z_2^2$, and vice versa. We say $z_1^2$ and $z_2^2$ belong to the same layer. In fact, if we look at the vector-matrix format of the model, we can see that each vector is a layer, e.g., $\mathbf{z}^1$ is a layer because all elements in the vector $\mathbf{z}^1$ are independent of each other. The same applies to $\mathbf{z}^2$, $\dots$, $\hat{\mathbf{y}}$.

We call the vector $\hat{\mathbf{y}}$ the output layer because it's the final output of a neural network. Some people extend the naming system to the input vector $\mathbf{x}$ and call it an input layer. 

##### Hidden layers

Intermediate vectors $\mathbf{z}^1$, $\dots$, $\mathbf{z}^L$ are called hidden layers because we don't see them if we treat a neural network as a black box. The variable $L$ hence denotes the number of hidden layers.

##### Forward propagation

Forward propagation means the procedure of the output $\hat{\mathbf{y}}$, starting from the input $\mathbf{x}$, then $\mathbf{z}^1$, $\mathbf{z}^2$, $\dots$, and so on. The word propagation may come from the propagation of signals. Each neuron layer receives signals from the previous layer, does some processing, and then passes the processed signal to the next layer.

##### Backward propagation

Backward propagation is related to calculating the gradients of parameters. We didn't see it in our teaching material as we rely on `autograd` to do the job. However, in a nutshell, backward propagation is a technique for obtaining the gradients of parameters. It is an application of the chain rule from calculus. Nowadays, it's more common to use third-party libraries for calculating gradients, and the backward propagation is usually how these libraries work under the hood.

##### Training and learning

You may notice we use the term *optimization* most of the time, though sometimes the terms *training* and *learning* slip through and are present in the teaching material. They all mean the same thing in machine learning: finding the best set of parameters that makes a model best fits given data. In other words, if you have taken any courses in numerical methods or numerical analysis, they are synonyms of *model fitting*.

In lesson 5, we used the example of take-home exercises, quizzes, and final exams to explain the concepts of training, validation, and test datasets. We can see the optimization of a model is indeed similar to training a student to some degree.

##### Learning rate

The *step size* in gradient-descent-based optimization methods is called learning rate.

##### Hyperparameters

Matrices and vectors, $W^0$, $\mathbf{b}^0$, $W^1$, $\mathbf{b}^1$, etc., are called *model parameters* or simply *parameters*. However, we also have other parameters such as the coefficient used in gradient descent (i.e., learning rate), the coefficients for regularization, etc. These parameters are not part of a model and are called hyperparameters.

In [None]:
# Execute this cell to load the notebook's style sheet, then ignore it\n",
from IPython.core.display import HTML
css_file = '../style/custom.css'
HTML(open(css_file, "r").read())