# Backpropagation in Practice

## 1. Unrolling Parameters

Let's suppose that we have a neural network structure with 4 layers and, $s_1=s_2=s_3=10$ and $s_4=1$.

Then our matrices $\Theta^{(l)}$ and $D^{(l)}$ will have the following dimensions:

- $\Theta^{(1)}, D^{(1)}\in\mathbb{R}^{10 \times 11}$
- $\Theta^{(2)}, D^{(2)}\in\mathbb{R}^{10 \times 11}$
- $\Theta^{(3)}, D^{(3)}\in\mathbb{R}^{1 \times 11}$

In order to work with these in a numerical optimizer we have to accomodate them into vectors as:

$$
\Theta_{vec} = \left[
\begin{array}{c}
vec(\Theta^{(1)}) \\
vec(\Theta^{(2)}) \\
vec(\Theta^{(3)})
\end{array}
\right] \in\mathbb{R}^{331} \qquad \text{and} \qquad D_{vec} = \left[
\begin{array}{c}
vec(D^{(1)}) \\
vec(D^{(2)}) \\
vec(D^{(3)})
\end{array}
\right]\in\mathbb{R}^{331}
$$

where the $vec(A)\in\mathbb{R}^{mn}$ operation over a matrix $A\in\mathbb{R}^{m \times n}$ consists on vertically stacking the columns of $A$.

Finally, once the numerical optimizer converges, we can rearrange the vectors into the former matrices.

## 2. Gradient Checking

We can numerically approximate the gradient in order to check that our backpropagation algorithm is working properly. To do this, consider the unrolled version of the parameters $\Theta_{vec} = [\theta_1, \dots, \theta_n]$.

Then, the $i$-th partial derivative of the cost function can be approximated by the two-sided difference:

$$
\frac{\partial}{\partial \theta_i} J(\Theta_{vec}) \approx \frac{J(\theta_1, \dots, \theta_i + \varepsilon, \dots, \theta_n) - J(\theta_1, \dots, \theta_i - \varepsilon, \dots, \theta_n)}{2\varepsilon}.
$$

Then, for a sufficiently small $\varepsilon$ the approximated derivative $\frac{\partial}{\partial \theta_i} J(\Theta_{vec})$ should be pretty close to the $i$-th component of $D_{vec}$.

## 3. Random Initialization

How to select the initial value of $\Theta_{vec}$?

For the logistic regresion and linear regresion we always initialized the parameters as zero. In those algorithms, the initialization does not matter theoretically since the cost function is convex.

On the other hand, the cost function for neural networks is **not convex** in general. Moreover, initialize the parameters as zero often leads to poor performances (non-identifiability of the parameters).

One recommendation is to initialized each $\Theta_{ij}^{(l)}$ to a random value in $[-\epsilon, \epsilon]$.

<script>
  $(document).ready(function(){
    $('div.prompt').hide();
    $('div.back-to-top').hide();
    $('nav#menubar').hide();
    $('.breadcrumb').hide();
    $('.hidden-print').hide();
  });
</script>

<footer id="attribution" style="float:right; color:#808080; background:#fff;">
Created with Jupyter by Esteban Jiménez Rodríguez. Based on the content of the Machine Learning course offered through coursera by Prof. Andrew Ng.
</footer>