**Exercise:** there is a way of determining the bitwise representation of a digit by adding an extra layer to the three-layer network above. The extra layer converts the output from the previous layer into a binary representation, as illustrated in the figure below. Find a set of weights and biases for the new output layer. Assume that the first 3 layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least 0.99, and incorrect outputs have activation less than 0.01.

![image.png](attachment:image.png)

**Solution:** 
* ``The weights that start from the 0th neuron in the old output layer: 6[0, 0, 0,-1]``
* ``The weights that start from the 1st neuron in the old output layer: 6[0, 0, 0, 1]``
* ``The weights that start from the 2nd neuron in the old output layer: 6[0, 0, 1, 0]``
* ``The weights that start from the 3rd neuron in the old output layer: 6[0, 0, 1, 1]``
* ``The weights that start from the 4th neuron in the old output layer: 6[0, 1, 0, 0]``
* ``The weights that start from the 5th neuron in the old output layer: 6[0, 0, 1, 1]``
* ``The weights that start from the 6th neuron in the old output layer: 6[0, 1, 1, 0]``
* ``The weights that start from the 7th neuron in the old output layer: 6[0, 1, 1, 1]``
* ``The weights that start from the 8th neuron in the old output layer: 6[1, 0, 0, 0]``
* ``The weights that start from the 9th neuron in the old output layer: 6[1, 0, 0, 1]``

(*Why 6? see the graph of the sigmoid (aka logistic) function*)

The bias of all the neurons in the new output layer is 0.

---

**Exercise:** Prove the assertion: the choice of $\Delta v$ which minimizes $\nabla	C \cdot \Delta v$ is $\Delta v = -\eta \nabla C$, where $\eta = \epsilon \big/ \lVert \nabla C \rVert$ is determined by the size constraint $\lVert \Delta v \rVert= \epsilon$. *Hint: If you're not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it.*

**Solution:**

$$
\begin{align}
\lvert \Delta C \rvert = \lvert \nabla C \cdot \Delta v \rvert & \leq \lVert \nabla C \rVert \lVert \Delta v \rVert \;\;\; \text{(Cauchy-Schwarz)}
\\
& \leq \epsilon \lVert \nabla C \rVert
\\
& \leq \frac{\epsilon \lVert \nabla C \rVert^2}{\lVert \nabla C \rVert}
\\
& \leq \frac{\epsilon}{\lVert \nabla C \rVert} \times (\nabla C \cdot \nabla C)
\\
& \leq \eta \; (\nabla C \cdot \nabla C) = (\eta\nabla C) \cdot \nabla C
\end{align}
$$

Because we want to decrease $C$, we want $\Delta C < 0$:
$$
\nabla C \cdot \Delta v \geq (-\eta \nabla C) \cdot \nabla C
$$
The equality occurs  at $\Delta v = - \eta \nabla C$. In other words, $\Delta C = \nabla C \cdot \Delta v$ is minimized when $\Delta v = -\eta \nabla C$.

**Exercise:** What happens when $C$ is a function of just one variable? Can you provide a geometric interpretation of what gradient descent is doing in the one-dimensional case?

**Solution:**
<div>
    <img src="attachment:image.png" width="300"/>
</div>

In this case, $C = v^3 + 2v^2$. We have: $\nabla C = \frac{\partial C}{\partial v} = \frac{dC}{dv} = 3v^2+4v$. At $v = -1$, $\nabla C = 3\times(-1)^2+4\times(-1)=-1$. Let us pick $\epsilon = 0.2$ then $\eta = \frac{\epsilon}{\lVert \nabla C \rVert} = \frac{0.2}{\sqrt{(-1)^2}}=0.2$. According to the Gradient Descent algorithm, the optimal $\Delta v = -\eta \nabla C = (-0.2) \times (-1) = 0.2$. This means that moving to the right by $0.2$ will do the most to immediately decrease $C$. The graph confirms this statement.

---

**Exercise:** An extreme version of gradient descent is to use a mini-batch size of just 1. That is, given a training input, $x$, we update our weights and biases according to the rules $w_k \rightarrow w'_k=w_k-\eta \partial C_x/\partial w_k$ and $b_l→b'_l=b_l−\eta \partial C_x/ \partial b_l$. Then we choose another training input, and update the weights and biases again. And so on, repeatedly. This procedure is known as online, on-line, or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do). Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, 20.

**Solution:**
* Advantage: Maximizes the speedup of computing the estimate of the gradient $\nabla C$.
* Disadvantage: $\nabla C_x$ of the one training input is likely to be a poorer estimatate of the true $\nabla C$ compared with the estimate based on the mini-batch of 20 training examples.

---

**Exercise:** Write out Equation(22): $a' = \sigma(wa + b)$ in component form, and verify that it gives the same result as the rule (4): $\frac{1}{1+exp(-\Sigma_j w_jx_j - b)}$ for computing the output of a sigmoid neuron.

**Solution:** Suppose there are $n$ neurons in the former layer and $m$ neurons in the latter layer:
$$
\begin{align}
\sigma(wa + b) & =
\sigma \Biggl(
\begin{bmatrix}
w_{11} & w_{12} & ... & w_{1n}\\
w_{21} & w_{22} & ... & w_{2n}\\
...    & ...    & ... & ...\\
w_{m1} & w_{m2} & ... & w_{mn}
\end{bmatrix}
\times 
\begin{bmatrix}
a_1\\
a_2\\
...\\
a_n
\end{bmatrix}
+
\begin{bmatrix}
b_1\\
b_2\\
...\\
b_m
\end{bmatrix}
\Biggr)
\\
\\
& =
\sigma \Biggl(
\begin{bmatrix}
a_{1}w_{11} + a_{2}w_{12} + ... + a_{n}w_{1n}\\
a_{1}w_{21} + a_{2}w_{22} + ... + a_{n}w_{2n}\\
... \\
a_{1}w_{m1} + a_{2}w_{m2} + ... + a_{n}w_{mn}
\end{bmatrix}
+
\begin{bmatrix}
b_1\\
b_2\\
...\\
b_m
\end{bmatrix}
\Biggr)
\\
\\
& =
\sigma \Biggl(
\begin{bmatrix}
a_{1}w_{11} + a_{2}w_{12} + ... + a_{n}w_{1n} + b_1\\
a_{1}w_{21} + a_{2}w_{22} + ... + a_{n}w_{2n} + b_2\\
... \\
a_{1}w_{m1} + a_{2}w_{m2} + ... + a_{n}w_{mn} + b_m
\end{bmatrix}
\Biggr)
\\
\\
& =
\begin{bmatrix}
\sigma(a_{1}w_{11} + a_{2}w_{12} + ... + a_{n}w_{1n} + b_1)\\
\sigma(a_{1}w_{21} + a_{2}w_{22} + ... + a_{n}w_{2n} + b_2)\\
... \\
\sigma(a_{1}w_{m1} + a_{2}w_{m2} + ... + a_{n}w_{mn} + b_m)
\end{bmatrix}
\\
\\
& =
\begin{bmatrix}
\large{\frac{1}{1+exp(-\Sigma_j^n w_{1j}a_{j} - b_1)}}\\
\large{\frac{1}{1+exp(-\Sigma_j^n w_{2j}a_{j} - b_2)}}\\
... \\
\large{\frac{1}{1+exp(-\Sigma_j^n w_{mj}a_{j} - b_m)}}\\
\end{bmatrix}
\end{align}
$$