# Chapter 10 - Excercise Solutions - Conceptual

## 1

Consider a neural network with two hidden layers: *p* = 4 input units, 2 units in the first hidden layer, 3 units in the second hidden layer, and a single output.

**(a)** Draw a picture of the network, similar to Figures 10.1 or 10.4.

![image](Ch10_NN_4x2x3x1.png)

**(b)** Write out an expression for f(X), assuming ReLU activation functions. Be as explicit as you can!

The first hidden layer:

$$ A_k^{(1)} = h_k^{(1)} (X) $$

$$ A_k^{(1)} = g \left( w_{k0}^{(1)} + \sum_{j=1}^{4} w_{kj}^{(1)} X_j \right) $$

for $ k = 1, 2 $

$ \mathbf{W_1} $ has dimensions: (1 + 4) $ \times $ 2, i.e. **10** elemenets

[`1 + seq_length` $\times$ `num_features_l1`]

The second hidden layer:

$$ A_l^{(2)} = h_l^{(2)} (X) $$

$$ A_l^{(2)} = g \left( w_{l0}^{(2)} + \sum_{l=1}^{2} w_{lk}^{(2)} A_k^{(1)} \right) $$

for $ l = 1, 2, 3 $ (differentiated from $ k $ for clarity)

$ \mathbf{W_2} $ has dimensions: (1 + 2) $ \times $ 3, i.e. **9** elements

[`1 + num_features_l1` $\times$ `num_features_l2`]

$g(z)$ is the ReLU (rectified linear unit) function, defined as:

$$
g(z) =
\begin{cases} 
      0 & \text{if } z < 0 \\
      z & \text{if } z \geq 0 
   \end{cases}
$$

The output layer:

$$ f_0(X) = \beta_0 + \sum_{l=1}^{3} \beta_l h_l^{(2)}(X) $$

$ \mathbf{B} $ has dimensions: (1 + 3) $ \times $ 1, i.e. **4** elements

Combining the layers:

$$ f_0(X) = \beta_0 + \sum_{l=1}^{3} \beta_l \; \; g \left[ w_{l0}^{(2)} + \sum_{l=1}^{2} w_{lk}^{(2)} A_k^{(1)} \right] $$

$$ f_0(X) = \beta_0 + \sum_{l=1}^{3} \beta_l \; \; g \left[ w_{l0}^{(2)} + \sum_{l=1}^{2} w_{lk}^{(2)} \; g \left( w_{k0}^{(1)} + \sum_{j=1}^{4} w_{kj}^{(1)} X_j \right)   \right] $$



**(c)** Now plug in some values for the coefficients and write out the value of f(X).

In [58]:
import numpy as np

np.random.seed(0)
X = np.random.uniform(-10, 10, 4)

np.random.seed(1)
A1k1 = np.random.uniform(-1, 0, 4) # for k = 1
A11 = np.random.uniform(-1, 1) + X.dot(A1k1) # 1 + 4 parameters

np.random.seed(2)
A1k2 = np.random.uniform(-1, 0, 4) # for k = 2
A12 = np.random.uniform(-1, 1) + X.dot(A1k1) # 1 + 4 parameters

A1 = abs(np.array([A11, A12])) # ReLU layer

np.random.seed(3)
A2k1 = np.random.uniform(-1, 0, 2) # for l = 1
A21 = np.random.uniform(-1, 1) + A1.dot(A2k1) # 1 + 2 parameters

np.random.seed(4)
A2k2 = np.random.uniform(-1, 0, 2) # for l = 2
A22 = np.random.uniform(-1, 1) + A1.dot(A2k2) # 1 + 2 parameters

np.random.seed(5)
A2k3 = np.random.uniform(-1, 0, 2) # for l = 3
A23 = np.random.uniform(-1, 1) + A1.dot(A2k3) # 1 + 2 parameters

A2 = abs(np.array([A21, A22, A23])) # ReLU layer

np.random.seed(6)
B1 = np.random.uniform(-1, 0, 3) # length 3 for size of prev layer
B = np.random.uniform(-1, 1) + A2.dot(B1) # 1 + 3 parameters
# No ReLU for output layer
print(B)


-3.1607552059888944


**(d)** How many parameters are there?

There are 23 parameters in total:

* First hidden layer: (1 + 4) $ \times $ 2, i.e. **10** elemenets

* Second hidden layer: (1 + 2) $ \times $ 3, i.e. **9** elements

* Output layer: (1 + 3) $ \times $ 1, i.e. **4** elements

10 + 9 + 4 = 23

## 2

Consider the *softmax* function in (10.13) (see also (4.13) on page 145) for modeling multinomial probabilities.

**(a)** In (10.13), show that if we add a constant $c$ to each of the $Z_l$, then the probability is unchanged.

Original formula:

$$ f_m(X) = \mathrm{Pr}(Y = m | X) = \frac{e^{Z_m}}{\sum_{l=0}^{9} e^{Z_l} }$$

Adding $c$ to each $Z_L$

$$ = \frac{e^{Z_m + c}}{\sum_{l=0}^{9} e^{Z_l + c} }$$

$$ = \frac{e^{Z_m} e^{c}}{\sum_{l=0}^{9} e^{Z_l} e^{c} }$$

$$ = \frac{ e^{c} e^{Z_m}}{ e^{c} \sum_{l=0}^{9} e^{Z_l} }$$

$$ f_m(X) = \mathrm{Pr}(Y = m | X) = \frac{ e^{Z_m}}{ \sum_{l=0}^{9} e^{Z_l} }$$




**(b)** In (4.13), show that if we add constants $c_j$, $j = 0, 1, ..., p$, to each of the corresponding coefficients for each of the classes, then the predictions at any new point $x$ are unchanged.

Original formula:

$$
\mathrm{Pr}(Y = k | X = x) = \frac{ e^{\beta_{k0} + \beta_{k1x}x_1 + ... + \beta_{kp}x_p} }{\sum_{l=1}^K e^{\beta_{k0} + \beta_{k1x}x_1 + ... + \beta_{kp}x_p} }
$$

Adding $c_j$ to each of the coefficients $ \beta_0, ..., \beta_p $

$$
= \frac{ e^{\beta_{k0} + c_0 + (\beta_{k1x} + c_1) x_1 ... + (\beta_{kp} + c_p) x_p} }{\sum_{l=1}^K e^{\beta_{k0} + c_0 + (\beta_{k1x} + c_1) x_1 ... + (\beta_{kp} + c_p) x_p} }
$$

$$
= \frac{ e^{\beta_{k0} + \beta_{k1x}x_1 + ... + \beta_{kp}x_p + c_0 + c_1 x_1 + ... + c_p x_p} }{\sum_{l=1}^K e^{\beta_{k0} + \beta_{k1x}x_1 + ... + \beta_{kp}x_p + c_0 + c_1 x_1 + ... + c_p x_p} }
$$

$$
= \frac{ e^{\beta_{k0} + \beta_{k1x}x_1 + ... + \beta_{kp}x_p} e^{c_0 + c_1 x_1 + ... + c_p x_p} }{\sum_{l=1}^K e^{\beta_{k0} + \beta_{k1x}x_1 + ... + \beta_{kp}x_p} e^{c_0 + c_1 x_1 + ... + c_p x_p} }
$$

$$
= \frac{ e^{c_0 + c_1 x_1 + ... + c_p x_p} e^{\beta_{k0} + \beta_{k1x}x_1 + ... + \beta_{kp}x_p} }{e^{c_0 + c_1 x_1 + ... + c_p x_p} \sum_{l=1}^K e^{\beta_{k0} + \beta_{k1x}x_1 + ... + \beta_{kp}x_p} }
$$

$$
\mathrm{Pr}(Y = k | X = x) = \frac{ e^{\beta_{k0} + \beta_{k1x}x_1 + ... + \beta_{kp}x_p} }{\sum_{l=1}^K e^{\beta_{k0} + \beta_{k1x}x_1 + ... + \beta_{kp}x_p} }
$$

This shows that the softmax function is *over-parametrized*. However, regularization and SGD typically constrain the solutions so that this is not a problem.

* In other words, there are redundant parameters in the sense ethat infinitely many sets of parameters that lead to the same probability predictions, i.e. excess degrees of freedom.


## 3

Show that the negative multinomial log-likelihood (10.14) is equivalent to the negative log of the likelihood expression (4.5) when there are M = 2 classes.


Multinomial log-likelihood:

$$
- \sum_{i=1}^n \sum_{m=0}^M y_{im} \mathrm{log} \left( f_m(x_i) \right)
$$

For $ M = 2, m = 0, 1$

$$
- \sum_{i=1}^n \sum_{m=0}^1 y_{im} \mathrm{log} \left( f_m(x_i) \right)
$$