---
title: Network Architecture
math: 
    '\abs': '\left\lvert #1 \right\rvert' 
    '\norm': '\left\lvert #1 \right\rvert' 
    '\Set': '\left\{ #1 \right\}'
    '\mc': '\mathcal{#1}'
    '\M': '\boldsymbol{#1}'
    '\R': '\mathsf{#1}'
    '\RM': '\boldsymbol{\mathsf{#1}}'
    '\op': '\operatorname{#1}'
    '\E': '\op{E}'
    '\d': '\mathrm{\mathstrut d}'
---

**DIVE into Deep Learning**
___

In [None]:
%reload_ext jupyter_ai

In [None]:
from util import *

%matplotlib widget

## Mathematical Definition

::::{card}
:header: [open in new tab](https://www.cs.cityu.edu.hk/~ccha23/playground)
:::{iframe} https://www.cs.cityu.edu.hk/~ccha23/playground
:width: 100%
:::
::::

As shown below, a neural network is organized into layers of computation units called the *neurons*.

For $\ell\in \{0,\dots,L\}$ and integer $L\geq 1$, let 
- $\M{a}^{(\ell)}$ be the output of the $\ell$-th layer of the neural network, and
- $a^{(\ell)}_i$ be the $i$-th element of $\M{a}^{(\ell)}$. The element is computed from the output $\M{a}^{(\ell-1)}$ of its previous layer except for $\ell=0$.

The $0$-th layer is called the *input layer*, i.e.,
$$\M{a}^{(0)}:=\M{x}.$$

The $L$-th layer $\M{a}^{(L)}$ is called the *output layer*. All other layers are called the *hidden layers*.

**What should be the neural network output?**

The goal is to train a *classifier* that predicts a label $\R{y}$ for an input feature $\RM{x}$:

- Instead of a hard-decision classifier is a function $f:\mc{X}\to \mc{Y}$ such that
$f(\RM{x})$ predicts $\R{y}$,

- we train a probabilistic classifier $q_{\R{y}|\RM{x}}$ that estimates $p_{\R{y}|\RM{x}}$, i.e.,

$$
\begin{align}
[q_{\R{y}|\RM{x}}(y|\M{x})]_{y\in \mc{Y}} &:= \M{a}^{(L)}.
\end{align}
$$

For the MNIST dataset, a common goal is to classify the digit type of a handwritten digit.  When given a handwritten digit,
- a hard-decision classifier returns a digit type, and
- a probabilistic classifier returns a distribution of the digit types.

**Why train a probabilistic classifier?**

- A probabilistic classifer is more general and it can give a hard decision as well   

  $$f(\RM{x}):=\arg\max_{y\in \mc{Y}} q_{\R{y}|\RM{x}}(y|\RM{x})$$ 
  by returning the estimated most likely digit type.

- A neural network can model the distribution $p_{\R{y}|\RM{x}}(\cdot|\RM{x})$ better than $\R{y}$ because its output is continuous.

**How to ensure $\M{a}^{(L)}$ is a valid probability vector?**

The soft-max activation function is often used for the last layer:

$$ 
\begin{align}
\sigma^{(L)}\left(\left[\begin{smallmatrix}z^{(\ell)}_1 \\ \vdots \\ z^{(\ell)}_k\end{smallmatrix}\right]\right) := \frac{1}{\sum_{i=1}^k e^{z^{(\ell)}_i}} \left[\begin{smallmatrix}e^{z^{(\ell)}_1} \\ \vdots \\ e^{z^{(\ell)}_k}\end{smallmatrix}\right]\tag{soft-max} 
\end{align}$$ 

where $k:=\abs{\mc{Y}}=10$ is the number of distinct class labels.

It follows that: 

$$\sum_{i=1}^k a_i^{(L)}  = 1\kern1em \text{and} \kern1em a_i^{(L)}\geq 0\qquad \forall i\in \{1,\dots,k\}.$$

**How are the different layers related?**

$$
\begin{align}
\M{a}^{(\ell)}&:=\begin{cases}
\M{x} & \ell=0\\
\sigma^{(\ell)}(\overbrace{\M{W}^{(\ell)}\M{a}^{(\ell-1)}+\M{b}^{(\ell)}}^{\RM{z}^{(\ell)}:=})& \ell>0;
\end{cases}\tag{net}
\end{align}
$$

- $\M{W}^{(\ell)}$ is a matrix of weights;
- $\M{b}^{(\ell)}$ is a vector called bias; and
- $\sigma^{(\ell)}$ is a reveal-valued function called the *activation function*.

The activation functions $\sigma^{(\ell)}$ for other layers $1\leq \ell<L$ is often the vectorized version of 
-  sigmoid:  

  $$\sigma_{\text{sigmoid}}(z) = \frac{1}{1+e^{-z}}$$
-  rectified linear unit (ReLU): 

  $$ \sigma_{\text{ReLU}}(z) = \max\{0,z\}. $$

::::{card}
:header: [open in new tab](https://www.youtube.com/embed/aircAruvnKk?start=649&end=695)
:::{iframe} https://www.youtube.com/embed/aircAruvnKk?start=649&end=695
:width: 100%
:::
::::

The following plots the ReLU activation function.

In [None]:
def ReLU(z):
    return np.max([np.zeros(z.shape), z], axis=0)


z = np.linspace(-5, 5, 100)
plt.figure(num=1)
plt.plot(z, ReLU(z))
plt.xlim(-5, 5)
plt.title(r"ReLU: $\max\{0,z\}$")
plt.xlabel(r"$z$")
plt.show()

:::{exercise}
:label: ex:1
Complete the vectorized function `sigmoid` using the vectorized exponentiation `np.exp`.
:::

In [None]:
def sigmoid(z):
    # YOUR CODE HERE
    raise NotImplementedError


z = np.linspace(-5, 5, 100)
plt.figure(num=2)
plt.plot(z, sigmoid(z))
plt.xlim(-5, 5)
plt.title(r"Sigmoid function: $\frac{1}{1+e^{-z}}$")
plt.xlabel(r"$z$")
plt.show()

In [None]:
# tests

## Implementation

The following uses the [`keras`](https://keras.io/) library to define the basic neural network achitecture.

`keras` runs on top of `tensorflow` and offers a higher-level abstraction to simplify the construction and training of a neural network. ([`tflearn`](https://github.com/tflearn/tflearn) is another library that provides a higher-level API for `tensorflow`.)

In [None]:
def create_simple_model():
    tf.keras.backend.clear_session()  # clear keras cache.
    # See https://github.com/keras-team/keras/issues/7294
    model = tf.keras.models.Sequential(
        [
            tf.keras.layers.Input(shape=(28, 28, 1)),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(16, activation=tf.keras.activations.relu),
            tf.keras.layers.Dense(16, activation=tf.keras.activations.relu),
            tf.keras.layers.Dense(10, activation=tf.keras.activations.softmax),
        ],
        "Simple_sequential",
    )
    return model


model = create_simple_model()
model.summary()

The above defines [a linear stack](https://www.tensorflow.org/api_docs/python/tf/keras/layers) of [fully-connected layers](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) after [flattening the input](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Flatten). The method `summary` is useful for [debugging in Keras](https://keras.io/examples/keras_recipes/debugging_tips/).

::::{exercise}
:label: ex:2
Assign to `n_hidden_layers` the number of hidden layers for the above simple sequential model. 

:::{hint}
:class: dropdown
The layer `Flatten` does not count as a hidden layer since it simply reshapes the input without using any trainable parameters. The output layer also does not count as a hidden layer since its output is the output of the neural network, not intermediate (hidden) values that require further processing by the neurons.
:::
::::

In [None]:
# YOUR CODE HERE
raise NotImplementedError
n_hidden_layers

In [None]:
# tests

::::{important}
Remember to release the resources if it is no longer used. You can release the memory or GPU memory by `Kernel->Shut Down Kernel`.
::::