# Libraries

In [1]:
import pandas as pd
import numpy as np

# Forward Propagation

$$
z = \sigma(\mathbf{W}^\text{T} \mathbf{x} + \mathbf{b})
$$

In [12]:
# Activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Dimensions
D = 4 
M = 3  

# Synthetic dataset
x = np.random.rand(D, 1)  # column vector of size D
W = np.random.rand(D, M)  # matrix of size DxM
b = np.random.rand(M, 1)  # a vector of size M

# Vectorized computation
z = sigmoid(W.T @ x + b)  # element-wise operation

# Now z is your output vector
z

array([[0.79628357],
       [0.76315865],
       [0.74388715]])

# The Geometric Form

A linear boundary takes the form of:

$$
\mathbf{W}^\text{T} \mathbf{x} + \mathbf{b}
$$

A (2-layer) neural network boundary takes the form of:

$$
\mathbf{W}^{(2)\text{T}} \sigma(\mathbf{W}^{(1)\text{T}} \mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)}
$$

In [13]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [20]:
D = 4   # input size vector
M1 = 5  # size of first layer
M2 = 3  # size of second layer

W1 = np.random.rand(D, M1)   # weights for first layer
b1 = np.random.rand(M1, 1)   # biases for first layer

W2 = np.random.rand(M1, M2)  # weights for second layer
b2 = np.random.rand(M2, 1)  # biases for second layer

x = np.random.rand(D, 1)  # create an input size vector x

z1 = sigmoid(W1.T @ x + b1)  # forward pass through the first layer

z2 = W2.T @ z1 + b2  # forward pass through the second layer
z2

array([[2.02125455],
       [2.86643055],
       [1.82947088]])

# Activation Functions

Activation functions makes neural network's decision boundary non-linear.

**Standardization**

- we don't want inputs with extremely different ranges.

- we prefer inputs centred around 0 and approximately around the same range.

- however, the sigmoid outputs goes between 0-1, centre = 0.5

- the hyperbolic tangent (tanh) solves this with range between -1 and +1:

$$
\tanh(x) = \frac{\sinh(x)}{\cosh(x)} = \frac{{\exp(2a) - 1}}{{\exp(2a) + 1}}
$$

However, both the sigmoid and the tanh suffer from the vanishing gradient problem. 

The vanishing gradient problem occurs when repeatedly multiplying the derivative of the sigmoid activation function in deep neural networks causes the gradients to become extremely small, leading to slow or stalled learning during backpropagation.

The Rectified Linear Unit (**ReLU**) activation function transforms its input to zero if it's negative; otherwise, it passes the input value unchanged.

$$
f(x) = \max(0, x)
$$

# How to Represent Images