# Establish CNNs from scratch
Activation function layers still contain two proccesses, namely **Forward Propogation** and **Backward Propogation**. They are set for multiple purposes, like Non-linearity, Thresholding / Decision making, Squashing / Bounding values, Differentiability, Mitigate vanishing and exploding gradient problems, computational efficiency, and representation and learning capabilities.

In [2]:
import numpy as np
from layer import Layer
from activation import Activation

### 1.Non-Linearity
The reason for introducing the activation layer, which often uses a non-linear activation function, is that without these non-linear functions (or in other words, with only linear activation functions), multiple layers would eventually become just one layer. Here is the mathmatical provement.  
Suppose that we employ two layers with a linear activation function:
$$
y_{1}=\omega_{1} x_{1}+b_{1}
$$
$$
\phi_{activation} (y_{1})=y_{1}
$$
$$
y_{2}=\omega_{2} x_{2}+b_{2} 
$$
where $x_{2}=\phi_{activation} (y_{1})$  
Hence
$$
y_{2}=\omega_{1} \omega_{2} x_{1}+\omega_{2} b_{1}+b_{2}= \omega_{3}  x_{1}+b_{3}
$$
where $\omega_{3}=\omega_{1} \omega_{2} , b_{3}=\omega_{2} b_{1}+b_{2}$   
The hidden layer is useless with the existence of the linear activation fucntion.  

### 2.Commonly used activation functions
Some commonly used non-linear activation functions include Tanh, Sigmoid, ReLu, etc.   
$$
\tanh(x)=\frac{e^{x}-e^{x}}{e^{x}+e^{x}}   
$$
$$
\sigma(x) = \frac{1}{1 + e^{-x}} 
$$
$$
\text{ReLU}(x) = \max(0, x)   
$$   
And their corresponding derivatives go as follows:
$$
\tanh'(x) = 1 - \tanh^2(x) 
$$
$$
\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x)) 
$$
$$
\text{ReLU}'(x) = 
\begin{cases} 
1 & \text{if } x > 0 \\
0 & \text{if } x < 0 \\
\text{user-defined} & \text{if } x = 0
\end{cases} 
$$  
The formula of **Forward Propogation** goes as follows: 
$$
Y=f(X) \tag{1}
$$ 
The formula of **Backward Propogation** goes as follows:
$$
\frac{\partial E}{\partial X} = \frac{\partial E}{\partial Y} \odot f'(X) \tag{2}
$$  
where $\odot$ denotes the element-wise multiplication of two matrixs.

### 3.Softmax activation layer
In CNNs, softmax activation function layer is the last layer of the whole model, following the fully connected layer.

The formula of **Forward Propogation**, which generates the predicted probilities, goes as follows:
$$
\hat{y_{i}}=\frac{e^{x_{i}}}{\sum_{n}^{j=1}e^{x_{j}}} \tag{3} 
$$  
The formula of **Backward Propogation**, whose input ($\partial_{Y}E$) is from the derivatives of cross-entropy loss with respect to predicted outcomes,  goes as follows:
$$
\partial_{X} E =(M \odot (I-M^T)) \cdot \partial_{Y}E  \tag{4} 
$$
$$
M=
\begin{bmatrix}
y_{1} & y_{1} & \dots & y_{1}\\
y_{2} & y_{2} & \dots & y_{2}\\
\vdots & \vdots & \ddots & \vdots\\
y_{n} & y_{n} & \dots & y_{n}\\
\end{bmatrix}
\;
\partial_{X} E =
\begin{bmatrix}
\partial_{x_{1}} E\\
\partial_{x_{2}} E\\
\vdots\\
\partial_{x_{n}} E\\
\end{bmatrix}
\partial_{Y} E =
\begin{bmatrix}
\partial_{y_{1}} E\\
\partial_{y_{2}} E\\
\vdots\\
\partial_{y_{n}} E\\
\end{bmatrix}
$$
where $\odot$ denotes the element-wise multiplication of two matrixs.

In [5]:
class Tanh(Activation):
    def __init__(self):
        def tanh(x):
            return np.tanh(x)

        def tanh_prime(x):
            return 1 - np.tanh(x) ** 2

        super().__init__(tanh, tanh_prime)

class Sigmoid(Activation):
    def __init__(self):
        def sigmoid(x):
            return 1 / (1 + np.exp(-np.clip(x,-100,100)))

        def sigmoid_prime(x):
            s = sigmoid(x)
            return s * (1 - s)

        super().__init__(sigmoid, sigmoid_prime)

class ReLu(Activation):
    def __init__(self):
        def relu(x):
            return np.maximum(0, x)
        def relu_prime(x):
            return np.where(x > 0, 1, 0)
        
        super().__init__(relu,relu_prime)


Supplementry information from Chatgpt  
In neural networks, activation functions, which are often represented by activation layers, play a crucial role. Here's a breakdown of their purposes:

1. **Non-linearity**: The most essential role of an activation function is to introduce non-linearity into the network. Without non-linearity, even a deep neural network would behave just like a single-layer linear model because the composition of linear functions remains linear. Non-linearity allows the network to capture and model more complex relationships in the data.

2. **Thresholding / Decision Making**: Activation functions like the step function (rarely used in practice for hidden layers) or the ReLU (Rectified Linear Unit) can be thought of as making decisions – they determine if a particular neuron should be activated or not based on the weighted sum of its inputs.

3. **Squashing / Bounding Values**: Some activation functions, like the sigmoid or tanh, squash the incoming values into a specific range. For instance, the sigmoid function maps values to the range (0,1), while tanh maps to the range (-1,1). This can be useful in ensuring that the neuron’s output doesn't reach extremely high or low values, and in some contexts, like output layers for binary classification, a bounded output is desired.

4. **Differentiability**: Many optimization methods, such as gradient descent, rely on the ability to compute gradients or derivatives. For this reason, activation functions used in practice (like sigmoid, tanh, ReLU, etc.) are often differentiable (with some minor exceptions, like the exact point at 0 for ReLU, but workarounds exist).

5. **Mitigate Vanishing and Exploding Gradient Problems**: Certain activation functions can help in mitigating the issues of vanishing or exploding gradients, which are common in deep networks. For example, the ReLU function doesn't squash values, making it less prone to the vanishing gradient problem than the sigmoid or tanh, especially in deep networks. However, ReLU can suffer from exploding gradients, which is why there are variants like Leaky ReLU and Parametric ReLU.

6. **Computational Efficiency**: Some activation functions are computationally more efficient to compute than others. For instance, ReLU and its variants tend to be faster than sigmoid or tanh since they involve simpler mathematical operations.

7. **Representation and Learning Capabilities**: Different activation functions can lead to different learning dynamics and capabilities. Some may help networks converge faster, while others may help in capturing more nuanced patterns.

In summary, activation functions help neural networks model complex, non-linear relationships, ensure computational stability, and offer desired properties for optimization and learning dynamics. The choice of activation function can significantly impact the performance and training dynamics of a neural network, and it's often a subject of empirical study and research in deep learning.