<h1 id="activations" >Activation Functions</h1>

<hr/>

In [1]:
from src.utils.logger import getLogger



Loaded environment variables.
Directory already exists: logs
Directory already exists: datasets
Directory already exists: models
Loaded Environment Variables: 
{
  "LOG_LEVEL": "INFO",
  "PYTHONENV": "development",
  "PYTHONPATH": "."
}


## Softmax

### Problem Domain

Multiclass classification problems are very common in machine learning. For example, classifiers used for object recognition often need to recognize thousands of distinct categories of objects. Natural language models that try to predict the next word in a sentence may have to choose among tens of thousands of possible words. For this kind of prediction, we need the network to output a categorical distribution that is, if there are $d$ possible answers, we need $d$ output nodes that represent probabilities summing to 1.

### Solution

To achieve this, we use a **softmax** layer, which outputs a vector of $d$ values given a vector of input values **$in = <{\text{in}_{1}, \ldots, \text{in}_{d}}>$**. The th element of that output vector is given by:

$$
\begin{align*}
\text{softmax(in)}_{k} &= \frac{e^{in_{k}}}{\sum_{k^{j}=1}^{d} e^{in_{k}}}
\end{align*}
$$

#### Where:

- $in_{k}$ is the k-th element of the input vector.
- $e$ is the base of the natural logarithm (Euler's number).
- $e^{in_{k}}$ is the exponential of the k-th element of the input vector.
- $\sum_{k^{j}=1}^{d} e^{in_{k}}$ is the sum of the exponential of all elements in the input vector.
- $d$ is the dimensionality of the input vector.
- The output is a probability distribution over the $d$ classes.

#### Key Points

- The softmax is clean and differentiable, unlike the `max` function.
- softmax units propagate multiclass information.
- The softmax function is differentiable, which allows us to use it in backpropagation.
- The softmax function is used in the output layer of neural networks for multiclass classification problems.
- It is a generalization of the logistic function to multiple dimensions, and used in multinomial logistic regression. 
- The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.


In [2]:
# Example Usage
import numpy as np

from src.functions.normalize import normalize_data

# If the vector inputs are given by:
__output = np.array([5, 2, 0, -2])

# TODO: From Scratch lets calculate the softmax function
# Create the mathematical expression e
E = 2.718281828459045 # Euler's number, alternatively use math.e

# Calculate the exponential of each element in the input vector
exp_values = [E ** i for i in __output]

# Not normalized exponential values
normalize_base = sum(exp_values)

# Normalized exponential values
normalize_values = [i / normalize_base for i in exp_values]
print(f"""Normalized values: \n{[float(f'{a:5f}') for a in normalize_values]}""")

# Sum of normalized values
sum_norm_values = sum(normalize_values)
print(f"Sum of normalized values: {sum_norm_values}")


# Calculate the exponential of each element in the input vector
exp_values = np.exp(__output)

# Normalized exponential values, summed row wise along axis 0
normalize_values = exp_values / np.sum(exp_values, axis=0, keepdims=True)
print(f"""Normalized values: \n{[float(f'{a:5f}') for a in normalize_values]}""")

# Sum of normalized values
sum_norm_values = sum(normalize_values)
print(f"Sum of normalized values: {sum_norm_values}")
# Output: ['0.945683', '0.047083', '0.006372', '0.000862']

Normalized values: 
[0.945683, 0.047083, 0.006372, 0.000862]
Sum of normalized values: 1.0000000000000002
Normalized values: 
[0.945683, 0.047083, 0.006372, 0.000862]
Sum of normalized values: 1.0000000000000002


In [3]:
# TODO: Now we numpy to calculate the softmax function
# Calculate the exponential of each element in the input vector
exps = np.exp(__output - np.max(__output))    # To avoid overflow
outputs = exps / np.sum(exps, axis=0, keepdims=True)
print(f"""Normalized values: \n{[float(f'{a:5f}') for a in outputs]}""")
sum_norm_values = sum(outputs)
print(f"Sum of normalized values: {sum_norm_values}")
# Output: ['0.945683', '0.047083', '0.006372', '0.000862']

Normalized values: 
[0.945683, 0.047083, 0.006372, 0.000862]
Sum of normalized values: 1.0000000000000002


In [4]:
from src.functions.activation import Softmax
import numpy as np

__output = np.array([4.8, 1.21, 2.385])
softmax = Softmax()
outputs = softmax(__output)
print(outputs)
print(sum(outputs))
# Output: [0.89528266 0.02470831 0.08000903]


[0.89528266 0.02470831 0.08000903]
0.9999999999999999


In [5]:
# Example Usage
output = softmax(np.array([-2,-1,0]))
print([float(f'{a:5f}') for a in output])
# Output: [0.090031, 0.244728, 0.665241]
output = softmax(np.array([1,2,3]))
print([float(f'{a:5f}') for a in output])
# Output: [0.090031, 0.244728, 0.665241]
output = softmax(np.array([0.5, 1.0, 1.5]))
print([float(f'{a:5f}') for a in output])
# Output: [0.186324, 0.307196, 0.50648]

[0.090031, 0.244728, 0.665241]
[0.090031, 0.244728, 0.665241]
[0.186324, 0.307196, 0.50648]


In [6]:
# Now we create a dense layer with 3 neurons with 2 inputs each and 2 dense layers; the first layer has 3 neurons with 2 inputs each and the second layer has 3 neurons with 3 inputs each.
from src.layer.dense import Dense
from src.utils.datasets import create_spiral_dataset
from src.functions.activation import Softmax, ReLU

# Initialize activation function
softmax = Softmax()


# Create a spiral dataset
X, y = create_spiral_dataset(100, 3)
print(f"Inputs: {X.shape}")
print(f"Y is a spiral dataset: {y.shape}")
# Create a dense layer with 3 neurons with 2 inputs each
dense = Dense(2, 3)

# Lets do the forward pass
dense.forward(X)
print(f"Weights Layer 1: {dense.weights.shape}")
print(f"Biases Layer 1: {dense.biases.shape}")
print(f"Output Layer 1: {dense.output.shape}")

# TODO: These final outputs are also our “confidence scores.” The higher the confidence score, the more confident the model is that the input belongs to that class.
# Run the activation function ReLU
predictions = softmax(dense.output)
print(f"Predictions: {predictions.shape}")

# Calculate the loss
avg_loss, loss = dense.loss(np.array([y]), predictions)
print(f"Loss: {avg_loss}")

# Run ArgMax to get the predicted class
predicted_class = np.argmax(predictions, axis=1)
print(f"Predicted Class: {predicted_class.shape}")

Inputs: (300, 2)
Y is a spiral dataset: (300,)
Weights Layer 1: (2, 3)
Biases Layer 1: (1, 3)
Output Layer 1: (300, 3)
Predictions: (300, 3)


ValueError: Shapes of true labels and predicted probabilities must be compatible.
                True Labels Shape: 
(3, 300)
                Predicted Probabilities Shape: 
(1, 300)


## ReLU (Rectified Linear Unit) Activation Function

### Problem Domain

In deep learning models, especially in the layers of neural networks, non-linear activation functions are required to capture complex patterns. ReLU is one of the most popular activation functions due to its simplicity and effectiveness in practice.

### Solution
The ReLU activation function is defined as:

$$
\begin{align*}
\text{y} &= \text{ReLU}(x) = \max(0, x)
\end{align*}
$$

#### Where:
- $x$ is the input to the function.
- $\max(a, b)$ returns the maximum of $a$ and $b$. In this case, it returns $0$ if $x$ is negative, and $x$ otherwise.
- $y$ is the output of the activation function.

This means that it outputs the input directly if it is positive; otherwise, it outputs zero.

### Key Points
- **Non-linear**: The ReLU function introduces non-linearity to the model, allowing it to learn complex patterns.
- **Sparse Activation**: For any given input, some neurons will be inactive (outputting zero), which can make the network more efficient.
- The ReLU function is a piecewise linear function that outputs the input directly if it is positive, and zero otherwise.

### Manual Implementation


In [None]:

import numpy as np

def relu(x):
    return np.maximum(0, x)

# Example usage
x = np.array([-1, 0, 1, 2])
outputs = relu(x)
print([float(f'{a:5f}') for a in outputs])
# Output: [0 0 1 2]


In [None]:

from src.functions.activation import ReLU

relu = ReLU()
outputs = relu(np.array([-1, 0, 1, 2]))
print([float(f'{a:5f}') for a in outputs])
# Output: [0 0 1 2]



## Sigmoid Activation Function

### Problem Domain
The Sigmoid function is often used in binary classification problems or as the activation function for the output layer of a neural network when the output needs to be in the range (0, 1), such as in probability predictions.

### Solution
The Sigmoid function is defined as:

$$
\begin{align*}
\text{y} &= \text{Sigmoid}(x) &= \frac{1}{1 + e^{-x}}
\end{align*}
$$

#### Where:
- $x$ is the input to the function.
- $e$ is the base of the natural logarithm (Euler's number).
- $e^{-x}$ is the exponential of the negative input.
- $y$ is the output of the activation function.

This function maps any real-valued number into the range (0, 1).

### Key Points
- **Smooth Gradient**: The Sigmoid function has a smooth gradient, which makes it suitable for backpropagation.
- **Output Range**: The output is always between 0 and 1, making it ideal for probability estimation.
- **Vanishing Gradient**: For extreme input values, the gradient becomes very small, which can slow down learning in deep networks.

### Manual Implementation


In [None]:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Example usage
x = np.array([-1, 0, 1, 2])
outputs = sigmoid(x)
print([float(f'{a:5f}') for a in outputs])
# Output: [0.26894142 0.5 0.73105858 0.88079708]


In [None]:

from src.functions.activation import Sigmoid

sigmoid = Sigmoid()
outputs = sigmoid(np.array([-1, 0, 1, 2]))
print([float(f'{a:5f}') for a in outputs])
# Output: ['0.268941', '0.500000', '0.731059', '0.880797']



## Tanh Activation Function

### Problem Domain
The Tanh (Hyperbolic Tangent) function is commonly used in neural networks, particularly for hidden layers. Unlike the Sigmoid function, the Tanh function outputs values in the range (-1, 1), which can make learning more efficient in practice.

### Solution
The Tanh function is defined as:

$$
\begin{align*}
\text{y} &= \tanh(x) &= \frac{e^{2x} - 1}{e^{2x} + 1}
\end{align*}
$$

#### Where:
- $x$ is the input to the function.
- $e$ is the base of the natural logarithm (Euler's number).
- $e^{x}$ is the exponential of the input.
- $y$ is the output of the activation function.

**Note**: Tanh maps any real-valued number into the range (-1, 1). Tanh 
is a scaled and shifted version of the sigmoid, as $\tanh(x) = 2\sigma(2x) - 1$.

### Key Points
- **Zero-centered**: Unlike the Sigmoid function, Tanh is zero-centered, meaning that negative inputs will map strongly negative, zero inputs will map near zero, and positive inputs will map strongly positive.
- **Smooth Gradient**: The Tanh function has a smooth gradient, which is advantageous for gradient-based optimization methods.

### Manual Implementation


In [None]:

import numpy as np

def tanh(x):
    return (np.tanh(x) - 1) / (np.tanh(x) + 1)

# Example usage
x = np.array([-1, 0, 1, 2])
outputs = tanh(x)
print([float(f'{a:5f}') for a in outputs])
# Output: ['-0.761594', '0.000000', '0.761594', '0.964028']

In [None]:

from src.functions.activation import Tanh

tanh = Tanh()
outputs = tanh(np.array([-1, 0, 1, 2]))
print([float(f'{a:5f}') for a in outputs])
# Output: ['-0.761594', '0.000000', '0.761594', '0.964028']



## Leaky ReLU Activation Function

### Problem Domain
A potential issue with the ReLU activation function is the "dying ReLU" problem, where neurons can become inactive and only output zero. Leaky ReLU is a variation that attempts to fix this by allowing a small, non-zero gradient when the input is negative.

### Solution
The Leaky ReLU function is defined as:

$$
\begin{align*}
\text{Leaky ReLU}(x) = 
\begin{cases} 
      x & x \geq 0 \\
      \alpha x & x < 0 
\end{cases}
\end{align*}
$$

#### Where:
- $x$ is the input to the function.
- $\alpha$ is a small positive constant.
- The function outputs the input directly if it is positive, and $\alpha x$ 
if it is negative.
- This small slope for negative inputs helps to keep the gradient alive and 
prevent neurons from dying.
- The Leaky ReLU function is a piecewise linear function that outputs the 
input directly if it is positive, and a small fraction of the input otherwise.
Where $( \alpha )$ is a small positive constant (e.g., 0.01).

### Key Points
- **Fixes "Dying ReLU"**: Leaky ReLU introduces a small slope for negative inputs, which helps to keep the gradient alive even for negative inputs.
- **Simple and Effective**: This modification is simple to implement and has been shown to be effective in practice.

### Manual Implementation


In [None]:

import numpy as np

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

# Example usage
x = np.array([-1, 0, 1, 2])
outputs = leaky_relu(x)
print(outputs)
# Output: [-0.01  0.    1.    2.  ]


In [None]:

from src.functions.activation import LeakyReLU

leaky_relu = LeakyReLU()
outputs = leaky_relu(np.array([-1, 0, 1, 2]))
print(outputs)
# Output: [-0.01  0.    1.    2.  ]
