## Activication Function
The activation function is applied to the output of a neuron (or layer of neurons), which modifies outputs.We use activation functions because if the activation function itself is nonlinear, it allows for neural networks with usually two or more hidden layers to map nonlinear functions

In general, your neural network will have ***two types*** of activation functions. The first will be the activation function used in hidden layers, and the second will be used in the output layer.

### Type 1: hidden layer activation
###### 1 Linear Activation Function
A linear function is simply the equation of a line. It will appear as a straight line when graphed, where y=x and the output value equals the input.
<div>
<img src="images/image4.1.png" width="400"/>
</div>
This activation function is usually applied to the last layer’s output in the case of a regression model — a model that outputs a scalar value instead of a classification

###### 2 (OutDated). The Step Activation Function
this activation function serves is to mimic a neuron “firing” or “not firing” based on input information. The simplest version of this is a step function. In a single neuron, if the ​weights · inputs + bias results in a value greater than 0, the neuron will fire and output a 1; otherwise, it will output a 0.
<div>
<img src="images/image4.2.png" width="400"/>
</div>
This activation function has been used historically in hidden layers, ***but nowadays, it is rarely a choice.***

###### 3 The Sigmoid Activation Function
The problem with a step function is that its less clear to the optimizer what these impacts are because theres very little information gathered from this function. Neurons are either dead or alive(Its either on (1) or off (0)). The original, more granular, activation function used for neural networks was the Sigmoid activation function, which looks like:
<div>
<img src="images/image4.3.png" width="400"/>
</div>
This function returns a value in the range of 0 for negative infinity, through 0.5 for the input of 0, and to 1 for positive infinity. In this case, were getting a value that can be reversed to its original value; the returned value contains all the information from the input, contrary to a function like the step function, where an input of 3 will output the same value as an input of 300,000. The Sigmoid function, historically used in hidden layers, was eventually replaced by the Rectified Linear Units​ activation function (or ​ReLU​)

###### The Rectified Linear Activation Function (ReLU)
y=x, clipped at 0 from the negative side. If x is less than or equal to 0, then y is 0  otherwise, y is equal to x. The ReLU activation function is extremely close to being a linear activation function while remaining nonlinear, due to that bend after 0. This simple property is, however, very effective.
<div>
<img src="images/image4.4.png" width="400"/>
</div>

## Why Use Activaction Functions
In most cases, for a neural network to fit a nonlinear function, we need it to contain two or more hidden layers, and we need those hidden layers to use a nonlinear activation function. Why?

No matter what we do with this neuron’s weights and biases in a linear activation function, the output of this neuron will be perfectly linear to y=x. This linear nature will continue throughout the entire network
<div>
<img src="images/image4.5.png" width="400"/>
</div>

When using the same 2 hidden layers of 8 neurons each with the rectified linear activation function, or any other non linear activation function, we see the following result after training (note: ReLU is barely nonlinear):
<div>
<img src="images/image4.6.png" width="400"/>
</div>
In the image above the weights and bias for each input can be adjusted so that our final output model fits our non-linear relationship. If we kept 2 hidden layers but changed the 8 neurons to 64 neurons we see further improvement
<div>
<img src="images/image4.7.png" width="400"/>
</div>

In [1]:
# ReLU Activation Function Code
# x > 0 return x, x < 0 return 0
inputs = [0, 2, -1, 3.3, -2.7, 1.1, 2.2, -100]
output = []
for i in inputs:
    output.append(max(0, i))
    
print(output)

[0, 2, 0, 3.3, 0, 1.1, 2.2, 0]


Let’s talk about the activation function that we are going to use on the output of the last layer

### Type 2: output layer activation
First, why are we bothering with another activation function? It just depends on what our overall goals are. In this case, the rectified linear unit is unbounded, not normalized with other units, and exclusive. “Not normalized” implies the values can be anything, an output of [12, 99, 318] is without context, and “exclusive” means each output is independent of the others. 
###### The Softmax Activation Function
<div>
<img src="images/image4.8.png" width="400"/>
</div>
Softmax activation function is meant for classification problems. To address this lack of context, the softmax activation on the output data can take in non-normalized, or uncalibrated, inputs and produce a normalized distribution of probabilities for our classes. In the case of classification, what we want to see is a prediction of which class the network “thinks” the input represents. This distribution returned by the softmax activation function represents ​confidence scores​ for each class and will add up to 1. For example, if our network has a confidence distribution for two classes: [0.45, 0.55], the prediction is the 2nd class, but the confidence in this prediction isn’t very high. Maybe our program would not act in this case since it’s not very confident.

In [6]:
layer_outputs = [4.8, 1.21, 2.385]

# For each value in a vector, calculate the exponential value
exp_values = np.exp(layer_outputs) 
print('exponentiated values:')
print(exp_values)

# Now normalize values
norm_values = exp_values / np.sum(exp_values)
print('normalized exponentiated values:')
print(norm_values)
print('sum of normalized values:', np.sum(norm_values))

exponentiated values:
[121.51041752   3.35348465  10.85906266]
normalized exponentiated values:
[0.89528266 0.02470831 0.08000903]
sum of normalized values: 0.9999999999999999


Equation for Softmax::
Step 1. “exponentiate” the outputs
***The exponential function is a monotonic function***. This means that, with higher input values, outputs are also higher, so we won’t change the predicted class after applying it while making sure that we get non-negative values.

Step 2. convert these numbers to a probability distribution (Normalization)
take a given exponentiated value and divide it by the sum of all of the exponentiated values

In [4]:
# pip install numpy nnfs
import numpy as np
import nnfs
from nnfs.datasets import spiral_data
nnfs.init()

class Layer_Dense:
    # Initialize weights and biases
    def __init__(self, n_inputs, n_neurons) :
        self.weights = 0.01 * np.random.randn(n_inputs, n_neurons)
        self.biases = np.zeros((1, n_neurons))
        
    # Forward pass
    # Calculate output values from inputs, weights and biases
    def forward(self, inputs):
        self.output = np.dot(inputs, self.weights) + self.biases
        
# ReLU activation
class Activation_ReLU:
    # Forward pass
    def forward(self, inputs):
        # Calculate output values from inputs self.output = np.maximum(0, inputs)
        self.output = np.maximum(0, inputs)
        
# Softmax activation
class Activation_Softmax: # Forward pass
    def forward(self, inputs):
        # Get unnormalized probabilities
        exp_values = np.exp(inputs - np.max(inputs, axis=1, keepdims=True))
        probabilities = exp_values / np.sum(exp_values, axis=1,keepdims=True)
        self.output = probabilities       
        
        
# Create dataset
X, y = spiral_data(samples=100, classes=3)
# Create Dense layer with 2 input features and 3 output values
dense1 = Layer_Dense(2, 3)
# Create ReLU activation (to be used with Dense layer):
activation1 = Activation_ReLU()
# Create second Dense layer with 3 input features (as we take output # of previous layer here) and 3 output values (output values)
dense2 = Layer_Dense(3, 3)
# Create Softmax activation (to be used with Dense layer):
activation2 = Activation_Softmax()
# Make a forward pass of our training data through this layer
dense1.forward(X)
# Make a forward pass through activation function
# it takes the output of first dense layer here
activation1.forward(dense1.output)
# Make a forward pass through second Dense layer
# it takes outputs of activation function of first layer as inputs
dense2.forward(activation1.output)
# Make a forward pass through activation function
# it takes the output of second dense layer here
activation2.forward(dense2.output)

# Let's see output of the first few samples:
print(activation2.output[:5])

[[0.33333334 0.33333334 0.33333334]
 [0.33333316 0.3333332  0.33333364]
 [0.33333287 0.3333329  0.33333418]
 [0.3333326  0.33333263 0.33333477]
 [0.33333233 0.3333324  0.33333528]]


We used the ***Rectified Linear (ReLU) activation function*** on the hidden layer, which works on a per-neuron basis. We additionally used the ***Softmax activation function*** for the output layer since it accepts non-normalized values as input and outputs a probability distribution, which were using as confidence scores for each class.
To Begin adjusting ***weights*** and ***biases*** to decrease error over time, our next step is to quantify how wrong the model is through whats defined as a ***loss function***.
