# Activation Function
- Activation functions are an essential component of neural networks, including deep learning models. 
- They introduce non-linearity to the network, enabling it to learn complex patterns and relationships in the data. 

- Activation functions help map input values to a known range, which helps stabilize training and helps map values to a desired output in the last layer.


### ReLU (Rectified Linear Unit):

##### f(x)=max(0,x)
- ReLU is one of the most widely used activation functions due to its simplicity and effectiveness. 
- It replaces negative values with zero, resulting in faster convergence during training and alleviating the vanishing gradient problem.

![image.png](attachment:image.png)

### Leaky ReLU:
 

- Introduces a small slope for negative values to prevent dying ReLU problem.
- Example: Similar applications as ReLU, but with the advantage of mitigating dead neurons.
![image.png](attachment:image.png)

### Exponential Linear Unit (ELU):


- Similar to ReLU but allows negative values with smoother transition.
- Example: Image segmentation where smoother transitions between object and background pixels are desired.


![image-2.png](attachment:image-2.png)

### Sigmoid

 - Sigmoid function squashes the input values between 0 and 1, making it suitable for binary classification tasks where the output needs to be interpreted as probabilities. 
 - However, it suffers from the vanishing gradient problem and is rarely used in hidden layers of deep neural networks.
 
![image-2.png](attachment:image-2.png)

![image.png](attachment:image.png)

### Tanh (Hyperbolic Tangent):

- Tanh function squashes the input values between -1 and 1, similar to the sigmoid function but with output centered at zero.
- It is often used in hidden layers of neural networks, helping to alleviate the vanishing gradient problem compared to the sigmoid function.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Softmax:

- Softmax function is commonly used in the output layer of neural networks for multi-class classification tasks.
- It computes the probability distribution over multiple classes, ensuring that the output values sum up to 1.0.
- Softmax is useful for mutually exclusive classes.

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [1]:
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

def relu(x):
    return np.maximum(0, x)

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def elu(x, alpha=1.0):
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

def softmax(x):
    exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

# Example usage:
input_data = np.array([1, -2, 3, 0])

print("Sigmoid:", sigmoid(input_data))
print("Tanh:", tanh(input_data))
print("ReLU:", relu(input_data))
print("Leaky ReLU:", leaky_relu(input_data))
print("ELU:", elu(input_data))
print("Softmax:", softmax(input_data))


Sigmoid: [0.73105858 0.11920292 0.95257413 0.5       ]
Tanh: [ 0.76159416 -0.96402758  0.99505475  0.        ]
ReLU: [1 0 3 0]
Leaky ReLU: [ 1.   -0.02  3.    0.  ]
ELU: [ 1.         -0.86466472  3.          0.        ]
Softmax: [0.11354962 0.0056533  0.83902451 0.04177257]


# Optimizers

- In deep learning, optimizers are algorithms that play a crucial role in the training process.
- Their main function is to adjust the internal parameters of a model, like weights and biases, to minimize a loss function. 
- This loss function essentially measures how well the model performs on a given dataset. 
- By minimizing the loss function, the optimizer helps the model learn and improve its accuracy in making predictions.

### Stochastic Gradient Descent (SGD): 
- A fundamental optimizer that updates parameters based on a single data point at a time. It can be slow but works well for certain problems.

![image.png](attachment:image.png)


**Advantages of Stochastic Gradient Descent**

- Frequent updates of model parameter
- Requires less Memory.
- Allows the use of large data sets as it has to update only one example at a time.

**Disadvantages of Stochastic Gradient Descent**

- The frequent can also result in noisy gradients which may cause the error to increase instead of decreasing it.
- High Variance.
- Frequent updates are computationally expensive.

### Mini-batch Gradient Descent:
- A variation of SGD that updates parameters based on small batches of data points, offering a balance between speed and accuracy.

![image.png](attachment:image.png)

**Advantages of Mini Batch Gradient Descent:**

- It leads to more stable convergence.
- more efficient gradient calculations.
- Requires less amount of memory.

**Disadvantages of Mini Batch Gradient Descent**

- Mini-batch gradient descent does not guarantee good convergence,
- If the learning rate is too small, the convergence rate will be slow. 
- If it is too large, the loss function will oscillate or even deviate at the minimum value.

### RMSProp (Root Mean Square Propagation):

- It modifies SGD to normalize the gradients by dividing by the square root of the exponential moving average of squared gradients.
- Helps to adaptively scale the learning rate for each parameter.

![image.png](attachment:image.png)

**Advantages of RMS-Prop**

- In RMS-Prop learning rate gets adjusted automatically and it chooses a different learning rate for each parameter.

**Disadvantages of RMS-Prop**

- Slow Learning

### Adagrad (Adaptive Gradient Algorithm):

- Adapts the learning rate for each parameter by scaling it inversely proportional to the square root of the sum of historical squared gradients.
- Particularly useful for sparse data.

![image.png](attachment:image.png)

**Advantages of AdaGrad**

- Learning Rate changes adaptively with iterations.
- It is able to train sparse data as well.

**Disadvantage of AdaGrad**

- If the neural network is deep the learning rate becomes very small number which will cause dead neuron problem.

### Adam (Adaptive Moment Estimation):

- Combines ideas from RMSProp and Momentum.
- Computes adaptive learning rates for each parameter.
- It maintains exponential moving averages of both gradients and their squares, and then updates the parameters.

![image.png](attachment:image.png)

**Advantages of Adam**

- Easy to implement
- Computationally efficient.
- Little memory requirements.