# Why Activation Functions Are Required in Machine Learning?

Activation functions are essential components in neural networks. They introduce non-linearity to the model, enabling it to learn and model complex relationships in data. Without activation functions, a neural network would behave like a simple linear regression model, no matter how many layers it has.

# Key Reasons for Using Activation Functions
**Introducing Non-Linearity:**

* Neural networks solve problems that are often non-linear in nature (e.g., image recognition, language translation).
* Activation functions allow the model to learn from and represent non-linear patterns, making them suitable for solving complex tasks.

**Deciding Neuron Output:**

* Activation functions decide whether a particular neuron should be activated (output a signal) based on its input.
* This mimics how biological neurons work.

**Enabling Deep Networks:**

* Without non-linear activation functions, stacking multiple layers in a neural network would not add any extra power to the model.
* Non-linearity ensures that each layer captures different levels of abstraction.

**Bounded Output:**

* Many activation functions limit the range of output values (e.g., sigmoid outputs between 0 and 1). This helps with stability and interpretability.

# Example to Understand the Importance

Without Activation Function:

* Consider a neural network with linear activation (f(x)=x):
* The output of each layer is just a linear transformation of the input.
* Stacking multiple layers results in a single linear transformation: y=Wx+b.
* This makes the network unable to capture complex relationships in data.

# What is a Step Function?

A Step Function is a simple activation function used in neural networks. It outputs a fixed value (usually 0 or 1) depending on whether the input is below or above a certain threshold. This function is used to decide whether a neuron should be activated or not.

**Mathematical Representation**
The step function can be defined as:

$$
f(x) = 
\begin{cases} 
1 & \text{if } x \geq 0 \\
0 & \text{if } x < 0 
\end{cases}
$$

* If the input x is greater than or equal to 0, the output is 1.
* If the input x is less than 0, the output is 0.

# How It Works

* Input to Neuron: The neuron receives a weighted sum of inputs (x).
* Threshold Check: The step function checks if the input exceeds the threshold (usually 0).
* Output Decision: Based on the input value:If x≥0: The neuron is "activated" (output = 1).
* If x<0: The neuron is "not activated" (output = 0).

![image.png](attachment:image.png)

# Simple Example

Suppose we have a step function in a binary classification problem:

* If the input x=3, then f(3)=1 (positive class).
* If the input x=−2, then f(−2)=0 (negative class).

This can be thought of as a yes/no decision boundary.

# Advantages and Disadvantages of Step Function

# Advantages

**Simplicity:**

* The step function is easy to understand and implement.
* Suitable for simple decision-making tasks, such as binary classification.

**Efficient Computation:**

* Requires minimal computational resources as it involves only a comparison operation.

**Binary Decision:**

* Produces a clear and deterministic output (0 or 1), which is useful for threshold-based tasks.

**Historical Significance:**

* Used in perceptrons, which were among the first neural network models, laying the foundation for modern deep learning.

# Disadvantages

**Non-Differentiable:**

* The function has a sharp jump at the threshold, making it non-differentiable. This is a major limitation for training neural networks using gradient descent.

**No Gradient Information:**

* The gradient (rate of change) is either 0 or undefined, which prevents the model from learning and updating weights effectively.

**Limited Application:**

* Cannot model complex relationships in data due to its inability to capture non-linear patterns.
* Modern activation functions like ReLU or Sigmoid are preferred for their flexibility and smooth transitions.

**Abrupt Transitions:**

* The output changes abruptly from 0 to 1 without considering intermediate values, leading to less smooth learning.

**Not Suitable for Multi-Class Problems:**

* Cannot handle multi-class outputs; functions like Softmax are used instead for such tasks.

![image-2.png](attachment:image-2.png)

**In the above graph we can see that some data are misclassfied which is rounded by red colour**


![image-4.png](attachment:image-4.png)

**In this neural network output we can see that for more than one output it is giving one which is diffcult tp predict the final(excat) ouput.**

So to overcome this problem we use Sigmoid Function?

# What is a Sigmoid Function?

The Sigmoid function is a mathematical function that maps any real-valued number into a value between 0 and 1. It is widely used as an activation function in machine learning, especially in neural networks, to introduce non-linearity and make predictions in probabilistic terms.

Mathematical Formula
The sigmoid function is defined as:

$$f(x) = \frac{1 + e^{-x}}{1}$$

Where:

* x is the input to the function.
* e is the base of natural logarithms (~2.718).

# Key Characteristics

**Range:**

* Outputs values between 0 and 1, making it suitable for binary classification problems.

**S-Shaped Curve:**

* It has an "S" shape, also called a logistic curve.

**Smooth Gradient:**

* The output changes gradually, making it differentiable and useful for gradient-based learning.

**How It Works**

* Input (𝑥) Positive: If x>0, the output approaches 1 as x increases.
* Input (𝑥) Negative: If x<0, the output approaches 0 as 𝑥 decreases.
* Input (x=0): The output is exactly 0.5.
![image.png](attachment:image.png)
# Example

Suppose we use the sigmoid function to predict whether an email is spam or not:

If the input x=2, the output is:

$$f(2) = \frac{1 + e^{-2}}{1} \approx 0.88$$

This means there's an 88% chance the email is spam.

If x=−2, the output is:
 
$$f(-2) = \frac{1 + e^{-2}}{1} \approx 0.12$$ 

This means there's only a 12% chance the email is spam.

# Advantages

**Smooth Output:**

* Provides probabilities for classification problems.

**Easy Interpretation:**

* Maps outputs to the range of 0 to 1, making them interpretable as probabilities.

**Non-Linearity:**

* Allows neural networks to learn non-linear relationships.

# Disadvantages

# Vanishing Gradient:

* For very large or small inputs, the gradient becomes close to 0, slowing down learning in deep networks.

**Not Zero-Centered:**

* Outputs are between 0 and 1, which can lead to slower convergence during training.

**Computationally Expensive:**

* Involves exponential calculations, which can be slower compared to simpler functions like ReLU.

# Where It’s Used

* Binary Classification: To output probabilities (e.g., logistic regression, neural networks).
* Probabilistic Models: To map outputs into a probability range.
* Output Layer of Neural Networks: Especially for binary classification problems.

 

# What is a Tanh (Hyperbolic Tangent) Function?

The Tanh function (short for hyperbolic tangent) is a mathematical function often used as an activation function in machine learning. Like the Sigmoid function, it introduces non-linearity, but its output is centered around 0, ranging between -1 and 1.

Mathematical Formula
The Tanh function is defined as:
$$f(x) = \tanh(x) = \frac{e^x + e^{-x}}{e^x - e^{-x}}$$
* Where: 𝑥 is the input to the function.
* 𝑒 is the base of natural logarithms (~2.718).

# Key Characteristics
**Range:**
* Outputs values between -1 and 1, making it symmetric around 0.
**Shape:**
* It has an "S-shaped" curve like the Sigmoid function but centered at 0.
**Zero-Centered Output:**
* This property makes optimization easier in some machine learning models.

# How It Works

* Input (𝑥) Positive: If x>0, the output approaches 1 as 𝑥 increases.
* Input (𝑥) Negative: If x<0, the output approaches -1 as 𝑥 decreases.
* Input (x=0): The output is exactly 0.
![image.png](attachment:image.png)
# Example

x=2, the output is:

$$\tanh(2) = \frac{e^2 + e^{-2}}{e^2 - e^{-2}} \approx 0.96$$

This means the output is close to 1 for a large positive input.

If x=−2, the output is:

$$\tanh(-2) = \frac{e^2 + e^{-2}}{e^2 - e^{-2}} \approx -0.96$$
This means the output is close to -1 for a large negative input.

# Advantages

**Zero-Centered:**
* Unlike the Sigmoid function, the output is centered around 0, which helps in faster convergence during training.
**Non-Linearity:**
* Allows the model to learn complex patterns.
**Wide Range:**
* Outputs range from -1 to 1, which can represent more nuanced information than the Sigmoid function.

# Disadvantages

**Vanishing Gradient:**
* For very large or very small inputs, the gradient becomes close to 0, which can slow down learning.
**Computational Cost:**
* Similar to Sigmoid, it involves exponential calculations, which can be slower than simpler activation functions like ReLU.

# Where It’s Used

* Hidden Layers: Often used in hidden layers of neural networks when zero-centered data is required.
* Text and Time-Series Models: For tasks where outputs are required in both positive and negative ranges (e.g., NLP tasks like sentiment analysis).

# USE `sigmoid` IN OUTPUT LAYER. IN ALL OTHERS PLACES TRY TO USE `tanh`

# what are the issues in sigmoid and tanh when it comes to derivative in simple words in simple way

**Issues with Sigmoid and Tanh in Terms of Their Derivative**

Both Sigmoid and Tanh activation functions can suffer from a problem called the Vanishing Gradient Problem when their derivatives become very small during backpropagation in deep neural networks. Here's a simple explanation:

1. Sigmoid Function

Derivative of Sigmoid
The derivative of the sigmoid function is: 

$$f'(x)=f(x)⋅(1−f(x))$$

Where 

f(x) is the sigmoid function.

**Problem**

* Small Derivative for Large Inputs:
* When the input x is very large or very small, the output of the sigmoid function becomes close to 1 or 0.
* In these regions, f′(x) becomes very small (close to 0).

**Impact on Learning:**

* During backpropagation, gradients get multiplied layer by layer. If the gradient is very small, it can "vanish", causing earlier layers in the network to stop learning effectively.
* This slows down or even stops the training process.


2. Tanh Function

Derivative of Tanh

The derivative of the Tanh function is:

$$f'(x) = 1 - f(x)^2$$

Where 

f(x) is the Tanh function.

**Problem**

* Small Derivative for Large Inputs:
* When the input 𝑥 is very large or very small, the output of the Tanh function approaches 1 or -1.
* In these regions, f′(x) becomes very small (close to 0).

Impact on Learning:

* Like sigmoid, the small gradients lead to the vanishing gradient problem, especially in deep networks, making it hard to learn in earlier layers.

# Why This Matters

When the gradient is small (or vanishes):

**Slow Learning:**

* Weights in earlier layers update very slowly or not at all.

**Difficulty Training Deep Networks:**

* This makes sigmoid and tanh less suitable for deep networks, as the information struggles to propagate backward through many layers.

# How ReLU Solves This

* ReLU does not suffer from vanishing gradients for positive inputs because its derivative is constant (1 for x>0).
* It allows gradients to flow more effectively, enabling faster and more stable training in deep networks.


# Explain in simple example

**Explaining Vanishing Gradient Problem in Sigmoid and Tanh with a Simple Example**

Imagine you're teaching a group of students through a chain of teachers (like layers in a neural network). The first teacher explains to the second, the second to the third, and so on. If the message weakens at each step, by the time it reaches the last teacher, it becomes almost useless.

# How This Relates to Neural Networks

**Backpropagation:**

* During training, the error (or feedback) is sent backward through the network to update weights.
* Each layer multiplies the feedback by the derivative of the activation function (Sigmoid or Tanh in this case).

**Small Derivatives:**

* If the derivative of the activation function is very small, the feedback shrinks more and more as it goes backward.
* For Sigmoid and Tanh, the derivative becomes very small when the input is far from 0 (either very positive or very negative).

# Example: Sigmoid Activation

Suppose a layer in your network uses the Sigmoid function. If the input to the Sigmoid function is 𝑥=10
$$f(x) = \frac{1 + e^{-10}}{1} \approx 0.999$$

The derivative is:
$$f'(x) = f(x) \cdot (1 - f(x)) \approx 0.999 \cdot (1 - 0.999) = 0.000999$$
This derivative is very small! When it's multiplied by the gradients in earlier layers, the updates become tiny. Over multiple layers, the gradients can become nearly zero, stopping the earlier layers from learning.


Example: Tanh Activation
Now consider the Tanh function. If the input is 𝑥=3
$$f(x) = \tanh(3) \approx 0.995$$

The derivative is:
$$f'(x) = 1 - f(x)^2 \approx 1 - (0.995)^2 = 0.0099$$
Again, the derivative is small, and the same problem occurs: the feedback shrinks layer by layer.

# Real-World Example Analogy

* Think of a classroom with 5 teachers (representing 5 layers).
* The principal (output layer) gives instructions: "Improve teaching methods."
* If each teacher reduces the clarity of the instructions (small derivatives), by the time it reaches the first teacher (input layer), the * * message is so weak they don't know what to do.

This is why learning slows down or stops for earlier layers.

# How ReLU Helps

* For ReLU, the derivative is constant for positive inpu f′(x)=1 if x>0

This ensures that the feedback doesn't shrink, allowing all layers to learn effectively.


# What is the ReLU (Rectified Linear Unit) Function?

The ReLU function is one of the most commonly used activation functions in deep learning. It introduces non-linearity into the neural network and is defined as:
f(x)=max(0,x)

This means:

If the input x>0, the output is 𝑥.
If the input x≤0, the output is 0.

# Key Characteristics of ReLU

**Range:**

* Outputs values between 0 and infinity.

**Simplicity:**

* Computationally efficient because it requires only a comparison and a simple operation.

**Non-Linearity:**

* Despite its simple definition, ReLU introduces non-linearity, enabling the model to learn complex patterns.

![image.png](attachment:image.png)

# How It Works

* Positive Input:

If x=3, f(3)=max(0,3)=3.

* Negative or Zero Input:

If x=−2, f(−2)=max(0,−2)=0.

If x=0, f(0)=max(0,0)=0.

# Example

Imagine a neural network layer with a set of inputs 

x=[−2,−1,0,1,2]. Applying the ReLU function:

f(x)=[0,0,0,1,2]

# Advantages of ReLU

**Efficient Computation:**

* It is simple and requires less computation compared to functions like Sigmoid and Tanh.

**Sparsity:**

* Sets all negative values to 0, which can introduce sparsity in the network (some neurons do not activate). This helps improve efficiency and reduces overfitting.

* Avoids Saturation:

* Unlike Sigmoid and Tanh, it does not saturate for large positive inputs, enabling faster learning.

# Disadvantages of ReLU

**Dying ReLU Problem:**

* Neurons can "die" if they consistently output 0 for all training data, making them inactive and not contributing to learning.

**Unbounded Output:**

* Outputs are not constrained, which can sometimes lead to instability in optimization.

# Example Analogy

Think of neurons as team members working on a project.

`* If a team member (neuron) is consistently excluded from tasks (negative inputs), they stop contributing and disengage completely.`

`* To keep everyone involved, provide alternative ways to contribute (e.g., Leaky ReLU allows small outputs even for negative inputs).`

# Variants of ReLU

To overcome its limitations, variants like Leaky ReLU and Parametric ReLU (PReLU) are used:

**Leaky ReLU:** 

Allows a small, non-zero gradient for negative inputs (f(x)=0.01x when x<0).

Parametric ReLU: Makes the slope for negative values a learnable parameter.

# Where It’s Used

**Deep Learning Models:**

* Primarily in the hidden layers of neural networks.

**Convolutional Neural Networks (CNNs):**

* Often used in image processing tasks.

**Recurrent Neural Networks (RNNs):**

* Can also be used in time-series data.


# What is the Dying ReLU Problem?

* In a neural network, if a neuron consistently receives negative inputs, the ReLU function will output 0 every time (f(x)=0 for x≤0).

* Over time, the gradient for this neuron becomes 0 (since the derivative of ReLU for negative inputs is 0). This means the weights associated 

with this neuron stop updating during `backpropagation.`
* Such neurons effectively become "dead" and contribute nothing to the learning process.

# So to over come this problem we use `Leaky ReLU`

# What is Leaky ReLU?

Leaky ReLU (Rectified Linear Unit) is a variation of the ReLU activation function designed to address the "dying ReLU" problem, where some neurons stop learning because their outputs become zero for all inputs.

**Mathematical Definition**
For an input 𝑥, the Leaky ReLU function is defined as:

$$
f(x) = \begin{cases} 
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0 
\end{cases}
$$

Here, 𝛼 is a small positive constant (e.g., 0.01) that allows small, non-zero outputs for negative inputs.

**How Does It Work?**

* Positive Inputs: For x>0, Leaky ReLU behaves just like ReLU (f(x)=x).

* Negative Inputs: For x≤0, it outputs a small negative value proportional to the input (f(x)=αx).

![image.png](attachment:image.png)

**Advantages of Leaky ReLU**

**Fixes Dying ReLU:**

* Unlike ReLU, Leaky ReLU allows small gradients even for negative inputs, preventing neurons from "dying."

**Efficient Computation:**

* It's simple to compute and retains the computational efficiency of ReLU.


**Disadvantages of Leaky ReLU**

**Fixed Slope for Negative Values:**

* The slope for negative values (𝛼) is constant and predefined. It might not always be optimal for the data or task.

**Non-Zero Center:**

* Since the output is non-zero for most negative inputs, optimization can sometimes become harder.

# Example

**Suppose we have the following inputs:**

* x=3: Output f(x)=3 (positive, same as input).

* x=−2: Output f(x)=−0.02 (negative, scaled by α=0.01).

* x=0: Output f(x)=0.

**Real-Life Analogy**

Imagine a leaky faucet:

* When the faucet is fully open (positive inputs), water flows freely (like ReLU).
* When the faucet is closed (negative inputs), there’s still a small drip (small negative output, controlled by𝛼).

**Comparison with Other Functions**

* `ReLU:` Outputs 0 for all negative inputs, which can "kill" neurons.
* `Leaky ReLU:` Outputs small negative values, keeping neurons active even for negative inputs.

# Here is the summary of all the graphs

![image-3.png](attachment:image-3.png)


# Lets jump into the coding Part

#                                                                 Sigmoid

In [6]:
import math

def sigmoid(x):
  return 1 / (1 + math.exp(-x))

In [2]:
sigmoid(100) # so we know that sigmoid will convert any number in to between 0 and 1

1.0

In [3]:
sigmoid(1) # 0.5

0.7310585786300049

In [4]:
sigmoid(-27) # very close to 0

1.8795288165355508e-12

In [5]:
sigmoid(0.5) # very close to 0

0.6224593312018546

# tanh

In [7]:
def tanh(x):
  return (math.exp(x) - math.exp(-x)) / (math.exp(x) + math.exp(-x))

In [8]:
tanh(-56)

-1.0

In [9]:
tanh(50)

1.0

In [10]:
tanh(1)

0.7615941559557649

# ReLU

In [11]:
def relu(x):
    return max(0,x)

In [12]:
relu(-1)

0

In [13]:
relu(27)

27

In [14]:
relu(-72)

0

# Leaky ReLU

In [15]:
def leaky_relu(x):
    return max(0.1*x,x)

In [18]:
leaky_relu(-1)

-0.1

In [19]:
leaky_relu(-2)

-0.2

In [20]:
leaky_relu(0)

0.0

In [21]:
leaky_relu(1)

1

In [22]:
leaky_relu(47)

47