In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.keras import backend as K

### Binary Classification with sigmoid activation function

Suppose we are training a model for a binary classification problem with a sigmoid activation function.

Given a training example with input $x^{(i)}$, the model will output a float between 0 and 1. Based on whether this float is less than or greater than our "threshold" (which by default is set at 0.5), we round the float to get the predicted classification $y_{pred}$ from the model.

The accuracy metric compares the value of $y_{pred}$ on each training example with the true output, the one-hot coded vector $y_{true}^{(i)}$ from our training data.

Let $$\delta(y_{pred}^{(i)},y_{true}^{(i)}) = \begin{cases} 1 & y_{pred}=y_{true}\\
0 & y_{pred}\neq y_{true} \end{cases}$$

The accuracy metric  computes the mean of $\delta(y_{pred}^{(i)},y_{true}^{(i)})$ over all training examples.

$$ accuracy = \frac{1}{N} \sum_{i=1}^N \delta(y_{pred}^{(i)},y_{true}^{(i)}) $$

This is implemented in the backend of Keras as follows.


In [7]:
y_true = tf.constant([0.0, 1.0, 1.0])
y_pred = tf.constant([0.4, 0.7, 0.3])

accuracy = K.mean(K.equal(y_true, K.round(y_pred)))
accuracy

<tf.Tensor: shape=(), dtype=float32, numpy=0.6666667>

### Categorical Classification

Now suppose we are training a model for a classification problem which should sort data into $m>2$ different classes using a softmax activation function in the last layer.

Given a training example with input $x^{(i)}$, the model will output a tensor of probabilities $p_1, p_2, \dots p_m$, giving the likelihood (according to the model) that $x^{(i)}$ falls into each class.

The accuracy metric works by determining the largest argument in the $y_{pred}^{(i)}$ tensor, and compares its index to the index of the maximum value of $y_{true}^{(i)}$ to determine $\delta(y_{pred}^{(i)},y_{true}^{(i)})$. It then computes the accuracy in the same way as for the binary classification case.

$$ accuracy = \frac{1}{N} \sum_{i=1}^N \delta(y_{pred}^{(i)},y_{true}^{(i)}) $$

In the backend of Keras, the accuracy metric is implemented slightly differently depending on whether we have a binary classification problem ($m=2$) or a categorical classifcation problem. Note that the accuracy for binary classification problems is the same, no matter if we use a sigmoid or softmax activation function to obtain the output.


In [9]:
# Binary classification with softmax

y_true = tf.constant([[0.0,1.0],[1.0,0.0],[1.0,0.0],[0.0,1.0]])
y_pred = tf.constant([[0.4,0.6], [0.3,0.7], [0.05,0.95],[0.33,0.67]])
accuracy = K.mean(K.equal(y_true, K.round(y_pred)))
accuracy

# Categorical classification with m>2

y_true = tf.constant([[0.0,1.0,0.0,0.0],[1.0,0.0,0.0,0.0],[0.0,0.0,1.0,0.0]])
y_pred = tf.constant([[0.4,0.6,0.0,0.0], [0.3,0.2,0.1,0.4], [0.05,0.35,0.5,0.1]])
accuracy = K.mean(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)))
accuracy

<tf.Tensor: shape=(), dtype=float32, numpy=0.6666667>

### Implementing Activation Functions with Numpy:

#### Sigmoid Activation Function:

Using a mathematical definition, the sigmoid function takes any range real number and returns the output value which falls in the range of 0 to 1. Based on the convention, the output value is expected to be in the range of -1 to 1. The sigmoid function produces an “S” shaped curve. Mathematically, sigmoid is represented as:

$$ f (x) =  \frac{\mathrm{1} }{\mathrm{1} + e^- x }  $$ 


<p align="center">
  <img  src=assets/sigmoid.png/>
</p>

##### Properties of the Sigmoid Function:

 - The sigmoid function takes in real numbers in any range and returns a real-valued output.
 - The first derivative of the sigmoid function will be non-negative (greater than or equal to zero) or non-positive (less than or equal to Zero).
 - It appears in the output layers of the Deep Learning architectures, and is used for predicting probability based outputs and has been successfully implemented in binary classification problems, logistic regression tasks as well as other neural network applications.

In [5]:
class Sigmoid():
    
    def __call__(self, x):
        return (1/ 1+np.exp(-x))
    
    def gradient(self, x):
        return self.__call__(x) * (1 - self.__call__(x))

In [8]:
x = np.arange(-10, 10, 0.2)

y = Sigmoid()
y(x)

array([2.20274658e+04, 1.80347449e+04, 1.47657816e+04, 1.20893807e+04,
       9.89812906e+03, 8.10408393e+03, 6.63524401e+03, 5.43265959e+03,
       4.44806675e+03, 3.64195031e+03, 2.98195799e+03, 2.44160198e+03,
       1.99919590e+03, 1.63698443e+03, 1.34043076e+03, 1.09763316e+03,
       8.98847292e+02, 7.36095189e+02, 6.02845038e+02, 4.93749041e+02,
       4.04428793e+02, 3.31299560e+02, 2.71426407e+02, 2.22406416e+02,
       1.82272242e+02, 1.49413159e+02, 1.22510418e+02, 1.00484316e+02,
       8.24508687e+01, 6.76863310e+01, 5.55981500e+01, 4.57011845e+01,
       3.75982344e+01, 3.09641000e+01, 2.55325302e+01, 2.10855369e+01,
       1.74446468e+01, 1.44637380e+01, 1.20231764e+01, 1.00250135e+01,
       8.38905610e+00, 7.04964746e+00, 5.95303242e+00, 5.05519997e+00,
       4.32011692e+00, 3.71828183e+00, 3.22554093e+00, 2.82211880e+00,
       2.49182470e+00, 2.22140276e+00, 2.00000000e+00, 1.81873075e+00,
       1.67032005e+00, 1.54881164e+00, 1.44932896e+00, 1.36787944e+00,
      

#### Softmax Activation Function:

The Softmax function is used for prediction in multi-class models where it returns probabilities of each class in a group of different classes, with the target class having the highest probability. The calculated probabilities are then helpful in determining the target class for the given inputs. Mathematically, softmax is represented as:

$$ softmax(x)_i = \frac{exp(x_i)}{\sum_{j}^{ }exp(x_j))} $$

##### What does the Softmax function look like?

Assume that you have values from $x_1, x_2, \ldots, x_k$. The Softmax function for these values would be:

$\ln{\sum_{i=1}^k e^{x_i}}$

##### What is the Softmax function doing?

*It is approximating the max function*. Can you see why? Let us call the largest $x_i$ value $x_{max}.$ Now, we are taking exponential so $e^{x_{max}}$ will be much larger than any $e^{x_i}$.

$\ln{\sum_{i=0}^k e^{x_i}} \approx \ln e^{x_{max}}$
$\ln{\sum_{i=0}^k e^{x_i}} \approx x_{max}$

Look at the below graph for a comparison between max(0, x)(red) and softmax(0, x)(blue).

<p align="center">
  <img  src=assets/softmax.png/  height="600px" width="600px">
</p>

##### Why is it called Softmax?

* It is an approximation of Max.
* It is a *soft/smooth* approximation of max. Notice how it approximates the sharp corner at 0 using a smooth curve.

##### What is the purpose of Softmax?

Softmax gives us the [differentiable](https://en.wikipedia.org/wiki/Differentiable_function) approximation of a [non-differentiable](https://math.stackexchange.com/questions/1329252/is-max0-x-a-differentiable-function) function max. Why is that important? **For optimizing models, including machine learning models, it is required that functions describing the model be differentiable. So if we want to optimize a model which uses the max function then we can do that by replacing the max with softmax.**


##### But, What about the Softmax Activation Function name?

Here are my guesses on why the Softmax Activation function has the word “Softmax” in it:

* Softmax Activation function looks very similar to the Softmax function. Notice the denominator. $f(x_i)=\frac{e^{x_i}}{\sum_{i=0}^k e^{x_i}}$

* Softmax Activation function highlights the largest input and suppresses all the significantly [smaller ones](https://en.wikipedia.org/wiki/Softmax_function#Example) in certain conditions. In this way, it behaves similar to the softmax function.

##### Properties of the Softmax Function:

- The Softmax function produces an output which is a range of values between 0 and 1, with the sum of the probabilities been equal to 1.
- The main difference between the Sigmoid and Softmax functions is that Sigmoid is used in binary classification while the Softmax is used for multi-class tasks.


[REFERENCES](https://medium.com/data-science-bootcamp/softmax-function-beyond-the-basics-51f09ce11154)

In [16]:
class Softmax():
    
    def __call__(self, x):
        return np.exp(x)/np.sum(np.exp(x), axis = 1, keepdims=True)
    
    def gradient(self, x):
        p = self.__call__(x)
        return p * (1-p)

# Taking care of numerical stability

"""
def softmax(x):
    # Compute softmax values for each sets of scores in x.
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum()
    
"""

'\ndef softmax(x):\n    # Compute softmax values for each sets of scores in x.\n    e_x = np.exp(x - np.max(x))\n    return e_x / e_x.sum()\n    \n'

In [17]:
x = np.array([0.3, 0.7, 0.9])
x2 = np.array([[1, 2, 3, 6],  # sample 1
               [2, 4, 5, 6],  # sample 2
               [1, 2, 3, 6]]) # sample 1 again(!)
y = Softmax()
y(x2)

array([[0.00626879, 0.01704033, 0.04632042, 0.93037047],
       [0.01203764, 0.08894682, 0.24178252, 0.65723302],
       [0.00626879, 0.01704033, 0.04632042, 0.93037047]])

#### ReLu Activation Function:

The Rectified linear unit (ReLu) activation function has been the most widely used activation function for deep learning applications with state-of-the-art results. It usually achieves better performance and generalization in deep learning compared to the sigmoid activation function.

$$ f (x) =  max(x, 0) $$ 

<p align="center">
  <img  src=assets/relu.png/  height="600px" width="600px">
</p>

**Nature** :- Non-linear, which means we can easily backpropagate the errors and have multiple layers of neurons being activated by the ReLU function.

**Uses** :- ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. At a time only a few neurons are activated making the network sparse making it efficient and easy for computation.

**It avoids and rectifies vanishing gradient problem.** Almost all deep learning Models use ReLu nowadays.
But its limitation is that it should only be used within Hidden layers of a Neural Network Model.

Another problem with ReLu is that **ReLU units can be fragile during training and can "die".** For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be "dead" (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

To fix this problem another modification was introduced called **Leaky ReLu** to fix the problem of dying neurons. It introduces a small slope to keep the updates alive.

The leak helps to increase the range of the ReLU function. Usually, the value of alpha is 0.01 or so.

**When a is not 0.01 then it is called Randomized ReLU.**

Therefore the range of the Leaky ReLU is (-infinity to infinity).

Both Leaky and Randomized ReLU functions are monotonic(A function which is either entirely non-increasing or non-decreasing.) in nature. Also, their derivatives also monotonic in nature.

<p align="center">
  <img  src=assets/lrelu.png/  height="600px" width="600px">
</p>

In [21]:
class ReLu:
    def __call__(self, x):
        return np.where(x>=0, x, 0)
    
    def gradient(self, x):
        return np.where(x>=0, 1, 0)

In [22]:
z = np.random.uniform(-1, 1, (3,3))

relu = ReLu()
relu(z)

array([[0.        , 0.99694447, 0.        ],
       [0.        , 0.        , 0.81905501],
       [0.        , 0.18253543, 0.92855557]])

In [25]:
class Leaky_ReLu:
    def __init__(self, alpha = 0.01):
        self.alpha = alpha
        
    def __call__(self, x):
        return np.where(x>=0, x, self.alpha*x)
    
    def gradient(self, x):
        return np.where(x>=0, 1, self.alpha)

In [26]:
z = np.random.uniform(-1, 1, (3,3))

lrelu = Leaky_ReLu()
lrelu(z)

array([[ 6.89244738e-01, -2.48947306e-03,  6.64513815e-01],
       [-1.40195921e-04,  3.37395609e-01, -9.66792375e-03],
       [ 1.58964744e-01, -3.73088069e-03, -5.08719887e-03]])

#### Tanh Activation Function:

tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh is also sigmoidal (s - shaped).


<p align="center">
  <img  src=assets/tanh_formula.jpg/>
</p>

### Tanh also suffers from Vanishing Gradient Problem
<p align="center">
  <img  src=assets/tanh.jpg/>
</p>

<p align="center">
  <img  src=assets/tanh_and_gradient.jpg/>
</p>


In [27]:
class TanH():
    def __call__(self, x):
        return 2 / (1 + np.exp(-2*x)) - 1

    def gradient(self, x):
        return 1 - np.power(self.__call__(x), 2)

In [28]:
z = np.random.uniform(-1, 1, (3,3))

tanh = TanH()
tanh(z)

array([[-0.48660719, -0.70912964,  0.18195864],
       [-0.36787464, -0.7005745 ,  0.70007243],
       [ 0.1619226 ,  0.69155896, -0.16911645]])

### Other Metrics

#### Crossentropy Loss:

<p align="center">
  <img  src=assets/CE_loss.png/ height="600px" width="600px">
</p>


**NOTE** :
- Also see [Kullback-Leibler Divergence](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained) [Very often in Probability and Statistics we'll replace observed data or a complex distributions with a simpler, approximating distribution. KL Divergence helps us to measure just how much information we lose when we choose an approximation.] (which explains how the formula for CE originated) 
- [Cross-Entropy explained qualitatively](https://stats.stackexchange.com/questions/80967/qualitively-what-is-cross-entropy)
- [Shannon's entropy](https://stats.stackexchange.com/questions/87182/what-is-the-role-of-the-logarithm-in-shannons-entropy/87194#87194)
- [Bayesian Surprise and Maximum Entropy Distribution](https://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198)
- [Intuition on the Kullback-Leibler (KL) Divergence](https://stats.stackexchange.com/questions/188903/intuition-on-the-kullback-leibler-kl-divergence/189758#189758)

##### Multi-Class Classification:

One-of-many classification. Each sample can belong to ONE of $ C $ classes. The CNN will have $C$ output neurons that can be gathered in a vector $s$ (Scores). The target (ground truth) vector $t$ will be a one-hot vector with a positive class and $C - 1$ negative classes.   
This task is treated as a single classification problem of samples in one of $C$ classes.

##### Multi-Label Classification:

Each sample can belong to more than one class. The CNN will have as well $C$ output neurons. The target vector $t$ can have more than a positive class, so it will be a vector of 0s and 1s with $C$ dimensionality.   
This task is treated as $C$ different binary (C’ = 2, t’ = 0) or  (t’ = 1) and independent classification problems, where each output neuron decides if a sample belongs to a class or not.

<p align="center">
  <img  src=assets/multiclass_multilabel.png height="600px" width="600px">
</p>

The Cross-Entropy Loss can be defined as:

$$ CE = -\sum_{i}^{C}t_{i} log (s_{i}) $$

Where $t_i$ and $s_i$ are the groundtruth and the CNN score for each class _i_ in $C$. As **usually an activation function (Sigmoid / Softmax) is applied to the scores before the CE Loss computation**, we write $f(s_i)$ to refer to the activations.   

In a **binary classification problem**, where $C’ = 2$, the Cross Entropy Loss can be defined also as:

$$CE = -\sum_{i=1}^{C'=2}t_{i} log (s_{i}) = -t_{1} log(s_{1}) - (1 - t_{1}) log(1 - s_{1})$$

where it’s assumed that there are two classes: $C_1$ and $C_2$. $t_1$ [0,1] and $s_1$ are the groundtruth and the score for $C_1$, and $t_2 =  1 - t_1$ and $s_2 =  1 - s_1$ are the groundtruth and the score for $C_2$. That is the case when we split a Multi-Label classification problem in $C$ binary classification problems. See next Binary Cross-Entropy Loss section for more details.


**Logistic Loss** and **Multinomial Logistic Loss** are other names for **Cross-Entropy loss**.

### Categorical Cross-Entropy Loss:

<p align="center">
  <img  src=assets/softmax_CE_pipeline.png height="600px" width="600px">
</p>

Also called **Softmax Loss**. It is a **Softmax activation** plus a **Cross-Entropy loss**. If we use this loss, we will train a CNN to output a probability over the $C$ classes for each image. It is used for multi-class classification.

In the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class $C_p$ keeps its term in the loss. There is only one element of the Target vector $t$ which is not zero $t_i = t_p$. So discarding the elements of the summation which are zero due to target labels, we can write:

$$ CE = -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}}\right ) $$

where **Sp** is the CNN score for the positive class.

### Binary Cross-Entropy Loss:

Also called **Sigmoid Cross-Entropy loss**. It is a **Sigmoid activation** plus a **Cross-Entropy loss**. Unlike **Softmax loss** it is independent for each vector component (class), meaning that the loss computed for every CNN output vector component is not affected by other component values. That’s why it is used for **multi-label classification**, were the insight of an element belonging to a certain class should not influence the decision for another class.

<p align="center">
  <img  src=assets/sigmoid_CE_pipeline.png height="600px" width="600px">
</p>


It’s called **Binary Cross-Entropy Loss** because it sets up a binary classification problem between C’ = 2 classes for every class in $C$, as explained above. So when using this Loss, the formulation of **Cross Entroypy Loss** for binary problems is often used:

$$CE = -\sum_{i=1}^{C'=2}t_{i} log (f(s_{i})) = -t_{1} log(f(s_{1})) - (1 - t_{1}) log(1 - f(s_{1}))$$

This would be the pipeline for each one of the $C$ clases. We set $C$ independent binary classification problems (C’ = 2). Then we sum up the loss over the different binary problems: We sum up the gradients of every binary problem to backpropagate, and the losses to monitor the global loss. $s_1$ and $t_1$ are the score and the gorundtruth label for the class $C_1$, which is also the class $C_i$ in $C$. $s_2 = 1 - s_1$ and $t_2 = 1 - t_1$ are the score and the groundtruth label of the class $C_2$, which is not a “class” in our original problem with $C$ classes, but a class we create to set up the binary problem with $C_1 = C_i$. We can understand it as a background class.

The loss can be expressed as:

$$CE = \left\{\begin{matrix} & - log(f(s_{1})) & & if & t_{1} = 1 \\ & - log(1 - f(s_{1})) & & if & t_{1} = 0 \end{matrix}\right.$$

where $t_1 = 1$ means that the class $C_1 = C_i$ is positive for this sample.  