In [4]:
import tensorflow as tf
import numpy as np
from tensorflow.keras import backend as K

### Binary Classification with sigmoid activation function

Suppose we are training a model for a binary classification problem with a sigmoid activation function.

Given a training example with input $x^{(i)}$, the model will output a float between 0 and 1. Based on whether this float is less than or greater than our "threshold" (which by default is set at 0.5), we round the float to get the predicted classification $y_{pred}$ from the model.

The accuracy metric compares the value of $y_{pred}$ on each training example with the true output, the one-hot coded vector $y_{true}^{(i)}$ from our training data.

Let $$\delta(y_{pred}^{(i)},y_{true}^{(i)}) = \begin{cases} 1 & y_{pred}=y_{true}\\
0 & y_{pred}\neq y_{true} \end{cases}$$

The accuracy metric  computes the mean of $\delta(y_{pred}^{(i)},y_{true}^{(i)})$ over all training examples.

$$ accuracy = \frac{1}{N} \sum_{i=1}^N \delta(y_{pred}^{(i)},y_{true}^{(i)}) $$

This is implemented in the backend of Keras as follows.


In [7]:
y_true = tf.constant([0.0, 1.0, 1.0])
y_pred = tf.constant([0.4, 0.7, 0.3])

accuracy = K.mean(K.equal(y_true, K.round(y_pred)))
accuracy

<tf.Tensor: shape=(), dtype=float32, numpy=0.6666667>

### Categorical Classification

Now suppose we are training a model for a classification problem which should sort data into $m>2$ different classes using a softmax activation function in the last layer.

Given a training example with input $x^{(i)}$, the model will output a tensor of probabilities $p_1, p_2, \dots p_m$, giving the likelihood (according to the model) that $x^{(i)}$ falls into each class.

The accuracy metric works by determining the largest argument in the $y_{pred}^{(i)}$ tensor, and compares its index to the index of the maximum value of $y_{true}^{(i)}$ to determine $\delta(y_{pred}^{(i)},y_{true}^{(i)})$. It then computes the accuracy in the same way as for the binary classification case.

$$ accuracy = \frac{1}{N} \sum_{i=1}^N \delta(y_{pred}^{(i)},y_{true}^{(i)}) $$

In the backend of Keras, the accuracy metric is implemented slightly differently depending on whether we have a binary classification problem ($m=2$) or a categorical classifcation problem. Note that the accuracy for binary classification problems is the same, no matter if we use a sigmoid or softmax activation function to obtain the output.


In [9]:
# Binary classification with softmax

y_true = tf.constant([[0.0,1.0],[1.0,0.0],[1.0,0.0],[0.0,1.0]])
y_pred = tf.constant([[0.4,0.6], [0.3,0.7], [0.05,0.95],[0.33,0.67]])
accuracy = K.mean(K.equal(y_true, K.round(y_pred)))
accuracy

# Categorical classification with m>2

y_true = tf.constant([[0.0,1.0,0.0,0.0],[1.0,0.0,0.0,0.0],[0.0,0.0,1.0,0.0]])
y_pred = tf.constant([[0.4,0.6,0.0,0.0], [0.3,0.2,0.1,0.4], [0.05,0.35,0.5,0.1]])
accuracy = K.mean(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)))
accuracy

<tf.Tensor: shape=(), dtype=float32, numpy=0.6666667>

### Implementing Activation Functions with Numpy:

#### Sigmoid Activation Function:

Using a mathematical definition, the sigmoid function takes any range real number and returns the output value which falls in the range of 0 to 1. Based on the convention, the output value is expected to be in the range of -1 to 1. The sigmoid function produces an “S” shaped curve. Mathematically, sigmoid is represented as:

$$ f (x) =  \frac{\mathrm{1} }{\mathrm{1} + e^- x }  $$ 


<p align="center">
  <img  src=assets/sigmoid.png/>
</p>

##### Properties of the Sigmoid Function:

 - The sigmoid function takes in real numbers in any range and returns a real-valued output.
 - The first derivative of the sigmoid function will be non-negative (greater than or equal to zero) or non-positive (less than or equal to Zero).
 - It appears in the output layers of the Deep Learning architectures, and is used for predicting probability based outputs and has been successfully implemented in binary classification problems, logistic regression tasks as well as other neural network applications.

In [5]:
class Sigmoid():
    
    def __call__(self, x):
        return (1/ 1+np.exp(-x))
    
    def gradient(self, x):
        return self.__call__(x) * (1 - self.__call__(x))

In [8]:
x = np.arange(-10, 10, 0.2)

y = Sigmoid()
y(x)

array([2.20274658e+04, 1.80347449e+04, 1.47657816e+04, 1.20893807e+04,
       9.89812906e+03, 8.10408393e+03, 6.63524401e+03, 5.43265959e+03,
       4.44806675e+03, 3.64195031e+03, 2.98195799e+03, 2.44160198e+03,
       1.99919590e+03, 1.63698443e+03, 1.34043076e+03, 1.09763316e+03,
       8.98847292e+02, 7.36095189e+02, 6.02845038e+02, 4.93749041e+02,
       4.04428793e+02, 3.31299560e+02, 2.71426407e+02, 2.22406416e+02,
       1.82272242e+02, 1.49413159e+02, 1.22510418e+02, 1.00484316e+02,
       8.24508687e+01, 6.76863310e+01, 5.55981500e+01, 4.57011845e+01,
       3.75982344e+01, 3.09641000e+01, 2.55325302e+01, 2.10855369e+01,
       1.74446468e+01, 1.44637380e+01, 1.20231764e+01, 1.00250135e+01,
       8.38905610e+00, 7.04964746e+00, 5.95303242e+00, 5.05519997e+00,
       4.32011692e+00, 3.71828183e+00, 3.22554093e+00, 2.82211880e+00,
       2.49182470e+00, 2.22140276e+00, 2.00000000e+00, 1.81873075e+00,
       1.67032005e+00, 1.54881164e+00, 1.44932896e+00, 1.36787944e+00,
      

#### Softmax Activation Function:

The Softmax function is used for prediction in multi-class models where it returns probabilities of each class in a group of different classes, with the target class having the highest probability. The calculated probabilities are then helpful in determining the target class for the given inputs. Mathematically, softmax is represented as:

$$ softmax(x)_i = \frac{exp(x_i)}{\sum_{j}^{ }exp(x_j))} $$

##### What does the Softmax function look like?

Assume that you have values from $x_1, x_2, \ldots, x_k$. The Softmax function for these values would be:

$\ln{\sum_{i=1}^k e^{x_i}}$

##### What is the Softmax function doing?

*It is approximating the max function*. Can you see why? Let us call the largest $x_i$ value $x_{max}.$ Now, we are taking exponential so $e^{x_{max}}$ will be much larger than any $e^{x_i}$.

$\ln{\sum_{i=0}^k e^{x_i}} \approx \ln e^{x_{max}}$
$\ln{\sum_{i=0}^k e^{x_i}} \approx x_{max}$

Look at the below graph for a comparison between max(0, x)(red) and softmax(0, x)(blue).

<p align="center">
  <img  src=assets/softmax.png/>
</p>

##### Why is it called Softmax?

* It is an approximation of Max.
* It is a *soft/smooth* approximation of max. Notice how it approximates the sharp corner at 0 using a smooth curve.

##### What is the purpose of Softmax?

Softmax gives us the [differentiable](https://en.wikipedia.org/wiki/Differentiable_function) approximation of a [non-differentiable](https://math.stackexchange.com/questions/1329252/is-max0-x-a-differentiable-function) function max. Why is that important? **For optimizing models, including machine learning models, it is required that functions describing the model be differentiable. So if we want to optimize a model which uses the max function then we can do that by replacing the max with softmax.**


##### But, What about the Softmax Activation Function name?

Here are my guesses on why the Softmax Activation function has the word “Softmax” in it:

* Softmax Activation function looks very similar to the Softmax function. Notice the denominator. $f(x_i)=\frac{e^{x_i}}{\sum_{i=0}^k e^{x_i}}$

* Softmax Activation function highlights the largest input and suppresses all the significantly [smaller ones](https://en.wikipedia.org/wiki/Softmax_function#Example) in certain conditions. In this way, it behaves similar to the softmax function.



[REFERENCES](https://medium.com/data-science-bootcamp/softmax-function-beyond-the-basics-51f09ce11154)

In [9]:
class Softmax():
    
    def __call__(self, x):
        return np.exp(x)/np.sum(np.exp(x), axis = 0)
    
    def gradient(self, x):
        p = self.__call__(x)
        return p * (1-p)

In [22]:
x = np.array([[0.3, 0.7, 0.9]])

y = Softmax()
y(x)

array([[1., 1., 1.]])