In [8]:
from IPython.display import HTML
css_file = './custom.css'
HTML(open(css_file, "r").read())

# Neural Networks - Activation Functions

© 2018 Daniel Voigt Godoy

## 1. Definition

The role of an ***activation function*** is to ***introduce a non-linearity***. If all activations were ***linear***, the whole network could be replace by a ***single affine transformation*** (a linear transformation followed by a translation).

### 1.1 Motivation

As you have seen in previous lessons, what most algorithms do to perform ***classification*** of data points is to ***separate them linearly*** (in some hyper-plane).

What do you do if the points are ***not*** linearly separable, like this?

![](https://cdn-images-1.medium.com/max/800/1*3jGx6YoSYXsrSfgbxO2yzw.png)

Support Vector Machines make use of the ***kernel trick*** to overcome this difficulty. In a way, they ***modify the feature space*** using a kernel.

The way ***neural networks*** modify the feature space is by ***twisting and turning it*** using an ***activation function***, until it looks like this:

![](https://cdn-images-1.medium.com/max/800/1*ZNPrD0PmXz7I6rhnbPuh1A.png)

Success! Now the colored points are ***separated by a line***!

But, if you look at the ***decision boundary in the original feature space***, they look like this:

![](https://cdn-images-1.medium.com/max/800/1*BoWjJPEuXqtUJXFuOnu_gQ.png)

Can you guess ***which*** activation function yielded each one of these boundaries?

### 1.2 Functions

Sigmoid | Tanh | ReLU
:---:|:---:|:---:
![sigmoid](https://cdn-images-1.medium.com/max/800/1*tXzS5GwC3BBqi7ppwcQWdw.png) | ![tanh](https://cdn-images-1.medium.com/max/800/1*SeCBB7lfA7KPJ-T1Mi7GRg.png) | ![](https://cdn-images-1.medium.com/max/800/1*piSRCnAIA2paTMd8kVDfKg.png)

For a thorough explanation of the most common activations functions (sigmoid, tanh and ReLU), check my blog [post](https://towardsdatascience.com/hyper-parameters-in-action-a524bf5bf1c).

## 2. Experiment

Time to try it yourself!

You are feeding the two lines as inputs to a very simple neural network, with only ***two hidden units*** ($h_1$ and $h_2 \ $) and a single ouput unit ($h_3 \ $) to perform the classification. It looks like this:

![](https://cdn-images-1.medium.com/max/800/1*Frni4L9WiHQCNvVFCIW2ZA.png)

The controls below allow you to:
- the first set of weights ($w_{11} \ $, $w_{12} \ $, $w_{21} \ $ and $w_{22} \ $) define a ***transformation matrix*** to apply to the ***feature space***
- the biases ($b_{11} \ $ and $b_{12} \ $) perform the ***translation*** of the feature space by modifying its origin
- the ***grid*** and the ***colored lines*** are subject to the result of both transformation and translation, resulting in z-values which are then fed to the ***activation function*** $\sigma \ $.
- the ***activation values*** are then multiplied by the weights ($w_{13} \ $ and $w_{23} \ $) and have the bias added ($b_{13} \ $) - whenever the result of this operation ***equals zero***, it defines a ***linear boundary*** which is used to ***separate*** the classes.

Use the controls to play with different configurations and answer the ***questions*** below.

In [2]:
from intuitiveml.supervised.classification.Activations import *

In [3]:
X, y = load_data()
myact = plotActivations(X, y)
vb = VBox(build_figure(myact), layout={'align_items': 'center'})

In [4]:
vb

VBox(children=(FigureWidget({
    'data': [{'line': {'color': 'black', 'width': 1},
              'mode': 'lin…

#### Questions

1. Keep ***activation linear***:
    - change w21 to 1.0 - what kind of transformation is that?
    - now slide w11 slowly to -1.0 - what happened to the grid?
    - change b11 to 1.0 and observe the scale on the x axis - what is this operation called?
    - change w12 to 1.0
        - what happened to the grid? Why? 
        - did anything else change? Why?
    - change b23 to -1.0 - what happened?
    - can you separate the colored points with these operations?
    
    
2. Change ***activation to sigmoid***:
    - what is the first thing you noticed? (hint: look at the scales)
    - set the weights to the ***basis vectors*** (recall from the lesson about eigenvectors and values) and set biases to ***zero*** - what does the grid look like?
    - change w11 and w22 to 3.0 (we are ***scaling*** the grid) - what is the ***effect*** of the sigmoid activation?
    - make w21 equals 3.0 - what happened to the grid? Does the shape look familiar?
    - from this set of weights, which kind of ***transformation*** does it represent?
        - change ***activation to linear*** and look at the ***shape of the grid*** to confirm
        - then set it back to sigmoid
    - set b12 to 3.0 - what effect did the bias have on the grid?
    
    
3. Change ***activation to tanh***:
    - what is the first thing you noticed?
    - change the value of w12 to both extremes and observe what happened to the grid space
    
    
4. Change ***activation to ReLU***:
    - what is the first thing you noticed?
    - slowly change b11 from 0 to -3 - what happened to the grid?
  
Before next question, set:
- activation to ***linear***
- set the ***transformation matrix*** and ***biases*** to: $$
\begin{bmatrix}
   -3 & 3 \\
   -3 & -3
 \end{bmatrix}
 \
\begin{bmatrix}
   1 \\
   -1
 \end{bmatrix} 
$$
   

5. Starting from the configuration above:
    - what transformation is that?
    - change ***activation to sigmoid*** - what happened to the grid and, especially, the colored points?
    - change the last unit's weights and bias and try ***separating the points***? Did you manage to do it?
    - change ***activation to tanh*** - what happened to the grid and, especially, the colored points?
    - change the last unit's weights and bias and try ***separating the points***? Did you manage to do it? How do you compare it to the ***sigmoid***?

Before next question, set:
- activation to ***linear***
- set the ***transformation matrix*** and ***biases*** to: $$
\begin{bmatrix}
   -3 & 0.4 \\
   -3 & 3
 \end{bmatrix}
 \
\begin{bmatrix}
   0.1 \\
   1.8
 \end{bmatrix} 
$$


6. Starting from the configuration above:
    - change ***actiation to ReLU*** - what happened to the grid and the data points to the left?
    - change the last unit's weights and bias and try ***separating the points***? Did you manage to do it?

## 3. Keras

This is the Keras' implementation of the simple network diagram shown before.

```python
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from keras.initializers import glorot_normal, normal

activation = 'sigmoid'

# Uses Glorot initializer for hidden layer with a typical seed: 42
glorot_initializer = glorot_normal(seed=42)
# Uses Normal initializer for outputlayer with the same seed
normal_initializer = normal(seed=42)

# Uses Stochastic Gradient Descent with a learning rate of 0.05
sgd = SGD(lr=0.05)

# Uses Keras' Sequential API
model = Sequential()

model.add(Dense(input_dim=2, # Input layer contains 2 units
                units=2,     # Hidden layer contains 2 units
                kernel_initializer=glorot_initializer, 
                activation=activation))

# Output layer with sigmoid activation for binary classification
model.add(Dense(units=1, 
                kernel_initializer=normal_initializer,
                activation='sigmoid'))

# Compiles model using binary crossentropy as loss
model.compile(loss='binary_crossentropy', 
              optimizer=sgd, 
              metrics=['acc'])

# Fits the model using a mini-batch size of 16 during 150 epochs
model.fit(X, y, epochs=150, batch_size=16)

print(model.get_weights())
```

In [5]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
from keras.initializers import glorot_normal, normal

activation = 'tanh'

# Uses Glorot initializer for hidden layer with a typical seed: 42
glorot_initializer = glorot_normal(seed=42)
# Uses Normal initializer for outputlayer with the same seed
normal_initializer = normal(seed=42)

# Uses Stochastic Gradient Descent with a learning rate of 0.05
sgd = SGD(lr=0.05)

# Uses Keras' Sequential API
model = Sequential()

model.add(Dense(input_dim=2, # Input layer contains 2 units
                units=2,     # Hidden layer contains 2 units
                kernel_initializer=glorot_initializer, 
                activation=activation))

# Output layer with sigmoid activation for binary classification
model.add(Dense(units=1, 
                kernel_initializer=normal_initializer,
                activation='sigmoid'))

# Compiles model using binary crossentropy as loss
model.compile(loss='binary_crossentropy', 
              optimizer=sgd, 
              metrics=['acc'])

# Fits the model using a mini-batch size of 16 during 150 epochs
model.fit(X, y, epochs=150, batch_size=16)


Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.

Using TensorFlow backend.


Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150


<keras.callbacks.History at 0x7f9c6a00bdd8>

In [6]:
print(model.get_weights())

[array([[-4.7761188, -4.3531065],
       [-4.5376463,  4.010479 ]], dtype=float32), array([-1.7944971,  1.6984158], dtype=float32), array([[ 6.3906374],
       [-6.5252028]], dtype=float32), array([5.300107], dtype=float32)]


## 4. More Resources

[Hyper-parameters in Action! Part I — Activation Functions](https://towardsdatascience.com/hyper-parameters-in-action-a524bf5bf1c)

[Hyper-parameters in Action! Introducing DeepReplay](https://towardsdatascience.com/hyper-parameters-in-action-introducing-deepreplay-31132a7b9631)

[Hyper-parameters in Action! Part II — Weight Initializers](https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404)

[Neural Networks Series](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)

[A Visual and Interactive Guide to the Basics of Neural Networks](http://jalammar.github.io/visual-interactive-guide-basics-neural-networks/)

[A Visual And Interactive Look at Basic Neural Network Math](https://jalammar.github.io/feedforward-neural-networks-visual-interactive/)

[A visual proof that neural nets can compute any function](http://neuralnetworksanddeeplearning.com/chap4.html)

#### This material is copyright Daniel Voigt Godoy and made available under the Creative Commons Attribution (CC-BY) license ([link](https://creativecommons.org/licenses/by/4.0/)). 

#### Code is also made available under the MIT License ([link](https://opensource.org/licenses/MIT)).

In [7]:
from IPython.display import HTML
HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')