<a href="https://colab.research.google.com/github/deltorobarba/machinelearning/blob/master/activation_function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Activation Functions

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

mpl.rcParams['figure.figsize'] = (8, 6)
mpl.rcParams['axes.grid'] = False

TensorFlow 2.x selected.


## Summary

https://blog.exxactcorp.com/activation-functions-and-optimizers-for-deep-learning-models/

**Summary**
* Use ReLU. Be careful with your learning rates
* Try out Leaky ReLU / Maxout / ELU
* Try out tanh but don’t expect much
* Don’t use sigmoid


**Activation Functions - Key Points**
* Reside within neurons
* Transform input values into acceptable and useful range
* Allow pass-through of values which are useful in subsequent layers of neurons
* Default hidden layer activation function is ReLU
* Sigmoid only for binary classification output layer

https://www.kdnuggets.com/2017/09/neural-network-foundations-explained-activation-function.html


![Optimizer](https://raw.githubusercontent.com/deltorobarba/repo/master/optimizer_3.png)

Source: [Stanford.edu](https://stanford.edu/~shervine/teaching/cs-229/cheatsheet-deep-learning)

![Optimizer](https://raw.githubusercontent.com/deltorobarba/repo/master/optimizer_1.png)

# Activation Functions

Comparison of Activation Functions:
[Wikipedia](https://en.wikipedia.org/wiki/Activation_function#Comparison_of_activation_functions)

## Sigmoid

Squashes numbers to range [0,1]. Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron. But 3 problems:
* Saturated neurons “kill” the gradients (look at x= -10, 0 and 10)
* Sigmoid outputs are not zero-centered (Consider what happens when the input to a neuron (x) is always positive. What can we say about the gradients on w? They are always all positive or all negative (and run zig-zag)! (this is also why you want zero-mean data!)
* exp() is a bit compute expensive
* Logistic Regression for binary classification problems. between 0 and 1. 
* sigmoid activation derived from the mean field solution of a Boltzmann Machine. 
* Old way, slow learner! sigmoid stört zero mean, weil zw 0 und 1, erwartungswert nicht mehr bei Null. learning time geht länger. also be derived as the€maximum likelihood solution to for logistic regression in statistics). 
* Problem: logistic sigmoid€can cause a neural network€to get “stuck” during training. This is due in part to the fact that if a strongly-negative input is provided to the logistic sigmoid, it outputs values very near zero. Since€neural networks use€the feed-forward€activations to calculate parameter gradients, this can result in model parameters€that are updated less regularly than we would like, and are thus “stuck” in their current state. 
* sigmoid works well for a classifier: approximating a classifier function as combinations of sigmoid is easier than maybe ReLu, for example. Which will lead to faster training process and convergence
* The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
* The function is differentiable.That means, we can find the slope of the sigmoid curve at any two points.
* The function is monotonic but function’s derivative is not.
* The logistic sigmoid function can cause a neural network to get stuck at the training time.
* The softmax function is a more generalized logistic activation function which is used for multiclass classification.



## tanh

LeCun et al., 1991
* Squashes numbers to range [-1,1]
* zero centered (nice)
* The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be mapped near zero in the tanh graph.
* The function is differentiable.
* The function is monotonic while its derivative is not monotonic.
* The tanh function is mainly used classification between two classes.
* still kills gradients when saturated




## ReLU

* Krizhevsky et al., 2012
* rectified linear units, faster and more efficient, since fewer neurons are activated (less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations). 
* No gradient vanishing problem, as Relu’s gradient is constant = 1. Sparsity: since output 0 for negative values of x! When W*x < 0, Relu gives 0, which means sparsity. Less calculation load. This may be least important. 
* However, ReLu may amplify the signal inside the network more than softmax and sigmoid. 
* But: dying ReLU problem for values zero and smaller: neurons will never reactivated. Solution: leaky ReLU, noisy ReLU (in RBMs) and ELU (exponential linear units)
* ReLU as the activation function for hidden layers and sigmoid for the output layer (these are standards, didn’t experiment much on changing these). Also, I used the standard categorical cross-entropy loss.

Plus Side

* Does not saturate (in +region)
* Very computationally efficient
* Converges much faster than sigmoid/tanh in practice (e.g. 6x)
Actually more biologically plausible than sigmoid

Bad Side

* Not zero-centered output
* An annoyance: what is the gradient when x < 0? What happens when x = -10, 0 or 10?

People like to initialize ReLU neurons with slightly positive biases (e.g. 0.01)



## Leaky ReLU

* Mass et al., 2013 and He et al., 2015
* Does not saturate
* Computationally efficient
* Converges much faster than sigmoid/tanh in practice! (e.g. 6x) will not “die”.


## ELU

* Exponential Linear Units
* Clevert et al., 2015
* All benefits of ReLU
* Closer to zero mean outputs
* Negative saturation regime compared with Leaky ReLU adds some robustness to noise 
* But Computation requires exp()


## SELU

* scaled exponential linear units
* instead of normalizing the output of the activation function — the activation function suggested (SELU — scaled exponential linear units) outputs normalized values. https://towardsdatascience.com/selu-make-fnns-great-again-snn-8d61526802a9
* Background: batchnormalization for feedfirward networks: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. (https://arxiv.org/abs/1502.03167)
* Negative values sometimes: Scaling the function is the mechanism by which the authors accomplish the goal (of self-normalizing properties). As a byproduct, they sometimes output negative values, but there's no hidden meaning in it. It just makes the math work out. 
* **SELU vs RELU**: https://www.hardikp.com/2017/07/24/SELU-vs-RELU/

## Softmax

* is an activation function that is not function of a single fold x from the previous layer or layers.
* usually used in the last layer
* Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) 
* is a generalization of logistic regression that we can use for multi-class classification (under the assumption that the classes are mutually exclusive). We use the (standard) Logistic Regression model in binary classification tasks. in softmax regression (SMR), we replace the sigmoid logistic function by the so-called€softmax function€φ

## Maxout

* is an activation function that is not function of a single fold x from the previous layer or layers.

# LSTM Model

## Import Data

This tutorial uses a <a href="https://www.bgc-jena.mpg.de/wetter/" class="external">[weather time series dataset</a> recorded by the <a href="https://www.bgc-jena.mpg.de" class="external">Max Planck Institute for Biogeochemistry</a>.

In [0]:
# "https://www.bgc-jena.mpg.de/wetter/" (weather time series dataset)
zip_path = tf.keras.utils.get_file(
    origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
    fname='jena_climate_2009_2016.csv.zip',
    extract=True)
csv_path, _ = os.path.splitext(zip_path)

# Read file
df = pd.read_csv(csv_path)

# Select Univariate Data
uni_data = df['T (degC)']
uni_data.index = df['Date Time']

# Define Window Size
def univariate_data(dataset, start_index, end_index, history_size, target_size):
  data = []
  labels = []

  start_index = start_index + history_size
  if end_index is None:
    end_index = len(dataset) - target_size

  for i in range(start_index, end_index):
    indices = range(i-history_size, i)
    # Reshape data from (history_size,) to (history_size, 1)
    data.append(np.reshape(dataset[indices], (history_size, 1)))
    labels.append(dataset[i+target_size])
  return np.array(data), np.array(labels)

# Train test Split (first 300,000 rows of the data will be the training dataset, 
# there remaining will be the validation dataset. 
# This amounts to ~2100 days worth of training data.
TRAIN_SPLIT = 300000
tf.random.set_seed(13)
uni_data = uni_data.values

# Compute mean and standard deviation of training data
uni_train_mean = uni_data[:TRAIN_SPLIT].mean()
uni_train_std = uni_data[:TRAIN_SPLIT].std()

# Standardize the data
uni_data = (uni_data-uni_train_mean)/uni_train_std

# Create Data Pipeline (the model will be given the last 20 recorded temperature observations, and needs to learn to predict the temperature at the next time step.)
univariate_past_history = 20
univariate_future_target = 0

x_train_uni, y_train_uni = univariate_data(uni_data, 0, TRAIN_SPLIT,
                                           univariate_past_history,
                                           univariate_future_target)
x_val_uni, y_val_uni = univariate_data(uni_data, TRAIN_SPLIT, None,
                                       univariate_past_history,
                                       univariate_future_target)

## Choose Activation Function

Activation functions can be defined as layers:
* https://www.tensorflow.org/api_docs/python/tf/keras/layers/ReLU

Activation functions can be defined as activation:
* https://www.tensorflow.org/api_docs/python/tf/keras/activations/elu

In [0]:
activation='tanh' # tanh is default in LSTM

In [0]:
recurrent_activation='sigmoid' # sigmoid is default in LSTM

## Define Model

In [0]:
# tf.data to shuffle, batch, and cache the dataset. 
BATCH_SIZE = 256
BUFFER_SIZE = 10000

train_univariate = tf.data.Dataset.from_tensor_slices((x_train_uni, y_train_uni))
train_univariate = train_univariate.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE).repeat()

val_univariate = tf.data.Dataset.from_tensor_slices((x_val_uni, y_val_uni))
val_univariate = val_univariate.batch(BATCH_SIZE).repeat()

# LSTM requires the input shape of the data it is being given.

simple_lstm_model = tf.keras.models.Sequential([
    tf.keras.layers.LSTM(8, input_shape=x_train_uni.shape[-2:], activation=activation, recurrent_activation=recurrent_activation),
    tf.keras.layers.Dense(1)
])

simple_lstm_model.compile(optimizer='adam', loss='mae')

## Train & Evaluate Loss

In [6]:
EVALUATION_INTERVAL = 200
EPOCHS = 10

simple_lstm_model.fit(train_univariate, epochs=EPOCHS,
                      steps_per_epoch=EVALUATION_INTERVAL,
                      validation_data=val_univariate, validation_steps=50)

Train for 200 steps, validate for 50 steps
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fec45bf7240>

The recurrent activation function 'relu' has a much lower evaluation loss than 'sigmoid', but it takes longer to compute.