## TODO:

* possibly add other activation functions (non-differentiable)
* simulated annealing? ~~nelder mead?~~
* ~~right now, only globalbestpso used~~
* ~~add button to run optimization only after clicking~~
* ~~Cost history / number of iterations needed compared to classic MLP~~ comparison between 

# Neural Networks

Neural networks are a way of parametrizing non-linear functions. On a very basic level, they are formed by a composition of non-linear function. The function is defined with a layered architecture. The mapping from the input layer to the output layer is performed via hidden layers. Each layer $k$ produces an output $z_k$ that is a non-linear function of a weighted combination of the outputs of the previous layer, $z_k = g_k(W_k z_{k-1})$. 

Once the architecture and the activation functions $g_k(\cdot)$ are defined, the weights $W_k$ are trained. If all the functions $g_k$ are (sub)-differentiable then, via the chain rule, gradients exist and can be computed. Usually, the weights are then trained via different variants of gradient descent.

What we do here in this notebook is another approach: instead of using a gradient based method that does the fitting of the network, we want to apply the particle swarm optimization, simulated annealing and Nelder Mead algorithms. We will use sample datasets for binary classification that are supplied by Scikit Learn.

In [1]:
import numpy as np 
import matplotlib as mpl 
import matplotlib.pyplot as plt 

from sklearn import cluster, datasets, mixture
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.model_selection import train_test_split

import ipywidgets
from ipywidgets import interact, interactive, interact_manual, fixed
from IPython.display import clear_output
from timeit import default_timer as timer
from scipy.optimize import minimize as scipy_minimize

from utilities import plot_helpers


%matplotlib inline
%load_ext autoreload
%autoreload 2
from matplotlib import rcParams
rcParams['figure.figsize'] = (10, 5)  # Change this if figures look ugly. 
rcParams['font.size'] = 16

import warnings
warnings.filterwarnings("ignore")

import pyswarms as ps

def get_dimensions(hidden_layer_sizes, n_features, n_classes):
    dimensions = 0
    in_i = n_features
    for l in hidden_layer_sizes:
        dimensions += in_i * l + l
        in_i = l
    dimensions += in_i * n_classes + n_classes
    return dimensions

## Architecture

Since the PySwarm optimizer needs a specific objective function, we implement a basic neural network with that works with different activation functions and variable hidden layers. This implementation is based on NumPy.

### Activation functions

In [2]:
# Sigmoid aka. Logistic function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))
# Relu function
def relu(x):
    return np.maximum(0, x)

# tanh
def tanh(x):
    return np.tanh(x)

def inv(x):
    return np.sqrt(np.abs(x))

activations = {
    "logistic": sigmoid,
    "relu": relu,
    "identity": lambda x: x,
    "tanh": tanh,
    "inv": inv
}

In [3]:
N_SAMPLES = 200
N_FEATURES = 2
N_CLASSES = 2

## Forward Prop, Loss and Prediction

In [4]:
def forward_prop(p, activation, n_inputs, n_classes, hidden_layer_sizes, X):
    """ Calculate roll-back the weights and biases
    Inputs
    ------
    p: np.ndarray
        unrolled version of the weights and biases
        
    activation: function :: np.ndarray -> np.ndarray
  
    hidden_layer_sizes: tuple (int)
  
    X: np.ndarray

    Returns
    -------
    numpy.ndarray of logits for the output layer, which correspond
    to the probabilities of predicting each class

    """
    start = 0
    in_i = X
    for i in range(0, len(hidden_layer_sizes)):
        layer_size = hidden_layer_sizes[i]
        no_weights = n_inputs * layer_size
        # get weights and biases for this layer
        W_i = p[start:(start+no_weights)].reshape((n_inputs, layer_size))
        b_i = p[(start+no_weights):(start+no_weights+layer_size)].reshape((layer_size,))
        
        out_i = activation(in_i.dot(W_i) + b_i)
        
        # update variables
        start = start + no_weights + layer_size
        n_inputs = layer_size
        in_i = out_i
    
    # Compute last layer
    no_weights = n_inputs * n_classes
    W_k = p[start:(start+no_weights)].reshape((n_inputs, n_classes))
    b_k = p[(start+no_weights):(start+no_weights+n_classes)].reshape((n_classes,))
    
    # Pre-activation
    out = in_i.dot(W_k) + b_k
    
    # Probabilities using softmax
    # make every value 0 or below, as exp(0) won't overflow
    out_scaled = out - out.max(axis=-1, keepdims=True)
    exp_scores = np.exp(out_scaled)
    return exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

In [5]:
# Getting the loss of our current network, using negative log likelihood
def loss(params, activation, n_inputs, n_classes, hidden_layer_sizes, reg, X, Y):
    probs = forward_prop(params, activation, n_inputs, n_classes, hidden_layer_sizes, X)
    correct_logprobs = -np.log(probs[range(X.shape[0]), Y])
    loss = np.sum(correct_logprobs) / X.shape[0]
    return loss + reg * np.linalg.norm(params)

In [6]:
# Prediction using the probabilities of our forward propagation
def predict(params, activation, n_inputs, n_classes, hidden_layer_sizes, X):
    probs = forward_prop(params, activation, n_inputs, n_classes, hidden_layer_sizes, X)
    y_pred = np.argmax(probs, axis=1)
    return y_pred

## Classification Demo

Neural network training has a lot of hyperparameters. Architecture, learning rate, batch size, optimization algorithm, random seed are just a few of them. Additionally, we have the hyperparameters for the swarm optimization. These generally are:
* $c_1$, the cognitive parameter (attraction towards individual best)
* $c_2$, the social parameter (attraction towards global/neighborhood best)
* $w$, the inertia or momentum

Also, the topology can be specified when using the local optimization method (using neighborhood best).

#### Data creation

In [7]:
N_SAMPLES = 200

In [8]:
def data_creation(dataset, noise):
    if dataset is 'blobs':
        X, Y = datasets.make_blobs(n_samples=N_SAMPLES, centers=2, random_state=3, cluster_std=10*noise)
    elif dataset is 'circles':
        X, Y = datasets.make_circles(n_samples=N_SAMPLES, factor=.5, noise=noise, random_state=42)
    elif dataset is 'moons':
        X, Y = datasets.make_moons(n_samples=N_SAMPLES, noise=noise, random_state=42)
    elif dataset == 'xor':
        np.random.seed(42)
        step = int(N_SAMPLES/4)
        
        X = np.zeros((N_SAMPLES, 2))
        Y = np.zeros(N_SAMPLES)
        
        X[0*step:1*step, :] = noise * np.random.randn(step, 2)
        Y[0*step:1*step] = 1
        X[1*step:2*step, :] = np.array([1, 1]) + noise * np.random.randn(step, 2)
        Y[1*step:2*step] = 1
        
        X[2*step:3*step, :] = np.array([0, 1]) + noise * np.random.randn(step, 2)
        Y[2*step:3*step] = 0
        X[3*step:4*step, :] = np.array([1, 0]) + noise * np.random.randn(step, 2)
        Y[3*step:4*step] = 0
    
    elif dataset == 'periodic':
        
        step = int(N_SAMPLES/4)
        
        X = np.zeros((N_SAMPLES, 2))
        Y = np.zeros(N_SAMPLES)
        
        X[0*step:1*step, :] = noise * np.random.randn(step, 2)
        Y[0*step:1*step] = 1
        X[1*step:2*step, :] = np.array([0, 2]) + noise * np.random.randn(step, 2)
        Y[1*step:2*step] = 1
        
        X[2*step:3*step, :] = np.array([0, 1]) + noise * np.random.randn(step, 2)
        Y[2*step:3*step] = 0
        X[3*step:4*step, :] = np.array([0, 3]) + noise * np.random.randn(step, 2)
        Y[3*step:4*step] = 0
    
    
    X = X[Y <= 1, :]
    Y = Y[Y <=1 ]
   # Y[Y==0] = -1
    return X, Y

#### Defining the interactive function

In [9]:
rcParams['figure.figsize'] = (10, 5)  # Change this if figures look ugly. 
rcParams['font.size'] = 16
def mlp(solver, k, dataset, hidden_layer_sizes, activation, iters, particles, c1, c2, w, reg, noise):
    # constants for this example
    N_FEATURES = 2
    N_CLASSES = 2
    np.random.seed(42)

    # timer to see time of execution
    start = timer()
    
    # to calculate the dimension, we know we have (in * out) no. of weights
    # and (out) no. of biases per layer
    dimensions = get_dimensions(hidden_layer_sizes, N_FEATURES, N_CLASSES)
    print("Dimensions for this problem:", dimensions)
    
    # get the data
    X, Y = data_creation(dataset, noise)
    
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, Y.astype(int), test_size=.2)
    
    # get function to corresponding string
    activation_function = activations[activation]
    
    # Pick optimizer and train
    if solver == 'Nelder-Mead':
        # wrap the function to use just x
        f = lambda x: loss(x, activation_function, N_FEATURES, N_CLASSES,\
                    hidden_layer_sizes, np.power(10., reg), X_train, y_train)
        
        # guess randomly between -1 and 1
        initial_guess = np.random.uniform(low=-1.0, high=1.0, size=(dimensions + 1, dimensions))
        
        result = scipy_minimize(fun=f, x0=initial_guess[0,:],\
                                method='Nelder-Mead', \
                                options={'adaptive': False,
                                         'maxiter': iters,
                                         'initial_simplex': initial_guess},\
                                callback=lambda xk: print('current loss: {}'.format(f(xk)), end="\r", flush = False)
                               ) 
        
        if not result.success:
            print("\nOptimizer exited unsuccessfully. Message:", result.message)
        print('Nelder-Mead done after', result.nit, "iterations\n",\
             "Loss value:", result.fun)
        pos = result.x
        print("Validation accuracy:", \
          (y_test == predict(pos, activation_function, N_FEATURES, N_CLASSES,\
                    hidden_layer_sizes, X_test)).mean())
        
    elif solver == 'sgd':
        classifier = MLPClassifier(hidden_layer_sizes=hidden_layer_sizes, 
                    activation=activation, solver=solver, max_iter=iters, 
                    alpha=np.power(10., reg), random_state=1,
                    learning_rate_init=.1)
        classifier.fit(X_train, y_train)
        print("Validation accuracy: {}".format(classifier.score(X_test, y_test)))
    else:
         # wrap our function for particles
        def f(x):
            n_particles = x.shape[0]
            res = [loss(x[i], activation_function, N_FEATURES, N_CLASSES,\
                  hidden_layer_sizes, 10**reg, X_train, y_train)\
                       for i in range(n_particles)]
            return np.array(res)
        # The options for the optimizer
        options = {'c1': c1, 'c2': c2, 'w': w, 'k': k, 'p': 2}
    
        if solver == 'GlobalBestPSO':
            optimizer = ps.single.GlobalBestPSO(n_particles=particles, dimensions=dimensions,\
                                        options=options)
        
        elif solver == 'LocalBestPSO':        
            optimizer = ps.single.LocalBestPSO(n_particles=particles, dimensions=dimensions,\
                                               options=options)
    
    
        cost, pos = optimizer.optimize(f, iters=iters)
    
        print("Validation accuracy: {}".format(
              (y_test == predict(pos, activation_function, N_FEATURES, N_CLASSES,
                        hidden_layer_sizes, X_test)).mean()))
    
    end = timer()
    print("Elapsed time in seconds: {:3.5}".format(end - start))
    
    # plot the line, the points, and the nearest vectors to the plane
    plt.figure()
    plt.clf()
    fig = plt.axes()
    opt = {'marker': 'r*', 'label': '+'}
    plot_helpers.plot_data(X[np.where(Y == 1)[0], 0], X[np.where(Y == 1)[0], 1], fig=fig, options=opt)
    opt = {'marker': 'bs', 'label': '-'}
    plot_helpers.plot_data(X[np.where(Y == 0)[0], 0], X[np.where(Y == 0)[0], 1], fig=fig, options=opt)

    mins = np.min(X, 0)
    maxs = np.max(X, 0)
    x_min = mins[0] - 1
    x_max = maxs[0] + 1
    y_min = mins[1] - 1
    y_max = maxs[1] + 1

    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]  
    Xplot = np.c_[XX.ravel(), YY.ravel()]   
    
    # get all the predictions
    if solver == 'sgd':
        Z = classifier.predict_proba(Xplot)[:, 1]
    else:
        Z = forward_prop(pos, activation_function, N_FEATURES, N_CLASSES, hidden_layer_sizes, Xplot)[:,1]

    # Put the result into a color plot
    Z = Z.reshape(XX.shape)
    # plt.figure(fignum, figsize=(4, 3))
    # Put the result into a color plot
    plt.contourf(XX, YY, Z, cmap=plt.cm.jet, alpha=.3)

### Interactive modelation

To be able to visually see and play around with the different parameters, we here give an interactive tool. One can pick different hyperparameters, datasets, layers etc. and see the dataset as well as the decision boundaries of the trained weights and biases of the trained network.

In [10]:
solver_widget = ipywidgets.Dropdown(options=['GlobalBestPSO', 'LocalBestPSO', 'Nelder-Mead', 'sgd'],\
                            value='GlobalBestPSO', description='Solver', disabled=False)

particles_widget = ipywidgets.Dropdown(options=[20,50,100,200,1000],\
                               value=20, description='Particles', disabled=False)

c1_widget = ipywidgets.FloatSlider(value=0.5, min=0, max=1, step=0.1, readout_format='.1f',
                            description='c1:', style={'description_width': 'initial'},
                            continuous_update=False)

c2_widget = ipywidgets.FloatSlider(value=0.3,min=0,max=1,step=0.1,readout_format='.1f',
                            description='c2:', style={'description_width': 'initial'},
                            continuous_update=False)

w_widget = ipywidgets.FloatSlider(value=0.9,min=0,max=1,step=0.1,readout_format='.1f',
                           description='w:', style={'description_width': 'initial'},
                           continuous_update=False)

k_widget = ipywidgets.IntSlider(value=5,min=1, max=particles_widget.value, step=1,
                            readout_format='d',description='neighborhood k:',
                            style={'description_width': 'initial'},
                            continuous_update=False, disabled=True)

def disable_pso_args(*args):
    # enable the neighborhood k only for Local PSO
    k_widget.disabled = True if solver_widget.value != 'LocalBestPSO' else False
    k_widget.max = particles_widget.value
    # enable all PSO hyperparameters only for PSO
    c1_widget.disabled = False if solver_widget.value in ['GlobalBestPSO', 'LocalBestPSO'] else True
    c2_widget.disabled = False if solver_widget.value in ['GlobalBestPSO', 'LocalBestPSO'] else True
    w_widget.disabled = False if solver_widget.value in ['GlobalBestPSO', 'LocalBestPSO'] else True
    particles_widget.disabled = False if solver_widget.value in ['GlobalBestPSO', 'LocalBestPSO'] else True
    
solver_widget.observe(disable_pso_args)
particles_widget.observe(disable_pso_args)

interact_manual(mlp, 
        solver=solver_widget,
        k=k_widget,
        dataset=['blobs', 'circles', 'moons', 'xor', 'periodic'],
        activation=['relu', 'logistic', 'identity', 'tanh'],
        hidden_layer_sizes=[(50, ), (100, ), (50, 50), (100, 100), (50, 50, 50), (100, 100, 100)],
        iters=[100,200,500,1000,2000,3000,5000,10000],
        particles=particles_widget,
        c1=c1_widget,
        c2=c2_widget,
        w=w_widget,
        reg=ipywidgets.FloatSlider(value=-3,min=-3,max=3,step=0.1,readout_format='.1f',
                    description='reg 10^:',style={'description_width': 'initial'},
                    continuous_update=False),
        noise=ipywidgets.FloatSlider(value=0.05,min=0.01,max=0.3,step=0.01,
                    readout_format='.2f',description='noise:', style={'description_width': 'initial'},
                    continuous_update=False),  
        );

interactive(children=(Dropdown(description='Solver', options=('GlobalBestPSO', 'LocalBestPSO', 'Nelder-Mead', …

In fact, we can see that the PSO works quite well to find the fit for the classification and judging by the visualization the decision boundaries found by the swarm is very much reasonable in almost all cases. Evidently, the underlying function we are trying to optimize is an extremely high-dimensional non-convex function with (in general) no single global optimum, but many local minimas or possible solutions.

For Nelder-Mead, it does not seem to work as good as PSO and does not give as good classification boundaries as quickly. It needs a big amount of iterations with often negligible improvement despite being nowhere near to a 'nice' solution. In other words, if it gives a good solution it took NM a big amount of iterations. One way to interpret this is the dependency of NM to a good initialization as well as its little 'learning rate', i.e. improvement in one iteration. Also, there is the risk of being stuck in a local optimum. It does, however, seem to give similar results to SGD, just needs more time to execute.

Note that the PSO initial guess is chosen uniformly at random between 0 and 1 for every dimension, where with NM we saw best results with sampling between -1 and 1.

![NM vs SGD](nm_sgd.png)
*Circle Dataset and ReLU activation, one hidden layer*
 * Left: NM with 2000 iterations
 * Middle: NM with 5000 iterations
 * Right: SGD

Interestingly, the PSO and NM can give somewhat different results compared to popular optimization methods like SGD and Adam. In fact, using the logistic activation function for a linearly non-separable dataset like *moons*, classic SGD fails to find a good solution, whereas PSO does, and NM with a big amount of iterations.


![PSO vs. SGD vs. NM for logistic activation](logistic_pso_sgd_nm.png)
*PSO vs. SGD vs. NM (10k iterations) for logistic activation and the moon dataset, one hidden layer*

Between GlobalBestPSO and LocalBestPSO, there are not immediately striking differences in the results and performances. One example, however, is the circle dataset, where a smaller neighborhood seems to give better decision boundaries:

![GlobalBestPSO vs LocalBestPSO](globalpso_local10_local5.png)
*Circle Dataset with ReLU activation, one hidden layer, 20 particles, 200 iterations*
 * Left: GlobalBestPSO
 * Middle: LocalBestPSO, 10 particle neighborhood
 * Right: LocalBestPSO, 5 particle neighborhood

### Performance
We already mentioned the slow convergence of NM and the need for many iterations.
<br/> <br/>
The biggest problem with PSO and neural network training certainly is the complexity. With just one hidden layer of size $50$, we have $252$ weights and biases to be adjusted, therefore a $252$-dimensional function which we then evaluate $50 * 200 = 10^4$ times for e.g. $50$ particles and $200$ iterations.
<br/><br/>

Let's consider this with a small excursion to the theoretical complexity of the forward propagation:

Given one feature vector $x$, we first multiply the vector with the weight matrix $W_1$, and then add the bias $b_1$ before applying the activation function. Assuming the number of perceptrons in the hidden layer have the same dimension $n$ like the input, we have a complexity of $O(n^2)$ as a result of the matrix with vector multiplication. Repeating this for $k$ layers and $m$ data points, we get something in the order of $O(k*m*n^2)$. Repeating this operation for every particle for every iteration therefore is a big computational burden.
<br/> <br/> <br/>
Let's have a comparison for one specific example:

In [11]:
def compare_nm_pso(dataset):
    np.random.seed(42)
    # hyperparameters and choices of variables
    hidden_layer_sizes = (30,30)
    activation_function = relu
    reg = -3.0

    # PSO specific
    c1 = 0.5
    c2 = 0.3
    w = 0.9
    n_particles = 20

    # iters chosen to roughly have same number of function evaluations
    iters_pso = 200
    iters_nm = int(n_particles * iters_pso / 3)

    # load data
    if dataset == 'breast cancer':    
        X, Y = datasets.load_breast_cancer(return_X_y=True)
        N_CLASSES = 2
    elif dataset == 'iris':
        X, Y = datasets.load_iris(return_X_y=True)
        N_CLASSES = 3
    else:
        X, Y = datasets.load_wine(return_X_y=True)
        N_CLASSES = 3
    X = StandardScaler().fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.2, random_state=0)

    N_FEATURES = len(X[0])

    dimensions = get_dimensions(hidden_layer_sizes, N_FEATURES, N_CLASSES)
    print("Dataset: {}, Features: {}, Samples: {}, Training Size: {}, Layers: {}".format(
        dataset, N_FEATURES, len(X), len(X_train), hidden_layer_sizes))
    print("Dimensions for this problem: {}".format(dimensions))
    print("Method PSO: {} particles, {} iterations".format(n_particles, iters_pso))
    print("Method NM: {} iterations".format(iters_nm))
    
    evals_hist = {
        "NM": {
            "eval_count": 0,
            "fbest": 5.0,
            "fbest_hist": []
        },
        "PSO": {
            "eval_count": 0,
            "fbest": 5.0,
            "fbest_hist": []
        }
    }

    # define a loss function where we keep track of the number of evaluations and the fbest
    def loss_tracked(method, params, activation, hidden_layer_sizes, reg, X, Y):
        res = loss(params, activation, N_FEATURES, N_CLASSES, hidden_layer_sizes, reg, X, Y)
        evals_hist[method]['eval_count'] += 1
        curr_fbest = evals_hist[method]['fbest']
        evals_hist[method]['fbest'] = np.minimum(curr_fbest, res)
        evals_hist[method]['fbest_hist'].append(evals_hist[method]['fbest'])
        return res

    # functions wrapping the arguments for both NM and PSO
    def f_nm(x):
        return loss_tracked("NM", x, activation_function, hidden_layer_sizes,\
                            np.power(10., reg), X_train, y_train)
    def f_pso(x):
        n_particles = x.shape[0]
        res = [loss_tracked("PSO", x[i], activation_function, hidden_layer_sizes, 10**reg, X_train, y_train)\
               for i in range(n_particles)]
        return np.array(res)

    # first: Nelder-Mead
    print("--------- Nelder-Mead")
    initial_guess = np.random.uniform(low=-1.0, high=1.0, size=(dimensions + 1, dimensions))
    start = timer()
    result = scipy_minimize(fun=f_nm, x0=initial_guess[0,:],\
                            method='Nelder-Mead', \
                            options={'adaptive': False,
                                     'maxiter': iters_nm,
                                     'initial_simplex': initial_guess}) 

    if not result.success:
        print("\nOptimizer exited unsuccessfully. Message:", result.message)
    print('Nelder-Mead done after {} evaluations\nf_best: {}'.format(evals_hist['NM']['eval_count'],
                                                                     evals_hist['NM']['fbest']), flush=True)
    pos_nm = result.x
    print("Validation accuracy:", \
      (y_test == predict(pos_nm, activation_function,\
                         N_FEATURES, N_CLASSES, hidden_layer_sizes, X_test)).mean(), flush=True)
    end = timer()
    print("Elapsed time in seconds: {:3.5}".format(end - start), flush=True)

    # now: PSO
    print("\n--------- PSO", flush=True)
    start = timer()
    options = {"c1": c1, "c2": c2, "w": w}
    optimizer = ps.single.GlobalBestPSO(n_particles=n_particles, dimensions=dimensions,\
                                            options=options)
    cost_pso, pos_pso = optimizer.optimize(f_pso, iters=iters_pso)
    print('PSO done after {} evaluations\nf_best: {}'.format(evals_hist['PSO']['eval_count'],
                                                                     evals_hist['PSO']['fbest']))
    print("Validation accuracy: {}".format(
          (y_test == predict(pos_pso, activation_function, N_FEATURES, N_CLASSES,
                    hidden_layer_sizes, X_test)).mean()))
    end = timer()
    print("Elapsed time in seconds: {:3.5}".format(end - start))

    # ---- plotting
    x_axis_pso = np.arange(1, 1 + evals_hist['PSO']['eval_count'], 1)
    x_axis_nm = np.arange(1, 1 + evals_hist['NM']['eval_count'], 1)

    fig, ax = plt.subplots()
    ax.plot(x_axis_pso, evals_hist['PSO']['fbest_hist'], 'b-')
    ax.plot(x_axis_nm, evals_hist['NM']['fbest_hist'], 'r-')
    ax.legend(['PSO','NM'])
    ax.set(xlabel='Number of function evaluations', ylabel='Best loss found',\
           title='NM vs. PSO for {} dataset'.format(dataset))
    ax.grid()
    plt.show()
    
interact_manual(compare_nm_pso, dataset=['iris', 'breast cancer', 'wine'])

interactive(children=(Dropdown(description='dataset', options=('iris', 'breast cancer', 'wine'), value='iris')…

<function __main__.compare_nm_pso(dataset)>

## Non-differentiable activation functions
Some typical properties one looks for in an activation function are:
* Non-linearity (to be able to learn arbitrary functions)
* Monotonicity and "Smoothness"
* Close approximation of the identity function near the origin (helps with learning after random initialization)
* **Continous differentiability**

The commonly used activation functions for neural networks, like $sigmoid$ and $tanh$ seen above, are differentiable in order to allow gradient-based optimization methods. A very popular activation function that is actually not differentiable (at 0) is $ReLU(x)=max(x,0)$. However, it is in fact the most popular activation function because of its simplicity and robustness to the vanishing gradient problem.

Now that we are using optimization methods like the PSO, we do not need to require any differentiability and can, in fact, create arbitrary complex functions as our activations. Of course, it is nevertheless important to have functions that can be efficiently *evaluated*.

In [12]:
# step function, where return value is 
# 0 for negative, 1 for positive, and constant (chosen to be 1 here) for 0

def heaviside(x):
    return np.heaviside(x, 1)

# Notes / Food for thought

* How to prevent overfitting with PSO? Early stopping can be implemented, but e.g. not low learning rate; difficult with dropout, ...
* Nelder Mead not suitable for NN training, because code includes a lot of if and branches, i.e. not parallelizable
* PSO can be easily parallelized