# Implementation of SELUs & Dropout-SELUs in NumPy
This looks pretty neat. 
They can prove that when you slightly modify the ELU activation,
your average unit activation goes towards zero mean/unit variance (if the network is deep enough). 
If they're right, this might make batch norm obsolete, which would be a huge bon to training speeds! 

The experiments look convincing, so apparently it even beats BN+ReLU in accuracy... though 

I wish they would've shown the resulting distributions of activations after training. 

But assuming their fixed point proof is true, it will. 

Still, still would've been nice if they'd shown it -- maybe they ran out of space in their appendix ;)

Weirdly, the exact ELU modification they proposed isn't stated explicitly in the paper! 

For those wondering, it can be found in the available sourcecode, and looks like this:

In [1]:
# An extra explaination from Reddit
# # Thanks, I will double check the analytical solution. For the numerical one, could you please explain why running the following code results in a value close to 1 rather than 0?
# du = 0.001
# u_old = np.mean(selu(np.random.normal(0,    1, 100000000)))
# u_new = np.mean(selu(np.random.normal(0+du, 1, 100000000)))
# # print (u_new-u_old) / du
# print(u_old, u_new)
# # Now I see your problem: 
# #     You do not consider the effect of the weights. 
# #     From one layer to the next, we have two influences: 
# #         (1) multiplication with weights and 
# #         (2) applying the SELU. 
# #         (1) has a centering and symmetrising effect (draws mean towards zero) and 
# #         (2) has a variance stabilizing effect (draws variance towards 1). 

# #         That is why we use the variables \mu&\omega and \nu&\tau to analyze the both effects.
# # Oh yes, thats true, zero mean weights completely kill the mean. Thanks!

# Tensorflow implementation
import numpy as np

def selu(x):
    alpha = 1.6732632423543772848170429916717
    scale = 1.0507009873554804934193349852946
    return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))

In [2]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
    w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200))  # their initialization scheme
    x = x @ w
    x = selu(x=x)
    mean = x.mean(axis=1)
    scale = x.std(axis=1) # standard deviation=square-root(variance)
    print(mean.min(), mean.max(), scale.min(), scale.max())

-0.176949977598 0.221005443215 0.835309367275 1.17453616899
-0.189307534126 0.186617918943 0.788913119832 1.18362789521
-0.196552755095 0.188090127733 0.770470014929 1.1616392021
-0.215528683477 0.186350067193 0.820428935118 1.20518722095
-0.230688823454 0.206173791012 0.799844892938 1.19280368729
-0.163523722388 0.183048120893 0.776775004742 1.24549998477
-0.144053215751 0.211896999119 0.797482683642 1.20425914826
-0.268366141978 0.332837237159 0.780291030843 1.23653341958
-0.225042227486 0.211523443746 0.757500838493 1.18271662009
-0.228838917957 0.252977585647 0.767139081137 1.20645786101
-0.176143296647 0.196899486613 0.809853497338 1.18721244432
-0.207192909889 0.187316043074 0.818630409606 1.2251001154
-0.135618258806 0.129711697865 0.76846749851 1.17385787684
-0.165881752824 0.245504872115 0.74691869659 1.18283066112
-0.260200908429 0.294824287792 0.805082083259 1.1886161453
-0.173702066776 0.205132958897 0.819200743815 1.17986138705
-0.255709430006 0.240450176469 0.835864530401

In [3]:
# My NumPy implemetation of Normal dropout for ReLU
def dropout_forward(X, p_dropout):
    u = np.random.binomial(1, p_dropout, size=X.shape) / p_dropout
    out = X * u
    cache = u
    return out, cache

def dropout_backward(dout, cache):
    dX = dout * cache
    return dX

In [4]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
    w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200))  # their initialization scheme
    x = x @ w
    x = selu(x)
    x, _ = dropout_forward(p_dropout=0.8, X=x)
    mean = x.mean(axis=1)
    scale = x.std(axis=1) # standard deviation=square-root(variance)
    print(mean.min(), mean.max(), scale.min(), scale.max())

-0.27865085009 0.235032226285 0.915129009868 1.29260130653
-0.258492571053 0.255327206083 1.00847446396 1.47345938655
-0.19713231233 0.266209352556 1.07729960899 1.57705321955
-0.193528130951 0.416940693557 1.08031991368 1.68861633923
-0.208472944343 0.40818777893 1.12004958261 1.90190355558
-0.164595641533 0.460386628564 1.15461609611 1.93369600081
-0.230747215363 0.449316824305 1.20421394382 1.95141843399
-0.117658759369 0.422397120352 1.20351492159 2.03517261375
-0.200750029963 0.562718572385 1.1518401281 2.05301847911
-0.22007430707 0.507823249652 1.26724171273 2.10539419892
-0.195732915719 0.445585916195 1.28040711766 2.06638456713
-0.130107941004 0.442719333733 1.24650609578 2.19915347757
-0.12597937045 0.505710491189 1.29548735155 2.1220740626
-0.136519137039 0.468136268834 1.26438235906 2.11297345854
-0.305066942372 0.542040320604 1.27697375089 2.13053647135
-0.16686018046 0.698204815239 1.231065446 2.40950914819
-0.190440300193 0.490100777547 1.26205624514 2.36696269247
-0.232

In [5]:
# Tensorflow implementation on github
def dropout_selu(x, rate, alpha= -1.7580993408473766, fixedPointMean=0.0, fixedPointVar=1.0, 
                 noise_shape=None, seed=None, name=None, training=False):
    """Dropout to a value with rescaling."""

    def dropout_selu_impl(x, rate, alpha, noise_shape, seed, name):
        keep_prob = 1.0 - rate
        x = ops.convert_to_tensor(x, name="x")
        if isinstance(keep_prob, numbers.Real) and not 0 < keep_prob <= 1:
            raise ValueError("keep_prob must be a scalar tensor or a float in the "
                                             "range (0, 1], got %g" % keep_prob)
        keep_prob = ops.convert_to_tensor(keep_prob, dtype=x.dtype, name="keep_prob")
        keep_prob.get_shape().assert_is_compatible_with(tensor_shape.scalar())

        alpha = ops.convert_to_tensor(alpha, dtype=x.dtype, name="alpha")
        keep_prob.get_shape().assert_is_compatible_with(tensor_shape.scalar())

        if tensor_util.constant_value(keep_prob) == 1:
            return x

        noise_shape = noise_shape if noise_shape is not None else array_ops.shape(x)
        random_tensor = keep_prob
        random_tensor += random_ops.random_uniform(noise_shape, seed=seed, dtype=x.dtype)
        binary_tensor = math_ops.floor(random_tensor)
        ret = x * binary_tensor + alpha * (1-binary_tensor)

        a = tf.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * tf.pow(alpha-fixedPointMean,2) + fixedPointVar)))

        b = fixedPointMean - a * (keep_prob * fixedPointMean + (1 - keep_prob) * alpha)
        ret = a * ret + b
        ret.set_shape(x.get_shape())
        return ret

    with ops.name_scope(name, "dropout", [x]) as name:
        return utils.smart_cond(training,
            lambda: dropout_selu_impl(x, rate, alpha, noise_shape, seed, name),
            lambda: array_ops.identity(x))

In [6]:
def elu_fwd(X):
    alpha = 1.0
    scale = 1.0
    #     return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))
    X_pos = np.maximum(0.0, X) # ReLU
    X_neg = np.minimum(X, 0.0) # otherwise: if X<=0, Exp Leaky ReLU
    X_neg_exp = alpha * (np.exp(X_neg)-1) # a: slope, a>=0
    out = scale * (X_pos + X_neg_exp)
    cache = (scale, alpha, X) # mean=0, std=1
    return out, cache

def elu_bwd(dout, cache):
    scale, alpha, X = cache # mean=0, std=1
    dout = dout * scale
    dX_neg = dout.copy()
    dX_neg[X>0] = 0
    X_neg = np.minimum(X, 0) # otherwise: if X<=0, Exp Leaky ReLU
    dX_neg = dX_neg * alpha * np.exp(X_neg) # derivative of abs(np.exp(X_neg)-1) # a: slope, a>=0
    dX_pos = dout.copy()
    dX_pos[X<0] = 0
    dX_pos = dX_pos * 1
    dX = dX_neg + dX_pos
    return dX

In [7]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
    w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200))  # their initialization scheme
    x = x @ w
    x, _ = elu_fwd(X=x)
    x, _ = dropout_forward(p_dropout=0.95, X=x)
    mean = x.mean(axis=1)
    scale = x.std(axis=1) # standard deviation=square-root(variance)
    print(mean.min(), mean.max(), scale.min(), scale.max())

0.0267840597416 0.34870387555 0.688048836569 1.04717887996
0.0118506955263 0.304876451832 0.56287559608 0.883068359607
-0.00848144962242 0.206902931306 0.461438275839 0.773605957189
-0.0135542733853 0.188593823674 0.382063563973 0.688186843237
-0.0561000469178 0.122238741339 0.349825172164 0.602874984924
-0.0343566384099 0.18028897461 0.31253884688 0.548510582005
-0.0429958073188 0.116054084819 0.258804537482 0.468348512243
-0.0365803448673 0.0872113156711 0.24577948731 0.438054082984
-0.0599072152755 0.107624519345 0.215066658884 0.422357677398
-0.0350411695585 0.0776162947341 0.205136251217 0.425153568286
-0.0283457218838 0.0652535851403 0.185845104457 0.385896334075
-0.0272042989458 0.0766230668132 0.17945437564 0.343158539221
-0.0265363719492 0.0529204998045 0.175905477592 0.337747682314
-0.0195131277272 0.050982043472 0.157488026364 0.320393452692
-0.0279429986826 0.0494597719809 0.148463686617 0.302291903009
-0.0163050932564 0.0395091568666 0.134573867209 0.275776177388
-0.035063

In [8]:
def selu_fwd(X):
    alpha = 1.6732632423543772848170429916717
    scale = 1.0507009873554804934193349852946
    #     return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))
    X_pos = np.maximum(0.0, X) # ReLU
    X_neg = np.minimum(X, 0.0) # otherwise: if X<=0, Exp Leaky ReLU
    X_neg_exp = alpha * (np.exp(X_neg)-1) # a: slope, a>=0
    out = scale * (X_pos + X_neg_exp)
    cache = (scale, alpha, X) # mean=0, std=1
    return out, cache

def selu_bwd(dout, cache):
    scale, alpha, X = cache # mean=0, std=1
    dout = dout * scale
    dX_neg = dout.copy()
    dX_neg[X>0] = 0
    X_neg = np.minimum(X, 0) # otherwise: if X<=0, Exp Leaky ReLU
    dX_neg = dX_neg * alpha * np.exp(X_neg) # derivative of abs(np.exp(X_neg)-1) # a: slope, a>=0
    dX_pos = dout.copy()
    dX_pos[X<0] = 0
    dX_pos = dX_pos * 1
    dX = dX_neg + dX_pos
    return dX

# def dropout_selu_forward(X, p_dropout):
def dropout_selu_forward(X, keep_prob):
    alpha= -1.7580993408473766
    fixedPointMean=0.0
    fixedPointVar=1.0

    u = np.random.binomial(1, keep_prob, size=X.shape) / keep_prob
    out = X * u + alpha * (1-u)

    #     a = tf.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * tf.pow(alpha-fixedPointMean,2) + fixedPointVar)))
    a = np.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * (alpha-fixedPointMean)**2 + fixedPointVar)))
    b = fixedPointMean - a * (keep_prob * fixedPointMean + (1 - keep_prob) * alpha)
    out = a * out + b
    cache = a, u
    return out, cache

def dropout_selu_backward(dout, cache):
    a, u = cache
    dout = dout * a
    dX = dout * u
    return dX

In [11]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
    w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200))  # their initialization scheme
    x = x @ w
    x, cache = selu_fwd(x)
#     x, _ = dropout_selu_forward(keep_prob=0.95, X=x)
    mean = x.mean(axis=1)
    scale = x.std(axis=1) # standard deviation=square-root(variance)
    print(mean.min(), mean.max(), scale.min(), scale.max())

-0.152871155666 0.173813542308 0.820291785309 1.13981820486
-0.18122122386 0.207630199258 0.815210431198 1.18937529405
-0.297846578441 0.266203730506 0.817983597323 1.23752921224
-0.161757570033 0.235202476314 0.831543157724 1.28683767092
-0.142265800851 0.215884424058 0.878437825834 1.26805727295
-0.165170971108 0.192902736105 0.830602021809 1.22491925504
-0.214282420794 0.267225427297 0.829318420082 1.21366357317
-0.185252100148 0.173920268415 0.779902394677 1.15248294176
-0.153067144439 0.20891501543 0.826645437398 1.16220404025
-0.144313463151 0.140307245998 0.816166417019 1.20294289235
-0.162489773485 0.161401275126 0.788057464437 1.19460372201
-0.177443613252 0.144282207372 0.811307580107 1.22868265658
-0.204457829679 0.170138943013 0.804559821665 1.19601136669
-0.245177009897 0.220070455351 0.82173445298 1.17316394068
-0.180660981614 0.21015737802 0.798093122281 1.1647082751
-0.229195097623 0.234943629284 0.806055624367 1.2206629188
-0.142119011139 0.159630286891 0.791431772844 

# Discussion & wrapup
According to this, even after a 100 layers, mean neuron activations stay fairly close to mean 0 / variance 1 
(even the most extreme means/variances are only off by 0.2).

Sepp Hochreiter is amazing: LSTM, meta-learning, SNNN. 

I think he has already done a much larger contribution to science than some self-proclaimed pioneers of DL 
who spend more time on social networks than actually doing any good research.