# Introduction of SELUs
This looks pretty neat. 
They can prove that when you slightly modify the ELU activation,
your average unit activation goes towards zero mean/unit variance (if the network is deep enough). 
If they're right, this might make batch norm obsolete, which would be a huge bon to training speeds! 

The experiments look convincing, so apparently it even beats BN+ReLU in accuracy... though 

I wish they would've shown the resulting distributions of activations after training. 

But assuming their fixed point proof is true, it will. 

Still, still would've been nice if they'd shown it -- maybe they ran out of space in their appendix ;)

Weirdly, the exact ELU modification they proposed isn't stated explicitly in the paper! 

For those wondering, it can be found in the available sourcecode, and looks like this:

In [3]:
import numpy as np

def selu(x):
    alpha = 1.6732632423543772848170429916717
    scale = 1.0507009873554804934193349852946
#     return scale*np.where(x>=0.0, x, alpha*np.exp(x)-alpha)    
#     return scale * np.maximum(0.0, alpha*np.exp(x)-alpha)
#     return scale * alpha * np.where(x>=0.0, x, np.exp(x)-1)
    return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))

In [4]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
    w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200))  # their initialization scheme
    x = x @ w
    x = selu(x=x)
    mean = x.mean(axis=1)
    scale = x.std(axis=1) # standard deviation=square-root(variance)
    print(mean.min(), mean.max(), scale.min(), scale.max())

-0.196616111925 0.224280364437 0.818826323087 1.20876426248
-0.191604771465 0.188587363549 0.822755874019 1.24344800241
-0.157282508598 0.174637868892 0.791649928783 1.20126673366
-0.167062400419 0.22278489766 0.773504788939 1.16375238425
-0.167852316628 0.200074674407 0.811787042106 1.16186432918
-0.187030267758 0.196644459961 0.800800121371 1.17807256893
-0.227842424975 0.206835105566 0.821447906304 1.18298208072
-0.165911049646 0.223727085267 0.837952640153 1.20291223138
-0.187175510158 0.288024185182 0.8039830216 1.20672812329
-0.163073198289 0.197713920598 0.832939870669 1.14407170008
-0.187042891809 0.133347033808 0.815552199062 1.19333569547
-0.188739606349 0.203325048064 0.755867955484 1.21199247572
-0.233773019439 0.248088338015 0.715002047001 1.23617554245
-0.16348247404 0.18039628769 0.797514923525 1.23010780634
-0.226971844705 0.29050489033 0.818160110464 1.21763362071
-0.277775262046 0.237132822377 0.831482243349 1.21348822998
-0.149755553551 0.194284701388 0.803702086571 

In [5]:
# Thanks, I will double check the analytical solution. For the numerical one, could you please explain why running the following code results in a value close to 1 rather than 0?
import numpy as np
def selu(x):
    alpha = 1.6732632423543772848170429916717
    scale = 1.0507009873554804934193349852946
    return scale*np.where(x>=0.0, x, alpha*np.exp(x)-alpha)

du = 0.001
u_old = np.mean(selu(np.random.normal(0,    1, 100000000)))
u_new = np.mean(selu(np.random.normal(0+du, 1, 100000000)))
# print (u_new-u_old) / du
print(u_old, u_new)
# Now I see your problem: 
#     You do not consider the effect of the weights. 
#     From one layer to the next, we have two influences: 
#         (1) multiplication with weights and 
#         (2) applying the SELU. 
#         (1) has a centering and symmetrising effect (draws mean towards zero) and 
#         (2) has a variance stabilizing effect (draws variance towards 1). 

#         That is why we use the variables \mu&\omega and \nu&\tau to analyze the both effects.
# Oh yes, thats true, zero mean weights completely kill the mean. Thanks!

-5.96593216528e-05 0.000961902494112


In [6]:
# Normal dropout for ReLU
def dropout_forward(X, p_dropout):
    u = np.random.binomial(1, p_dropout, size=X.shape) / p_dropout
    out = X * u
    cache = u
    return out, cache

def dropout_backward(dout, cache):
    dX = dout * cache
    return dX

In [7]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
    w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200))  # their initialization scheme
    x = x @ w
    x = selu(x)
    x, _ = dropout_forward(p_dropout=0.8, X=x)
    mean = x.mean(axis=1)
    scale = x.std(axis=1) # standard deviation=square-root(variance)
    print(mean.min(), mean.max(), scale.min(), scale.max())

-0.256957516928 0.307727747979 0.918308217507 1.2931015458
-0.182192403207 0.318422064717 0.959007439863 1.49359179336
-0.230316499051 0.314326798104 1.0321997765 1.63824815892
-0.23649657688 0.348654017622 1.05635651738 1.74116137036
-0.24933939289 0.392245158414 1.14475114238 1.76191160941
-0.163087489105 0.421406263884 1.17250788374 1.94479958698
-0.267504611261 0.436318338005 1.13018738786 1.86434460042
-0.208328031694 0.469791982325 1.18507012108 1.90226164169
-0.140344345836 0.435378766854 1.19170773151 2.06887862909
-0.168632731811 0.493313071662 1.09673156178 2.11524893245
-0.119108868741 0.52054718538 1.10884678308 2.06723729255
-0.131501857725 0.563248660121 1.16180977707 2.14694152282
-0.153622970919 0.537861007285 1.22662114284 2.26374826768
-0.243633404724 0.640730433516 1.16749581758 2.24247800773
-0.126240620186 0.536816259405 1.25405538517 2.1429614854
-0.206910851628 0.570610760406 1.25953298042 2.27830865496
-0.222444622614 0.493762157422 1.21738094279 2.12734238594
-

In [8]:
def dropout_selu(x, rate, alpha= -1.7580993408473766, fixedPointMean=0.0, fixedPointVar=1.0, 
                 noise_shape=None, seed=None, name=None, training=False):
    """Dropout to a value with rescaling."""

    def dropout_selu_impl(x, rate, alpha, noise_shape, seed, name):
        keep_prob = 1.0 - rate
        x = ops.convert_to_tensor(x, name="x")
        if isinstance(keep_prob, numbers.Real) and not 0 < keep_prob <= 1:
            raise ValueError("keep_prob must be a scalar tensor or a float in the "
                                             "range (0, 1], got %g" % keep_prob)
        keep_prob = ops.convert_to_tensor(keep_prob, dtype=x.dtype, name="keep_prob")
        keep_prob.get_shape().assert_is_compatible_with(tensor_shape.scalar())

        alpha = ops.convert_to_tensor(alpha, dtype=x.dtype, name="alpha")
        keep_prob.get_shape().assert_is_compatible_with(tensor_shape.scalar())

        if tensor_util.constant_value(keep_prob) == 1:
            return x

        noise_shape = noise_shape if noise_shape is not None else array_ops.shape(x)
        random_tensor = keep_prob
        random_tensor += random_ops.random_uniform(noise_shape, seed=seed, dtype=x.dtype)
        binary_tensor = math_ops.floor(random_tensor)
        ret = x * binary_tensor + alpha * (1-binary_tensor)

        a = tf.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * tf.pow(alpha-fixedPointMean,2) + fixedPointVar)))

        b = fixedPointMean - a * (keep_prob * fixedPointMean + (1 - keep_prob) * alpha)
        ret = a * ret + b
        ret.set_shape(x.get_shape())
        return ret

    with ops.name_scope(name, "dropout", [x]) as name:
        return utils.smart_cond(training,
            lambda: dropout_selu_impl(x, rate, alpha, noise_shape, seed, name),
            lambda: array_ops.identity(x))

In [9]:
# def dropout_selu_forward(X, p_dropout):
def dropout_selu_forward(X, keep_prob):
    alpha= -1.7580993408473766
    fixedPointMean=0.0
    fixedPointVar=1.0

    u = np.random.binomial(1, keep_prob, size=X.shape) / keep_prob
    out = X * u + alpha * (1-u)
    
    #     keep_prob = 1.0 - p_dropout # keep_prob==p_dropout, 1-rate for dropout, 80% is keep_prob
    #     a = tf.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * tf.pow(alpha-fixedPointMean,2) + fixedPointVar)))
    a = np.sqrt(fixedPointVar / (keep_prob *((1-keep_prob) * (alpha-fixedPointMean)**2 + fixedPointVar)))
    b = fixedPointMean - a * (keep_prob * fixedPointMean + (1 - keep_prob) * alpha)
    out = a * out + b
    cache = a, u
    return out, cache

def dropout_selu_backward(dout, cache):
    a, u = cache
    dout = dout * a
    dX = dout * u
    return dX

In [27]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
    w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200))  # their initialization scheme
    x = x @ w
    x = selu(x)
    x, _ = dropout_selu_forward(keep_prob=0.95, X=x)
    mean = x.mean(axis=1)
    scale = x.std(axis=1) # standard deviation=square-root(variance)
    print(mean.min(), mean.max(), scale.min(), scale.max())

-0.105856907841 0.279570434562 0.86660190467 1.20251243972
-0.155191664383 0.288261515002 0.889275976739 1.25138699752
-0.0705334706002 0.344630660594 0.951567828874 1.31551801341
-0.132482524009 0.353791403522 0.938117529417 1.37189346045
-0.113616815796 0.306603972222 0.956869698278 1.35072670805
-0.0918661701641 0.325162981366 0.950788678996 1.42381428375
-0.0869468239639 0.324479072308 0.961812747501 1.41067656968
-0.0606090730805 0.351917586035 0.972776951234 1.38533958476
-0.10872708601 0.383740546293 1.02475160705 1.40877286324
-0.183687697566 0.388829816808 0.985993555097 1.45597050914
-0.183079470999 0.353045997825 1.04372547392 1.43863367183
-0.101662083454 0.424148949162 0.99622128851 1.40205418122
-0.0976699484971 0.387634760108 1.04357712112 1.40889933699
-0.173219181539 0.327215656716 0.983300254122 1.40398571873
-0.10727509687 0.343160040971 0.976136432118 1.39255829901
-0.143136983458 0.379707702698 0.983961253262 1.42320519678
-0.0508431911137 0.367824686636 1.01587006

In [5]:
def elu_fwd(X):
    X_pos = np.maximum(0.0, X) # ReLU
    m = 1.0 # 1e-3==0.001, a==m, 0.0 <= a <= 1.0, active/passive, on/off
    X_neg = np.minimum(X, 0) # otherwise: if X<=0, Exp Leaky ReLU
    X_neg_exp = m * (np.exp(X_neg)-1) # a: slope, a>=0
    return X_pos + X_neg_exp

def elu_bwd(X, dX):
    m = 1.0 # 1e-3==0.001, a==m, 0.0 <= a <= 1.0, active/passive, on/off
    X_neg = np.minimum(X, 0) # otherwise: if X<=0, Exp Leaky ReLU
    m_neg_exp = m * np.exp(X_neg) # derivative of abs(np.exp(X_neg)-1) # a: slope, a>=0
    return dX * m_neg_exp

In [12]:
def selu_fwd(X):
    alpha = 1.6732632423543772848170429916717
    scale = 1.0507009873554804934193349852946
    #     return scale * np.where(x>=0.0, x, alpha * (np.exp(x)-1))
    X_pos = np.maximum(0.0, X) # ReLU
    X_neg = np.minimum(X, 0.0) # otherwise: if X<=0, Exp Leaky ReLU
    X_neg_exp = alpha * (np.exp(X_neg)-1) # a: slope, a>=0
    out = scale * (X_pos + X_neg_exp)
    cache = (scale, alpha, X) # mean=0, std=1
    return out, cache

def selu_bwd(dX, cache):
    scale, alpha, X = cache # mean=0, std=1
    dX = dX * scale
    dX_neg = dX.copy()
    dX_neg[X>0] = 0
    X_neg = np.minimum(X, 0) # otherwise: if X<=0, Exp Leaky ReLU
    dX_neg = dX_neg * alpha * np.exp(X_neg) # derivative of abs(np.exp(X_neg)-1) # a: slope, a>=0
    dX_pos = dX.copy()
    dX_pos[X<0] = 0
    dX_pos = dX_pos * 1
    dX = dX_neg + dX_pos
    return dX

In [13]:
# EDIT: For the fun of it, I ran a quick experiment to see if activations would really stay close to 0/1:
x = np.random.normal(size=(300, 200))
for _ in range(100):
    w = np.random.normal(size=(200, 200), scale=np.sqrt(1/200))  # their initialization scheme
    x = x @ w
    x, cache = selu_fwd(x)
    mean = x.mean(axis=1)
    scale = x.std(axis=1) # standard deviation=square-root(variance)
    print(mean.min(), mean.max(), scale.min(), scale.max())

-0.192573130677 0.225464851152 0.8517251468 1.17343643636
-0.181990439693 0.19236772977 0.830614924998 1.16460880061
-0.156997657224 0.180271015029 0.828943363818 1.16442619197
-0.199104746591 0.22398155216 0.796812820533 1.2104985597
-0.173998806034 0.170736138442 0.779529801004 1.19198148117
-0.219109562845 0.156020176235 0.770437549522 1.212449289
-0.18967747706 0.221089878525 0.758048489469 1.23475467222
-0.223147408747 0.194725050129 0.820239419803 1.21362171248
-0.17199878824 0.166789833841 0.837366033913 1.21879645082
-0.212695078258 0.297249892012 0.824467642147 1.21928508349
-0.206781718701 0.183024553619 0.850118701275 1.22015661902
-0.193197569119 0.230625786613 0.82600270733 1.18500910834
-0.163987888743 0.171045279072 0.807928639787 1.17680909555
-0.20329546972 0.243542309672 0.843773195816 1.16241449782
-0.222863285566 0.209524992637 0.832198097308 1.14877634514
-0.213679413185 0.196324396734 0.820204317897 1.18841301673
-0.141203904392 0.215977872027 0.823700795761 1.228

# Discussion & wrapup
According to this, even after a 100 layers, mean neuron activations stay fairly close to mean 0 / variance 1 
(even the most extreme means/variances are only off by 0.2).

Sepp Hochreiter is amazing: LSTM, meta-learning, SNNN. 

I think he has already done a much larger contribution to science than some self-proclaimed pioneers of DL 
who spend more time on social networks than actually doing any good research.