In [None]:
import tensorflow as tf
from tensorflow import keras

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

In [None]:
model.summary()

Each BN layer adds four parameters per input: $\gamma$, $\beta$, $\mu$ and $\sigma$. The last two ones are not affected by backpropagation, so keras call them "non-trainable" parameters.

Let's look at the parameters of the first BN layer.

In [None]:
[(var.name, var.trainable) for var in model.layers[1].variables]

The authors of BN paper argued in favor of adding the BN layers before the activations function, rather than after (as we have done here). To add the BN layers before the activation functions, we must remove the activation function from the hidden layers and add the as separate layers after the BN layers. Moreover, since a BN layer includes one offset parameter per input, we can remove the bias term from the previous layer:

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax", use_bias=False)
])