# Explanation

As we observed earlier, ReLU and Dropout have very similar effects in how they promote regularization in neural networks - they both work by zero-ing out neurons (under different conditions), enabling sparse representations and forcing neurons to be independently useful.

GELU is an activation function that combines these properties into one concept, and empirically performs better than ReLU and other activation functions of the same family.

Today, GELU is used in state-of-the-art model including BERT and GPTs, so it has remained an optimal choice of activation functions in many cases.

### Intuition

As mentioned, GELU attempts to conceptually combine the ideas of Dropout and RELU.

Specifically it explores the idea of adaptive dropout - instead of each neuron being equally likely to dropout regardless of its input (based on the fixed quantity $p$), we enable neurons to adapt to their input by decreasing the probability of dropping out as their input increases.

To understand this properly, let's look at the math.

### Math

We can think about adaptive dropout as multiplying each neuron $x$ by $m \sim \textrm{Bernoulli}(\Phi(x))$ rather that sampling from $\textrm{Bernoulli}(p)$. In this case, $\Phi(x)$ is the CDF of the Gaussian distribution, so the probability of the neuron not dropping increases with $x$.

We can create a deterministic version of this function via:

$$ \textrm{GELU}(x) = x \Phi(x) $$

In this case, instead of a neuron being dropped out based on a the CDF of the Gaussian, the neuron value slowly gets closer to the identity mapping as $x$ increases, and gets closer to hitting 0 as the $x$ becomes more negative.

Asymptotically, GELU behaves the same as ReLU. You can think of GELU as a smoothed out version of ReLU which has curvature around $x = 0$ (compared with ReLU being non-differentiable at $x = 0$).

Emprically, GELU performs better than ReLU in many cases - this may be because it allows the model to represent more complex non-linearities with it's curvature.

# My Notes

## 📜 [Gaussian Error Linear Units (GELUs)](https://arxiv.org/pdf/1606.08415)

> Despite having less of a statistical motivation, the ReLU remains a competitive engineering solution which often enables faster and better convergence than sigmoids.

> Nonlinearities and dropout thus determine a neuron’s output together, yet the two innovations have remained distinct.

> [GELU] relates to stochastic regularizers in that it is the expectation of a modification to Adaptive Dropout.

### GELU Formulation

> We motivate our activation function by combining properties from dropout, zone-out, and ReLUs.

The paper introduces a stochastic regularizer that multiplies a neuron input $x$ by $m \sim \text{Bernoulli}(\Phi(x))$ where $\Phi(x)$ is the CDF of the Gaussian distribution (standard normal distribution).

> We choose this distribution since neuron inputs tend to follow a normal distribution, especially with Batch Normalization.

> In this setting, inputs have a higher probability of being “dropped” as x decreases, so the transformation applied to x is stochastic yet depends upon the input.

Here, we have something similar to adaptive dropout, where the probability of dropout is not constant, but instead is dependent on the input value (more “important” inputs, as determined their relative scale after normalization are less likely to be dropped out).

In order to motivate our non-linearity, we create a deterministic version of this function $\textrm{GELU}(x) = x \Phi(x)$.

With this function, $x$ gets closer to the identity as the CDF gets higher, and $x$ gets closer to being zeroed as the CDF gets lower. Asymptotically, this distribution behaves the same as ReLU.

> Loosely, this expression states that we scale x by how much greater it is than other inputs.

Basically, $x$ gets prioritized to be mapped to the identity function if it is originally much larger than the other inputs to normalization, and thus is normalized to a higher value.

### Discussion

> For example, as σ → 0 and if µ = 0, the GELU becomes a ReLU.

> More, the ReLU and GELU are equal asymptotically

> In fact, the GELU can be viewed as a way to smooth a ReLU.

> This non-convex, non-monotonic function is not linear in the positive domain and exhibits curvature at all points. Meanwhile ReLUs and ELUs, which are convex and monotonic activations, are linear in the positive domain and thereby can lack curvature. As such, increased curvature and non-monotonicity may allow GELUs to more easily
> approximate complicated functions than can ReLUs or ELUs.

> We can see that the ReLU gates the input depending upon its sign, while the GELU weights its input depending upon how much greater it is than other inputs.

> In addition and significantly, the GELU has a probabilistic interpretation given that it is the expectation of a stochastic regularizer.

### Conclusion

> For the numerous datasets evaluated in this paper, the GELU exceeded the accuracy of the ELU and ReLU consistently, making it a viable alternative to previous nonlinearities.
