author: Simo Ryu
tag: research
title: Is that a faster GELU?
abstract: We have a deeper look at recent post by Daniel de Kok in twitter (https://twitter.com/danieldekok/status/1484898130441166853). Is that function a faster GELU?

Recent twitter post by @danieldekok (https://twitter.com/danieldekok/status/1484898130441166853) mentions that there is a faster version of the GELU activation function. He mentions: 

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Did anyone try this activation function? It is very similar to GELU and Swish, but has the benefit that it can be implemented easily using SIMD on modern CPUs. <a href="https://t.co/4WytEyhueD">pic.twitter.com/4WytEyhueD</a></p>&mdash; Daniël de Kok 💉💉 (@danieldekok) <a href="https://twitter.com/danieldekok/status/1484898130441166853?ref_src=twsrc%5Etfw">January 22, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

The function looks like this: 

$$
y = 0.5 x (1 + \frac{x}{\sqrt{1 + x^2}})
$$

We will compare the performance and efficiency of GELU and the faster version of GELU. From now on, the faster GELU will be named DaniGELU.


# 1. Efficiency



Is the DaniGELU function faster than GELU? Let's compare it with pytorch.

In [2]:
import torch
import torch.nn as nn

# we would normally define it as follows:

class DaniGELU(nn.Module):
    '''
    Activation such that:
    $$
    y = 0.5 x (1 + \frac{x}{\sqrt{1 + x^2}})
    $$
    '''
    def __init__(self):
        super(DaniGELU, self).__init__()
        

    def forward(self, x):
        return 0.5 * x * (1 + x / torch.sqrt(1 + x**2))


gelu = nn.GELU()
dani_gelu = DaniGELU()

# test
x = torch.randn(1, 1, 1, 1)
print(gelu(x).item())
print(dani_gelu(x).item())


-0.13763469457626343
-0.12556812167167664


In [3]:
#rm
import time

# test dain_gelu time performance
start = time.time()
for i in range(10000):
    dani_gelu(x)
end = time.time()

print('dani_gelu time:', end - start)

# test gelu time performance

start = time.time()
for i in range(10000):
    gelu(x)
end = time.time()

print('gelu time:', end - start)



dani_gelu time: 0.12120413780212402
gelu time: 0.017823219299316406


However, I doubt that the difference is caused by the actual logic of the function. Clearly, difference is caused by the implementation of the function.
Inside, GeLU is implemented with c++ backend, and the faster version is implemented with cuda backend. However, DaniGELU above is implemented with python.

Fair comparison would be to implment the function with c++, and compare the performance. Let's have a look at just that.

```cpp
#include <torch/torch.h>
#include <torch/script.h>


torch::Tensor GELU(torch::Tensor x) {
    return 0.5 * x * (1 + torch::erf(x / torch::sqrt(2)));
}


torch::Tensor daniGELU(torch::Tensor x) {
  return 0.5 * x * (1 + torch::sqrt(1 + x.pow(2)));
}


int main() {
  torch::DeviceType device_type;
  if (torch::cuda::is_available()) {
    device_type = torch::kCUDA;
  } else {
    device_type = torch::kCPU;
  }
  torch::Device device(device_type);

  torch::Tensor x = torch::randn({1, 1}, device);
  torch::Tensor geLU_out = GELU(x);
  torch::Tensor daniGELU_out = daniGELU(x);

  std::cout << "GELU: " << geLU_out << std::endl;
  std::cout << "daniGELU: " << daniGELU_out << std::endl;
}
```
