In this notebook I want to play around with the parameters of a transformer encoder to get a better understanding of them.

In [1]:
import matplotlib.pyplot as plt
import numpy as np

# Token Embedding Dimension

Here I want to see if I can find some resons why the BERT<sub>BASE</sub> model chosses a token embedding dimension $H$ (in the BERT paper) of specifically $H=768$. The reason I am interested in this number to begin with is the study of the attention head count hyperparameter $A$. If you want to split the embedding vector equally between all the attention heads but still want a large number of possible attention head counts $A$, then the embedding space dimension should be chosen to have a large number of integer divisors.

In [2]:
H = 768

First let's determine the prime number decomposition of the dimension.

In [3]:
from sympy.ntheory import factorint

factorint(H)

{2: 8, 3: 1}

That's a bit underwhelming, it's just $3 \cdot 256$, three thirds of the $1024$ in the BERT<sub>LARGE</sub> model. Then again powers of two naturally have many divisors. Let's see what divisors or possible values for $A$ we are working with. In that, we limit ourselves to realistic values from $A=1$ (single-head attention) to $A=24$ (1.5 times the value as in BERT<sub>LARGE</sub>).

In [4]:
from sympy.ntheory import divisors

A_max=24
A_possible = [n for n in divisors(H, generator=True) if n <= A_max]

print(f"Found {len(A_possible)} possible values for A: {A_possible}")

Found 9 possible values for A: [1, 2, 4, 8, 16, 3, 6, 12, 24]
