In this notebook we want to implement the zero-layer transformer from transformer circuits (https://transformer-circuits.pub/2021/framework/index.html). We would like to do this using an observer pattern framework similar to pytorch ignite. Let's order the steps we should go through:

 - get data
 - write model
 - write training loop
 - visualization

# Get Data

According to the post "The training dataset is as described in Kaplan et al. (A General Language Assistant as a Laboratory for Alignment)". Upon inspecting this paper, it is not totally clear what dataset was actually used. It shouldn't really matter which dataset we use for LM pretraining, we might just expect some different results. Let's use Wikipedia since it's on huggingface.

On the "bert-base-uncased" page, it says "The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers)." I'm not sure if lists, tables, headers are in this corpus. I'll do some simple filtering and reformatting to try and fix most of these issues.

In [1]:
from datasets import load_dataset

wiki_dataset = load_dataset("wikipedia", "20220301.en")
wiki_dataset = wiki_dataset['train']

Reusing dataset wikipedia (/data/users/bmak2/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


  0%|          | 0/1 [00:00<?, ?it/s]

In [2]:
def clean_wikipedia_formatting(ex, title_len_thresh=10):
    return {
        'text': ' '.join(filter(lambda text: len(text.split()) > title_len_thresh, ex['text'].split('\n')))
    }

wiki_dataset = wiki_dataset.map(clean_wikipedia_formatting, num_proc=32)



                                  

#0:   0%|          | 0/201834 [00:00<?, ?ex/s]

 

#1:   0%|          | 0/201834 [00:00<?, ?ex/s]

  

#2:   0%|          | 0/201834 [00:00<?, ?ex/s]

 

#3:   0%|          | 0/201834 [00:00<?, ?ex/s]

#4:   0%|          | 0/201834 [00:00<?, ?ex/s]

 

#5:   0%|          | 0/201834 [00:00<?, ?ex/s]

#6:   0%|          | 0/201834 [00:00<?, ?ex/s]

  

#7:   0%|          | 0/201834 [00:00<?, ?ex/s]

 

#9:   0%|          | 0/201834 [00:00<?, ?ex/s]

   

#10:   0%|          | 0/201834 [00:00<?, ?ex/s]

#11:   0%|          | 0/201834 [00:00<?, ?ex/s]

  

#8:   0%|          | 0/201834 [00:00<?, ?ex/s]

#12:   0%|          | 0/201834 [00:00<?, ?ex/s]

#14:   0%|          | 0/201833 [00:00<?, ?ex/s]

#13:   0%|          | 0/201834 [00:00<?, ?ex/s]

   

#16:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#15:   0%|          | 0/201833 [00:00<?, ?ex/s]

#17:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#18:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#19:   0%|          | 0/201833 [00:00<?, ?ex/s]

#20:   0%|          | 0/201833 [00:00<?, ?ex/s]

   

#21:   0%|          | 0/201833 [00:00<?, ?ex/s]

#22:   0%|          | 0/201833 [00:00<?, ?ex/s]

#23:   0%|          | 0/201833 [00:00<?, ?ex/s]

  

#25:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#24:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#27:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#31:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#28:   0%|          | 0/201833 [00:00<?, ?ex/s]

#26:   0%|          | 0/201833 [00:00<?, ?ex/s]

  

#29:   0%|          | 0/201833 [00:00<?, ?ex/s]

#30:   0%|          | 0/201833 [00:00<?, ?ex/s]

Now we need to train a BPE tokenizer on this dataset. HuggingFace has a lot of options when it comes to tokenizers. Let's look at what's available and make some choices.

First we need to decide on our core algorithm. Let's go with BPE since that's what all transformer implementations use. Here is a list of the available init args

 - vocab (Dict[str, int], optional) — A dictionnary of string keys and their ids {"am": 0,...}
 - merges (List[Tuple[str, str]], optional) — A list of pairs of tokens (Tuple[str, str]) [("a", "b"),...]
 - cache_capacity (int, optional) — The number of words that the BPE cache can contain. The cache allows to speed-up the process by keeping the result of the merge operations for a number of words.
 - dropout (float, optional) — A float between 0 and 1 that represents the BPE dropout to use.
 - unk_token (str, optional) — The unknown token to be used by the model.
 - continuing_subword_prefix (str, optional) — The prefix to attach to subword units that don’t represent a beginning of word.
 - end_of_word_suffix (str, optional) — The suffix to attach to subword units that represent an end of word.
 - fuse_unk (bool, optional) — Whether to fuse any subsequent unknown tokens into a single one
 
These all look interesting but let's only define the unk token for now

In [38]:
from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE(continuing_subword_prefix='##', end_of_word_suffix='##'))

There are several other components we can choose to add to our model. We can add Normalizers, Pre-tokenizers, Post-Processors, and Decoders. 

Let's have a look at normalizers first. This is the description huggingface gives for normalizers:

"A Normalizer is in charge of pre-processing the input string in order to normalize it as relevant for a given use case. Some common examples of normalization are the Unicode normalization algorithms (NFD, NFKD, NFC & NFKC), lowercasing etc… The specificity of tokenizers is that we keep track of the alignment while normalizing. This is essential to allow mapping from the generated tokens back to the input text."

A lot of the normalization options look quite complicated (NFD, NFKD, NFC & NFKC). These seem to deal with weird fonts and text from other languages. Hopefully the en wikipedia dataset does not have any of these and we can skip this step. Let's just do lowercase, strip, and strip_accents.

In [39]:
from tokenizers import normalizers

normalizer = normalizers.Sequence([normalizers.Strip(), normalizers.StripAccents(), normalizers.Lowercase()])
tokenizer.normalizer = normalizer

In [40]:
tokenizer.normalizer.normalize_str(wiki_dataset[0]['text'])

'anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. as a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian marxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism. humans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empires. with the rise of organised hierarchical bodies, scepticism toward authority also rose. although traces of anarchist thought are found throughout history, modern anarchism emerged from the enlightenment. during the latter half of the 19th and the first decades of the 20th century, the anarchist movement flourished in 

Now comes pre-tokenization. This is what huggingface has to say about pre-tokenization:

"The PreTokenizer takes care of splitting the input according to a set of rules. This pre-processing lets you ensure that the underlying Model does not build tokens across multiple “splits”. For example if you don’t want to have whitespaces inside a token, then you can have a PreTokenizer that splits on these whitespaces.

You can easily combine multiple PreTokenizer together using a Sequence (see below). The PreTokenizer is also allowed to modify the string, just like a Normalizer does. This is necessary to allow some complicated algorithms that require to split before normalizing (e.g. the ByteLevel)"

Let's go for GPT's ByteLevel pretokenizer since it seems to have some nice properties.

In [41]:
from tokenizers import pre_tokenizers

tokenizer.pre_tokenizer = pre_tokenizers.Sequence([pre_tokenizers.WhitespaceSplit(), pre_tokenizers.ByteLevel(use_regex=False)])

In [33]:
tokenizer.pre_tokenizer.pre_tokenize_str(wiki_dataset[0]['text'])

[('ĠAnarchism', (0, 9)),
 ('Ġis', (10, 12)),
 ('Ġa', (13, 14)),
 ('Ġpolitical', (15, 24)),
 ('Ġphilosophy', (25, 35)),
 ('Ġand', (36, 39)),
 ('Ġmovement', (40, 48)),
 ('Ġthat', (49, 53)),
 ('Ġis', (54, 56)),
 ('Ġsceptical', (57, 66)),
 ('Ġof', (67, 69)),
 ('Ġauthority', (70, 79)),
 ('Ġand', (80, 83)),
 ('Ġrejects', (84, 91)),
 ('Ġall', (92, 95)),
 ('Ġinvoluntary,', (96, 108)),
 ('Ġcoercive', (109, 117)),
 ('Ġforms', (118, 123)),
 ('Ġof', (124, 126)),
 ('Ġhierarchy.', (127, 137)),
 ('ĠAnarchism', (138, 147)),
 ('Ġcalls', (148, 153)),
 ('Ġfor', (154, 157)),
 ('Ġthe', (158, 161)),
 ('Ġabolition', (162, 171)),
 ('Ġof', (172, 174)),
 ('Ġthe', (175, 178)),
 ('Ġstate,', (179, 185)),
 ('Ġwhich', (186, 191)),
 ('Ġit', (192, 194)),
 ('Ġholds', (195, 200)),
 ('Ġto', (201, 203)),
 ('Ġbe', (204, 206)),
 ('Ġunnecessary,', (207, 219)),
 ('Ġundesirable,', (220, 232)),
 ('Ġand', (233, 236)),
 ('Ġharmful.', (237, 245)),
 ('ĠAs', (246, 248)),
 ('Ġa', (249, 250)),
 ('Ġhistorically', (251, 263)),
 ('Ġleft-

After the pre_tokenizer, the tokenizer applies the model which we already set.

Next comes the post-processor. Here is huggingface's description:

"After the whole pipeline, we sometimes want to insert some special tokens before feed a tokenized string into a model like ”[CLS] My horse is amazing [SEP]”. The PostProcessor is the component doing just that."

For our purposes we don't need such a thing so we can skip this.

Note that in the future a SOS or EOS might be interesting to this pipeline to see if it improves things.

Finally there is the decoder. The huggingface description is:

"The Decoder knows how to go from the IDs used by the Tokenizer, back to a readable piece of text. Some Normalizer and PreTokenizer use special characters or identifiers that need to be reverted for example."

For us we should use the ByteLevel decoder

In [42]:
from tokenizers import decoders

tokenizer.decoder = decoders.ByteLevel()

Now it's time to train the tokenizer. Let's load in a BPEtrainer and decide on its args.

'''
    
    vocab_size (:obj:`int`, `optional`):
        The size of the final vocabulary, including all tokens and alphabet.

    min_frequency (:obj:`int`, `optional`):
        The minimum frequency a pair should have in order to be merged.

    show_progress (:obj:`bool`, `optional`):
        Whether to show progress bars while training.

    special_tokens (:obj:`List[Union[str, AddedToken]]`, `optional`):
        A list of special tokens the model should know of.

    limit_alphabet (:obj:`int`, `optional`):
        The maximum different characters to keep in the alphabet.

    initial_alphabet (:obj:`List[str]`, `optional`):
        A list of characters to include in the initial alphabet, even
        if not seen in the training dataset.
        If the strings contain more than one character, only the first one
        is kept.

    continuing_subword_prefix (:obj:`str`, `optional`):
        A prefix to be used for every subword that is not a beginning-of-word.

    end_of_word_suffix (:obj:`str`, `optional`):
        A suffix to be used for every subword that is a end-of-word.
'''


GPT has a vocab size of 40478, so let's limit ours to 50000. Not quite sure what we should do for min_frequency so let's leave that for now. We have no special_tokens. Not sure what limit_alphabet gets us so leaving it empty. Filling in the same continuing_subword_prefix and end_of_word_suffix as we gave BPE.

In [43]:
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(
    vocab_size=50000,
    show_progress=True,
    continuing_subword_prefix='##',
    end_of_word_suffix='##'
)

In [None]:
def wiki_dataset_iterator(batch_size=1000):
    for i in range(0, 2000, batch_size):
        yield wiki_dataset[i : i + batch_size]['text']

tokenizer.train_from_iterator(wiki_dataset_iterator(), trainer, length=2000)




In [4]:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(
    vocab_size=
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)

In [5]:
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

In [None]:
def wiki_dataset_iterator(batch_size=1000):
    for i in range(0, len(wiki_dataset), batch_size):
        yield wiki_dataset[i : i + batch_size]['text']

tokenizer.train_from_iterator(wiki_dataset_iterator(), trainer, length=len(wiki_dataset))




In [None]:
tokenizer.save("tokenizer-wiki.json")

In [12]:
tokenizer

<tokenizers.Tokenizer at 0x5644855f0360>

In [None]:
print('here')

In [35]:
from transformers import OpenAIGPTTokenizer

tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")

ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.


In [8]:
tokenizer

PreTrainedTokenizer(name_or_path='openai-gpt', vocab_size=40478, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '<unk>'})