In this notebook we want to implement the zero-layer transformer from transformer circuits (https://transformer-circuits.pub/2021/framework/index.html). We would like to do this using an observer pattern framework similar to pytorch ignite. Let's order the steps we should go through:

 - get data
 - write model
 - write training loop
 - visualization

# Get Data

According to the post "The training dataset is as described in Kaplan et al. (A General Language Assistant as a Laboratory for Alignment)". Upon inspecting this paper, it is not totally clear what dataset was actually used. It shouldn't really matter which dataset we use for LM pretraining, we might just expect some different results. Let's use Wikipedia since it's on huggingface.

On the "bert-base-uncased" page, it says "The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers)." I'm not sure if lists, tables, headers are in this corpus. I'll do some simple filtering and reformatting to try and fix most of these issues.

In [1]:
from datasets import load_dataset

wiki_dataset = load_dataset("wikipedia", "20220301.en")
wiki_dataset = wiki_dataset['train']

Reusing dataset wikipedia (/data/users/bmak2/.cache/huggingface/datasets/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)


  0%|          | 0/1 [00:00<?, ?it/s]

In [2]:
def clean_wikipedia_formatting(ex, title_len_thresh=10):
    return {
        'text': ' '.join(filter(lambda text: len(text.split()) > title_len_thresh, ex['text'].split('\n')))
    }

wiki_dataset = wiki_dataset.map(clean_wikipedia_formatting, num_proc=32)



                                  

#0:   0%|          | 0/201834 [00:00<?, ?ex/s]

   

#1:   0%|          | 0/201834 [00:00<?, ?ex/s]

#2:   0%|          | 0/201834 [00:00<?, ?ex/s]

#3:   0%|          | 0/201834 [00:00<?, ?ex/s]

#4:   0%|          | 0/201834 [00:00<?, ?ex/s]

  

#5:   0%|          | 0/201834 [00:00<?, ?ex/s]

#6:   0%|          | 0/201834 [00:00<?, ?ex/s]

 

#7:   0%|          | 0/201834 [00:00<?, ?ex/s]

  

#8:   0%|          | 0/201834 [00:00<?, ?ex/s]

 

#9:   0%|          | 0/201834 [00:00<?, ?ex/s]

 

#10:   0%|          | 0/201834 [00:00<?, ?ex/s]

  

#11:   0%|          | 0/201834 [00:00<?, ?ex/s]

#13:   0%|          | 0/201834 [00:00<?, ?ex/s]

 

#12:   0%|          | 0/201834 [00:00<?, ?ex/s]

#14:   0%|          | 0/201833 [00:00<?, ?ex/s]

  

#15:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#16:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#17:   0%|          | 0/201833 [00:00<?, ?ex/s]

#18:   0%|          | 0/201833 [00:00<?, ?ex/s]

   

#19:   0%|          | 0/201833 [00:00<?, ?ex/s]

#20:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#21:   0%|          | 0/201833 [00:00<?, ?ex/s]

#22:   0%|          | 0/201833 [00:00<?, ?ex/s]

   

#23:   0%|          | 0/201833 [00:00<?, ?ex/s]

#24:   0%|          | 0/201833 [00:00<?, ?ex/s]

#25:   0%|          | 0/201833 [00:00<?, ?ex/s]

 

#26:   0%|          | 0/201833 [00:00<?, ?ex/s]

  

#28:   0%|          | 0/201833 [00:00<?, ?ex/s]

#27:   0%|          | 0/201833 [00:00<?, ?ex/s]

   

#31:   0%|          | 0/201833 [00:00<?, ?ex/s]

#30:   0%|          | 0/201833 [00:00<?, ?ex/s]

#29:   0%|          | 0/201833 [00:00<?, ?ex/s]

Now we need to train a BPE tokenizer on this dataset.

In [3]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

In [4]:
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

In [5]:
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

In [None]:
from tqdm import tqdm

def wiki_dataset_iterator(batch_size=10000):
    for i in tqdm(range(0, len(wiki_dataset), batch_size)):
        yield wiki_dataset[i : i + batch_size]['text']

tokenizer.train_from_iterator(wiki_dataset_iterator(), trainer)

Since the tokenizer will take >7hrs to train, let's just use GPT's tokenizer for now.

In [16]:
from transformers import OpenAIGPTTokenizerFast

tokenizer = OpenAIGPTTokenizerFast.from_pretrained("openai-gpt")

Downloading:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

In [28]:
import os

os.environ['CUDA_VISIBLE_DEVICES'] = '5'
context_length = 16
    
    
def tokenize(element):
    outputs = tokenizer(
        element["text"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
        if length == context_length:
            input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_wiki_dataset = wiki_dataset.map(
    tokenize, batched=True, remove_columns=wiki_dataset.column_names, num_proc=16
)
tokenized_wiki_dataset

                      

#0:   0%|          | 0/404 [00:00<?, ?ba/s]

#1:   0%|          | 0/404 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/404 [00:00<?, ?ba/s]

#7:   0%|          | 0/404 [00:00<?, ?ba/s]

#2:   0%|          | 0/404 [00:00<?, ?ba/s]

 

#4:   0%|          | 0/404 [00:00<?, ?ba/s]

#5:   0%|          | 0/404 [00:00<?, ?ba/s]

#6:   0%|          | 0/404 [00:00<?, ?ba/s]

  

#8:   0%|          | 0/404 [00:00<?, ?ba/s]

 

#9:   0%|          | 0/404 [00:00<?, ?ba/s]

#10:   0%|          | 0/404 [00:00<?, ?ba/s]

     

#15:   0%|          | 0/404 [00:00<?, ?ba/s]

#12:   0%|          | 0/404 [00:00<?, ?ba/s]

#11:   0%|          | 0/404 [00:00<?, ?ba/s]

#13:   0%|          | 0/404 [00:00<?, ?ba/s]

#14:   0%|          | 0/404 [00:00<?, ?ba/s]

Dataset({
    features: ['input_ids'],
    num_rows: 229417657
})

In [29]:
tokenized_wiki_dataset.save_to_disk('tokenized_wiki_dataset')

Now we have our tokenized wikipedia data. We can move onto implementing the model.