# Introduction
In this notebook we pretrain a [RoBERTa](https://huggingface.co/transformers/model_doc/roberta.html#) model from scratch on Danish using huggingface. RoBERTa is a slightly optimized version of BERT with a tiny embedding tweak. Furthermore RoBERTa uses a byte-level BPE as a tokenizer.

This notebook is based on the following [huggingface notebook](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=UIvgZ3S6AO0z).  

The following steps are carried out:  
1. Setting Up Workspace  
2. Preparing Data  
3. Training BPE Tokenizer  
4. Configuring RoBERTa  
5. Training RoBERTa  
6. Testing Trained RoBERTa  
  
Links:  
RoBERTa Paper: https://arxiv.org/abs/1907.11692  
RoBERTa Huggingface: https://huggingface.co/transformers/model_doc/roberta.html#  
RoBERTa Blog Post: https://huggingface.co/blog/how-to-train

# 1. Setting Up Workspace
It is recommended to install transformers directly from github as the pip version is usually outdated, and thus misses several important features such as resuming from a checkpoint.

In [None]:
# Install transformers from master
!pip install git+https://github.com/huggingface/transformers

In [None]:
# Getting versions  
!pip list | grep -E 'transformers|tokenizers'

In [None]:
import glob
import numpy
import os
from pathlib import Path
import random
import re
from transformers import DataCollatorForLanguageModeling
from transformers import LineByLineTextDataset
from transformers import pipeline
from transformers import RobertaConfig
from transformers import RobertaForMaskedLM
from transformers import RobertaTokenizerFast
from transformers import Trainer, TrainingArguments
from tokenizers import ByteLevelBPETokenizer
import torch
from torch.utils.data import Dataset

In [None]:
# Checking that we have a GPU
!nvidia-smi

In [None]:
# Checking that PyTorch sees GPU
torch.cuda.is_available()

# 2. Preparing data
We will be using the Danish wikipedia for training our model and hopefully later the [Danish Gigaword Corpus](https://gigaword.dk/).
  
The training data needs to be in the one line pr document, which is achieved using the following steps:  
1. Download Danish wikipedia  
2. Clean XML file
3. Divided file into one article pr file  
4. Concatenated articles into train, valid and test files (Articles are scrambled and concatenated where each article is separated by an empty line)

In [None]:
# Downloading danish wikipedia
!curl -O -J -L https://dumps.wikimedia.org/dawiki/20210401/dawiki-20210401-pages-articles.xml.bz2

In [None]:
# Unpacking file
!bzip2 -d dawiki-20210401-pages-articles.xml.bz2

In [None]:
# Installing wikiextractor to clean xml
!git clone https://github.com/attardi/wikiextractor.git

Wikiextractor has some problems with its imports and the following sed command fixes this:  

**Before:**  
from .extract import Extractor, ignoreTag, define_template, acceptedNamespaces  
  
**After:**  
from extract import Extractor, ignoreTag, define_template, acceptedNamespaces

In [None]:
# Replacing first occurence of .extract with extract
!sed -i '0,/.extract/{s/.extract/extract/}' ./wikiextractor/wikiextractor/WikiExtractor.py

In [None]:
# Extracting articles to one big file
!cd wikiextractor/wikiextractor && python3 WikiExtractor.py ../../dawiki-20210401-pages-articles.xml --no-templates --processes 4 -b 100G -o ../../ 

In [None]:
# Moving files around and cleaning up
!mv AA/wiki_00 wiki_da.txt 
!rm -r AA dawiki-20210401-pages-articles.xml wikiextractor
!mkdir wikipedia_da

Now we create one file pr article using regular expressions.

In [None]:
wiki = open("wiki_da.txt")
dest = "wikipedia_da/"
f = None
for i,l in enumerate(wiki):
    if i%10000 == 0:
        print(str(i))
    title_re = re.compile(rf'<doc id="\d+" url="https://da.wikipedia.org/wiki\?curid=\d+" title="([^"]+)">')
    if l.startswith('<doc id="'):
        title = title_re.findall(l)[0].replace('/','_').replace("'", "").replace('"', '')
        if len(title)>150: continue
        if f: f.close()
        f = open(dest + title.replace(' ','_') + '.txt', 'w')
    if l.startswith('</doc>'):
        continue
    if not l:
        f.write('')
    else:
        f.write(l)
f.close()
wiki.close()

In [None]:
# Listing articles and scrambling
article_paths = glob.glob('wikipedia_da/*.txt')
random.shuffle(article_paths)

In [None]:
# Dividing into train, valid and test (90 %, 5 %, 5 %)
train, valid, test = numpy.split(article_paths, [int(0.90*len(article_paths)),int(0.95*len(article_paths))]); print(str(len(train)) + " " + str(len(valid)) + " " + str(len(test)))

In [None]:
# Writing train to file
with open('wiki_train.txt', 'w') as out_file:
    for file_path in train:
        with open(file_path) as in_file:
            # replacing line endings with whitespace
            lines = " ".join([l[:-1] + " " for l in in_file.readlines()[1:]])
            out_file.write(lines + "\n")

In [None]:
# Writing valid to file
with open('wiki_valid.txt', 'w') as out_file:
    for file_path in valid:
        with open(file_path) as in_file:
            # replacing line endings with whitespace
            lines = " ".join([l[:-1] + " " for l in in_file.readlines()[1:]])
            out_file.write(lines + "\n")

In [None]:
# Writing test to file
with open('wiki_test.txt', 'w') as out_file:
    for file_path in test:
        with open(file_path) as in_file:
            # replacing line endings with whitespace
            lines = " ".join([l[:-1] + " " for l in in_file.readlines()[1:]])
            out_file.write(lines + "\n")

# 3. Training BPE Tokenizer
Training a byte-level BPE tokenizer is prefered to the WordPiece tokenizer of BERT because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens, which means that we for the most part limit the use of ```<unk>``` tokens. The special tokens are set in the order required by RoBERTa.

Training a Byte-Pair Encoding (BPE) tokenizer follows the following process:  
1. Start with all the characters present in the training corpus as tokens.
2. Identify the most common pair of tokens and merge it into one token.
3. Repeat until the vocabulary (e.g., the number of tokens) has reached the size we want.

Link: https://huggingface.co/docs/tokenizers/python/latest/quicktour.html#build-a-tokenizer-from-scratch

In [None]:
# Listing files
paths = ['wiki_train.txt', 'wiki_valid.txt', 'wiki_test.txt']

In [None]:
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

In [None]:
# Customize training
tokenizer.train(files=paths, vocab_size=25_000, min_frequency=3, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

In [None]:
# Saving model
os.makedirs("DaBERTo", exist_ok=True)
tokenizer.save_model("DaBERTo")

In [None]:
# Testing the tokenizer
tokenizer = RobertaTokenizerFast('DaBERTo/vocab.json', 'DaBERTo/merges.txt')
ids = tokenizer.encode("Mit navn er Dronning Magrethe!")
tokens = tokenizer.batch_decode(ids)
print("Tokens:")
print(tokens)
print("IDs:")
print(ids)

We now have both a __vocab.json__, which is a list of the most frequent tokens ranked by frequency, and a __merges.txt__ list of merges.

# 4. Configuring RoBERTa 
Here we configure the model for the task of Masked language modeling, i.e. to predict how to fill arbitrary tokens that we randomly mask in the dataset. As we are training from scratch, we only initialize from a config, and not from an existing pretrained model or checkpoint. The model automatically takes care of the masking during training. We keep the model very close to the default model, thus we just set the vocabulary size from our tokenizer, and we ensure a context window of 512 (max_position_embeddings includes start/end tokens, thus +2)  

Notes on GPU Memory and CUDA errors:  
* If your **chunk_size** is too large compared to the max_position_embeddings, you will get errors such as:
  * cuda error: CUBLAS_STATUS_NOT_INITIALIZED
  * cuda error device-side assert triggered
  * cuda error cublas_status_alloc_failed when calling cublascreate(handle)
* If you have not balanced **vocab_size**, **num_hidden_layers** and **chunk_size** well, you will get errors such as:
  * cuda out of memory. tried to allocate 20.00 mib....

To solve the first error, you have to lower the **chunk_size**, and to resolve the second case you should lower one or all of the mentioned parameres, or get a GPU with more memory.

In [None]:
tokenizer = RobertaTokenizerFast.from_pretrained("./DaBERTo", max_len=512)

In [None]:
config = RobertaConfig(
    vocab_size=25_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1
)

In [None]:
model = RobertaForMaskedLM(config=config)
model.num_parameters()

Here we write a custom dataset class that uses our pretrained tokenizer. The class further more cuts the text corpus into chunks of 510 tokens and adds the special tokens ```<s>``` and ```</s>``` to the start and end of the sequences. We furthermore throw away the final sequence instead of padding it to 512 tokens.

In [None]:
class ChunkedTextDataset(Dataset):
    def __init__(self, tokenizer_path: str, file_paths: list, chunk_size=510):
        tokenizer = RobertaTokenizerFast.from_pretrained(tokenizer_path, chunk_size=chunk_size)
        all_text = self.group_texts(file_paths)
        chunks = self.chunker(all_text, 1000)
        all_tokens = []
        for chunk in chunks:
            all_tokens.extend(tokenizer.encode('\n'.join(chunk))[1:-1])
        self.chunks = [{"input_ids":torch.tensor(chunk, dtype=torch.long)} for chunk in self.chunker_add_seq_tokens(all_tokens, chunk_size)[:-1]]
    
    @staticmethod
    def read_file(path):
        with open(path) as f:
            lines = f.readlines()
        return lines
      
    def group_texts(self, list_of_paths):
        # Concatenating all texts
        all_text = []
        for path in list_of_paths:
            all_text.extend(self.read_file(path))
        return all_text
    
    @staticmethod
    def chunker_add_seq_tokens(seq, size):
        chunks = [seq[pos:pos + size] for pos in range(0, len(seq), size)]
        for chunk in chunks:
            chunk.insert(0,0)
            chunk.insert(len(chunk),2)
        return chunks
    
    @staticmethod
    def chunker(seq, size):
        return (seq[pos:pos + size] for pos in range(0, len(seq), size))
    
    def __len__(self):
        return len(self.chunks)

    def __getitem__(self, i):
        return self.chunks[i]

In [None]:
dataset = ChunkedTextDataset(
    tokenizer_path='./DaBERTo/', 
    file_paths=["./wiki_train.txt"],
    chunk_size=510,
)

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

In [None]:
training_args = TrainingArguments(
    output_dir="./DaBERTo",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_gpu_train_batch_size=20,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

# 5. Training RoBERTa

In [None]:
%%time
trainer.train()

# 6. Testing Trained RoBERTa
We are testing whether RoBERTa can fill in the ```<mask>``` in a sensible way.

In [None]:
fill_mask = pipeline(
    "fill-mask",
    model="./DaBERTo",
    tokenizer="./DaBERTo"
)

In [None]:
# The sun <mask>.
fill_mask("Solen er så <mask>.")