# NOTE: This notebook was adapted from an existing notebook from HuggingFace (original here: https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb) by Kelly Peterson for the 2024 DELPHI NLP course at the University of Utah.  It has been changed to:
1. Use a smaller amount of data to allow training a model during a course which is non-optimal.
2. Remove one of the tasks in the original notebook
3. Show masked language model (MDM) predictions

## Also note that you should change the runtime of your notebook to be a TPU or GPU if it is not set already.  Otherwise this could take 10x or 20x longer...  

1. Above, select "Runtime"
2. Then select "Change Runtime Type"
3. In the dialog that opens, select "T4 GPU" or "TPU"
4. Then click OK

If you're opening this Notebook on colab, you will probably need to install HuggingFace Transformers and HuggingFace Datasets.

In [None]:
! pip install datasets transformers transformers[torch] pip install accelerate

Then you need to install Git-LFS.

In [None]:
!apt install git-lfs

Make sure your version of Transformers is at least 4.11.0 so this notebook will work

In [None]:
import math

import transformers

from transformers import AutoTokenizer
from transformers import Trainer, TrainingArguments

import torch

print(transformers.__version__)
print(torch.__version__)

# Train a language model

In this notebook, we'll see how to train a [HuggingFace Transformers](https://github.com/huggingface/transformers) model on a language modeling task. We will cover

- Masked language modeling: the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

We will see how to easily load and preprocess the dataset for each one of those tasks, and how to use the `Trainer` API to train a model on it.

This notebooks assumes you will be using a pretrained tokenizer or you have trained a tokenizer on the corpus you are using, see the [How to train a tokenizer](https://github.com/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb) notebook ([open in colab](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb)).

## Preparing the dataset

For our model, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

In [None]:
from datasets import load_dataset

# This is really big... it would take us 30+ minutes to train, even on GPU...
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

You could theoretically replace the dataset above with any dataset hosted on [the hub](https://huggingface.co/datasets) or use your own files. The example code below shows how you would do this with your own files:

In [None]:
# datasets = load_dataset("text", data_files={"train": path_to_train.txt, "validation": path_to_validation.txt}

You can also load datasets from a csv or a JSON file, see the [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

To access an actual element, you need to select a split first, then give an index:

In [None]:
datasets["train"][10]

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

In [None]:
print(type(datasets['train']))


trimmed_train_dataset = datasets['train'].select(range(1000))
trimmed_test_dataset = datasets['test'].select(range(100))

print(trimmed_train_dataset.__len__())
print(trimmed_test_dataset.__len__())

As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Masked language modeling

For masked language modeling (MLM) we are going to preprocess our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens). If you use a tokenizer you trained yourself, make sure the `[MASK]` token is among the special tokens you passed during training!

We will use the [`bert-base-cased`](https://huggingface.co/bert-based-cased) model architecture for this example.

In [None]:
model_checkpoint = "bert-base-cased"
tokenizer_checkpoint = "sgugger/bert-like-tokenizer"

Load our tokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)


We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [None]:
tokenized_train_dataset = trimmed_train_dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])
tokenized_test_dataset = trimmed_test_dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [None]:
tokenized_train_dataset[1]

We group texts together and chunk them in samples of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [None]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_train_dataset = tokenized_train_dataset.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

lm_test_dataset = tokenized_test_dataset.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Next  we use a model suitable for masked LM:

In [None]:
from transformers import AutoConfig, AutoModelForMaskedLM

config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForMaskedLM.from_config(config)

We redefine our `TrainingArguments`:

In [None]:
# we will not push our model to the Internet today... Maybe another day...
push_to_hub = False

training_args = TrainingArguments(
    "test-clm",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=push_to_hub,
)

Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking:

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Then we just have to pass everything to `Trainer` and begin training:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_train_dataset,
    eval_dataset=lm_test_dataset,
    data_collator=data_collator,
)

# Here we go... time to bake this model in the oven... This will take time, even with a short dataset...

![Julia Child](https://images.e-flux-systems.com/sc1086_2_07_rosler--julia_child.jpg,400)

In [None]:
# This will take 1 to 2 minutes when training on GPU..
# Much faster than if we used the full dataset which would take 30+ minutes

trainer.train()

We can evaluate our model on the validation set. The perplexity is not
great since we used a small fraction of the data

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

The perplexity is still quite high since for this demo we trained on a small dataset for a small number of epochs. For a real LM training, you  would need a larger dataset and more epochs.

# Now for some fun experiments... Let's see how well our model fills in the blanks

In [None]:
pred_model = transformers.pipeline('fill-mask', model = model, tokenizer = tokenizer, device = torch.cuda.current_device())

text = "The Milky Way is a [MASK] galaxy."

preds = pred_model(text)

#for pred in preds:
#    print(f">>> {pred['sequence']}")

## Now try your own.  See if the model has any different behavior with context...

In [None]:
# TODO make your own sentences and make sure to include this [MASK] token in your test sentence

your_own_text = "[MASK]"

preds = pred_model(your_own_text)

#for pred in preds:
#    print(f">>> {pred['sequence']}")