This notebook is derived from the 🤗 notebook here: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/language_modeling-tf.ipynb#scrollTo=QRTpmyCc3l_T

If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
! pip install transformers
! pip install datasets

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 51.2 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 48.4 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 33.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

Make sure your version of Transformers is at least 4.16.0 since some of the functionality we use was only introduced in that version.

In [None]:
import transformers

print(transformers.__version__)
import math

4.17.0


## Preparing the dataset

In [None]:
from datasets import load_dataset


In [None]:
from google.colab import drive  
drive.mount('/content/drive', force_remount=True)



Mounted at /content/drive


In [None]:
 datasets = load_dataset("csv", 
                         data_files={"train": "/content/drive/My Drive/Colab Notebooks/266-NLP-data/podcast_sentences_high_conf_20000_with_context.csv",
                                     "validation": "/content/drive/My Drive/Colab Notebooks/266-NLP-data/podcast_sentences_high_conf_2000_with_context_validation.csv",
                                     "test": "/content/drive/My Drive/Colab Notebooks/266-NLP-data/podcast_sentences_high_conf_2000_with_context_test.csv"})

FileNotFoundError: ignored

You can also load datasets from a csv or a JSON file, see the [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

To access an actual element, you need to select a split first, then give an index:

In [None]:
print(datasets)

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(datasets["train"])

As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).

We will use the [`distilroberta-base`](https://huggingface.co/distilroberta-base) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=masked-lm) instead:

NOTE the model selection is overridden below to use the goemotion outout model."monologg/bert-base-cased-goemotions-original"  it requires additional flags to convert it from a pytorch model to tf.


In [None]:
#model_checkpoint = "distilroberta-base"
#model_checkpoint = "monologg/bert-base-cased-goemotions-original"
model_checkpoint = "roberta-base"

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked. Don't panic about the warnings about inputs being too long for the model - remember that we'll be breaking them into shorter chunks right afterwards!

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["string"])

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
#tokenizer = AutoTokenizer.from_pretrained("monologg/bert-base-cased-goemotions-original", from_pt=True)
tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=['Unnamed: 0', 'confidence', 'source', 'string']
)

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

NameError: ignored

And now, we group texts together and chunk them into samples of length `block_size`. You can skip this step if your dataset is composed of individual sentences.

Now for the harder part: We need to concatenate all our texts together, and then split the result into chunks of a fixed size, which we will call block_size. To do this, we will use the map method again, with the option batched=True. When we use batched=True, the function we pass to map() will be passed multiple inputs at once, allowing us to group them into more or fewer examples than we had in the input. This allows us to create our new fixed-length samples.

We can use any block_size up to the the maximum length our model was pretrained with, which for models in the gpt2 family is usually something in the range 512-1024. This might be a bit too big to fit in your GPU RAM, though, so let's use something a bit smaller: 128.


In [None]:

# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:


In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, though you could add padding instead if the model supports it
    # In this, as in all things, we advise you to follow your heart
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Note that we duplicate the inputs for our labels, without masking them. This is because CausalLM models in the 🤗 Transformers library automatically apply masking to the inputs, so we don't need to do it manually.

Also note that by default, the map method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of block_size every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower).

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:

In [None]:
from transformers import TFAutoModelForMaskedLM

#model = TFAutoModelForMaskedLM.from_pretrained("monologg/bert-base-cased-goemotions-original", from_pt=True)
model = TFAutoModelForMaskedLM.from_pretrained(model_checkpoint)
#model =  TFAutoModelForMaskedLM.from_pretrained("/content/drive/My Drive/Colab Notebooks/266-NLP-data/goEmotion_pretrained_20000_13epoch")
#model =  TFAutoModelForMaskedLM.from_pretrained("/content/drive/My Drive/Colab Notebooks/266-NLP-data/goEmotion_pretrained_500000_2epoch")


Downloading:   0%|          | 0.00/627M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


We redefine our `optimizer` as we did with the CLM model, and we compile the model. We're using the internal loss again, like we did before.

In [None]:
from transformers import create_optimizer, AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(lr=2e-5, weight_decay_rate=0.01)

model.compile(optimizer=optimizer)

  super(Adam, self).__init__(name, **kwargs)
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as keys in the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.


Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible for taking the samples and batching them in tensors.  Here we want to randomly mask tokens. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking. Note that our data collators are designed to work for multiple frameworks, so ensure you set the `return_tensors='tf'` argument to get Tensorflow tensors out - you don't want to accidentally get a load of `torch.Tensor` objects in the middle of your nice TF code!

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf"
)

Now we generate our datasets as before. Remember to pass the `data_collator` you just made to the `collate_fn` argument.

In [None]:
train_set = lm_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

test_set = lm_datasets["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

validation_set = lm_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

And now we fit our model! As before, we can use a callback to sync with the hub during training. You can remove this if you don't want to!

In [None]:
# Get fit before pretraining
eval_results = model.evaluate(test_set)
print(f"Perplexity: {math.exp(eval_results):.2f}")

Perplexity: 17.60


In [None]:



model.fit(train_set, validation_data=validation_set, epochs=13)

Epoch 1/13
Epoch 2/13
Epoch 3/13
Epoch 4/13
Epoch 5/13
Epoch 6/13
Epoch 7/13
Epoch 8/13
Epoch 9/13
Epoch 10/13
Epoch 11/13
Epoch 12/13
Epoch 13/13


<keras.callbacks.History at 0x7f520f612c90>

Like before, we can evaluate our model on the validation set and compute perplexity. The perplexity is much lower than for the CLM objective because for the MLM objective, we only have to make predictions for the masked tokens (which represent 15% of the total here) while having access to the rest of the tokens. It's thus an easier task for the model.

In [None]:


eval_results = model.evaluate(test_set)
print(f"Perplexity: {math.exp(eval_results):.2f}")

Perplexity: 8.36


# Save model

In [None]:


drive.mount('/content/drive',force_remount=True)

model.save_pretrained("/content/drive/My Drive/Colab Notebooks/266-NLP-data/roberta_pretrained_20000_13epoch")

drive.flush_and_unmount()

Mounted at /content/drive


# Run with 500K

Reload cached model 

In [None]:

model =  TFAutoModelForMaskedLM.from_pretrained("/content/drive/My Drive/266-NLP-data/roberta_pretrained_20000_13epoch")
model.compile(optimizer=optimizer)

All model checkpoint layers were used when initializing TFRobertaForMaskedLM.

All the layers of TFRobertaForMaskedLM were initialized from the model checkpoint at /content/drive/My Drive/266-NLP-data/roberta_pretrained_20000_13epoch.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.
No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as keys in the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.


Load large corpus

In [None]:
from google.colab import drive  
drive.mount('/content/drive', force_remount=True)
datasets = load_dataset("csv", 
                         data_files={"train": "/content/drive/My Drive/266-NLP-data/podcast_sentences_high_conf_20000_with_context.csv",
                                     "validation": "/content/drive/My Drive/266-NLP-data/podcast_sentences_high_conf_2000_with_context_validation.csv",
                                     "test": "/content/drive/My Drive/266-NLP-data/podcast_sentences_high_conf_2000_with_context_test.csv"})

tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=['Unnamed: 0', 'confidence', 'source', 'string']
)

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)
train_set = lm_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

test_set = lm_datasets["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

validation_set = lm_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

Using custom data configuration default-37fd03610a554aa2


Mounted at /content/drive
Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-37fd03610a554aa2/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-37fd03610a554aa2/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

        

#2:   0%|          | 0/5 [00:00<?, ?ba/s]

#1:   0%|          | 0/5 [00:00<?, ?ba/s]

#0:   0%|          | 0/5 [00:00<?, ?ba/s]

#3:   0%|          | 0/5 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#2:   0%|          | 0/5 [00:00<?, ?ba/s]

#0:   0%|          | 0/5 [00:00<?, ?ba/s]

#1:   0%|          | 0/5 [00:00<?, ?ba/s]

#3:   0%|          | 0/5 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

        

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

Get pre-training fit

In [None]:
eval_results = model.evaluate(test_set)
print(f"Perplexity: {math.exp(eval_results):.2f}")

Perplexity: 8.68


Run first epoch

In [None]:
model.fit(train_set, validation_data=validation_set, epochs=1)
eval_results = model.evaluate(test_set)
print(f"Perplexity: {math.exp(eval_results):.2f}")
model.save_pretrained("/content/drive/My Drive/266-NLP-data/roberta_pretrained_500000_1epoch")

Perplexity: 8.50


## Train second epoch


In [None]:
model.fit(train_set, validation_data=validation_set, epochs=1)
eval_results = model.evaluate(test_set)
print(f"Perplexity: {math.exp(eval_results):.2f}")
model.save_pretrained("/content/drive/My Drive/266-NLP-data/roberta_pretrained_500000_2epoch")

Perplexity: 8.96


Go to 10 just because we can

In [None]:
model.fit(train_set, validation_data=validation_set, epochs=8)
eval_results = model.evaluate(test_set)
print(f"Perplexity: {math.exp(eval_results):.2f}")
model.save_pretrained("/content/drive/My Drive/266-NLP-data/roberta_pretrained_500000_10epoch")

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
Perplexity: 9.74
