# NLP Course

Please see the [Hugging Face NLP Course page](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt).

## Fine-tuning a masked language model

> However, there are a few cases where you’ll want to first fine-tune the language models on your data, before training a task-specific head. For example, if your dataset contains legal contracts or scientific articles, a vanilla Transformer model like BERT will typically treat the domain-specific words in your corpus as rare tokens, and the resulting performance may be less than satisfactory. By fine-tuning the language model on in-domain data you can boost the performance of many downstream tasks, which means you usually only have to do this step once!

Please see [Fine-tuning a masked language model](https://huggingface.co/learn/nlp-course/chapter7/3?fw=pt#fine-tuning-a-masked-language-model), 7. Main NLP Tasks, in the 🤗 NLP Course.

### Picking a pretrained model for masked language modeling

> Although the BERT and RoBERTa family of models are the most downloaded, we’ll use a model called [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) that can be trained much faster with little to no loss in downstream performance. This model was trained using a special technique called [knowledge distillation](https://en.wikipedia.org/wiki/Knowledge_distillation), where a large “teacher model” like BERT is used to guide the training of a “student model” that has far fewer parameters...

c.f. these search results on [https://huggingface.co/models?pipeline_tag=fill-mask&language=en&sort=trending](https://huggingface.co/models?pipeline_tag=fill-mask&language=en&sort=trending) 🤗HF. You should see [`distilbert-base-uncased`](https://huggingface.co/distilbert/distilbert-base-uncased) placed highly in the search results.

In [1]:
from transformers import AutoModelForMaskedLM

model_checkpoint = "distilbert-base-uncased"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [2]:
distilbert_num_parameters = model.num_parameters() / 1_000_000

print(f"'>>> DistilBERT number of parameters: {round(distilbert_num_parameters)}M'")
print(f"'>>> BERT number of parameters: 110M'")

'>>> DistilBERT number of parameters: 67M'
'>>> BERT number of parameters: 110M'


In [3]:
model.config

DistilBertConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.49.0",
  "vocab_size": 30522
}

----

In [4]:
text = "This is a great [MASK]."

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

> With a tokenizer and a model, we can now pass our text example to the model, extract the logits, and print out the top 5 candidates:

##### Let's unpack this...

1. First, we tokenize the example `text` as `inputs` to our model
2. Passing the tokenized `inputs` into our model, we obtain the corresponding logits to the tokenized `inputs`
3. Use [`torch.where`](https://pytorch.org/docs/stable/generated/torch.where.html#torch-where) to locate the index of the token that corresponds to the `[MASK]` token (`tokenizer.mask_token_id`)
4. Obtain the logits at the position of the `[MASK]` token
5. Finally, use [`torch.topk`](https://pytorch.org/docs/stable/generated/torch.where.html#torch-where) to find the indices for the top 5 word/token candidates that the models suggests could replace the `[MASK]` token

In [6]:
import torch

inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits

# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]

# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

'>>> This is a great deal.'
'>>> This is a great success.'
'>>> This is a great adventure.'
'>>> This is a great idea.'
'>>> This is a great feat.'


----

### The dataset

> To showcase domain adaptation, we’ll use the famous [Large Movie Review Dataset](https://huggingface.co/datasets/stanfordnlp/imdb) (or IMDb for short), which is a corpus of movie reviews that is often used to benchmark sentiment analysis models. By fine-tuning DistilBERT on this corpus, we expect the language model will adapt its vocabulary from the factual data of Wikipedia that it was pretrained on to the more subjective elements of movie reviews....




In [7]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")
imdb_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [8]:
sample = imdb_dataset["train"].shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

> ✏️ Try it out! Create a random sample of the unsupervised split and verify that the labels are neither 0 nor 1. While you’re at it, you could also check that the labels in the train and test splits are indeed 0 or 1 — this is a useful sanity check that every NLP practitioner should perform at the start of a new project!

In [9]:
# why not simply check each dataset?
check_df = imdb_dataset["train"].to_pandas()
print(f"labels in 'train': {check_df.label.unique()}")

check_df = imdb_dataset["test"].to_pandas()
print(f"labels in 'test': {check_df.label.unique()}")

check_df = imdb_dataset["unsupervised"].to_pandas()
print(f"labels in 'unsupervised': {check_df.label.unique()}")

labels in 'train': [0 1]
labels in 'test': [0 1]
labels in 'unsupervised': [-1]


_**NOTE** that the Dataset Viewer view of the [`stanfordnlp/imdb`](https://huggingface.co/datasets/stanfordnlp/imdb) dataset at 🤗 HF will also let you explore the data easily._

### Preprocessing the data

> <span style="background-color:#33FFFF">For both auto-regressive and masked language modeling, a common preprocessing step is to concatenate all the examples and then split the whole corpus into chunks of equal size.</span> This is quite different from our usual approach, where we simply tokenize individual examples. <p/>Why concatenate everything together?<br/> <span style="background-color:#33FFFF">The reason is that individual examples might get truncated if they’re too long, and that would result in losing information that might be useful for the language modeling task!</span>
> </p>
> ... we’ll first tokenize our corpus as usual, but without setting the <code>truncation=True</code> option in our tokenizer. We’ll also grab the word IDs if they are available (which they will be if we’re using a fast tokenizer, as described in Chapter 6), as we will need them later on to do whole word masking...




In [10]:
def tokenize_function(examples):
    result = tokenizer(examples["text"])
    if tokenizer.is_fast:
        result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
    return result

In [11]:
# Use batched=True to activate fast multithreading!
tokenized_datasets = imdb_dataset.map(
    tokenize_function, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

> Now that we’ve tokenized our movie reviews, the next step is to group them all together and split the result into chunks. But how big should these chunks be? <br/>
> ... a good starting point is to see what the model’s maximum context size is.

In [12]:
tokenizer.model_max_length

512

> ✏️ Try it out! Some Transformer models, like BigBird and Longformer, have a much longer context length than BERT and other early Transformer models. Instantiate the tokenizer for one of these checkpoints and verify that the `model_max_length` agrees with what’s quoted on its model card.

Why not simply just have a look at `tokenizer_config.json` for each model?
* [`tokenizer_config.json` for `distilbert/distilbert-base-uncased`](https://huggingface.co/distilbert/distilbert-base-uncased/blob/main/tokenizer_config.json): `model_max_length=512`
* [`tokenizer_config.json` for `google/bigbird-roberta-base`](https://huggingface.co/google/bigbird-roberta-base/blob/main/tokenizer_config.json): `model_max_length=4096`
* [`tokenizer_config.json` for `allenai/longformer-base-4096`](https://huggingface.co/google/bigbird-roberta-base/blob/main/tokenizer_config.json): `model_max_length=4096`

##### Checking `google/bigbird-roberta-base

Please see [`google/bigbird-roberta-base`](https://huggingface.co/google/bigbird-roberta-base)

from huggingface_hub import notebook_login

notebook_login()

!pip install tiktoken protobuf sentencepiece

In [13]:
_check_tokenizer = AutoTokenizer.from_pretrained("google/bigbird-roberta-base")

print(f"google/bigbird-roberta-base, model_max_length: {_check_tokenizer.model_max_length}")

google/bigbird-roberta-base, model_max_length: 4096


<hr width=40%/>

##### Checking `allenai/longformer-base-4096`

Please see [`allenai/longformer-base-4096`](https://huggingface.co/allenai/longformer-base-4096)

In [14]:
_check_tokenizer = AutoTokenizer.from_pretrained("allenai/longformer-base-4096")

print(f"allenai/longformer-base-4096, model_max_length: {_check_tokenizer.model_max_length}")

print("wtf?!")

allenai/longformer-base-4096, model_max_length: 1000000000000000019884624838656
wtf?!


... hm...

----

> ...in order to run our experiments on GPUs like those found on Google Colab, we’ll pick something a bit smaller that can fit in memory
> <p/>
> Note that using a small chunk size can be detrimental in real-world scenarios, so you should use a size that corresponds to the use case you will apply your model to.

In [15]:
chunk_size = 128

Now comes the fun part. To show how the concatenation works, let’s take a few reviews from our tokenized training set and print out the number of tokens per review

In [16]:
# Slicing produces a list of lists for each feature
tokenized_samples = tokenized_datasets["train"][:3]

for idx, sample in enumerate(tokenized_samples["input_ids"]):
    print(f"'>>> Review {idx} length: {len(sample)}'")

'>>> Review 0 length: 363'
'>>> Review 1 length: 304'
'>>> Review 2 length: 133'


##### Interesting use of Python `sum` to concatenate lists...

In [17]:
concatenated_examples = {
    k: sum(tokenized_samples[k], []) for k in tokenized_samples.keys()
}
total_length = len(concatenated_examples["input_ids"])
print(f"'>>> Concatenated reviews length: {total_length}'")

'>>> Concatenated reviews length: 800'


##### Need to unpack this logic, too...

* Iterating through the key,value pairs in `concatenated_examples` (keys are: `input_ids`, `attention_mask`, `word_ids`)
* Create a new mapping of key `k`
* ... to a list of the corresponding values `v`
* ... but in slices of size `chunk_size` or less

In [18]:
chunks = {
    k: [v[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
    for k, v in concatenated_examples.items()
}

for chunk in chunks["input_ids"]:
    print(f"'>>> Chunk length: {len(chunk)}'")

'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 128'
'>>> Chunk length: 32'


> As you can see in this example, the last chunk will generally be smaller than the maximum chunk size. There are two main strategies for dealing with this:<br/>
> * Drop the last chunk if it’s smaller than `chunk_size`.
> * Pad the last chunk until its length equals `chunk_size`.
>
> We’ll take the first approach here, so let’s wrap all of the above logic in a single function that we can apply to our tokenized datasets:

In [19]:
def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

    # Compute length of concatenated texts
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # We drop the last chunk if it's smaller than chunk_size
    total_length = (total_length // chunk_size) * chunk_size

    # Split by chunks of max_len
    result = {
        k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)]
        for k, t in concatenated_examples.items()
    }

    # Create a new labels column
    result["labels"] = result["input_ids"].copy()

    return result

> Note that in the last step of `group_texts()` we create a new `labels` column which is a copy of the `input_ids` one. As we’ll see shortly, that’s because <span style="background-color:#33FFFF">in masked language modeling the objective is to predict randomly masked tokens in the input batch, and by creating a `labels` column we provide the ground truth for our language model to learn from</span>

In [20]:
lm_datasets = tokenized_datasets.map(group_texts, batched=True)
lm_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 61291
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 59904
    })
    unsupervised: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 122957
    })
})

In [21]:
sample = lm_datasets["train"][1]

In [22]:
tokenizer.decode(sample["input_ids"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it ' s not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

In [23]:
tokenizer.decode(sample["labels"])

"as the vietnam war and race issues in the united states. in between asking politicians and ordinary denizens of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br / > < br / > what kills me about i am curious - yellow is that 40 years ago, this was considered pornographic. really, the sex and nudity scenes are few and far between, even then it ' s not shot like some cheaply made porno. while my countrymen mind find it shocking, in reality sex and nudity are a major staple in swedish cinema. even ingmar bergman,"

### Fine-tuning DistilBERT with the Trainer API

> Fine-tuning a masked language model is almost identical to fine-tuning a sequence classification model, like we did in Chapter 3. The only difference is that <span style="background-color:#33FFFF">we need a special data collator that can randomly mask some of the tokens in each batch of texts</span>. Fortunately, 🤗 Transformers comes prepared with a dedicated [`DataCollatorForLanguageModeling`](https://huggingface.co/docs/transformers/main/main_classes/data_collator#transformers.DataCollatorForLanguageModeling) for just this task. We just have to pass it the tokenizer and an `mlm_probability` argument that specifies what fraction of the tokens to mask. We’ll pick 15%, which is the amount used for BERT and a common choice in the literature

In [24]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

In [25]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] floors rented i am curious - yellow from my video store because of all the controversy that surrounded it when it was first released in 1967. i also heard that at first it was seized by u. s. customs [MASK] it ever tried to enter this country, therefore [MASK] a fan of films considered " controversial [MASK] i really had to see this for myself. < br / > [MASK] br / consists the [MASK] speculated centered around a young swedish drama student named lena who wants [MASK] learn everything she can about life. in particular she wants to focus her attentionimum to making some sort of documentary on what the average swede thought about certain political [MASK] such'

'>>> as the vietnam war and race issues in the [MASK] states. in between asking politicians and ordinary den [MASK]ns of stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men. < br [MASK] [MASK] < br [MASK] > [MASK] [MASK] me about i am curious - yellow is that 40 y

<hr width=40%/>

> When training models for masked language modeling, one technique that can be used is to mask whole words together, not just individual tokens. This approach is called _whole word masking_. <span style="background-color:#33FFFF">If we want to use whole word masking, we will need to build a data collator ourselves.</span> A data collator is just a function that takes a list of samples and converts them into a batch, so let’s do this now! We’ll use the word IDs computed earlier to make a map between word indices and the corresponding tokens, then randomly decide which words to mask and apply that mask on the inputs. Note that the labels are all `-100` except for the ones corresponding to mask words.

##### NOT TRUE!

Please see [`DataCollatorForWholeWordMask`](https://huggingface.co/docs/transformers/main/main_classes/data_collator#transformers.DataCollatorForWholeWordMask)

But for laughs, let's first try doing things the hard way:

In [26]:
import collections
import numpy as np

from transformers import default_data_collator

wwm_probability = 0.2


def whole_word_masking_data_collator(features):
    for feature in features:
        word_ids = feature.pop("word_ids")

        # Create a map between words and corresponding token indices
        mapping = collections.defaultdict(list)
        current_word_index = -1
        current_word = None
        for idx, word_id in enumerate(word_ids):
            if word_id is not None:
                if word_id != current_word:
                    current_word = word_id
                    current_word_index += 1
                mapping[current_word_index].append(idx)

        # Randomly mask words
        mask = np.random.binomial(1, wwm_probability, (len(mapping),))
        input_ids = feature["input_ids"]
        labels = feature["labels"]
        new_labels = [-100] * len(labels)
        for word_id in np.where(mask)[0]:
            word_id = word_id.item()
            for idx in mapping[word_id]:
                new_labels[idx] = labels[idx]
                input_ids[idx] = tokenizer.mask_token_id
        feature["labels"] = new_labels

    return default_data_collator(features)

In [27]:
samples = [lm_datasets["train"][i] for i in range(2)]
batch = whole_word_masking_data_collator(samples)

for chunk in batch["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] i rented [MASK] am curious - yellow from my video store because of all [MASK] controversy that [MASK] it when it was first released in 1967. [MASK] also heard that at first it was seized by u [MASK] [MASK]. customs if it ever tried to enter this country, therefore being a fan of films considered [MASK] [MASK] " i really had to see this for myself. < br / [MASK] [MASK] [MASK] / > the plot is [MASK] around a young swedish drama student named lena who wants to learn everything she can [MASK] [MASK]. in particular she wants to focus her [MASK] [MASK] to making some sort of documentary on what [MASK] average swede thought about certain [MASK] issues such'

'>>> as the vietnam war and [MASK] issues in the united [MASK]. [MASK] between [MASK] politicians [MASK] ordinary denizens of [MASK] about their [MASK] on politics, she has sex [MASK] her drama teacher, classmates, and married men. < br / > < br / > what kills [MASK] about i am curious - yellow [MASK] that [MASK] years ago, th

<hr width=40%/>

##### And now using `DataCollatorForWholeWordMask`

In [28]:
from transformers import DataCollatorForWholeWordMask

data_collator_wholeword = DataCollatorForWholeWordMask(tokenizer=tokenizer, mlm_probability=0.15)

In [29]:
samples = [lm_datasets["train"][i] for i in range(2)]
for sample in samples:
    _ = sample.pop("word_ids")

for chunk in data_collator_wholeword(samples)["input_ids"]:
    print(f"\n'>>> {tokenizer.decode(chunk)}'")


'>>> [CLS] i rented i am curious - yellow from my [MASK] [MASK] 6000 of all the controversy that surrounded it [MASK] [MASK] was first released in 1967. i also heard karlsruhe at first it was [MASK] canadians u. s. customs if it ever tried to enter this country, [MASK] [MASK] a fan of films considered [MASK] controversial " i really [MASK] to see this for [MASK]. < br / > < br / > the [MASK] is centered around a young swedish drama student named lena who wants to learn everything she can [MASK] life. in particular she wants to focus her attentions to making [MASK] sort of documentary on what the average swede thought about [MASK] political [MASK] such'

'>>> as the vietnam [MASK] [MASK] race issues in the united states. in between asking politicians and ordinary [MASK] [MASK] [MASK] of stockholm about their opinions on politics, she has sex with her drama teacher, classmates [MASK] and married men. < br / > < [MASK] / > what kills me about i am curious - yellow is that 40 years ago, t



> Now that we have <strike>two</strike> _three_ data collators, the rest of the fine-tuning steps are standard. Training can take a while on Google Colab if you’re not lucky enough to score a mythical P100 GPU 😭, so we’ll first downsample the size of the training set to a few thousand examples. Don’t worry, we’ll still get a pretty decent language model! A quick way to downsample a dataset in 🤗 Datasets is via the `Dataset.train_test_split()` function that we saw in Chapter 5:

In [30]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset = lm_datasets["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'word_ids', 'labels'],
        num_rows: 1000
    })
})

In [31]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

#### Prep `TrainingArguments`

> FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead

In [32]:
from transformers import TrainingArguments

batch_size = 64

# Show the training loss with every epoch
logging_steps = len(downsampled_dataset["train"]) // batch_size

model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-imdb",
    overwrite_output_dir=True,
    eval_strategy="epoch", #evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)
#training_args

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


> Here we tweaked a few of the default options...<p/>
> * `logging_steps` to ensure we track the training loss with each epoch<br/>
> * `fp16=True` to enable mixed-precision training, which gives us another boost in speed<br/>
> * `Trainer` will remove any columns that are not part of the model’s `forward()` method... if you’re using the whole word masking collator, you’ll also need to set `remove_unused_columns=False` to ensure we don’t lose the `word_ids` column during training

#### Prep `Trainer`

> FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

In [33]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator,
    processing_class=tokenizer #tokenizer=tokenizer,
)

In [34]:
import math

eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 21.94


In [35]:
trainer.train()

Epoch,Training Loss,Validation Loss,Model Preparation Time
1,2.6838,2.509436,0.0023
2,2.5878,2.450192,0.0023
3,2.5279,2.481931,0.0023


TrainOutput(global_step=471, training_loss=2.599245999775621, metrics={'train_runtime': 153.4678, 'train_samples_per_second': 195.481, 'train_steps_per_second': 3.069, 'total_flos': 994208670720000.0, 'train_loss': 2.599245999775621, 'epoch': 3.0})

In [36]:
eval_results = trainer.evaluate()
print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

>>> Perplexity: 12.05


In [37]:
trainer.push_to_hub()

CommitInfo(commit_url='https://huggingface.co/buruzaemon/distilbert-base-uncased-finetuned-imdb/commit/b11d5578a9ebe9f896215fd68f0ccdac5a358955', commit_message='End of training', commit_description='', oid='b11d5578a9ebe9f896215fd68f0ccdac5a358955', pr_url=None, repo_url=RepoUrl('https://huggingface.co/buruzaemon/distilbert-base-uncased-finetuned-imdb', endpoint='https://huggingface.co', repo_type='model', repo_id='buruzaemon/distilbert-base-uncased-finetuned-imdb'), pr_revision=None, pr_num=None)

<hr width=40%/>

> ✏️ Your turn! Run the training above after changing the data collator to the whole word masking collator. Do you get better results?


##### OK, how about trying that with _whole word masking_? 

In [38]:
training_args_wwm = TrainingArguments(
    output_dir=f"{model_name}-whole-word-masking-finetuned-imdb",
    overwrite_output_dir=True,
    eval_strategy="epoch", #evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    push_to_hub=True,
    fp16=True,
    logging_steps=logging_steps,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [39]:
trainer_wwm = Trainer(
    model=model,
    args=training_args_wwm,
    train_dataset=downsampled_dataset["train"],
    eval_dataset=downsampled_dataset["test"],
    data_collator=data_collator_wholeword, # this is that DataCollatorForWholeWordMask API
    processing_class=tokenizer #tokenizer=tokenizer,
)

In [40]:
eval_results = trainer_wwm.evaluate()
print(f">>> Evaluating whole-word masking Perplexity: {math.exp(eval_results['eval_loss']):.2f}")



>>> Evaluating whole-word masking Perplexity: 16.58


In [41]:
trainer_wwm.train()

eval_results = trainer_wwm.evaluate()
print(f">>>Fine-tuned whole-word masking Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

trainer_wwm.push_to_hub()

Epoch,Training Loss,Validation Loss,Model Preparation Time
1,2.823,2.68529,0.002
2,2.7692,2.662086,0.002
3,2.7445,2.639881,0.002




>>>Fine-tuned whole-word masking Perplexity: 14.47


CommitInfo(commit_url='https://huggingface.co/buruzaemon/distilbert-base-uncased-whole-word-masking-finetuned-imdb/commit/318117427a64ca52a2ffac57b2cf6706396d8dcb', commit_message='End of training', commit_description='', oid='318117427a64ca52a2ffac57b2cf6706396d8dcb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/buruzaemon/distilbert-base-uncased-whole-word-masking-finetuned-imdb', endpoint='https://huggingface.co', repo_type='model', repo_id='buruzaemon/distilbert-base-uncased-whole-word-masking-finetuned-imdb'), pr_revision=None, pr_num=None)

### Fine-tuning DistilBERT with 🤗 Accelerate

> ... However, we saw that `DataCollatorForLanguageModeling` also applies random masking with each evaluation, so we’ll see some fluctuations in our perplexity scores with each training run. One way to eliminate this source of randomness is to apply the masking once on the whole test set, and then use the default data collator in 🤗 Transformers to collect the batches during evaluation. To see how this works, let’s implement a simple function that applies the masking on a batch, similar to our first encounter with `DataCollatorForLanguageModeling`

Recall that `data_collator` is that instance of `DataCollatorForLanguageModeling` we set up first...

In [42]:
def insert_random_mask(batch):
    features = [
        dict(zip(batch, t)) 
        for t in zip(*batch.values())
    ]
    
    masked_inputs = data_collator(features)
    
    # Create a new "masked" column for each column in the dataset
    return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()}

In [43]:
downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"])

eval_dataset = downsampled_dataset["test"].map(
    insert_random_mask,
    batched=True,
    remove_columns=downsampled_dataset["test"].column_names,
)

eval_dataset = eval_dataset.rename_columns(
    {
        "masked_input_ids": "input_ids",
        "masked_attention_mask": "attention_mask",
        "masked_labels": "labels",
    }
)

In [44]:
downsampled_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 1000
    })
})

In [45]:
downsampled_dataset['train']

Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 10000
})

In [46]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

batch_size = 64

train_dataloader = DataLoader(
    downsampled_dataset["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

eval_dataloader = DataLoader(
    eval_dataset, batch_size=batch_size, collate_fn=default_data_collator
)

In [47]:
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

In [48]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

In [49]:
from accelerate import Accelerator

accelerator = Accelerator()

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [50]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [51]:
from huggingface_hub import get_full_repo_name

model_name = "distilbert-base-uncased-finetuned-imdb-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'buruzaemon/distilbert-base-uncased-finetuned-imdb-accelerate'

> FutureWarning: 'Repository' (from 'huggingface_hub.repository') is deprecated and will be removed from version '1.0'. Please prefer the http-based alternatives instead. Given its large adoption in legacy code, the complete removal is only planned on next major release.<p/>
> For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.

##### Using `huggingface_hub.HfApi`

In [52]:
from huggingface_hub import HfApi

output_dir = model_name

api = HfApi()

In [53]:
from tqdm.auto import tqdm
import torch
import math

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)

        loss = outputs.loss
        losses.append(accelerator.gather(loss.repeat(batch_size)))

    losses = torch.cat(losses)
    losses = losses[: len(eval_dataset)]
    try:
        perplexity = math.exp(torch.mean(losses))
    except OverflowError:
        perplexity = float("inf")

    print(f">>> Epoch {epoch}: Perplexity: {perplexity}")

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        #repo.push_to_hub(
        #    commit_message=f"Training in progress epoch {epoch}", blocking=False
        #)
        future = api.upload_folder( # Upload in the background (non-blocking action)
            repo_id=repo_name,
            folder_path=output_dir,
            run_as_future=True,
            commit_message=f"Training in progress epoch {epoch}"
        )

  0%|          | 0/471 [00:00<?, ?it/s]

>>> Epoch 0: Perplexity: 11.263709812307237
>>> Epoch 1: Perplexity: 10.754227971200459
>>> Epoch 2: Perplexity: 10.55502901619238


In [54]:
api.upload_folder(
    repo_id=repo_name,
    folder_path=output_dir,
    commit_message="Training completed!"
)

CommitInfo(commit_url='https://huggingface.co/buruzaemon/distilbert-base-uncased-finetuned-imdb-accelerate/commit/71cd5d36c8d6fa5e067e47b833fa89b4cc30e8fa', commit_message='Training completed!', commit_description='', oid='71cd5d36c8d6fa5e067e47b833fa89b4cc30e8fa', pr_url=None, repo_url=RepoUrl('https://huggingface.co/buruzaemon/distilbert-base-uncased-finetuned-imdb-accelerate', endpoint='https://huggingface.co', repo_type='model', repo_id='buruzaemon/distilbert-base-uncased-finetuned-imdb-accelerate'), pr_revision=None, pr_num=None)

### Using our fine-tuned model

In [55]:
from transformers import pipeline

mask_filler = pipeline(
    "fill-mask", 
    model=repo_name
)

Device set to use cuda:0


In [56]:
print(text)

This is a great [MASK].


In [57]:
preds = mask_filler(text)

for pred in preds:
    print(f">>> {pred['sequence']}")

>>> this is a great film.
>>> this is a great movie.
>>> this is a great idea.
>>> this is a great one.
>>> this is a great adventure.
