# NLP Course

Please see the [Hugging Face NLP Course page](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt).

## Fine-tuning a masked language model

### Try it out: extra credit work at end of this section

> ✏️ Try it out! To quantify the benefits of domain adaptation, fine-tune a classifier on the IMDb labels for both the pretrained and fine-tuned DistilBERT checkpoints. If you need a refresher on text classification, check out Chapter 3.<p/>
> ... actually, quite a lot of this comes from the [Text classification section of the Tasks Guide](https://huggingface.co/docs/transformers/v4.41.2/en/tasks/sequence_classification#text-classification) at 🤗 HF.

### The model checkpoints

We wil use the pretrained [`distilbert/distilbert-base-uncased`](https://huggingface.co/distilbert/distilbert-base-uncased); and also try eating our own dogfood by using [`buruzaemon/distilbert-base-uncased-finetuned-imdb-accelerate`](https://huggingface.co/buruzaemon/distilbert-base-uncased-finetuned-imdb-accelerate).

In [1]:
model_checkpoint_1 = "distilbert-base-uncased"
model_checkpoint_2 = "buruzaemon/distilbert-base-uncased-finetuned-imdb-accelerate"

### Preprocessing the data

#### Obtain the raw dataset(s)

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [3]:
raw_dataset_train = raw_datasets["train"]

sample = raw_dataset_train.shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")


'>>> Review: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...'
'>>> Label: 1'

'>>> Review: This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called "when you stu

#### Tokenize the raw dataset

Following the example in 3. Fine-tuning a pretrained model, we opt to use both padding and truncation.

However, please note:
* _truncation_ will happen in our `tokenize_function`
* _padding_ will be dynamic, which we will address at the batch-level with a `DataCollatorWithPadding`

In [4]:
from transformers import AutoTokenizer

tokenizer_1 = AutoTokenizer.from_pretrained(model_checkpoint_1)
tokenizer_2 = AutoTokenizer.from_pretrained(model_checkpoint_2)

In [5]:
def _create_tokenizer_function(tokenizer):
    def tokenize_function(examples):
        result = tokenizer(
            examples["text"],
            truncation=True
        )
        if tokenizer.is_fast:
            result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
        return result
    return tokenize_function

In [10]:
tokenize_function_1 = _create_tokenizer_function(tokenizer_1)

# Use batched=True to activate fast multithreading!
tokenized_datasets_1 = raw_datasets.map(
    tokenize_function_1, batched=True, remove_columns=["text"]
)
tokenized_datasets_1

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

In [11]:
tokenize_function_2 = _create_tokenizer_function(tokenizer_2)

# Use batched=True to activate fast multithreading!
tokenized_datasets_2 = raw_datasets.map(
    tokenize_function_2, batched=True, remove_columns=["text"]
)
tokenized_datasets_2

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 50000
    })
})

#### Downsample the datasets

In [12]:
train_size = 10_000
test_size = int(0.1 * train_size)

downsampled_dataset_1 = tokenized_datasets_1["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset_1

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1000
    })
})

In [13]:
downsampled_dataset_2 = tokenized_datasets_2["train"].train_test_split(
    train_size=train_size, test_size=test_size, seed=42
)
downsampled_dataset_2

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1000
    })
})

### 🤗 `Trainer` API

The general flow of set-up is this:

1. Get a dataset
2. Get a tokenizer corresponding to the sequence classification model you wish to fine-tune
3. Using that tokenizer, preprocess the raw dataset (we address truncation of the input texts here)
4. Using `map` and a function wrapping your tokenizer, preprocess the raw dataset into a tokenized one
5. Prepare `TrainingArguments`: stuff like local directory for saves, learning rate; any arguments we use in our example scripts _which relate to the training loop itself_.
6. Prepare `Trainer`: specify the model, training arguments (see above), training & eval datasets, processing class for tokenizer, data collator for creating batches, function for computing metrics, etc.

In [34]:
from transformers import (
    AutoModelForSequenceClassification, 
    TrainingArguments, 
    Trainer,
    DataCollatorWithPadding
)

In [15]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [16]:
model_1 = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint_1, num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
model_2 = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint_2, num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at buruzaemon/distilbert-base-uncased-finetuned-imdb-accelerate and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
from transformers import DataCollatorWithPadding

data_collator_1 = DataCollatorWithPadding(tokenizer=tokenizer_1)
data_collator_2 = DataCollatorWithPadding(tokenizer=tokenizer_2)

In [19]:
import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [20]:
batch_size = 64
num_epochs = 3
learning_rate = 2e-5

In [21]:
training_args = TrainingArguments(
    output_dir="ch7_extra_credit_pretrained_imbd_classifier",
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer_1 = Trainer(
    model=model_1,
    args=training_args,
    train_dataset=downsampled_dataset_1["train"],
    eval_dataset=downsampled_dataset_1["test"],
    processing_class=tokenizer_1,
    data_collator=data_collator_1,
    compute_metrics=compute_metrics,
)

trainer_1.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  trainer_1 = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.246377,0.908
2,No log,0.266727,0.897
3,No log,0.259856,0.912


TrainOutput(global_step=471, training_loss=0.23251174311222797, metrics={'train_runtime': 1448.6633, 'train_samples_per_second': 20.709, 'train_steps_per_second': 0.325, 'total_flos': 3974005401255168.0, 'train_loss': 0.23251174311222797, 'epoch': 3.0})

In [22]:
trainer_1.evaluate()

{'eval_loss': 0.24637728929519653,
 'eval_accuracy': 0.908,
 'eval_runtime': 14.0105,
 'eval_samples_per_second': 71.375,
 'eval_steps_per_second': 1.142,
 'epoch': 3.0}

<hr width=40%/>

In [23]:
%%time

training_args = TrainingArguments(
    output_dir="ch7_extra_credit_finetuned_imbd_classifier",
    learning_rate=learning_rate,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer_2 = Trainer(
    model=model_2,
    args=training_args,
    train_dataset=downsampled_dataset_2["train"],
    eval_dataset=downsampled_dataset_2["test"],
    processing_class=tokenizer_2,
    data_collator=data_collator_2,
    compute_metrics=compute_metrics,
)

trainer_2.train()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.244692,0.9
2,No log,0.262589,0.893
3,No log,0.264029,0.916


CPU times: user 24min 2s, sys: 4.87 s, total: 24min 7s
Wall time: 24min 9s


TrainOutput(global_step=471, training_loss=0.2162104969065154, metrics={'train_runtime': 1448.7462, 'train_samples_per_second': 20.708, 'train_steps_per_second': 0.325, 'total_flos': 3974005401255168.0, 'train_loss': 0.2162104969065154, 'epoch': 3.0})

In [25]:
trainer_2.evaluate()

{'eval_loss': 0.2446923553943634,
 'eval_accuracy': 0.9,
 'eval_runtime': 21.3142,
 'eval_samples_per_second': 46.917,
 'eval_steps_per_second': 0.751,
 'epoch': 3.0}

### Full training via `PyTorch` API

* We will re-use the downsampled datasets and data collators

In [None]:
raw_dataset_train = raw_datasets["train"]

sample = raw_dataset_train.shuffle(seed=42).select(range(3))

for row in sample:
    print(f"\n'>>> Review: {row['text']}'")
    print(f"'>>> Label: {row['label']}'")

#### Tokenization

Following the example in 3. Fine-tuning a pretrained model, we opt to use both padding and truncation.

However, please note:
* _truncation_ will happen in our `tokenize_function`
* _padding_ will be dynamic, which we will address at the batch-level with a `DataCollatorWithPadding`

In [None]:
from transformers import AutoTokenizer

tokenizer_1 = AutoTokenizer.from_pretrained(model_checkpoint_1)
tokenizer_2 = AutoTokenizer.from_pretrained(model_checkpoint_2)

In [None]:
def _create_tokenizer_function(tokenizer):
    def tokenize_function(examples):
        result = tokenizer(
            examples["text"],
            truncation=True
        )
        if tokenizer.is_fast:
            result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))]
        return result
    return tokenize_function

In [None]:
tokenize_function_1 = _create_tokenizer_function(tokenizer_1)

# Use batched=True to activate fast multithreading!
tokenized_datasets_1 = raw_datasets.map(
    tokenize_function_1, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets_1

In [None]:
tokenize_function_2 = _create_tokenizer_function(tokenizer_2)

# Use batched=True to activate fast multithreading!
tokenized_datasets_2 = raw_datasets.map(
    tokenize_function_2, batched=True, remove_columns=["text", "label"]
)
tokenized_datasets_2

In [None]:
from transformers import DataCollatorWithPadding

data_collator_1 = DataCollatorWithPadding(tokenizer=tokenizer_1, return_tensors="pt")
data_collator_2 = DataCollatorWithPadding(tokenizer=tokenizer_2, return_tensors="pt")

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets_1["train"],
    eval_dataset=tokenized_datasets_1["test"],
    processing_class=tokenizer_1,
    data_collator=data_collator_1,
    compute_metrics=compute_metrics,
)

trainer.train()

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint_1, num_labels=2)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets_1["train"],
    eval_dataset=tokenized_datasets_1["test"],
    data_collator=data_collator_1,
    processing_class=tokenizer_1,
)

In [None]:
trainer.train()

In [38]:
downsampled_dataset_1

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['label', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1000
    })
})

In [59]:
#downsampled_dataset_1_1 = downsampled_dataset_1.rename_column("label", "labels")
downsampled_dataset_1_1.set_format("torch")

#downsampled_dataset_2_1 = downsampled_dataset_2.rename_column("label", "labels")
downsampled_dataset_2_1.set_format("torch")

In [60]:
print(downsampled_dataset_1_1)
print()
print(downsampled_dataset_2_1)

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1000
    })
})

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask', 'word_ids'],
        num_rows: 1000
    })
})


In [61]:
downsampled_dataset_1_1.column_names

{'train': ['labels', 'input_ids', 'attention_mask', 'word_ids'],
 'test': ['labels', 'input_ids', 'attention_mask', 'word_ids']}

In [62]:
from torch.utils.data import DataLoader

train_dataloader_1 = DataLoader(
    downsampled_dataset_1_1["train"], shuffle=True, batch_size=64, collate_fn=data_collator_1
)
eval_dataloader_1 = DataLoader(
    downsampled_dataset_1_1["test"], batch_size=64, collate_fn=data_collator_1
)

In [63]:
train_dataloader_2 = DataLoader(
    downsampled_dataset_2_1["train"], shuffle=True, batch_size=64, collate_fn=data_collator_2
)
eval_dataloader_2 = DataLoader(
    downsampled_dataset_2_1["test"], batch_size=64, collate_fn=data_collator_2
)

In [64]:
for batch in train_dataloader_1:
    break
{k: v.shape for k, v in batch.items()}

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`word_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).