In this homework you will understand the fine-tuning procedure and get acquainted with Huggingface Datasets library

In [None]:
# ! pip install datasets
# ! pip install transformers

For our goals we will use [Datasets](https://huggingface.co/docs/datasets/) library and take `yahoo_answers_topics` dataset - the task of this dataset is to divide documents on 10 topic categories. More detiled information can be found on the dataset [page](https://huggingface.co/datasets/viewer/).


In [1]:
from datasets import load_dataset

In [None]:
dataset = load_dataset('yahoo_answers_topics') # the result is a dataset dictionary of train and test splits in this case

# **Part 1: Fine-tuning the model** (15 points + 5 bonus)

In [9]:
from transformers import (ElectraTokenizer, ElectraForSequenceClassification,
                          get_scheduler, pipeline, ElectraForMaskedLM, ElectraModel, AutoTokenizer)

import torch
from torch.utils.data import DataLoader
from datasets import load_metric

Fine-tuning procedure on the end task consists of adding additional layers on the top of the pre-trained model. The resulting model can be tuned fully (passing gradients through the all model) or partially.

**Task**: 
- load tokenizer and model
- look at the predictions of the model as-is before any fine-tuning


```
- Why don't you ask [MASK]?
- What is [MASK]
- Let's talk about [MASK] physics
```

- convert `best_answer` to the input tokens (supporting function for dataset is provided below) 

```
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
```

- define optimizer, sheduler (optional)
- fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score
- get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)
- In case you will tune the training hyperparameters (and write down your results) you will get 5 bonus points.

**Tips**:
- The easiest way to get predictions is to use transformers `pipeline` function 
- Do not forget to set `num_labels` parameter, when initializing the model
- To convert data to batches use `DataLoader`
- Even the `small` version of Electra can be long to train, so you can take data sample (>= 5000 and set seed for reproducibility)
- You may want to try freezing (do not update the pretrained model weights) all the layers exept the ones for classification, in that case use:


```
for param in model.electra.parameters():
      param.requires_grad = False
```


# load tokenizer and model

In [4]:
MODEL_NAME = "google/electra-small-generator"
TOKENIZER_NAME = "google/electra-small-generator"

In [None]:
# from transformers import ElectraModel, ElectraConfig

# model = ElectraModel(ElectraConfig())
# configuration = model.config

# tokenizer = ElectraTokenizer.from_pretrained(TOKENIZER_NAME)

In [23]:
from transformers import ElectraModel, ElectraTokenizer, ElectraConfig

model = ElectraModel.from_pretrained(MODEL_NAME)
configuration = model.config

tokenizer = ElectraTokenizer.from_pretrained(TOKENIZER_NAME)

Some weights of the model checkpoint at google/electra-small-generator were not used when initializing ElectraModel: ['generator_predictions.dense.weight', 'generator_lm_head.bias', 'generator_predictions.LayerNorm.weight', 'generator_lm_head.weight', 'generator_predictions.LayerNorm.bias', 'generator_predictions.dense.bias']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
num_labels = 0

In [11]:
# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# look at the predictions of the model as-is before any fine-tuning

In [None]:
fill_mask = pipeline(
    "fill-mask",
    model = MODEL_NAME,
    tokenizer = TOKENIZER_NAME
)

In [8]:
print(
    fill_mask(f"HuggingFace is creating a [MASK] that the community uses to solve NLP tasks.")
)

[{'score': 0.1500137746334076, 'token': 2291, 'token_str': 'system', 'sequence': 'huggingface is creating a system that the community uses to solve nlp tasks.'}, {'score': 0.12094223499298096, 'token': 6994, 'token_str': 'tool', 'sequence': 'huggingface is creating a tool that the community uses to solve nlp tasks.'}, {'score': 0.06042560562491417, 'token': 5576, 'token_str': 'solution', 'sequence': 'huggingface is creating a solution that the community uses to solve nlp tasks.'}, {'score': 0.05312653258442879, 'token': 7809, 'token_str': 'database', 'sequence': 'huggingface is creating a database that the community uses to solve nlp tasks.'}, {'score': 0.03361190855503082, 'token': 3274, 'token_str': 'computer', 'sequence': 'huggingface is creating a computer that the community uses to solve nlp tasks.'}]


# convert `best_answer` to the input tokens (supporting function for dataset is provided below)

In [12]:
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

100%|██████████| 1400/1400 [07:08<00:00,  3.26ba/s]
100%|██████████| 60/60 [00:18<00:00,  3.21ba/s]


# define optimizer, sheduler (optional)

In [13]:
from transformers.optimization import Adafactor, AdafactorSchedule

In [18]:
# https://huggingface.co/docs/transformers/main_classes/optimizer_schedules

optimizer = Adafactor(model.parameters(), scale_parameter=True,
                      relative_step=True, warmup_init=True, lr=None)
lr_scheduler = AdafactorSchedule(optimizer)


# fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score

In [24]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

In [26]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

Downloading builder script: 3.19kB [00:00, 290kB/s]                    


In [25]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [28]:
from transformers import Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

small_train_ds = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_ds = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_ds,
    eval_dataset=small_eval_ds,
    compute_metrics=compute_metrics,
    optimizer=optimizer
)


PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Loading cached shuffled indices for dataset at C:\Users\bogya\.cache\huggingface\datasets\yahoo_answers_topics\yahoo_answers_topics\1.0.0\b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902\cache-0e86df9392ffea52.arrow
Loading cached shuffled indices for dataset at C:\Users\bogya\.cache\huggingface\datasets\yahoo_answers_topics\yahoo_answers_topics\1.0.0\b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902\cache-235ea0d58a355e20.arrow


In [29]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `ElectraModel.forward` and have been ignored: best_answer, topic, question_title, id, question_content. If best_answer, topic, question_title, id, question_content are not expected by `ElectraModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375
  0%|          | 0/375 [00:00<?, ?it/s]

KeyError: 'loss'

# get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)

# In case you will tune the training hyperparameters (and write down your results) you will get 5 bonus points.