In this homework you will understand the fine-tuning procedure and get acquainted with Huggingface Datasets library

In [1]:
# ! pip install datasets
# ! pip install transformers

For our goals we will use [Datasets](https://huggingface.co/docs/datasets/) library and take `yahoo_answers_topics` dataset - the task of this dataset is to divide documents on 10 topic categories. More detiled information can be found on the dataset [page](https://huggingface.co/datasets/viewer/).


In [2]:
from datasets import load_dataset

In [3]:
dataset = load_dataset('yahoo_answers_topics') # the result is a dataset dictionary of train and test splits in this case

Reusing dataset yahoo_answers_topics (C:\Users\bogya\.cache\huggingface\datasets\yahoo_answers_topics\yahoo_answers_topics\1.0.0\b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902)
100%|██████████| 2/2 [00:00<00:00,  4.08it/s]


# **Part 1: Fine-tuning the model** (15 points + 5 bonus)

In [4]:
from transformers import (ElectraTokenizer, ElectraForSequenceClassification,
                          get_scheduler, pipeline, ElectraForMaskedLM, ElectraModel, AutoTokenizer)

import torch
from torch.utils.data import DataLoader
from datasets import load_metric

Fine-tuning procedure on the end task consists of adding additional layers on the top of the pre-trained model. The resulting model can be tuned fully (passing gradients through the all model) or partially.

**Task**: 
- load tokenizer and model
- look at the predictions of the model as-is before any fine-tuning


```
- Why don't you ask [MASK]?
- What is [MASK]
- Let's talk about [MASK] physics
```

- convert `best_answer` to the input tokens (supporting function for dataset is provided below) 

```
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
```

- define optimizer, sheduler (optional)
- fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score
- get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)
- In case you will tune the training hyperparameters (and write down your results) you will get 5 bonus points.

**Tips**:
- The easiest way to get predictions is to use transformers `pipeline` function 
- Do not forget to set `num_labels` parameter, when initializing the model
- To convert data to batches use `DataLoader`
- Even the `small` version of Electra can be long to train, so you can take data sample (>= 5000 and set seed for reproducibility)
- You may want to try freezing (do not update the pretrained model weights) all the layers exept the ones for classification, in that case use:


```
for param in model.electra.parameters():
      param.requires_grad = False
```


# load tokenizer and model

In [5]:
MODEL_NAME = "google/electra-small-generator"
TOKENIZER_NAME = "google/electra-small-generator"

In [6]:
# from transformers import ElectraModel, ElectraConfig

# model = ElectraModel(ElectraConfig())
# configuration = model.config

# tokenizer = ElectraTokenizer.from_pretrained(TOKENIZER_NAME)

In [7]:
from transformers import ElectraModel, ElectraTokenizer, ElectraConfig

num_labels = 10
model = ElectraModel.from_pretrained(MODEL_NAME, num_labels=num_labels)
configuration = model.config

tokenizer = ElectraTokenizer.from_pretrained(TOKENIZER_NAME)

Some weights of the model checkpoint at google/electra-small-generator were not used when initializing ElectraModel: ['generator_predictions.LayerNorm.bias', 'generator_predictions.dense.weight', 'generator_predictions.dense.bias', 'generator_lm_head.bias', 'generator_lm_head.weight', 'generator_predictions.LayerNorm.weight']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
# tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# look at the predictions of the model as-is before any fine-tuning

In [9]:
fill_mask = pipeline(
    "fill-mask",
    model = MODEL_NAME,
    tokenizer = TOKENIZER_NAME
)

In [10]:
# print(
#     fill_mask(f"HuggingFace is creating a [MASK] that the community uses to solve NLP tasks.")
# )

In [11]:
print(
    fill_mask(f"Why don't you ask [MASK]?")
)

[{'score': 0.5342980623245239, 'token': 2033, 'token_str': 'me', 'sequence': "why don't you ask me?"}, {'score': 0.0819607824087143, 'token': 3980, 'token_str': 'questions', 'sequence': "why don't you ask questions?"}, {'score': 0.04395361617207527, 'token': 2068, 'token_str': 'them', 'sequence': "why don't you ask them?"}, {'score': 0.04017291218042374, 'token': 2339, 'token_str': 'why', 'sequence': "why don't you ask why?"}, {'score': 0.03002440184354782, 'token': 4426, 'token_str': 'yourself', 'sequence': "why don't you ask yourself?"}]


In [12]:
print(
    fill_mask(f"What is [MASK]")
)

[{'score': 0.9262325763702393, 'token': 1029, 'token_str': '?', 'sequence': 'what is?'}, {'score': 0.051567427814006805, 'token': 1012, 'token_str': '.', 'sequence': 'what is.'}, {'score': 0.021510407328605652, 'token': 999, 'token_str': '!', 'sequence': 'what is!'}, {'score': 0.00011964970326516777, 'token': 1011, 'token_str': '-', 'sequence': 'what is -'}, {'score': 0.00010928422852884978, 'token': 1000, 'token_str': '"', 'sequence': 'what is "'}]


In [13]:
print(
    fill_mask(f"Let's talk about [MASK] physics")
)

[{'score': 0.24027475714683533, 'token': 8559, 'token_str': 'quantum', 'sequence': "let's talk about quantum physics"}, {'score': 0.21258579194545746, 'token': 9373, 'token_str': 'theoretical', 'sequence': "let's talk about theoretical physics"}, {'score': 0.056393858045339584, 'token': 10811, 'token_str': 'particle', 'sequence': "let's talk about particle physics"}, {'score': 0.03320789709687233, 'token': 2613, 'token_str': 'real', 'sequence': "let's talk about real physics"}, {'score': 0.022627944126725197, 'token': 8045, 'token_str': 'mathematical', 'sequence': "let's talk about mathematical physics"}]


# convert `best_answer` to the input tokens (supporting function for dataset is provided below)

In [14]:
def tokenize_function(examples):
    return tokenizer(examples["best_answer"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

100%|██████████| 1400/1400 [33:03<00:00,  1.42s/ba]
100%|██████████| 60/60 [01:26<00:00,  1.43s/ba]


In [41]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'topic', 'question_title', 'question_content', 'best_answer', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1400000
    })
    test: Dataset({
        features: ['id', 'topic', 'question_title', 'question_content', 'best_answer', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 60000
    })
})

# define optimizer, sheduler (optional)

In [16]:
# from transformers.optimization import Adafactor, AdafactorSchedule

In [17]:
# # https://huggingface.co/docs/transformers/main_classes/optimizer_schedules

# optimizer = Adafactor(model.parameters(), scale_parameter=True,
#                       relative_step=True, warmup_init=True, lr=None)
# lr_scheduler = AdafactorSchedule(optimizer)


In [44]:
from torch.optim import AdamW
from transformers import get_scheduler

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

# fine-tune the model (write the training loop), plot the loss changes and measure results in terms of weighted F1 score

In [18]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

In [None]:
# from transformers import AutoModelForSequenceClassification

# model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

In [None]:
# import torch

# device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
# model.to(device)

In [24]:
import numpy as np
from datasets import load_metric

metric = load_metric('f1', 'accuracy')

Downloading builder script: 5.27kB [00:00, 1.32MB/s]                   


In [20]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [77]:
from transformers import AutoModelForSequenceClassification

# model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
model1 = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=5)
model2 = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=5)

loading configuration file https://huggingface.co/google/electra-small-generator/resolve/main/config.json from cache at C:\Users\bogya/.cache\huggingface\transformers\ddf7554779ef5bd660812cf3b6c92a66e14e307bae0f8582015b43ce8f8de85c.e50e2a54975f5ef36835643600664f71c63e7f570a08222c48829a8d8e327dca
Model config ElectraConfig {
  "_name_or_path": "google/electra-small-generator",
  "architectures": [
    "ElectraForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "embedding_size": 128,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 256,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4"
  },
  "initializer_range": 0.02,
  "intermediate_size": 1024,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "electra",
  "num_attention_heads":

In [79]:
from tqdm.auto import tqdm
from torch.utils.data import DataLoader

model = model1

tokenized_ds = tokenized_datasets.remove_columns(['id', 'best_answer', 'question_title', 'question_content'])
tokenized_ds = tokenized_ds.rename_column('topic', 'labels')

tokenized_ds.set_format('torch')

small_train_dataset = tokenized_ds['train'].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_ds['test'].shuffle(seed=42).select(range(1000))

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Loading cached shuffled indices for dataset at C:\Users\bogya\.cache\huggingface\datasets\yahoo_answers_topics\yahoo_answers_topics\1.0.0\b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902\cache-72c740bbacfeb5e5.arrow
Loading cached shuffled indices for dataset at C:\Users\bogya\.cache\huggingface\datasets\yahoo_answers_topics\yahoo_answers_topics\1.0.0\b2712a72fde278f1d6e96cc4f485fd89ed2f79ecb231441e13645b53da021902\cache-fda54cca51fd45e1.arrow
  0%|          | 0/525000 [00:10<?, ?it/s]


IndexError: Target 6 is out of bounds.

In [None]:
metric = load_metric("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

# get the masked word prediction (sample sentences above) on the fine-tuned model, why the results as they are and what should be done in order to change that (write down your answer)

# In case you will tune the training hyperparameters (and write down your results) you will get 5 bonus points.