# Fine tune a multiple choice NLP model

Based on the [`transformers` multiple-choice fine tuning notebook](https://github.com/huggingface/notebooks/blob/main/examples/multiple_choice.ipynb). Using 
pretrained `roberta-base` first, should work seamlessly with any other 
`AutoModelForMultipleChoice` from HuggingFace.

In [None]:
%%capture

%cd ..


In [None]:
import transformers

print(transformers.__version__)


4.32.1


## Get the MutualPlus dataset

In [None]:
from datasets import load_dataset

mutual_plus = load_dataset("lighteval/mutual_harness", name="mutual_plus")
mutual_plus


Found cached dataset mutual_harness (/home/xqz-u/.cache/huggingface/datasets/lighteval___mutual_harness/mutual_plus/0.0.1/be31c67b35f0d5c4cce450a4e8475f0eb26ae243f5bfd701a524305282b0873a)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['answers', 'options', 'article', 'id'],
        num_rows: 7088
    })
    test: Dataset({
        features: ['answers', 'options', 'article', 'id'],
        num_rows: 886
    })
    validation: Dataset({
        features: ['answers', 'options', 'article', 'id'],
        num_rows: 886
    })
})

In [None]:
mutual_plus["train"].features


{'answers': Value(dtype='string', id=None),
 'options': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'article': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None)}

In [None]:
split = "train"
mutual_plus[split][0]


{'answers': 'A',
 'options': ['F: Although the suit you sew is the same as it, the material of this suit is imported from Italy.',
  "F: No suit has the same style as it. It's the style that makes it special. It is worth the price.",
  'F: I am afraid I did not quite catch what you were saying. Please repeat it.',
  'F: But the color of our suit is very special.'],
 'article': "M: Excuse me. How much is this suit? F: It's on sale today for $750. It's normally $900. M: Wow, that is pretty expensive! I was thinking that it might be 4 or 500. F: This material is imported from Italy. It's the finest in the world, and if you bought a suit made of this material at many department stores, you would pay about $2000. M: Uh-hah. But isn't that the point of coming to a market like this, to get a discount compared to the expensive department stores? Besides I saw a suit just like this one a few stalls down, and they were selling it for $600. I still thought that it was too expensive.",
 'id': 'tra

We raname the `"answers"` column to `"labels"` as this is can be passed to a dataset collator (batch padding) and to a HF 
Transformer model, which usually accepts keyword arguments `input_ids`, `attention_mask` and `labels` (the latter for 
training).

Moreover, if the `"labels"` column becomes a `ClassLabel`, then we have [numeric class labels](https://huggingface.co/docs/datasets/v2.14.5/en/package_reference/main_classes#datasets.Features) for free, so let's cast it. We can only do it on the `train` and `validation` splits 
though, as the `test` one only has empty labels as spaces `" "`.

In [None]:
from datasets import ClassLabel

mutual_plus = mutual_plus.rename_column("answers", "labels")
for split in ["train", "validation"]:
    mutual_plus[split] = mutual_plus[split].cast_column(
        "labels", ClassLabel(num_classes=4, names=["A", "B", "C", "D"])
    )
    print(split, mutual_plus[split].features)


Loading cached processed dataset at /home/xqz-u/.cache/huggingface/datasets/lighteval___mutual_harness/mutual_plus/0.0.1/be31c67b35f0d5c4cce450a4e8475f0eb26ae243f5bfd701a524305282b0873a/cache-6ef2610c9e680d1e.arrow
Loading cached processed dataset at /home/xqz-u/.cache/huggingface/datasets/lighteval___mutual_harness/mutual_plus/0.0.1/be31c67b35f0d5c4cce450a4e8475f0eb26ae243f5bfd701a524305282b0873a/cache-c595c1b4173c47da.arrow


train {'labels': ClassLabel(names=['A', 'B', 'C', 'D'], id=None), 'options': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'article': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None)}
validation {'labels': ClassLabel(names=['A', 'B', 'C', 'D'], id=None), 'options': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'article': Value(dtype='string', id=None), 'id': Value(dtype='string', id=None)}


In [None]:
def show_one_mutual(example: dict):
    print(f"Context: {example['article']}")
    for i, opt in enumerate(example["options"]):
        print(f"  {i} - {opt}")
    print(f"\nGround truth: {gt if (gt := example['labels']) != ' ' else 'UNKNOWN'}")


In [None]:
import random

random_idx = random.randint(0, mutual_plus[split].num_rows)
print(random_idx, split)
show_one_mutual(mutual_plus[split][random_idx])


39 validation
Context: F: Good morning. M: Morning. F: Come in, sit down. Now, you're a new patient, aren't you? M: Yes, that's right. F: Ok, so I better ask you some questions first. Now, have you ever had any serious illnesses or accidents? M: A broken leg I got from playing football when I was 17. I was in the school team at that time.
  0 - F: So you broke your leg in a car accident when you were 15, right?
  1 - F: When you were 15, a wild cat broke your leg? That's amazing.
  2 - F: So you broke your leg when you were playing football, right?
  3 - F: Just a minute! I do not quite follow what you are saying, would you mind repeating that?

Ground truth: 2


## Get the model: `roberta-base`

In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice

model_checkpoint = "roberta-base"

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForMultipleChoice.from_pretrained(model_checkpoint)
tokenizer, model


Some weights of RobertaForMultipleChoice were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'classifier.weight', 'classifier.bias', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


(RobertaTokenizerFast(name_or_path='roberta-base', vocab_size=50265, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)}, clean_up_tokenization_spaces=True),
 RobertaForMultipleChoice(
   (roberta): RobertaModel(
     (embeddings): RobertaEmbeddings(
       (word_embeddings): Embedding(50265, 768, padding_idx=1)
       (position_embeddings): Embedding(514, 768, padding_idx=1)
       (token_type_embeddings): Embedding(1, 768)
       (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
       (dropout): Dropout(p=0.1, inplace=False)
     )
     (encoder): RobertaEncoder(
       (layer): ModuleList(
         (0-11): 12 x RobertaLayer(
           (attention): RobertaAttention(
             (self): RobertaSel

## Preprocessing

### Tokenization

**NOTE** Here we could remove speaker tags! Can be done
1. Either before merging with selected datapoints from MMLU, or
2. After merging, then regex must still work.

In [None]:
from typing import Dict, List

from transformers import PreTrainedTokenizerBase


def preprocess_mutual(
    datapoints: Dict[str, List[str]], tokenizer: PreTrainedTokenizerBase = None
) -> Dict[str, List[List[List[int]]]]:
    assert tokenizer is not None, f"Need to pass a tokenizer as argument"
    n_options = len(datapoints["options"][0])
    # Repeat each context as many times along with each continuation
    full_options = [
        " ".join([article, opt])
        for article, options in zip(datapoints["article"], datapoints["options"])
        for opt in options
    ]
    # Tokenize
    tokenized_examples = tokenizer(full_options, truncation=True)
    # Un-flatten
    return {
        k: [v[i : i + n_options] for i in range(0, len(v), n_options)]
        for k, v in tokenized_examples.items()
    }


This function works with one or several examples. In the case of several 
examples, the tokenizer will return a list of lists of lists for each key: a 
list of all examples (here 5), then a list of all choices (4) and a list of 
input IDs (length varying here since we did not apply any padding).

**NOTE** To correctly preprocess _only one_ datapoint,

`preprocess_function(my_datapoints[:1], tokenizer=tokenizer)`.

This way, the `datasets.DatasetDict` adds a layer of list to `my_datapoints` 
values and effectively batches them.

In [None]:
example_features = preprocess_mutual(mutual_plus[split][:5], tokenizer=tokenizer)
print(list(example_features))
print(
    len(example_features["input_ids"]),
    len(example_features["input_ids"][0]),
    [len(x) for x in example_features["input_ids"][0]],
)


['input_ids', 'attention_mask']
5 4 [165, 175, 165, 170]


Check that that the tokenization was done correctly.

In [None]:
example_idx = 4
show_one_mutual(mutual_plus[split][example_idx])
[
    tokenizer.decode(example_features["input_ids"][example_idx][i])
    for i in range(len(example_features["input_ids"][example_idx]))
]


Context: F: Hi, Deck, would you like to go swimming this afternoon? M: I wish I could, but I have to spend the rest of the day in the library. I have a 10 page paper due tomorrow.
  0 - F: Oh, are you going to the swiming pool? I have to go to library and study.
  1 - F: I am afraid I did not quite catch what you were saying. Please repeat it.
  2 - F: Great! Let's go to the swimming pool together this afternoon.
  3 - F: Then I'll go to swimming pool without you. Enjoy your time at library.

Ground truth: 3


['<s>F: Hi, Deck, would you like to go swimming this afternoon? M: I wish I could, but I have to spend the rest of the day in the library. I have a 10 page paper due tomorrow. F: Oh, are you going to the swiming pool? I have to go to library and study.</s>',
 '<s>F: Hi, Deck, would you like to go swimming this afternoon? M: I wish I could, but I have to spend the rest of the day in the library. I have a 10 page paper due tomorrow. F: I am afraid I did not quite catch what you were saying. Please repeat it.</s>',
 "<s>F: Hi, Deck, would you like to go swimming this afternoon? M: I wish I could, but I have to spend the rest of the day in the library. I have a 10 page paper due tomorrow. F: Great! Let's go to the swimming pool together this afternoon.</s>",
 "<s>F: Hi, Deck, would you like to go swimming this afternoon? M: I wish I could, but I have to spend the rest of the day in the library. I have a 10 page paper due tomorrow. F: Then I'll go to swimming pool without you. Enjoy your ti

Now tokenize the whole `MutualPlus`! 

**NOTE** We can differentiate tokenized datasets by caching them in different 
folders, there will be repetition but it's ok (the train split of MutualPlus is 
~21MB). E.g. `cache_dir/mutual_plus/train_sim_0.6`, `cache_dir/mutual_plus/train_rand_0.6`, 
`cache_dir/mutual_plus/train_nospeakers_sim_0.6` etc.

**NOTE NOTE** No need to worry about file overriding, caching is performed properly automatically. This is true even 
across kernel restarts!

In [None]:
# from similarity_augmentation import conf

# mutualplus_tokenized_cache = conf.TOKENIZED_DATASET_DIR / "mutual_plus"
# mutualplus_tokenized_cache.mkdir(parents=True, exist_ok=True)

# NOTE no need to tokenize the test set too, but I'm lazy and it's only 2.6MB
tokenized_mutualplus = mutual_plus.map(
    preprocess_mutual,
    batched=True,
    fn_kwargs={"tokenizer": tokenizer},
    # cache_file_names={k: str(mutualplus_tokenized_cache / k) for k in mutual_plus},
)
tokenized_mutualplus


Loading cached processed dataset at /home/xqz-u/.cache/huggingface/datasets/lighteval___mutual_harness/mutual_plus/0.0.1/be31c67b35f0d5c4cce450a4e8475f0eb26ae243f5bfd701a524305282b0873a/cache-3e4280decfc9d40d.arrow
Loading cached processed dataset at /home/xqz-u/.cache/huggingface/datasets/lighteval___mutual_harness/mutual_plus/0.0.1/be31c67b35f0d5c4cce450a4e8475f0eb26ae243f5bfd701a524305282b0873a/cache-d719d476f949c9a2.arrow
Loading cached processed dataset at /home/xqz-u/.cache/huggingface/datasets/lighteval___mutual_harness/mutual_plus/0.0.1/be31c67b35f0d5c4cce450a4e8475f0eb26ae243f5bfd701a524305282b0873a/cache-5697a6235ea52fe6.arrow


DatasetDict({
    train: Dataset({
        features: ['labels', 'options', 'article', 'id', 'input_ids', 'attention_mask'],
        num_rows: 7088
    })
    test: Dataset({
        features: ['labels', 'options', 'article', 'id', 'input_ids', 'attention_mask'],
        num_rows: 886
    })
    validation: Dataset({
        features: ['labels', 'options', 'article', 'id', 'input_ids', 'attention_mask'],
        num_rows: 886
    })
})

**NOTE** `input_ids` and `attention_mask` are now added to the dataset.

Even better, the results are automatically cached by the `datasets` library to 
avoid spending time on this step the next time you run your notebook. The 
`datasets` library is normally smart enough to detect when the function you pass
to map has changed (and thus requires to not use the cache data). For instance, 
it will properly detect if you change the task in the first cell and rerun the 
notebook. `datasets` warns you when it uses cached files, you can pass 
`load_from_cache_file=False` in the call to `map` to not use the cached files 
and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This
is to leverage the full benefit of the fast tokenizer we loaded earlier, which 
will use multi-threading to treat the texts in a batch concurrently.

## Fine tuning

In [None]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PaddingStrategy
from typing import Optional, Union

import torch


@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that dynamically pads the inputs for multiple choices received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[str, List[List[int]]]]]):
        labels = [feature.pop("labels") for feature in features]
        batch_size, num_choices = len(features), len(features[0]["input_ids"])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)]
            for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels under a key accepted by model()
        return batch | {"labels": torch.tensor(labels, dtype=torch.int64)}


When called on a list of examples, it will flatten all the inputs/attentions masks etc. in big lists that it will pass to the `tokenizer.pad` method. This will return a dictionary with big tensors (of shape `(batch_size * 4) x seq_length`) that we then unflatten.

We can check this data collator works on a list of features, we just have to make sure to remove all features that are not inputs accepted by our model (something the `Trainer` will do automatically for us after):

In [None]:
accepted_keys = ["input_ids", "attention_mask", "labels"]
example = [
    {k: v for k, v in tokenized_mutualplus[split][i].items() if k in accepted_keys}
    for i in range(10)
]
example_batch = DataCollatorForMultipleChoice(tokenizer=tokenizer)(example)
example_batch["input_ids"].shape, example_batch["labels"].shape


You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


(torch.Size([10, 4, 301]), torch.Size([10]))

Sanity check:

In [None]:
show_one_mutual(tokenized_mutualplus["train"][8])
[
    tokenizer.decode(example_batch["input_ids"][8][i].tolist())
    for i in range(len(example_batch["input_ids"][8]))
]


Context: M: Good morning, I'd like to buy a cake. F: No problem sir, we have many cakes here, what size would you like? M: Well, it's for my coworker's birthday, there are 14 people in the office. F: Well, this cake feeds 12 people and this one behind it feeds 20. M: I'll take the bigger one, it's better to have too much than not enough. F: Sounds good, do you want it delivered? M: Yes. Can you deliver it to my office? The birthday party will be after work at a park near the office.
  0 - F: Sure. Hope you enjoy the party in the park. Have a nice day.
  1 - F: No problem. I'll send the cake to the park you asked me to do.
  2 - F: Sure, hope you have fun during the party at the office.
  3 - F: Just a minute! I do not quite follow what you are saying, would you mind repeating that?

Ground truth: 0


["<s>F: Hey Mike, over here. M: Hi, it's great to see you, been waiting long? F: No, not at all. What do you want to have? M: Just a salad, so how's the new apartment working out? F: Good, I like it. The neighborhood, though, is... Well, some of the buildings down the street are covered with terrible pictures drawn by teenagers. M: I know what you mean. I think we need to report people who are drawing to the police. F: Yes, and I like all the stores. It's convenient for shopping, and it's pretty quiet at night. That's definitely a plus. M: Sounds like you're pretty satisfied. F: Yeah, I guess so, uh the only problem is that it's impossible to find parking. I have to drive around the block 6 or 7 times to find a space, usually I can't find a space usually I can find one, but sometimes I have to park really far away. M: Well, is there anyway, you can rent space in a garage. F: Yeah, that's a good idea. So now are things in your neighborhood. M: There's a bit of noise problem where I live

In [None]:
print([len(el[0]) for el in tokenized_mutualplus[split][:10]["input_ids"]])
# article + option longest are at indces 2 and 3; still I don't know why the
# longest sequence is padded to 214 and not left at 210
[
    tokenizer.decode(example_batch["input_ids"][2][i].tolist())
    for i in range(len(example_batch["input_ids"][3]))
]


[165, 205, 87, 129, 69, 98, 134, 222, 299, 97]


["<s>M: Hello, is this doctor, Smith's office? F: Yes, it is. May I help you? M: Yes, I'd like to speak to doctor Smith, please? F: Doctor Smith went home this afternoon. May I ask who is calling? M: This is Jim White. F: Please wait a second. I'll get Dr. Smith for you. He is right in his office.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pa

## Define evaluation metrics

For evaluation, `transformers.Trainer` passes an instance of 
[`transformers.trainer_utils.EvalPrediction`](https://github.com/huggingface/transformers/blob/b71f20a7c9f3716d30f6738501559acf863e2c5c/src/transformers/trainer_utils.py#L108) 
to the function passed to the `Trainer` by the parameter `compute_metrics`.

In [None]:
import numpy as np
from transformers.trainer_utils import EvalPrediction


def compute_metrics_fn(eval_predictions: EvalPrediction) -> Dict[str, float]:
    # NOTE loss gets included automatically
    predictions, labels = eval_predictions
    if len(labels.shape) < 2:
        # add batch dimension to labels if necessary
        labels = labels[..., None]
    ranked_predictions = np.argsort(-predictions)
    # since we only have 4 ranks, compute until R@3 (at 4 is always 1.0)
    recalls = (ranked_predictions == labels).astype(np.float32)
    mean_recalls_at = recalls.mean(0)
    recalls_dict = {
        f"R@{k}": mean_recalls_at[:k].sum().item()
        for k in range(1, len(mean_recalls_at))
    }
    inverse_ranks = 1 / (np.argwhere(ranked_predictions == labels)[:, 1] + 1)
    return recalls_dict | {"MRR": inverse_ranks.mean().item()}


Test the implemented metrics: recalls at increasing positions should be 
monotonically increasing, and higher recalls at lower positions should yield 
higher MRR.

In [None]:
n = 5
for _ in range(5):
    evalpred = EvalPrediction(np.random.rand(n, 4), np.random.randint(0, 4, (n,)))
    print(f"Golden labels: {evalpred.label_ids}")
    print(f"Ranked predictions:\n{np.argsort(-evalpred.predictions)}")
    print(f"Evaluation: {compute_metrics_fn(evalpred)}")
    print("-------")


Golden labels: [2 3 1 2 1]
Ranked predictions:
[[3 1 0 2]
 [3 0 1 2]
 [1 3 0 2]
 [0 2 1 3]
 [3 2 1 0]]
Evaluation: {'R@1': 0.4000000059604645, 'R@2': 0.6000000238418579, 'R@3': 0.800000011920929, 'MRR': 0.6166666666666667}
-------
Golden labels: [0 1 2 0 2]
Ranked predictions:
[[1 2 0 3]
 [0 3 2 1]
 [1 2 0 3]
 [2 1 3 0]
 [1 2 0 3]]
Evaluation: {'R@1': 0.0, 'R@2': 0.4000000059604645, 'R@3': 0.6000000238418579, 'MRR': 0.36666666666666664}
-------
Golden labels: [2 0 3 1 3]
Ranked predictions:
[[0 3 1 2]
 [2 3 1 0]
 [2 0 1 3]
 [2 1 0 3]
 [2 0 3 1]]
Evaluation: {'R@1': 0.0, 'R@2': 0.20000000298023224, 'R@3': 0.4000000059604645, 'MRR': 0.31666666666666665}
-------
Golden labels: [3 0 0 3 0]
Ranked predictions:
[[1 0 3 2]
 [0 1 2 3]
 [3 2 0 1]
 [3 0 2 1]
 [3 1 2 0]]
Evaluation: {'R@1': 0.4000000059604645, 'R@2': 0.4000000059604645, 'R@3': 0.800000011920929, 'MRR': 0.5833333333333333}
-------
Golden labels: [0 3 1 3 0]
Ranked predictions:
[[1 3 0 2]
 [1 3 2 0]
 [1 0 3 2]
 [1 0 3 2]
 [3 1 0 2]

## Train

In [None]:
from transformers import TrainingArguments

from similarity_augmentations import conf

model_name = model_checkpoint.split("/")[-1]
batch_size = 16  # this is copied from the example notebook, try other values

args = TrainingArguments(
    conf.FINETUNED_MODELS_DIR / f"{model_name}-finetuned-mutualplus",
    evaluation_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    # push_to_hub=True,
)
print(args)


TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.EPOCH,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_pu

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_mutualplus["train"],
    eval_dataset=tokenized_mutualplus["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer),
    compute_metrics=compute_metrics_fn,
)
trainer


<transformers.trainer.Trainer>

In [None]:
# trainer.train()


# Now, to the "fun" stuff: mixing the datasets

First of all, let's get the MMLU.

In [None]:
mmlu = load_dataset("cais/mmlu", name="all")
mmlu


Found cached dataset mmlu (/home/xqz-u/.cache/huggingface/datasets/cais___mmlu/all/1.0.0/1f5be36877bf67bdc9a548113a281aec5730e14ead069b31cb63971b9fab210d)


  0%|          | 0/4 [00:00<?, ?it/s]

DatasetDict({
    auxiliary_train: Dataset({
        features: ['question', 'subject', 'choices', 'answer'],
        num_rows: 99842
    })
    test: Dataset({
        features: ['question', 'subject', 'choices', 'answer'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['question', 'subject', 'choices', 'answer'],
        num_rows: 1531
    })
    dev: Dataset({
        features: ['question', 'subject', 'choices', 'answer'],
        num_rows: 285
    })
})

Idea is to insert specific datapoints of MMLU inside MuTual with the same features, so that the same preprocessing, 
`DataCollator` and fine tuning code for bare MuTual applies.

In [None]:
from typing import Iterable, Tuple

from datasets import concatenate_datasets, Dataset


def unify_datasets_structure(
    mutual_split: Dataset, mmlu_split: Dataset
) -> Tuple[Dataset, Dataset]:
    # drop some unused columns
    if "subject" in mmlu_split.features:
        mmlu_split = mmlu_split.remove_columns("subject")
    if "id" in mutual_split.features:
        mutual_split = mutual_split.remove_columns("id")
    # 'labels' is used by HF transformers for ground truth
    if "labels" not in mutual_split.features and "answers" in mutual_split.features:
        mutual_split = mutual_split.rename_column("answers", "labels")
    # normalize the feature names to MuTual's ones
    to_mutual_map = {"answer": "labels", "choices": "options", "question": "article"}
    mmlu_split = mmlu_split.rename_columns(to_mutual_map)
    # answer values are the same, but for MMLU is different dtype, make compatible
    mutual_split = mutual_split.cast_column("labels", mmlu_split.features["labels"])
    return mutual_split, mmlu_split


def merge_mmlu_in_mutual(
    mutual_split: Dataset, mmlu_split: Dataset, mmlu_merge_ids: Iterable[int]
) -> Dataset:
    mutual_split, mmlu_split = unify_datasets_structure(mutual_split, mmlu_split)
    return concatenate_datasets([mutual_split, mmlu_split.select(mmlu_merge_ids)])


Test that it works: the desired MMLU datapoits are appended to MuTual.

**NOTE** So it's important that the dataset is appropriately shuffled during 
training! And [it is](https://github.com/huggingface/transformers/blob/03af4c42a624ea44b3325ff78151f499392dd617/src/transformers/trainer.py#L796C23-L796C23).

**NOTE NOTE** This is basically the _random_ selection strategy for data augmentation.

In [None]:
random_ids = np.random.choice(len(mmlu["auxiliary_train"]), 10, replace=False)
print(random_ids)
print(mmlu["auxiliary_train"][random_ids])
print()
merged = merge_mmlu_in_mutual(mutual_plus["train"], mmlu["auxiliary_train"], random_ids)
merged[-10:]


Loading cached processed dataset at /home/xqz-u/.cache/huggingface/datasets/lighteval___mutual_harness/mutual_plus/0.0.1/be31c67b35f0d5c4cce450a4e8475f0eb26ae243f5bfd701a524305282b0873a/cache-66c4dec026cd72f2.arrow


[56038 60079 37144 91169 71811 58972 30662 14701 63317 88102]
{'question': ['Researchers in London and Bristol have found that men are particularly likely to yield to depression if their partners are also depressed. The finding highlights the importance of paying attention to the partners of depressed mothers, as young children themselves are vulnerable   to social problems if both parents are depressed. Researchers in London and at the University of Bristol launched their study to investigate whether family structure affects the likelihood of depression in men around the time their child is born. They looked at men from traditional families, men with children from a previous relationship, men whose partners had children by a former partner, and men who were not living with their partners. All 7,108 participants filled out a questionnaire on depression, and answered questions about their age, education level and employment status. Details about the quality of their relationships with t

{'labels': [1, 0, 3, 3, 3, 3, 1, 1, 3, 0],
 'options': [['Ten percent of women who were depressed had depressed partners.',
   '2.6 percent of healthy women were depressed.',
   'Special attention should be paid to families in which both the father and the mother were depressed.',
   "Primary school children whose parents were both depressed couldn't get along well withtheir peers."],
  ['Those who love nature.',
   'Those who love city life.',
   'Those who love the comfort in a fine hotel.',
   'Those who love going shopping.'],
  ['preparing a dinner for a poor family',
   'chatting with the elderly mother and her disabled son',
   'making preparations for their own Christmas festival',
   'visiting one of their good friends in other district'],
  ['red', 'blue', 'green', 'purple'],
  ['It told her to swim in the lake.',
   'It told her to play by the lake.',
   'It told her to catch fish for him.',
   'It taught her how to fish.'],
  ['do not believe the drawings are old.',
   'bel

Test that tokenization and `DataCollatorWithPadding` work for this mixed type dataset -- MMLU sentences are longer, it
likely introduces a lot of `<pad>`...

In [None]:
tokenized_merged = merged.map(
    preprocess_mutual,
    batched=True,
    fn_kwargs={"tokenizer": tokenizer},
)
tokenized_merged


Loading cached processed dataset at /home/xqz-u/.cache/huggingface/datasets/lighteval___mutual_harness/mutual_plus/0.0.1/be31c67b35f0d5c4cce450a4e8475f0eb26ae243f5bfd701a524305282b0873a/cache-c112e503a36ca98c.arrow


Dataset({
    features: ['labels', 'options', 'article', 'input_ids', 'attention_mask'],
    num_rows: 7098
})

In [None]:
example = [
    {k: v for k, v in tokenized_merged[-i].items() if k in accepted_keys}
    for i in range(1, 11)
]
example_batch = DataCollatorForMultipleChoice(tokenizer=tokenizer)(example)
print(example_batch["input_ids"].shape, example_batch["labels"].shape)

show_one_mutual(tokenized_merged[-1])
[
    tokenizer.decode(example_batch["input_ids"][0][i].tolist())
    for i in range(len(example_batch["input_ids"][0]))
]


torch.Size([10, 4, 512]) torch.Size([10])
Context: There are some very good inventions which, for one reason or another, don't become popular. These inventions should be better known, even though I think that some of them are crazy. Let's have a look at some of these inventions and see if you agree that they should be more successful. The Australians had a great idea to stop people from drinking and driving. The idea was that if a driver wanted to start the car, she or he would have to blow into a bag first. If there was too much alcohol   in their breath, the car wouldn't start. It sounded like a great idea to me, but people said that they might need to drive the car in an emergency   even if they had drunk too much alcohol. Another idea I liked was an invention by a scientist who thought his children watched too much TV. He connected the TV to an exercise bike so that the electricity to power the TV was produced by the bike. If the children wanted to watch a lot of TV, they had to pe

['<s>There are some very good inventions which, for one reason or another, don\'t become popular. These inventions should be better known, even though I think that some of them are crazy. Let\'s have a look at some of these inventions and see if you agree that they should be more successful. The Australians had a great idea to stop people from drinking and driving. The idea was that if a driver wanted to start the car, she or he would have to blow into a bag first. If there was too much alcohol   in their breath, the car wouldn\'t start. It sounded like a great idea to me, but people said that they might need to drive the car in an emergency   even if they had drunk too much alcohol. Another idea I liked was an invention by a scientist who thought his children watched too much TV. He connected the TV to an exercise bike so that the electricity to power the TV was produced by the bike. If the children wanted to watch a lot of TV, they had to pedal   very hard. I found another invention 