# Drilling down Multiple Choice downstream task


When I started studying Language Models, I remember when I've found the following image from Open AI transformer paper [@Radford2018ImprovingLU]:



![](images/downstream-gpt.png "Example of fine-tuning tasks from GPT paper")

However, the only difference is that the **input data** should be _slightly_ different:

> For these tasks, we are given a context
document $z$, a question $q$, and a set of possible answers ${a_k}$. We concatenate the document context
and question with each possible answer, adding a delimiter token in between to get [$z$; $q$; _\$_ ; $a_k$]. Each of these sequences are **processed independently with our model and then normalized via a softmax layer to produce an output distribution over possible answers**.

Therefore, these inputs could be optimized via [Categorical Cross Entropy Loss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html), where $C$ is the number of options available. For a specific question.


# From GPT to BERT

As we will see with [Hugging Face's transformer library](https://huggingface.co/transformers/), when we considerer application from a fine tuning task, the approach of BERT can be derived directly from the tecnique presented by [@Radford2018ImprovingLU].
It is possible to check it from [documentation](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertForMultipleChoice)

> Bert Model with a multiple choice classification head on top (a linear layer on top of the pooled output and a softmax) e.g. for RocStories/SWAG tasks.

In [5]:
#| code-fold: true
import numpy as np
import torch
from transformers import BertTokenizer, BertForMultipleChoice
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMultipleChoice.from_pretrained("bert-base-uncased")

Downloading: 100%|██████████| 232k/232k [00:01<00:00, 171kB/s]
Downloading: 100%|██████████| 433/433 [00:00<00:00, 122kB/s]
Downloading: 100%|██████████| 440M/440M [02:06<00:00, 3.48MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultipleChoice: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSeque

In [6]:
question = "George wants to warm his hands quickly by rubbing them. Which skin surface will produce the most heat?"
option_a = "dry palms"
option_b = "wet palms"
option_c = "palms covered with oil"
option_d = "palms covered with lotion"

In this case, option A is the correct one. Furthermore, the batch size here would be 1

In [7]:
labels = torch.tensor(0).unsqueeze(0) 

Notice that the question is the same for each option

In [11]:
encoding = tokenizer(
            [question, question, question, question],
            [option_a, option_b, option_c, option_d],
            return_tensors='pt',
            padding=True
           )

outputs = model(**{k: v.unsqueeze(0) for k,v in encoding.items()}, labels=labels)

:::{.callout-important}

Notice that if we have a dataset such as SquaD where each question comes with a context, we could append this context to either the question text or the option text and we would then have the tuple cited by [Open AI transformer paper](https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf)

:::

The output is a linear layer which would still be trained through a Cross Entropy loss. Then, as stated by the documentation, we still need to apply softmax to the logits

In [16]:
loss = outputs.loss
logits = outputs.logits


Linear Logits output:

In [25]:
#| echo: false
logits

tensor([[-0.3457, -0.3295, -0.3271, -0.3342]], grad_fn=<ViewBackward>)

Logits after the softmax function. Since this model did not learn anything, the result below is expected:

In [28]:
#| echo: false

torch.nn.functional.softmax(logits)

tensor([[0.2471, 0.2511, 0.2518, 0.2500]], grad_fn=<SoftmaxBackward>)

# Conclusion

Congratulations! Adding up with the [first part](https://abarbosa94.github.io/personal_blog/masters/nlp/knowledge-distill/2020/09/19/Distilling-BERT.html), you have learned the end-to-end BERT Flow :)