# HW1 Starter Snippets

Unless otherwise stated, (nearly) all your questions can be answered in the huggingface 🤗 documentation for
[transformers](https://huggingface.co/docs/transformers/index), 
[tokenizers](https://huggingface.co/docs/tokenizers/index), 
[datasets](https://huggingface.co/docs/datasets/index), or 
[evaluate](https://huggingface.co/docs/evaluate/index), respectively.
They're absolutely awesome docs.


Huggingface downloads and caches models, datasets, and other files in a "cache" directory, so you need a few gigs of space somewhere for this.
Setting the HF_HOME env variable lets the library know where you want things to be cached. This is not the same as "saving" trained models - you specify those paths individually. 

This just saves you network bandwidth and download times by fetching a local copy when possible. It lives up to the cache namesake in some cool ways to avoid redoing preprocessing/computation.

In [1]:
import os
os.environ['HF_HOME'] = "/cmlscratch/jkirchen/.cache/huggingface"

## Basic Tokenization Operations

In [2]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

  from .autonotebook import tqdm as notebook_tqdm


Turning a string into input token ids - the inputs to any transformer language model

In [3]:
encoded_input = tokenizer("this is going to be a fun programming assignment.")
print(encoded_input)

{'input_ids': [101, 1142, 1110, 1280, 1106, 1129, 170, 4106, 4159, 8641, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


Going back to strings

In [4]:
tokenizer.decode(encoded_input["input_ids"])

'[CLS] this is going to be a fun programming assignment. [SEP]'

You can tokenize in batches ... try out the different arguments 

In [5]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
# encoded_input = tokenizer(batch_sentences, padding=True)
# encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
# encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)

{'input_ids': [101, 1142, 1110, 1280, 1106, 1129, 170, 4106, 4159, 8641, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## Basic Datasets Operations

In [6]:
from datasets import load_dataset

We can load the train split of the MNLI dataset and see an overview of what "shape" it is

In [7]:
mnli_dataset = load_dataset("multi_nli", split="train")
print(mnli_dataset)

Found cached dataset multi_nli (/cmlscratch/jkirchen/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39)


Dataset({
    features: ['promptID', 'pairID', 'premise', 'premise_binary_parse', 'premise_parse', 'hypothesis', 'hypothesis_binary_parse', 'hypothesis_parse', 'genre', 'label'],
    num_rows: 392702
})


Inspecting an actual example

In [8]:
mnli_dataset[10]

{'promptID': 55785,
 'pairID': '55785e',
 'premise': 'I burst through a set of cabin doors, and fell to the ground-',
 'premise_binary_parse': '( I ( ( ( ( ( burst ( through ( ( a set ) ( of ( cabin doors ) ) ) ) ) , ) and ) ( fell ( to ( the ground ) ) ) ) - ) )',
 'premise_parse': '(ROOT (S (NP (PRP I)) (VP (VP (VBP burst) (PP (IN through) (NP (NP (DT a) (NN set)) (PP (IN of) (NP (NN cabin) (NNS doors)))))) (, ,) (CC and) (VP (VBD fell) (PP (TO to) (NP (DT the) (NN ground))))) (: -)))',
 'hypothesis': 'I burst through the doors and fell down.',
 'hypothesis_binary_parse': '( I ( ( ( ( burst ( through ( the doors ) ) ) and ) ( fell down ) ) . ) )',
 'hypothesis_parse': '(ROOT (S (NP (PRP I)) (VP (VP (VBP burst) (PP (IN through) (NP (DT the) (NNS doors)))) (CC and) (VP (VBD fell) (PRT (RP down)))) (. .)))',
 'genre': 'fiction',
 'label': 0}

We can also load SQuAD the same way

In [9]:
squad_dataset = load_dataset("squad_v2", split="train")
print(squad_dataset)

Found cached dataset squad_v2 (/cmlscratch/jkirchen/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d)


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 130319
})


And finally NER

In [10]:
wikineural_dataset = load_dataset("Babelscape/wikineural", split="train_en")
print(wikineural_dataset)

Found cached dataset parquet (/cmlscratch/jkirchen/.cache/huggingface/datasets/Babelscape___parquet/Babelscape--wikineural-579d1dc98d2a6b93/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


Dataset({
    features: ['tokens', 'ner_tags', 'lang'],
    num_rows: 92720
})


In [11]:
wikineural_dataset.features

{'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'lang': Value(dtype='string', id=None)}

In [12]:
wikineural_dataset[1]

{'tokens': ['"',
  'So',
  'here',
  'is',
  'the',
  'balance',
  'NBC',
  'has',
  'to',
  'consider',
  ':',
  'The',
  'Who',
  ',',
  "'",
  'Animal',
  'Practice',
  "'",
  '.'],
 'ner_tags': [0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 7, 8, 0, 0, 7, 8, 0, 0],
 'lang': 'en'}

### Useful operation : adding a derived column via map

Note: run it again and see how the result is loaded from the cache. Cool!
If the input to the map and the function applied in the map are the same this should work

In [13]:
# from the dataset card (https://huggingface.co/datasets/Babelscape/wikineural)
tag_set = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
tag_set = {v: k for k, v in tag_set.items()}

In [14]:
enriched_dataset = wikineural_dataset.map(lambda example: {'labels': [tag_set[tag] for tag in example['ner_tags']]})
print(enriched_dataset)

Loading cached processed dataset at /cmlscratch/jkirchen/.cache/huggingface/datasets/Babelscape___parquet/Babelscape--wikineural-579d1dc98d2a6b93/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-04d0b942ec517e80.arrow


Dataset({
    features: ['tokens', 'ner_tags', 'lang', 'labels'],
    num_rows: 92720
})


In [15]:
enriched_dataset[1]

{'tokens': ['"',
  'So',
  'here',
  'is',
  'the',
  'balance',
  'NBC',
  'has',
  'to',
  'consider',
  ':',
  'The',
  'Who',
  ',',
  "'",
  'Animal',
  'Practice',
  "'",
  '.'],
 'ner_tags': [0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 7, 8, 0, 0, 7, 8, 0, 0],
 'lang': 'en',
 'labels': ['O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-ORG',
  'O',
  'O',
  'O',
  'O',
  'B-MISC',
  'I-MISC',
  'O',
  'O',
  'B-MISC',
  'I-MISC',
  'O',
  'O']}

## Basic Transformers Usage

Loading different task-specific variants of a BERT Model 

These blocks are the most basic (and automagic) way to load a BERT model with a classification head for
tasks such as MNLI, extractive QA (MRC) on data like SQuAD, and token-wise classification for tasks like NER.

In [16]:
from transformers import (AutoConfig, 
                          AutoTokenizer, 
                          AutoModelForSequenceClassification, 
                          AutoModelForQuestionAnswering,
                          AutoModelForTokenClassification)


We start with an identifier on huggingface.co/models

For this assignment, generally, this is just the name for the standard base BERT encoder model (with a masked language model head by default)

Teh fact that it's the pretrained model name for just the base/core model impacts the subsequent load calls below...


In [17]:
hf_model_name_or_path = "bert-base-cased"

# Load the model configuration
config = AutoConfig.from_pretrained(hf_model_name_or_path)
# passing the num_labels argument is optional, as it can/should be inferred from the dataset

# Load the tokenizer for this model
tokenizer = AutoTokenizer.from_pretrained(hf_model_name_or_path)

Take a look at the messages that appear, they are very informative about the relationship between the BERT encoder weights and the task head

In [18]:
# Load the sequence classification/NLI model
seq_cls_model = AutoModelForSequenceClassification.from_pretrained(hf_model_name_or_path, config=config)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [19]:
# Load the MRC/question answering model
qa_model = AutoModelForQuestionAnswering.from_pretrained(hf_model_name_or_path, config=config)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForQuestionAnswering: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and a

In [20]:
# Load the token classification/NER model
ner_model = AutoModelForTokenClassification.from_pretrained(hf_model_name_or_path, config=config)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

## Some static versions of the outputs of these types of BERT with task head models

To produce these yourself, run one of the `no_trainer` scripts included and stop in the training batches loop to inspect `outputs`


### Example output for seq_cls_model (like for NLI):

```sh
SequenceClassifierOutput(loss=tensor(0.7699, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[ 0.1040, -0.6909,  0.4659],
        [-0.9791,  0.7396, -0.0732],
        [-0.3110, -0.6021,  1.0554],
        [ 1.7014, -0.6825, -1.1260],
        [-1.2129,  0.5246,  0.7338],
        [-1.5830, -0.4266,  2.4691],
        [-0.7301,  1.1808, -0.6139],
        [ 1.8110, -1.2814, -0.7678],
        [ 1.9778, -1.0482, -1.2913],
        [-1.0419,  0.1376,  1.0747],
        [-1.4279,  1.4548,  0.5795],
        [ 1.4805, -0.8703, -0.8966],
        [ 1.8836, -0.9205, -1.1406],
        [ 1.6764, -0.8014, -0.5330],
        [ 1.5336, -0.7486, -1.1418],
        [-1.2788, -1.4566,  2.5621],
        [-1.5375,  1.5254,  0.3369],
        [-0.9993,  0.6337,  0.5967],
        [ 0.1783, -0.3170,  0.0619],
        [ 1.2508, -1.3081, -0.3031],
        [-1.7973,  0.0338,  1.7097],
        [-0.6461, -0.6172,  1.2520],
        [ 0.1375, -0.9910,  1.1393],
        [-0.7546,  0.8241, -0.2152],
        [-1.3073,  0.8289, -0.0564],
        [-1.8356,  0.6909,  1.2432],
        [-1.6042,  0.2484,  2.3209],
        [ 1.3790, -0.7681, -1.0276],
        [-1.5430, -0.7665,  3.1297],
        [-1.4621,  1.5816,  0.4027],
        [-0.5484,  0.0363,  0.4930],
        [ 1.3508, -1.0034, -0.7355]], device='cuda:0',
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
```

For a 3 class problem, the shape of the `logits` tensor is `(bsz x num_classes)`, which here is `(32,3)`
The `loss` is the average CE multiclass classification loss taken over the batch dim.


### Example output shape for qa_model (for SQuAD MRC):

```sh
QuestionAnsweringModelOutput(loss=tensor(5.8717, device='cuda:0', grad_fn=<DivBackward0>), start_logits=tensor([[-0.2344, -0.1153, -0.3226,  ..., -0.2264, -0.0654, -0.2667],
        [-0.0394, -0.1783, -0.1586,  ..., -0.2669, -0.1833, -0.5015],
        [-0.0489,  0.3059,  0.0471,  ..., -0.2551, -0.3042, -0.2128],
        ...,
        [-0.1104, -0.3046,  0.2381,  ..., -0.0628, -0.1989, -0.0310],
        [-0.1292, -0.4169, -0.3258,  ..., -0.1559, -0.1705, -0.1658],
        [-0.3442, -0.0563, -0.1119,  ..., -0.2945, -0.3721, -0.2272]],
       device='cuda:0', grad_fn=<CloneBackward0>), end_logits=tensor([[-0.5101, -0.2102,  0.3540,  ..., -0.1394, -0.0127, -0.0557],
        [-0.5998,  0.1171,  0.0030,  ..., -0.1966, -0.1890, -0.1973],
        [-0.4303, -0.2212,  0.7105,  ..., -0.4232, -0.2996, -0.2604],
        ...,
        [-0.2798, -0.3308,  0.2352,  ..., -0.0017, -0.0638,  0.0833],
        [-0.2013,  0.0410,  0.2185,  ..., -0.0690, -0.0288,  0.1315],
        [-0.4846, -0.0881, -0.2841,  ..., -0.3322, -0.1482, -0.3584]],
       device='cuda:0', grad_fn=<CloneBackward0>), hidden_states=None, attentions=None)
```

For the span prediction head, the model produces two output tensors for every input text/question.
These have shape `(bsz x seq_length)` and give the model's prediction for the start position and end position of the answer span in the input sequence (which includes the "passage/context" being read). In this example they are `(12, 384)`

The `loss` in the output object (`total_loss`) is the CE classification loss between the start/end logits and the true start/end_positions (labels), averaged over the batch dim:

```python
loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
```

### Example output shape for an NER model

```python
TokenClassifierOutput(loss=tensor(2.1217, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[[ 0.2860, -0.2712,  0.2737,  ..., -0.3259,  0.2023, -0.0978],
         [ 0.7092, -0.0134, -0.1322,  ..., -0.2983,  0.3161, -0.0509],
         [ 0.4163, -0.5686, -0.0563,  ...,  0.0067,  0.3946, -0.2604],
         ...,
         [ 0.2332,  0.1479,  0.2424,  ..., -0.1677,  0.0181,  0.0685],
         [ 0.5098, -0.0690,  0.2054,  ..., -0.4173,  0.1323, -0.1845],
         [ 0.4707, -0.1589,  0.1457,  ..., -0.3499,  0.2403, -0.1300]],

        [[ 0.1392, -0.1065,  0.3485,  ..., -0.5171, -0.0702,  0.2156],
         [-0.1113,  0.2093,  0.2295,  ...,  0.1406,  0.0407, -0.4870],
         [-0.3515, -0.1806,  0.1416,  ...,  0.1510,  0.2850, -0.4419],
         ...,
         [-0.0244, -0.2980,  0.2082,  ..., -0.3929,  0.2233, -0.3086],
         [-0.1493, -0.1583,  0.0858,  ..., -0.3656, -0.1608,  0.0397],
         [-0.0885,  0.1128, -0.2955,  ..., -0.3246, -0.2501, -0.1332]],

        [[-0.4092, -0.6252, -0.0032,  ..., -0.4688,  0.4135, -0.3822],
         [ 0.1071,  0.2398,  0.1925,  ..., -0.6402,  0.1205, -0.6248],
         [ 0.5574,  0.0604, -0.0136,  ..., -0.3985,  0.3627, -0.3857],
         ...,
         [ 0.3345, -0.6312, -0.1251,  ..., -0.1982,  0.3637, -0.8203],
         [ 0.2451, -0.2064,  0.0295,  ..., -0.1030,  0.4884, -0.1153],
         [ 0.1495,  0.0738, -0.1557,  ..., -0.0072,  0.4927,  0.6017]],

        ...,

        [[-0.3878,  0.0331, -0.2140,  ..., -0.3954,  0.0785, -0.1093],
         [-0.2488, -0.3900, -0.3403,  ..., -0.1708, -0.2019, -0.0287],
         [-0.2760,  0.3549,  0.0586,  ..., -0.1434,  0.1379,  0.1815],
         ...,
         [-0.1993, -0.4495,  0.0578,  ..., -0.4326,  0.2391, -0.2390],
         [-0.0191, -0.4702,  0.0100,  ..., -0.3972,  0.1565, -0.3812],
         [-0.1653, -0.0455, -0.3373,  ..., -0.2308, -0.1534, -0.1001]],

        [[-0.1269, -0.4377, -0.0396,  ..., -0.3559,  0.0533, -0.3187],
         [ 0.1010, -0.3163, -0.0770,  ..., -0.3450, -0.3520, -0.0923],
         [ 0.1018, -0.0359,  0.2626,  ..., -0.5112,  0.3169,  0.0298],
         ...,
         [ 0.1833, -0.2909,  0.1263,  ..., -0.3456,  0.1408, -0.1342],
         [ 0.0845, -0.3480,  0.1911,  ..., -0.3855,  0.2223, -0.2479],
         [ 0.3148, -0.0760,  0.2829,  ..., -0.2543,  0.2454, -0.3342]],

        [[-0.0261, -0.2420,  0.2130,  ..., -0.3749,  0.0549,  0.4439],
         [-0.0817, -0.3362, -0.1929,  ..., -0.0361,  0.2145,  0.3001],
         [-0.0170, -0.0779,  0.1517,  ..., -0.1094,  0.2054,  0.3024],
         ...,
         [ 0.1930, -0.2856,  0.0041,  ..., -0.0372,  0.3425,  0.1241],
         [ 0.0289, -0.0881, -0.0745,  ..., -0.2711,  0.2272, -0.0776],
         [-0.0857, -0.2791,  0.0754,  ..., -0.2199,  0.0361,  0.0086]]],
       device='cuda:0', grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)
```

For token classification the logits have shape `(bsz x seq_length x num_classes)`, where the classes in the NER task are:
```
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
```
The `loss` is the CE multi-class classification loss comparing the predicted tag to the actual tag for each token position.

## A more assignment-specific thing with datasets...

Get all the unique values in the genre column(Domain Splits)

In [21]:
mnli_dataset = load_dataset("multi_nli")

# drop the "validation_mismatched" splits
# mnli_dataset = mnli_dataset.remove_columns(["validation_mismatched"])
del mnli_dataset["validation_mismatched"]

# Get all the unique values in the genre column(Python)
# genres = set(sub_mnli_dataset["genre"])

# Get all the unique values in the genre column(NumPy arrays)
genres = mnli_dataset["train"].unique("genre")
genres

Found cached dataset multi_nli (/cmlscratch/jkirchen/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39)
100%|██████████| 3/3 [00:00<00:00, 209.89it/s]


['government', 'telephone', 'fiction', 'travel', 'slate']

### Steps

1.   We create a python dictionary to store the different datasets
2.   Using the genres from the previous cell, we iterate the unique genres and create a new dataset that contains only rows with the same genres
3. We use the DatasetDict from Hugging Face to collect all those subsets








In [22]:
# Create a dictionary to store the different datasets
genre_subsets = {}

# Loop through the unique genres and create a new dataset for each genre
# just 2 for display purposes
genres = genres[:2]
for genre in genres:
    genre_subsets[genre] = mnli_dataset.filter(lambda example: example['genre'] == genre)

# Collect the genre-specific datasets into Hugging Face Datasets
from datasets import DatasetDict

genre_subsets_dict = DatasetDict()
for genre, dataset in genre_subsets.items():
    genre_subsets_dict[genre] = dataset

genre_subsets_dict

Loading cached processed dataset at /cmlscratch/jkirchen/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39/cache-6ad602a244b8c0fa.arrow
Loading cached processed dataset at /cmlscratch/jkirchen/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39/cache-8cb959e48d09b3c7.arrow
Loading cached processed dataset at /cmlscratch/jkirchen/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39/cache-39caa2d64fcfe2fa.arrow
Loading cached processed dataset at /cmlscratch/jkirchen/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39/cache-8d54e02aefc38a2f.arrow


DatasetDict({
    government: DatasetDict({
        train: Dataset({
            features: ['promptID', 'pairID', 'premise', 'premise_binary_parse', 'premise_parse', 'hypothesis', 'hypothesis_binary_parse', 'hypothesis_parse', 'genre', 'label'],
            num_rows: 77350
        })
        validation_matched: Dataset({
            features: ['promptID', 'pairID', 'premise', 'premise_binary_parse', 'premise_parse', 'hypothesis', 'hypothesis_binary_parse', 'hypothesis_parse', 'genre', 'label'],
            num_rows: 1945
        })
    })
    telephone: DatasetDict({
        train: Dataset({
            features: ['promptID', 'pairID', 'premise', 'premise_binary_parse', 'premise_parse', 'hypothesis', 'hypothesis_binary_parse', 'hypothesis_parse', 'genre', 'label'],
            num_rows: 83348
        })
        validation_matched: Dataset({
            features: ['promptID', 'pairID', 'premise', 'premise_binary_parse', 'premise_parse', 'hypothesis', 'hypothesis_binary_parse', 'hypothes

In [23]:
# Print the number of train examples in each genre-specific dataset
for genre in genres:
    print(f"{genre} subset: {len(genre_subsets_dict[genre]['train'])} examples")

government subset: 77350 examples
telephone subset: 83348 examples


In [24]:
print(genre_subsets_dict["government"]["train"][0])

{'promptID': 31193, 'pairID': '31193n', 'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.', 'premise_binary_parse': '( ( Conceptually ( cream skimming ) ) ( ( has ( ( ( two ( basic dimensions ) ) - ) ( ( product and ) geography ) ) ) . ) )', 'premise_parse': '(ROOT (S (NP (JJ Conceptually) (NN cream) (NN skimming)) (VP (VBZ has) (NP (NP (CD two) (JJ basic) (NNS dimensions)) (: -) (NP (NN product) (CC and) (NN geography)))) (. .)))', 'hypothesis': 'Product and geography are what make cream skimming work. ', 'hypothesis_binary_parse': '( ( ( Product and ) geography ) ( ( are ( what ( make ( cream ( skimming work ) ) ) ) ) . ) )', 'hypothesis_parse': '(ROOT (S (NP (NN Product) (CC and) (NN geography)) (VP (VBP are) (SBAR (WHNP (WP what)) (S (VP (VBP make) (NP (NP (NN cream)) (VP (VBG skimming) (NP (NN work)))))))) (. .)))', 'genre': 'government', 'label': 1}


## Finally, check out task_sampler.py for some Part 2 (and really part 1 also) inspiration

The main method uses all the above to demo the "task sampling" concept to accomplish weighted multi-tasking

## MWE of multiple BERTs with a shared backbone

In [25]:
import torch
from datasets import load_dataset
from transformers import (default_data_collator,
                          AutoConfig, 
                          AutoTokenizer,
                          AutoModelForSequenceClassification,
                          AutoModelForQuestionAnswering)

hf_model_name_or_path = "bert-base-cased"

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(hf_model_name_or_path)
# Load the model configuration
nli_config = AutoConfig.from_pretrained(hf_model_name_or_path, num_labels=3, finetuning_task="mnli")
# load the first model, whose bert encoder we'll use as a backbone
nli_model = AutoModelForSequenceClassification.from_pretrained(hf_model_name_or_path, config=nli_config)

qa_config = AutoConfig.from_pretrained(hf_model_name_or_path)
qa_model = AutoModelForQuestionAnswering.from_pretrained(hf_model_name_or_path, config=qa_config)

model_dict = {
    "nli": nli_model,
    "qa": qa_model
}

# copy the backbone encoder from the first model
for model in model_dict.values():
    model.bert = nli_model.bert
    # and move to cuda
    model.to("cuda")
    
# optionally, freeze the backbone encoder (and see the effect this has later)
# for param in nli_model.bert.parameters():
#     param.requires_grad = False

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [26]:
# load the mnli dataset and the squad dataset
mnli_dataset = load_dataset("multi_nli", split="train")
squad_dataset = load_dataset("squad", split="train")

# select the first 16 examples from each dataset
mnli_examples = mnli_dataset.select(range(4))
squad_examples = squad_dataset.select(range(4))

Found cached dataset multi_nli (/cmlscratch/jkirchen/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39)
Found cached dataset squad (/cmlscratch/jkirchen/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)


In [27]:
sentence1_key = "premise"
sentence2_key = "hypothesis"
padding = "max_length"
max_seq_length = 128

def preprocess_function(examples):
    # Tokenize the texts
    texts = (
        (examples[sentence1_key],) if sentence2_key is None else (examples[sentence1_key], examples[sentence2_key])
    )
    result = tokenizer(*texts, padding=padding, max_length=max_seq_length, truncation=True)

    if "label" in examples:
        result["labels"] = examples["label"]
    return result

mnli_examples = mnli_examples.map(
    preprocess_function,
    batched=True,
    remove_columns=mnli_examples.column_names,
    desc="Running tokenizer on dataset",
)

Loading cached processed dataset at /cmlscratch/jkirchen/.cache/huggingface/datasets/multi_nli/default/0.0.0/591f72eb6263d1ab527561777936b199b714cda156d35716881158a2bd144f39/cache-4f2cb3d6f118dc9d.arrow


In [28]:
column_names = squad_dataset.column_names
pad_on_right = True
max_seq_length = 384
doc_stride = 128
pad_to_max_length  = True

question_column_name = "question" if "question" in column_names else column_names[0]
context_column_name = "context" if "context" in column_names else column_names[1]
answer_column_name = "answers" if "answers" in column_names else column_names[2]

# Training preprocessing
def prepare_train_features(examples):
    # Some of the questions have lots of whitespace on the left, which is not useful and will make the
    # truncation of the context fail (the tokenized question will take a lots of space). So we remove that
    # left whitespace
    examples[question_column_name] = [q.lstrip() for q in examples[question_column_name]]

    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
    # in one example possible giving several features when a context is long, each of those features having a
    # context that overlaps a bit the context of the previous feature.
    tokenized_examples = tokenizer(
        examples[question_column_name if pad_on_right else context_column_name],
        examples[context_column_name if pad_on_right else question_column_name],
        truncation="only_second" if pad_on_right else "only_first",
        max_length=max_seq_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length" if pad_to_max_length else False,
    )

    # Since one example might give us several features if it has a long context, we need a map from a feature to
    # its corresponding example. This key gives us just that.
    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    # The offset mappings will give us a map from token to character position in the original context. This will
    # help us compute the start_positions and end_positions.
    offset_mapping = tokenized_examples.pop("offset_mapping")

    # Let's label those examples!
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        # We will label impossible answers with the index of the CLS token.
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        # Grab the sequence corresponding to that example (to know what is the context and what is the question).
        sequence_ids = tokenized_examples.sequence_ids(i)

        # One example can give several spans, this is the index of the example containing this span of text.
        sample_index = sample_mapping[i]
        answers = examples[answer_column_name][sample_index]
        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != (1 if pad_on_right else 0):
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != (1 if pad_on_right else 0):
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                # Note: we could go after the last offset if the answer is the last word (edge case).
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)
                while offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

squad_examples = squad_examples.map(prepare_train_features, batched=True, remove_columns=column_names)

Loading cached processed dataset at /cmlscratch/jkirchen/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453/cache-adc3fcf60189f060.arrow


In [29]:
# make dataloaders for each dataset
from torch.utils.data import DataLoader

mnli_dataloader = DataLoader(mnli_examples, collate_fn=default_data_collator, batch_size=4)
squad_dataloader = DataLoader(squad_examples, collate_fn=default_data_collator, batch_size=4)

In [30]:
# construct and optimizer for each model
from transformers import AdamW

optimizers = {
    "nli": AdamW(nli_model.parameters(), lr=1e-3), # large lr to see update in single step
    "qa": AdamW(qa_model.parameters(), lr=1e-3)
}



In [31]:
# Check that the backbone encoder is shared
print(nli_model.bert == qa_model.bert)
# verify by printing the parameter values of the last layer of the backbone encoder
print(nli_model.bert.encoder.layer[11].attention.self.value.weight[0, :5])
print(qa_model.bert.encoder.layer[11].attention.self.value.weight[0, :5])

True
tensor([ 0.0448, -0.0352, -0.0046, -0.0059, -0.0712], device='cuda:0',
       grad_fn=<SliceBackward0>)
tensor([ 0.0448, -0.0352, -0.0046, -0.0059, -0.0712], device='cuda:0',
       grad_fn=<SliceBackward0>)


In [32]:
# Check that the linear classifiers are not shared
print(nli_model.classifier.weight[0, :5])
print(qa_model.qa_outputs.weight[0, :5])

tensor([ 0.0146, -0.0139,  0.0081, -0.0187, -0.0300], device='cuda:0',
       grad_fn=<SliceBackward0>)
tensor([-0.0321, -0.0021, -0.0061,  0.0225,  0.0114], device='cuda:0',
       grad_fn=<SliceBackward0>)


In [33]:
# take a batch from each dataloader and pas through each model, calculate the loss, and backpropagate

# nli batch 
mnli_batch = next(iter(mnli_dataloader))
mnli_batch = {k: v.to("cuda") for k, v in mnli_batch.items()}
# qa batch
squad_batch = next(iter(squad_dataloader))
squad_batch = {k: v.to("cuda") for k, v in squad_batch.items()}

# NLI
nli_model.train()
nli_model.zero_grad()
nli_outputs = nli_model(**mnli_batch)
nli_outputs.loss
nli_loss = nli_outputs.loss
nli_loss.backward()
optimizers["nli"].step()

# QA
qa_model.train()
qa_model.zero_grad()
qa_outputs = qa_model(**squad_batch)
qa_loss = qa_outputs.loss
qa_loss.backward()
optimizers["qa"].step()

In [34]:
# check that the backbone encoder is shared and still equal, but with different params than before
print(nli_model.bert == qa_model.bert)
print(nli_model.bert.encoder.layer[11].attention.self.value.weight[0, :5])
print(qa_model.bert.encoder.layer[11].attention.self.value.weight[0, :5])

True
tensor([ 0.0462, -0.0335, -0.0029, -0.0060, -0.0697], device='cuda:0',
       grad_fn=<SliceBackward0>)
tensor([ 0.0462, -0.0335, -0.0029, -0.0060, -0.0697], device='cuda:0',
       grad_fn=<SliceBackward0>)


In [35]:
# Check that the linear classifiers are not shared, and updated
print(nli_model.classifier.weight[0, :5])
print(qa_model.qa_outputs.weight[0, :5])

tensor([ 0.0136, -0.0129,  0.0091, -0.0197, -0.0290], device='cuda:0',
       grad_fn=<SliceBackward0>)
tensor([-0.0311, -0.0011, -0.0071,  0.0235,  0.0124], device='cuda:0',
       grad_fn=<SliceBackward0>)
