# HW1 Starter Snippets

Unless otherwise stated, (nearly) all your questions can be answered in the huggingface 🤗 documentation for
[transformers](https://huggingface.co/docs/transformers/index), 
[tokenizers](https://huggingface.co/docs/tokenizers/index), 
[datasets](https://huggingface.co/docs/datasets/index), or 
[evaluate](https://huggingface.co/docs/evaluate/index), respectively.
They're absolutely awesome docs.


Huggingface downloads and caches models, datasets, and other files in a "cache" directory, so you need a few gigs of space somewhere for this.
Setting the HF_HOME env variable lets the library know where you want things to be cached. This is not the same as "saving" trained models - you specify those paths individually. 

This just saves you network bandwidth and download times by fetching a local copy when possible. It lives up to the cache namesake in some cool ways to avoid redoing preprocessing/computation.

In [None]:
import os
os.environ['HF_HOME'] = "/cmlscratch/jkirchen/.cache/huggingface"

## Basic Tokenization Operations

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Turning a string into input token ids - the inputs to any transformer language model

In [None]:
encoded_input = tokenizer("this is going to be a fun programming assignment.")
print(encoded_input)

Going back to strings

In [None]:
tokenizer.decode(encoded_input["input_ids"])

You can tokenize in batches ... try out the different arguments 

In [None]:
batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
# encoded_input = tokenizer(batch_sentences, padding=True)
# encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
# encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)

## Basic Datasets Operations

In [None]:
from datasets import load_dataset

We can load the train split of the MNLI dataset and see an overview of what "shape" it is

In [None]:
mnli_dataset = load_dataset("multi_nli", split="train")
print(mnli_dataset)

Inspecting an actual example

In [None]:
mnli_dataset[10]

We can also load SQuAD the same way

In [None]:
squad_dataset = load_dataset("squad_v2", split="train")
print(squad_dataset)

And finally NER

In [None]:
wikineural_dataset = load_dataset("Babelscape/wikineural", split="train_en")
print(wikineural_dataset)

In [None]:
wikineural_dataset.features

In [None]:
wikineural_dataset[1]

### Useful operation : adding a derived column via map

Note: run it again and see how the result is loaded from the cache. Cool!
If the input to the map and the function applied in the map are the same this should work

In [None]:
# from the dataset card (https://huggingface.co/datasets/Babelscape/wikineural)
tag_set = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
tag_set = {v: k for k, v in tag_set.items()}

In [None]:
enriched_dataset = wikineural_dataset.map(lambda example: {'labels': [tag_set[tag] for tag in example['ner_tags']]})
print(enriched_dataset)

In [None]:
enriched_dataset[1]

## Basic Transformers Usage

Loading different task-specific variants of a BERT Model 

These blocks are the most basic (and automagic) way to load a BERT model with a classification head for
tasks such as MNLI, extractive QA (MRC) on data like SQuAD, and token-wise classification for tasks like NER.

In [None]:
from transformers import (AutoConfig, 
                          AutoTokenizer, 
                          AutoModelForSequenceClassification, 
                          AutoModelForQuestionAnswering,
                          AutoModelForTokenClassification)


We start with an identifier on huggingface.co/models

For this assignment, generally, this is just the name for the standard base BERT encoder model (with a masked language model head by default)

Teh fact that it's the pretrained model name for just the base/core model impacts the subsequent load calls below...


In [None]:
hf_model_name_or_path = "bert-base-cased"

# Load the model configuration
config = AutoConfig.from_pretrained(hf_model_name_or_path)
# passing the num_labels argument is optional, as it can/should be inferred from the dataset

# Load the tokenizer for this model
tokenizer = AutoTokenizer.from_pretrained(hf_model_name_or_path)

Take a look at the messages that appear, they are very informative about the relationship between the BERT encoder weights and the task head

In [None]:
# Load the sequence classification/NLI model
seq_cls_model = AutoModelForSequenceClassification.from_pretrained(hf_model_name_or_path, config=config)

In [None]:
# Load the MRC/question answering model
qa_model = AutoModelForQuestionAnswering.from_pretrained(hf_model_name_or_path, config=config)

In [None]:
# Load the token classification/NER model
ner_model = AutoModelForTokenClassification.from_pretrained(hf_model_name_or_path, config=config)

## Some static versions of the outputs of these types of BERT with task head models

To produce these yourself, run one of the `no_trainer` scripts included and stop in the training batches loop to inspect `outputs`


### Example output for seq_cls_model (like for NLI):

```sh
SequenceClassifierOutput(loss=tensor(0.7699, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[ 0.1040, -0.6909,  0.4659],
        [-0.9791,  0.7396, -0.0732],
        [-0.3110, -0.6021,  1.0554],
        [ 1.7014, -0.6825, -1.1260],
        [-1.2129,  0.5246,  0.7338],
        [-1.5830, -0.4266,  2.4691],
        [-0.7301,  1.1808, -0.6139],
        [ 1.8110, -1.2814, -0.7678],
        [ 1.9778, -1.0482, -1.2913],
        [-1.0419,  0.1376,  1.0747],
        [-1.4279,  1.4548,  0.5795],
        [ 1.4805, -0.8703, -0.8966],
        [ 1.8836, -0.9205, -1.1406],
        [ 1.6764, -0.8014, -0.5330],
        [ 1.5336, -0.7486, -1.1418],
        [-1.2788, -1.4566,  2.5621],
        [-1.5375,  1.5254,  0.3369],
        [-0.9993,  0.6337,  0.5967],
        [ 0.1783, -0.3170,  0.0619],
        [ 1.2508, -1.3081, -0.3031],
        [-1.7973,  0.0338,  1.7097],
        [-0.6461, -0.6172,  1.2520],
        [ 0.1375, -0.9910,  1.1393],
        [-0.7546,  0.8241, -0.2152],
        [-1.3073,  0.8289, -0.0564],
        [-1.8356,  0.6909,  1.2432],
        [-1.6042,  0.2484,  2.3209],
        [ 1.3790, -0.7681, -1.0276],
        [-1.5430, -0.7665,  3.1297],
        [-1.4621,  1.5816,  0.4027],
        [-0.5484,  0.0363,  0.4930],
        [ 1.3508, -1.0034, -0.7355]], device='cuda:0',
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)
```

For a 3 class problem, the shape of the `logits` tensor is `(bsz x num_classes)`, which here is `(32,3)`
The `loss` is the average CE multiclass classification loss taken over the batch dim.


### Example output shape for qa_model (for SQuAD MRC):

```sh
QuestionAnsweringModelOutput(loss=tensor(5.8717, device='cuda:0', grad_fn=<DivBackward0>), start_logits=tensor([[-0.2344, -0.1153, -0.3226,  ..., -0.2264, -0.0654, -0.2667],
        [-0.0394, -0.1783, -0.1586,  ..., -0.2669, -0.1833, -0.5015],
        [-0.0489,  0.3059,  0.0471,  ..., -0.2551, -0.3042, -0.2128],
        ...,
        [-0.1104, -0.3046,  0.2381,  ..., -0.0628, -0.1989, -0.0310],
        [-0.1292, -0.4169, -0.3258,  ..., -0.1559, -0.1705, -0.1658],
        [-0.3442, -0.0563, -0.1119,  ..., -0.2945, -0.3721, -0.2272]],
       device='cuda:0', grad_fn=<CloneBackward0>), end_logits=tensor([[-0.5101, -0.2102,  0.3540,  ..., -0.1394, -0.0127, -0.0557],
        [-0.5998,  0.1171,  0.0030,  ..., -0.1966, -0.1890, -0.1973],
        [-0.4303, -0.2212,  0.7105,  ..., -0.4232, -0.2996, -0.2604],
        ...,
        [-0.2798, -0.3308,  0.2352,  ..., -0.0017, -0.0638,  0.0833],
        [-0.2013,  0.0410,  0.2185,  ..., -0.0690, -0.0288,  0.1315],
        [-0.4846, -0.0881, -0.2841,  ..., -0.3322, -0.1482, -0.3584]],
       device='cuda:0', grad_fn=<CloneBackward0>), hidden_states=None, attentions=None)
```

For the span prediction head, the model produces two output tensors for every input text/question.
These have shape `(bsz x seq_length)` and give the model's prediction for the start position and end position of the answer span in the input sequence (which includes the "passage/context" being read). In this example they are `(12, 384)`

The `loss` in the output object (`total_loss`) is the CE classification loss between the start/end logits and the true start/end_positions (labels), averaged over the batch dim:

```python
loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
```

### Example output shape for an NER model

```python
TokenClassifierOutput(loss=tensor(2.1217, device='cuda:0', grad_fn=<NllLossBackward0>), logits=tensor([[[ 0.2860, -0.2712,  0.2737,  ..., -0.3259,  0.2023, -0.0978],
         [ 0.7092, -0.0134, -0.1322,  ..., -0.2983,  0.3161, -0.0509],
         [ 0.4163, -0.5686, -0.0563,  ...,  0.0067,  0.3946, -0.2604],
         ...,
         [ 0.2332,  0.1479,  0.2424,  ..., -0.1677,  0.0181,  0.0685],
         [ 0.5098, -0.0690,  0.2054,  ..., -0.4173,  0.1323, -0.1845],
         [ 0.4707, -0.1589,  0.1457,  ..., -0.3499,  0.2403, -0.1300]],

        [[ 0.1392, -0.1065,  0.3485,  ..., -0.5171, -0.0702,  0.2156],
         [-0.1113,  0.2093,  0.2295,  ...,  0.1406,  0.0407, -0.4870],
         [-0.3515, -0.1806,  0.1416,  ...,  0.1510,  0.2850, -0.4419],
         ...,
         [-0.0244, -0.2980,  0.2082,  ..., -0.3929,  0.2233, -0.3086],
         [-0.1493, -0.1583,  0.0858,  ..., -0.3656, -0.1608,  0.0397],
         [-0.0885,  0.1128, -0.2955,  ..., -0.3246, -0.2501, -0.1332]],

        [[-0.4092, -0.6252, -0.0032,  ..., -0.4688,  0.4135, -0.3822],
         [ 0.1071,  0.2398,  0.1925,  ..., -0.6402,  0.1205, -0.6248],
         [ 0.5574,  0.0604, -0.0136,  ..., -0.3985,  0.3627, -0.3857],
         ...,
         [ 0.3345, -0.6312, -0.1251,  ..., -0.1982,  0.3637, -0.8203],
         [ 0.2451, -0.2064,  0.0295,  ..., -0.1030,  0.4884, -0.1153],
         [ 0.1495,  0.0738, -0.1557,  ..., -0.0072,  0.4927,  0.6017]],

        ...,

        [[-0.3878,  0.0331, -0.2140,  ..., -0.3954,  0.0785, -0.1093],
         [-0.2488, -0.3900, -0.3403,  ..., -0.1708, -0.2019, -0.0287],
         [-0.2760,  0.3549,  0.0586,  ..., -0.1434,  0.1379,  0.1815],
         ...,
         [-0.1993, -0.4495,  0.0578,  ..., -0.4326,  0.2391, -0.2390],
         [-0.0191, -0.4702,  0.0100,  ..., -0.3972,  0.1565, -0.3812],
         [-0.1653, -0.0455, -0.3373,  ..., -0.2308, -0.1534, -0.1001]],

        [[-0.1269, -0.4377, -0.0396,  ..., -0.3559,  0.0533, -0.3187],
         [ 0.1010, -0.3163, -0.0770,  ..., -0.3450, -0.3520, -0.0923],
         [ 0.1018, -0.0359,  0.2626,  ..., -0.5112,  0.3169,  0.0298],
         ...,
         [ 0.1833, -0.2909,  0.1263,  ..., -0.3456,  0.1408, -0.1342],
         [ 0.0845, -0.3480,  0.1911,  ..., -0.3855,  0.2223, -0.2479],
         [ 0.3148, -0.0760,  0.2829,  ..., -0.2543,  0.2454, -0.3342]],

        [[-0.0261, -0.2420,  0.2130,  ..., -0.3749,  0.0549,  0.4439],
         [-0.0817, -0.3362, -0.1929,  ..., -0.0361,  0.2145,  0.3001],
         [-0.0170, -0.0779,  0.1517,  ..., -0.1094,  0.2054,  0.3024],
         ...,
         [ 0.1930, -0.2856,  0.0041,  ..., -0.0372,  0.3425,  0.1241],
         [ 0.0289, -0.0881, -0.0745,  ..., -0.2711,  0.2272, -0.0776],
         [-0.0857, -0.2791,  0.0754,  ..., -0.2199,  0.0361,  0.0086]]],
       device='cuda:0', grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)
```

For token classification the logits have shape `(bsz x seq_length x num_classes)`, where the classes in the NER task are:
```
{'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
```
The `loss` is the CE multi-class classification loss comparing the predicted tag to the actual tag for each token position.

## A more assignment-specific thing with datasets...

Get all the unique values in the genre column(Domain Splits)

In [None]:
mnli_dataset = load_dataset("multi_nli")

# drop the "validation_mismatched" splits
# mnli_dataset = mnli_dataset.remove_columns(["validation_mismatched"])
del mnli_dataset["validation_mismatched"]

# Get all the unique values in the genre column(Python)
# genres = set(sub_mnli_dataset["genre"])

# Get all the unique values in the genre column(NumPy arrays)
genres = mnli_dataset["train"].unique("genre")
genres

### Steps

1.   We create a python dictionary to store the different datasets
2.   Using the genres from the previous cell, we iterate the unique genres and create a new dataset that contains only rows with the same genres
3. We use the DatasetDict from Hugging Face to collect all those subsets








In [None]:
# Create a dictionary to store the different datasets
genre_subsets = {}

# Loop through the unique genres and create a new dataset for each genre
# just 2 for display purposes
genres = genres[:2]
for genre in genres:
    genre_subsets[genre] = mnli_dataset.filter(lambda example: example['genre'] == genre)

# Collect the genre-specific datasets into Hugging Face Datasets
from datasets import DatasetDict

genre_subsets_dict = DatasetDict()
for genre, dataset in genre_subsets.items():
    genre_subsets_dict[genre] = dataset

genre_subsets_dict

In [None]:
# Print the number of train examples in each genre-specific dataset
for genre in genres:
    print(f"{genre} subset: {len(genre_subsets_dict[genre]['train'])} examples")

In [None]:
print(genre_subsets_dict["government"]["train"][0])

## Finally, check out task_sampler.py for some Part 2 (and really part 1 also) inspiration

The main method uses all the above to demo the "task sampling" concept to accomplish weighted multi-tasking