# NLP Course

Please see the [Hugging Face NLP Course page](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt).

## 7. Main NLP Tasks

### Token classification

#### Objective

We will fine-tune the BERT model [`bert-base-cased`](https://huggingface.co/google-bert/bert-base-cased) for the Named Entity Recognition task.

> BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion.
>
> More precisely, it was pretrained with two objectives:<p/>
> * Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words.<br/>
> * Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining.
>z
> This way, the model learns an inner representation of the English language that can then be used to extract features
> useful for downstream tasks: if you have a dataset of labeled sentences for instance, you can train a standard
> classifier using the features produced by the BERT model as inputs.



Please see [Token Classification](https://huggingface.co/learn/nlp-course/chapter7/2?fw=pt#token-classification), 7. Main NLP Tasks, in the 🤗 NLP Course.

#### Preparing the data

> In this section we will use the CoNLL-2003 dataset (please see [`eriktks/conll2003`](https://huggingface.co/datasets/eriktks/conll2003), which contains news stories from Reuters.

Please also see [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Sang and Meulder, 2003](https://aclanthology.org/W03-0419.pdf):
> The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("conll2003")

In [2]:
print(raw_datasets)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


In [3]:
print(raw_datasets["train"][0]["tokens"])
print(f'{len(raw_datasets["train"][0]["tokens"])} tokens')

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
9 tokens


In [4]:
print(raw_datasets["train"][0]["ner_tags"])
print(f'{len(raw_datasets["train"][0]["ner_tags"])} NER tags')

[3, 0, 7, 0, 0, 0, 7, 0, 0]
9 NER tags


The features in a dataset are referenced by the `features` attribute, which is an instance of [`datasets.Features`](https://huggingface.co/docs/datasets/v3.3.1/en/package_reference/main_classes#datasets.Features)

> A special dictionary that defines the internal structure of a dataset.
>
> Instantiated with a dictionary of type `dict[str, FieldType]`, where keys are the desired column names, and values are the type of that column.

In [5]:
ner_feature = raw_datasets["train"].features["ner_tags"]

print(ner_feature)

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)


In [6]:
label_names = ner_feature.feature.names

print(label_names)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


##### Examining the NER tags

In [7]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU    rejects German call to boycott British lamb . 
B-ORG O       B-MISC O    O  O       B-MISC  O    O 


In [8]:
words = raw_datasets["train"][4]["tokens"]
labels = raw_datasets["train"][4]["ner_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer . 
B-LOC   O  O              O  O   B-ORG    I-ORG O  O          O         B-PER  I-PER     O    O  O         O         O      O   O         O    O         O     O    B-LOC   O     O   O          O      O   O       O 


✏️ Your turn! Print the same two sentences with their POS or chunking labels.

##### Examining the POS tags

In [9]:
pos_feature = raw_datasets["train"].features["pos_tags"]

print(pos_feature)

Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None)


In [10]:
label_names = pos_feature.feature.names

print(label_names)

['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']


In [11]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["pos_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU  rejects German call to boycott British lamb . 
NNP VBZ     JJ     NN   TO VB      JJ      NN   . 


In [12]:
words = raw_datasets["train"][4]["tokens"]
labels = raw_datasets["train"][4]["pos_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

Germany 's  representative to the European Union 's  veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer . 
NNP     POS NN             TO DT  NNP      NNP   POS JJ         NN        NNP    NNP       VBD  IN NNP       NNS       MD     VB  NN        IN   NNS       JJ    IN   NNP     IN    DT  JJ         NN     VBD JJR     . 


...

##### Examining the Chunking labels

In [13]:
chunk_feature = raw_datasets["train"].features["chunk_tags"]
print(chunk_feature)

Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None)


In [14]:
label_names = chunk_feature.feature.names
print(label_names)

['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP']


In [15]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["chunk_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU   rejects German call to   boycott British lamb . 
B-NP B-VP    B-NP   I-NP B-VP I-VP    B-NP    I-NP O 


In [16]:
words = raw_datasets["train"][4]["tokens"]
labels = raw_datasets["train"][4]["chunk_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

Germany 's   representative to   the  European Union 's   veterinary committee Werner Zwingmann said on   Wednesday consumers should buy  sheepmeat from countries other  than Britain until  the  scientific advice was  clearer . 
B-NP    B-NP I-NP           B-PP B-NP I-NP     I-NP  B-NP I-NP       I-NP      I-NP   I-NP      B-VP B-PP B-NP      I-NP      B-VP   I-VP B-NP      B-PP B-NP      B-ADJP B-PP B-NP    B-SBAR B-NP I-NP       I-NP   B-VP B-ADJP  O 


#### Processing the data

In [17]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

print("let's see some info on this tokenizer...")
tokenizer.init_kwargs

let's see some info on this tokenizer...


{'do_lower_case': False,
 'unk_token': '[UNK]',
 'sep_token': '[SEP]',
 'pad_token': '[PAD]',
 'cls_token': '[CLS]',
 'mask_token': '[MASK]',
 'tokenize_chinese_chars': True,
 'strip_accents': None,
 'model_max_length': 512,
 'name_or_path': 'bert-base-cased'}

In [18]:
print(tokenizer.is_fast)

True


In [19]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

In [20]:
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

##### Aligning the tokens with their corresponding labels

* The `tokens` are the words that the NER tags correspond with, and also _include_ the special tokens like `[CLS]` or `[SEP]`.
* **However**, `word_ids` only list `None` for those special tokens.

> The first rule we’ll apply is that special tokens get a label of `-100`.
> <span style="background-color:#33FFFF">This is because by default `-100` is an index that is ignored in the loss
> function we will use (cross entropy)</span>

We define a function `align_labels_with_tokens` that will take care of this.

In [21]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [22]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()

print(f"labels: {labels}")
print(f"word_ids: {word_ids}")
print()
print(f"after correction: {align_labels_with_tokens(labels, word_ids)}")

labels: [3, 0, 7, 0, 0, 0, 7, 0, 0]
word_ids: [None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

after correction: [-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


✏️ Your turn! Some researchers prefer to attribute only one label per word, and assign -100 to the other subtokens in a given word. This is to avoid long words that split into lots of subtokens contributing heavily to the loss. Change the previous function to align labels with input IDs by following this rule.

...

> To preprocess our whole dataset, we need to tokenize all the inputs and apply `align_labels_with_tokens()` on all the labels. To take advantage of the speed of our fast tokenizer, it’s best to tokenize lots of texts at the same time, so we’ll write a function that processes a list of examples and use the `Dataset.map()` method with the option `batched=True`. The only thing that is different from our previous example is that the `word_ids()` function needs to get the index of the example we want the word IDs of when the inputs to the tokenizer are lists of texts (or in our case, list of lists of words), so we add that too

N.B. this `tokenize_and_align_labels` is designed so that it can work on a short array of examples as well as on a batch invoked via [🤗 Datasets batch mapping](https://huggingface.co/docs/datasets/about_map_batch#batch-mapping).

In [23]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

Let's test this out on a very small slice of the training examples...

In [24]:
print(tokenize_and_align_labels(raw_datasets["train"][:2]))

{'input_ids': [[101, 7270, 22961, 1528, 1840, 1106, 21423, 1418, 2495, 12913, 119, 102], [101, 1943, 14428, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1]], 'labels': [[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100], [-100, 1, 2, -100]]}


Now let's transform the raw dataset (train, validation, and test sets) all in one go, putting the data into a form that we can use for our fine-tuning.

In [25]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

The raw dataset has these columns...

In [26]:
raw_datasets["train"].column_names

['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags']

And our transformed data for fine-tuning has these columns...

In [27]:
tokenized_datasets.column_names

{'train': ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
 'validation': ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
 'test': ['input_ids', 'token_type_ids', 'attention_mask', 'labels']}

<span style="background-color:#AFFF33;">**Q?** How to confirm the column names? Where do they come from? The tokenizer? the model?</span>

#### Fine-tuning the model with the Trainer API

 #### Data collation

> We can’t just use a [`DataCollatorWithPadding`](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorWithPadding)
> like in Chapter 3 because that only pads the inputs (input IDs, attention mask, and token type IDs).
> <span style="background-color:#33FFFF">Here our labels should be padded the exact same way
> as the inputs so that they stay the same size, using `-100` as a value so that the corresponding
> predictions are ignored in the loss computation</span>
>
> This is all done by a [`DataCollatorForTokenClassification`](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForTokenClassification). Like the `DataCollatorWithPadding`, <span style="background-color:#33FFFF">it takes the tokenizer used to preprocess the inputs</span>

In [28]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Just to see, let's take another look at the raw example at index 1:

In [29]:
raw_datasets["train"][1]

{'id': '1',
 'tokens': ['Peter', 'Blackburn'],
 'pos_tags': [22, 22],
 'chunk_tags': [11, 12],
 'ner_tags': [1, 2]}

Compare that to the tokenized example at index 1:

In [30]:
tokenized_datasets["train"][1]

{'input_ids': [101, 1943, 14428, 102],
 'token_type_ids': [0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1],
 'labels': [-100, 1, 2, -100]}

And then compare the above raw and tokenized examples with what happens when we have a (mini) batch:

In [31]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])

print(batch["labels"])
print()
print("see how the 2nd row (example at index 1) is nicely padded (on the right) to match the length of the 1st row (example at index 0)?")

tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

see how the 2nd row (example at index 1) is nicely padded (on the right) to match the length of the 1st row (example at index 0)?


#### Metrics

From [Metric: sequeval](https://huggingface.co/spaces/evaluate-metric/seqeval) on 🤗 Spaces:

> Seqeval produces labelling scores along with its sufficient statistics from a source against one or more references.
>
> It takes two mandatory arguments:
> <br/><span style="padding-left:20px;"/><tt>predictions</tt>: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.</span>
> <br/><span style="padding-left:20px;"/><tt>references</tt>: a list of lists of reference labels, i.e. the ground truth/target values.</span>


It works like this:
 
    seqeval = evaluate.load('seqeval')
    predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
    references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
    results = seqeval.compute(predictions=predictions, references=references)


In [32]:
import evaluate

metric = evaluate.load("seqeval")

##### Quick example of `seqeval`

* Take the NER tags very first training example as being truth.
* Copy those labels and change the 3rd one from `B-INTJ` to O``, just for kicks
* Now evaluate with `seqeval`, passing in your `predictions` and `references` (both as lists)

In [33]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]

print(f"labels: {labels}")

labels: ['B-ADVP', 'O', 'B-INTJ', 'O', 'O', 'O', 'B-INTJ', 'O', 'O']


In [34]:
predictions = labels.copy()
predictions[2] = "O"

print(f"predictions: {predictions}")

predictions: ['B-ADVP', 'O', 'O', 'O', 'O', 'O', 'B-INTJ', 'O', 'O']


In [35]:
metric.compute(predictions=[predictions], references=[labels])

{'ADVP': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'INTJ': {'precision': 1.0,
  'recall': 0.5,
  'f1': 0.6666666666666666,
  'number': 2},
 'overall_precision': 1.0,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.8,
 'overall_accuracy': 0.8888888888888888}

...

> This `compute_metrics()` function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is `-100`, then pass the results to the `metric.compute()` method

In [36]:
import numpy as np

def compute_metrics(eval_preds):
    # unpack
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [
        [label_names[l] for l in label if l != -100]
        for label in labels
    ]
    
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

#### Defining the model

In [37]:
# we need to reset label_names since we re-used that var for both the POS and CHUNKING bits
label_names = ner_feature.feature.names

print(label_names)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


In [38]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [39]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [40]:
model.config.num_labels

9

In [41]:
id2label

{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC'}

#### Fine-tuning the model

In [42]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [43]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


    from transformers import Trainer

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer    # <<<<<< `tokenizer` is deprecated... Use `processing_class` instead
    )

In [44]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    processing_class=tokenizer,
)
#trainer.train()

In [45]:
%%time

trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.077,0.066341,0.901906,0.931505,0.916467,0.981559
2,0.0341,0.064706,0.930236,0.946988,0.938537,0.985489
3,0.0219,0.060179,0.936184,0.950522,0.943299,0.98674


CPU times: user 9min 26s, sys: 11.2 s, total: 9min 37s
Wall time: 9min 59s


TrainOutput(global_step=5268, training_loss=0.06637766964071704, metrics={'train_runtime': 591.1167, 'train_samples_per_second': 71.26, 'train_steps_per_second': 8.912, 'total_flos': 920771584279074.0, 'train_loss': 0.06637766964071704, 'epoch': 3.0})

In [46]:
trainer.push_to_hub(commit_message="Training complete")

CommitInfo(commit_url='https://huggingface.co/buruzaemon/bert-finetuned-ner/commit/1a7552c5569373f76d0c628990fcec510d1533b1', commit_message='Training complete', commit_description='', oid='1a7552c5569373f76d0c628990fcec510d1533b1', pr_url=None, repo_url=RepoUrl('https://huggingface.co/buruzaemon/bert-finetuned-ner', endpoint='https://huggingface.co', repo_type='model', repo_id='buruzaemon/bert-finetuned-ner'), pr_revision=None, pr_num=None)

### A custom training loop

... and for your edification, here's how to do it the _hard way_...

#### Preparing everything for training

> First we need to build the `DataLoaders` from our datasets.
> We’ll reuse our `data_collator` as a `collate_fn`
> and shuffle the training set, but _not_ the validation set

In [47]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)

eval_dataloader = DataLoader(
    tokenized_datasets["validation"], 
    collate_fn=data_collator, 
    batch_size=8
)

> Next we reinstantiate our model, to make sure we’re not continuing
> the fine-tuning from before but starting from the BERT pretrained
> model again:

In [48]:
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


> Then we will need an optimizer. We’ll use the classic [`AdamW`](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html),
> which is like `Adam`, but with a fix in the way weight decay is applied

Why?
> In case you have larger models or when training on complex, high-dimensional data, it's better to choose `AdamW`
> [citation/proof needed]

In [49]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

> Once we have all those objects, we can send them to the `accelerator.prepare()` method

In [50]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

> Now that we have sent our `train_dataloader` to accelerator.prepare(), we can use its length to compute the number of training steps. Remember that we should always do this after preparing the dataloader, as that method will change its length. We use a classic linear schedule from the learning rate to 0


* [`transformers.get_scheduler`](https://huggingface.co/docs/transformers/main/en/main_classes/optimizer_schedules#transformers.get_scheduler): ... read up on optimizations..?

In [51]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [52]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "bert-finetuned-ner-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'buruzaemon/bert-finetuned-ner-accelerate'

> FutureWarning: 'Repository' (from 'huggingface_hub.repository') is deprecated and will be removed from version '1.0'. Please prefer the http-based alternatives instead. Given its large adoption in legacy code, the complete removal is only planned on next major release.<p/>
> For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.

##### Using `huggingface_hub.HfApi`

In [53]:
from huggingface_hub import HfApi

output_dir = model_name

api = HfApi()

#### Training loop

> We are now ready to write the full training loop. To simplify its evaluation part,
> we define this postprocess() function that takes predictions and labels and converts
> them to lists of strings, like our metric object expects (c.f. `seqeval`)

In [54]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

The training loop covers three sub-tasks:

1. training
2. evaluation
3. save & upload



> The training in itself, which is the classic iteration over the
> `train_dataloader`, forward pass through the model, then backward
> pass and optimizer step.
>
> The evaluation, in which there is a novelty after getting the outputs
> of our model on a batch: since two processes may have padded the inputs
> and labels to different shapes, we need to use [`accelerator.pad_across_processes()`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator.pad_across_processes)
> to make the predictions and labels the same shape before calling the
> [`gather()`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator.gather)
> method. If we don’t do this, the evaluation will either error out or hang forever.
> Then we send the results to `metric.add_batch()` and call `metric.compute()` once
> the evaluation loop is over.
>
> Saving and uploading, where we first save the model and the tokenizer, then call
> [`repo.push_to_hub()`](https://huggingface.co/docs/diffusers/v0.32.2/en/api/pipelines/overview#diffusers.utils.PushToHubMixin.push_to_hub).
> Notice that we use the argument `blocking=False` to tell the 🤗 Hub library to
> push in an asynchronous process. This way, training continues normally and this
> (long) instruction is executed in the background



In [55]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model(**batch)

        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)

        true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=true_predictions, references=true_labels)

    results = metric.compute()
    print(
        f"epoch {epoch}:",
        {
            key: results[f"overall_{key}"]
            for key in ["precision", "recall", "f1", "accuracy"]
        },
    )

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        #repo.push_to_hub(
        #    commit_message=f"Training in progress epoch {epoch}", blocking=False
        #)
        future = api.upload_folder( # Upload in the background (non-blocking action)
            repo_id=repo_name,
            folder_path=output_dir,
            run_as_future=True,
            commit_message=f"Training in progress epoch {epoch}"
        )

  0%|          | 0/5268 [00:00<?, ?it/s]

epoch 0: {'precision': 0.9347021204981488, 'recall': 0.9038242473555737, 'f1': 0.9190038884752213, 'accuracy': 0.9812798022016836}
epoch 1: {'precision': 0.9404240996297543, 'recall': 0.918625678119349, 'f1': 0.9293970893970893, 'accuracy': 0.9833107670571614}
epoch 2: {'precision': 0.9488387748232918, 'recall': 0.9292895994725564, 'f1': 0.9389624448330418, 'accuracy': 0.9858568316948254}


In [56]:
api.upload_folder(
    repo_id=repo_name,
    folder_path=output_dir,
    commit_message="Training completed!"
)

CommitInfo(commit_url='https://huggingface.co/buruzaemon/bert-finetuned-ner-accelerate/commit/9a13eacccccdc648b14184596e56c15e086530cf', commit_message='Training completed!', commit_description='', oid='9a13eacccccdc648b14184596e56c15e086530cf', pr_url=None, repo_url=RepoUrl('https://huggingface.co/buruzaemon/bert-finetuned-ner-accelerate', endpoint='https://huggingface.co', repo_type='model', repo_id='buruzaemon/bert-finetuned-ner-accelerate'), pr_revision=None, pr_num=None)

#### Using the fine-tuned model

In [57]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "buruzaemon/bert-finetuned-ner-accelerate"

token_classifier = pipeline(
    "token-classification", 
    model=model_checkpoint, 
    aggregation_strategy="simple"
)

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/431M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Device set to use cuda:0


In [58]:
token_classifier("My name is Buruzaemon and I like fishing in Hokkaido.")

[{'entity_group': 'PER',
  'score': 0.96822476,
  'word': 'Buruzaemon',
  'start': 11,
  'end': 21},
 {'entity_group': 'LOC',
  'score': 0.9975211,
  'word': 'Hokkaido',
  'start': 44,
  'end': 52}]