# NLP Course

Please see the [Hugging Face NLP Course page](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt).

## 7. Main NLP Tasks

### Token classification

> ...In this section, we will fine-tune a model (BERT) on a NER task...

Please see [Token Classification](https://huggingface.co/learn/nlp-course/chapter7/2?fw=pt#token-classification), 7. Main NLP Tasks, in the 🤗 NLP Course.

#### Preparing the data

> In this section we will use the CoNLL-2003 dataset (please see [`eriktks/conll2003`](https://huggingface.co/datasets/eriktks/conll2003), which contains news stories from Reuters.

Please also see [Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition, Sang and Meulder, 2003](https://aclanthology.org/W03-0419.pdf):
> The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("conll2003")

In [2]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [3]:
print(raw_datasets["train"][0]["tokens"])

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']


In [4]:
print(raw_datasets["train"][0]["ner_tags"])

[3, 0, 7, 0, 0, 0, 7, 0, 0]


In [5]:
ner_feature = raw_datasets["train"].features["ner_tags"]
print(ner_feature)

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)


In [6]:
label_names = ner_feature.feature.names
print(label_names)

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


##### Examining the NER tags

In [7]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU    rejects German call to boycott British lamb . 
B-ORG O       B-MISC O    O  O       B-MISC  O    O 


In [8]:
words = raw_datasets["train"][4]["tokens"]
labels = raw_datasets["train"][4]["ner_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer . 
B-LOC   O  O              O  O   B-ORG    I-ORG O  O          O         B-PER  I-PER     O    O  O         O         O      O   O         O    O         O     O    B-LOC   O     O   O          O      O   O       O 


✏️ Your turn! Print the same two sentences with their POS or chunking labels.

##### Examining the POS tags

In [9]:
pos_feature = raw_datasets["train"].features["pos_tags"]
print(pos_feature)

Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None)


In [10]:
label_names = pos_feature.feature.names
print(label_names)

['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB']


In [11]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["pos_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU  rejects German call to boycott British lamb . 
NNP VBZ     JJ     NN   TO VB      JJ      NN   . 


In [12]:
words = raw_datasets["train"][4]["tokens"]
labels = raw_datasets["train"][4]["pos_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

Germany 's  representative to the European Union 's  veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer . 
NNP     POS NN             TO DT  NNP      NNP   POS JJ         NN        NNP    NNP       VBD  IN NNP       NNS       MD     VB  NN        IN   NNS       JJ    IN   NNP     IN    DT  JJ         NN     VBD JJR     . 


...

##### Examining the Chunking labels

In [13]:
chunk_feature = raw_datasets["train"].features["chunk_tags"]
print(chunk_feature)

Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None)


In [14]:
label_names = chunk_feature.feature.names
print(label_names)

['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP']


In [15]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["chunk_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

EU   rejects German call to   boycott British lamb . 
B-NP B-VP    B-NP   I-NP B-VP I-VP    B-NP    I-NP O 


In [16]:
words = raw_datasets["train"][4]["tokens"]
labels = raw_datasets["train"][4]["chunk_tags"]

line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

Germany 's   representative to   the  European Union 's   veterinary committee Werner Zwingmann said on   Wednesday consumers should buy  sheepmeat from countries other  than Britain until  the  scientific advice was  clearer . 
B-NP    B-NP I-NP           B-PP B-NP I-NP     I-NP  B-NP I-NP       I-NP      I-NP   I-NP      B-VP B-PP B-NP      I-NP      B-VP   I-VP B-NP      B-PP B-NP      B-ADJP B-PP B-NP    B-SBAR B-NP I-NP       I-NP   B-VP B-ADJP  O 


#### Processing the data

In [17]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [18]:
print(tokenizer.is_fast)

True


In [19]:
inputs = tokenizer(raw_datasets["train"][0]["tokens"], is_split_into_words=True)
inputs.tokens()

['[CLS]',
 'EU',
 'rejects',
 'German',
 'call',
 'to',
 'boycott',
 'British',
 'la',
 '##mb',
 '.',
 '[SEP]']

In [20]:
inputs.word_ids()

[None, 0, 1, 2, 3, 4, 5, 6, 7, 7, 8, None]

In [21]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [22]:
labels = raw_datasets["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[3, 0, 7, 0, 0, 0, 7, 0, 0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]


✏️ Your turn! Some researchers prefer to attribute only one label per word, and assign -100 to the other subtokens in a given word. This is to avoid long words that split into lots of subtokens contributing heavily to the loss. Change the previous function to align labels with input IDs by following this rule.

In [23]:
def align_labels_with_tokens2(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            #print(f"!= current_word: {label}")
            new_labels.append(label)
        elif word_id is None:
            # Special token
            #print("special token")
            new_labels.append(-100)
        else:
            # Same word as previous token
            #label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            #if label % 2 == 1:
            #    label += 1
            label = -100
            #print(f"same as previous token: {label}")
            new_labels.append(label)


    return new_labels

In [24]:
print(labels)
print(align_labels_with_tokens2(labels, word_ids))

[3, 0, 7, 0, 0, 0, 7, 0, 0]
[-100, 3, 0, 7, 0, 0, 0, 7, 0, -100, 0, -100]


...

> To preprocess our whole dataset, we need to tokenize all the inputs and apply `align_labels_with_tokens()` on all the labels. To take advantage of the speed of our fast tokenizer, it’s best to tokenize lots of texts at the same time, so we’ll write a function that processes a list of examples and use the `Dataset.map()` method with the option `batched=True`. The only thing that is different from our previous example is that the `word_ids()` function needs to get the index of the example we want the word IDs of when the inputs to the tokenizer are lists of texts (or in our case, list of lists of words), so we add that too

In [25]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [26]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

#### Fine-tuning the model with the Trainer API

 #### Data collation

> We can’t just use a DataCollatorWithPadding like in Chapter 3 because that only pads the inputs
> (input IDs, attention mask, and token type IDs).
> Here our labels should be padded the exact same way as the inputs so that they stay the same size,
> using -100 as a value so that the corresponding predictions are ignored in the loss computation.
>
> This is all done by a [`DataCollatorForTokenClassification`](https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForTokenClassification). Like the `DataCollatorWithPadding`, <span style="background-color:#33FFFF">it takes the tokenizer used to preprocess the inputs</span>

In [27]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [28]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])

batch["labels"]

tensor([[-100,    3,    0,    7,    0,    0,    0,    7,    0,    0,    0, -100],
        [-100,    1,    2, -100, -100, -100, -100, -100, -100, -100, -100, -100]])

In [29]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 3, 0, 7, 0, 0, 0, 7, 0, 0, 0, -100]
[-100, 1, 2, -100]


#### Metrics

From [Metric: sequeval](https://huggingface.co/spaces/evaluate-metric/seqeval) on 🤗 Spaces:

> Seqeval produces labelling scores along with its sufficient statistics from a source against one or more references.
>
> It takes two mandatory arguments:
> <br/><span style="padding-left:20px;"/><tt>predictions</tt>: a list of lists of predicted labels, i.e. estimated targets as returned by a tagger.</span>
> <br/><span style="padding-left:20px;"/><tt>references</tt>: a list of lists of reference labels, i.e. the ground truth/target values.</span>


It works like this:
 
    seqeval = evaluate.load('seqeval')
    predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
    references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
    results = seqeval.compute(predictions=predictions, references=references)


In [30]:
import evaluate

metric = evaluate.load("seqeval")

In [31]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

['B-ADVP', 'O', 'B-INTJ', 'O', 'O', 'O', 'B-INTJ', 'O', 'O']

In [32]:
predictions = labels.copy()
predictions[2] = "O"

metric.compute(predictions=[predictions], references=[labels])

{'ADVP': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'INTJ': {'precision': 1.0,
  'recall': 0.5,
  'f1': 0.6666666666666666,
  'number': 2},
 'overall_precision': 1.0,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.8,
 'overall_accuracy': 0.8888888888888888}

...

> This `compute_metrics()` function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is `-100`, then pass the results to the `metric.compute()` method

In [33]:
import numpy as np

def compute_metrics(eval_preds):
    # unpack
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [
        [label_names[l] for l in label if l != -100]
        for label in labels
    ]
    
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

#### Defining the model

In [34]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [35]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
model.config.num_labels

23

In [37]:
id2label

{0: 'O',
 1: 'B-ADJP',
 2: 'I-ADJP',
 3: 'B-ADVP',
 4: 'I-ADVP',
 5: 'B-CONJP',
 6: 'I-CONJP',
 7: 'B-INTJ',
 8: 'I-INTJ',
 9: 'B-LST',
 10: 'I-LST',
 11: 'B-NP',
 12: 'I-NP',
 13: 'B-PP',
 14: 'I-PP',
 15: 'B-PRT',
 16: 'I-PRT',
 17: 'B-SBAR',
 18: 'I-SBAR',
 19: 'B-UCP',
 20: 'I-UCP',
 21: 'B-VP',
 22: 'I-VP'}

#### Fine-tuning the model

In [38]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [39]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


    tokenizer=tokenizer
    
    var/tmp/ipykernel_60487/3203677919.py:3: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.

In [40]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    processing_class=tokenizer,
)
#trainer.train()

In [41]:
%%time

trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0801,0.068105,0.911199,0.932514,0.921733,0.981692
2,0.0371,0.064741,0.93403,0.948334,0.941127,0.986269
3,0.0207,0.060307,0.940179,0.952205,0.946154,0.987123


CPU times: user 4min 1s, sys: 10.7 s, total: 4min 12s
Wall time: 4min 37s


TrainOutput(global_step=5268, training_loss=0.07206919162492455, metrics={'train_runtime': 267.2737, 'train_samples_per_second': 157.603, 'train_steps_per_second': 19.71, 'total_flos': 920888121858078.0, 'train_loss': 0.07206919162492455, 'epoch': 3.0})

In [42]:
trainer.push_to_hub(commit_message="Training complete")

CommitInfo(commit_url='https://huggingface.co/buruzaemon/bert-finetuned-ner/commit/1babb502b4f2c6b5d1a0e989eb6b9ca595fd3aa4', commit_message='Training complete', commit_description='', oid='1babb502b4f2c6b5d1a0e989eb6b9ca595fd3aa4', pr_url=None, repo_url=RepoUrl('https://huggingface.co/buruzaemon/bert-finetuned-ner', endpoint='https://huggingface.co', repo_type='model', repo_id='buruzaemon/bert-finetuned-ner'), pr_revision=None, pr_num=None)