<a href="https://colab.research.google.com/github/anyuanay/medium/blob/main/src/working_huggingface/Working_with_HuggingFace_ch2_Preparing_Dataset_for_Fine_Tuning_NER_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Working with Hugging Face Models and Datasets
## Chapter 2: Named Entity Recognition (NER) using Models in Hugging Face
### Lesson 2.2: Loading and preparing a dataset for fine-tuning the pre-trained bert-base-NER model

In this lesson, we will load and prepare the WNUT17 dataset for fine-tuning the pre-trained bert-base-NER model for the named entity recoginition (NER) task.

# Install Transformers and Datasets from Hugging Face

In [1]:
# Transformers installation
! pip install -q transformers[torch] datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m66.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

# NER as Token classification

Token classification assigns a label or tag to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.

In a previous lesson [Lesson 2.1](https://github.com/anyuanay/medium/blob/main/src/working_huggingface/Working_with_HuggingFace_ch2_NER_bert_base_NER.ipynb), We applied a pre-trained model, bert-base-NER, to extract 4 pre-defined entities. In many applications, we need to extract different types of entities. To do so, we will fine tune the pre-trained model on a dataset which is application-specified.

In this lesson, we begin with preparing a dataset for a fine-tuning process.





## The WNUT 2017 dataset
The Workshop on Noisy and User-generated Text (WNUT) focuses on Natural Language Processing applied to noisy user-generated text. [The WNUT 2017 shared task](https://noisy-text.github.io/2017/index.html) provided data for identifying unusual, previously-unseen entities in the context of emerging discussions. We will use the WNUT 2017 dataset to fine tune the bert-base-NER model for more entity types.

Let us begin with loading the WNUT 17 dataset from the Datasets library:

In [2]:
from datasets import load_dataset

wnut = load_dataset("wnut_17")

Downloading builder script:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/4.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.05k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/185k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/39.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/66.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/3394 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1009 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1287 [00:00<?, ? examples/s]

The dataset has been split into train, test, and validation sets:

In [3]:
wnut

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3394
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1009
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1287
    })
})

A dataset has three columns: 'id', 'tokens', 'ner_tags'. If we index a dataset by a key, for example, 'tokens', we will get a list of lists of tokens.

In [40]:
type(wnut['test']['tokens']), len(wnut['test']['tokens']), len(wnut['test']['tokens'][1])

(list, 1287, 34)

Let us take a look at an example from the WNUT 2017 test dataset:

In [41]:
rec = wnut["test"][1]
for key in rec:
    print(key, ":", rec[key])

id : 1
tokens : ['&', 'gt', ';', '*', 'Police', 'last', 'week', 'evacuated', '80', 'villagers', 'from', 'Waltengoo', 'Nar', 'where', 'dozens', 'were', 'killed', 'after', 'a', 'series', 'of', 'avalanches', 'hit', 'the', 'area', 'in', '2005', 'in', 'the', 'south', 'of', 'the', 'territory', '.']
ner_tags : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [42]:
len(rec['tokens'])

34

# List the Tag Names in the WNUT 2017 Dataset

Each number in `ner_tags` represents an entity. Convert the numbers to their tag names to find out what the entities are:

In [33]:
wnut['test'].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-corporation', 'I-corporation', 'B-creative-work', 'I-creative-work', 'B-group', 'I-group', 'B-location', 'I-location', 'B-person', 'I-person', 'B-product', 'I-product'], id=None), length=-1, id=None)}

In [43]:
tag_names = wnut["test"].features[f"ner_tags"].feature.names
tag_names

['O',
 'B-corporation',
 'I-corporation',
 'B-creative-work',
 'I-creative-work',
 'B-group',
 'I-group',
 'B-location',
 'I-location',
 'B-person',
 'I-person',
 'B-product',
 'I-product']

As we have introduced before, the tag names are in the B-I-O scheme. The letter that prefixes each `ner_tag` indicates the token position of the entity:

- `B-` indicates the beginning of an entity.
- `I-` indicates a token is contained inside the same entity (for example, the `State` token is a part of an entity like
  `Empire State Building`).
- `0` indicates the token doesn't correspond to any entity.

There are total 6 named entities plus the tag 'O'. The 6 named entities are: Corporation, Creative-Work, Group, Location, Person, and Product.

# Load the Toknenizer of the bert-base-NER Model to Prepare the Dataset

To fine tune the bert-base-NER model, we need to load a bert-base-NER tokenizer to preprocess the `tokens` field:

In [18]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")

Downloading (…)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

# Tokenize the Tokens into Subwords by the Tokenizer

As you saw in the early example, the 'rec' has a `tokens` field. It looks like the sentence has already been tokenized. But the sentence actually hasn't been tokenized yet and we will need to set `is_split_into_words=True` to tokenize the words into subwords. For example:

In [39]:
rec = wnut['test'][1]
tokenized_result = tokenizer(rec["tokens"], is_split_into_words=True)
for key in tokenized_result:
    print(key, ":", tokenized_result[key])

input_ids : [101, 111, 176, 1204, 132, 115, 3284, 1314, 1989, 13776, 2908, 12453, 1121, 10495, 14429, 5658, 11896, 1197, 1187, 10366, 1127, 1841, 1170, 170, 1326, 1104, 170, 7501, 23742, 1116, 1855, 1103, 1298, 1107, 1478, 1107, 1103, 1588, 1104, 1103, 3441, 119, 102]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [26]:
tokens = tokenizer.convert_ids_to_tokens(tokenized_result["input_ids"])
tokens

['[CLS]',
 '&',
 'g',
 '##t',
 ';',
 '*',
 'Police',
 'last',
 'week',
 'evacuated',
 '80',
 'villagers',
 'from',
 'Walt',
 '##eng',
 '##oo',
 'Na',
 '##r',
 'where',
 'dozens',
 'were',
 'killed',
 'after',
 'a',
 'series',
 'of',
 'a',
 '##val',
 '##anche',
 '##s',
 'hit',
 'the',
 'area',
 'in',
 '2005',
 'in',
 'the',
 'south',
 'of',
 'the',
 'territory',
 '.',
 '[SEP]']

In [27]:
len(tokens), len(tokenized_result['input_ids']), len(tokenized_result['token_type_ids']), len(tokenized_result['attention_mask'])

(43, 43, 43, 43)

# Assign Given Tags to Tokens after the Tokenization

After we applied the tokenizer to the input, we need to assign the given NER tags to the resultant tokens. However, the tokenization process adds two special tokens `[CLS]` and `[SEP]` to the tokenized result. The tokenization also may split a single word into several subwords. The special and subword tokens cause a mismatch between the tokenized result and the given tags in the datasets. We need to realign the subword tokens and the given tags during fine-tuning when using the datasets.

We apply the following steps for the assignment by realignment:

1. First, we map all subword tokens to their corresponding word. There is a [`word_ids`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.BatchEncoding.word_ids) method of the tokenized result that maps tokens to their corresponding word ids.
2. Second, we assign the special tag `-100` to the special tokens [`CLS`] and [`SEP`].
3. Third, for a word that was split into multiple subword tokens, we only assign the first token with the original tag. For the rest of the subword tokens, we assign the special `-100` to them.


Let us illustrate the steps using an example.

In [44]:
rec = wnut['test'][1]
for key in rec:
    print(key, rec[key])

id 1
tokens ['&', 'gt', ';', '*', 'Police', 'last', 'week', 'evacuated', '80', 'villagers', 'from', 'Waltengoo', 'Nar', 'where', 'dozens', 'were', 'killed', 'after', 'a', 'series', 'of', 'avalanches', 'hit', 'the', 'area', 'in', '2005', 'in', 'the', 'south', 'of', 'the', 'territory', '.']
ner_tags [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [45]:
# Check the length of the ner_tags
len(rec['ner_tags'])

34

In [46]:
# Tokenize the input
tokenized_result =  tokenizer(rec['tokens'], is_split_into_words=True, truncation=True)
for key in tokenized_result:
    print(key, ":", tokenized_result[key])

input_ids : [101, 111, 176, 1204, 132, 115, 3284, 1314, 1989, 13776, 2908, 12453, 1121, 10495, 14429, 5658, 11896, 1197, 1187, 10366, 1127, 1841, 1170, 170, 1326, 1104, 170, 7501, 23742, 1116, 1855, 1103, 1298, 1107, 1478, 1107, 1103, 1588, 1104, 1103, 3441, 119, 102]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [47]:
# Check there is a mismatch
len(rec['ner_tags']), len(tokenized_result['input_ids'])

(34, 43)

In [49]:
# To re-assign the tags to the new tokens, map the tokens to their corresponding word ids in the input
word_ids = tokenized_result.word_ids()

In [52]:
# Check that the number of unique word ids is the same as the number of original tokens
# We subtract 1 because there is a special word id 'None' corresponding to the special tokens
len(set(word_ids)) - 1

34

In [53]:
# Re-assign tags to the new tokens
input_tags = []
previous_wid = None
for wid in word_ids:
    if wid is None:
        input_tags.append(-100)
    elif wid == previous_wid:
        input_tags.append(-100)
    else:
        input_tags.append(rec['ner_tags'][wid])
    previous_wid = wid

## Let us check the results

In [72]:
# The new tokens
tokens_new = tokenizer.convert_ids_to_tokens(tokenized_input['input_ids'])
print(tokens_new)

['[CLS]', '&', 'g', '##t', ';', '*', 'Police', 'last', 'week', 'evacuated', '80', 'villagers', 'from', 'Walt', '##eng', '##oo', 'Na', '##r', 'where', 'dozens', 'were', 'killed', 'after', 'a', 'series', 'of', 'a', '##val', '##anche', '##s', 'hit', 'the', 'area', 'in', '2005', 'in', 'the', 'south', 'of', 'the', 'territory', '.', '[SEP]']


In [71]:
# The assigned tags to the new tokens
print(input_tags)

[-100, 0, 0, -100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, -100, -100, 8, -100, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100, -100, -100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


In [75]:
# The original tokens
print(rec['tokens'])

['&', 'gt', ';', '*', 'Police', 'last', 'week', 'evacuated', '80', 'villagers', 'from', 'Waltengoo', 'Nar', 'where', 'dozens', 'were', 'killed', 'after', 'a', 'series', 'of', 'avalanches', 'hit', 'the', 'area', 'in', '2005', 'in', 'the', 'south', 'of', 'the', 'territory', '.']


In [74]:
# The original tags
print(rec['ner_tags'])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [76]:
# The tag name of index 7
tag_names[7]

'B-location'

In [81]:
# Which original word was labeled as tag_names[7] = 'B-location'?
print("The original word that was labeled as {} with tag index {} was \"{}\" with word id {}.".format('B-location', 7, rec['tokens'][11], 11))

The original word that was labeled as B-location with tag index 7 was "Waltengoo" with word id 11.


In [70]:
# Whether the corresponding subword tokens asisgned the correct tags
print(input_tags[13:16], tokens_new[13:16])

[7, -100, -100] ['Walt', '##eng', '##oo']


Put them together, we implement the following function to work on a dataset in batch mode:

In [91]:
def tokenize_and_align_tags(records):
    # Tokenize the input words. This will break words into subtokens if necessary.
    # For instance, "ChatGPT" might become ["Chat", "##G", "##PT"].
    tokenized_results = tokenizer(records["tokens"], truncation=True, is_split_into_words=True)

    input_tags_list = []

    # Iterate through each set of tags in the records.
    for i, given_tags in enumerate(records["ner_tags"]):
        # Get the word IDs corresponding to each token. This tells us to which original word each token corresponds.
        word_ids = tokenized_results.word_ids(batch_index=i)

        previous_word_id = None
        input_tags = []

        # For each token, determine which tag it should get.
        for wid in word_ids:
            # If the token does not correspond to any word (e.g., it's a special token), set its tag to -100.
            if wid is None:
                input_tags.append(-100)
            # If the token corresponds to a new word, use the tag for that word.
            elif wid != previous_word_id:
                input_tags.append(given_tags[wid])
            # If the token is a subtoken (i.e., part of a word we've already tagged), set its tag to -100.
            else:
                input_tags.append(-100)
            previous_word_id = wid

        input_tags_list.append(input_tags)

    # Add the assigned tags to the tokenized results.
    # In the Hugging Face Transformers library, a model recognizes the labels parameter
    # for computing losses along with logits (predictions)
    tokenized_results["labels"] = input_tags_list

    return tokenized_results


To apply the preprocessing function over the entire dataset, use Huggingface Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. We can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [92]:
tokenized_wnut = wnut.map(tokenize_and_align_tags, batched=True)

Map:   0%|          | 0/3394 [00:00<?, ? examples/s]

Map:   0%|          | 0/1009 [00:00<?, ? examples/s]

Map:   0%|          | 0/1287 [00:00<?, ? examples/s]

In [93]:
for key in wnut['train'][0]:
    print(key, ":", wnut['train'][0][key])

id : 0
tokens : ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
ner_tags : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0]


In [94]:
for key in tokenized_wnut['train'][0]:
    print(key, ":", tokenized_wnut['train'][0][key])

id : 0
tokens : ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
ner_tags : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0]
input_ids : [101, 137, 185, 18318, 13868, 1135, 112, 188, 1103, 2458, 1121, 1187, 146, 112, 182, 1690, 1111, 1160, 2277, 119, 2813, 1426, 4334, 134, 142, 19117, 119, 12004, 2213, 4162, 1303, 1314, 3440, 119, 102]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
labels : [-100, 0, -100, -100, -100, 0, 0, -100, 0, 0, 0, 0, 0, 0, -100, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, -100, 0, 0, 0, 0, 0, 0, 0, 0, -100]


# Create Data Collator

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [95]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)