<a href="https://colab.research.google.com/github/anyuanay/medium/blob/main/src/working_huggingface/Working_with_HuggingFace_ch2_Preparing_Dataset_for_Fine_Tuning_NER_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Working with Hugging Face Models and Datasets
## Chapter 2: Named Entity Recognition (NER) using Models in Hugging Face
### Lesson 2.2: Loading and preparing a dataset for fine-tuning the pre-trained bert-base-NER model

In this lesson, we will load and prepare the WNUT17 dataset for fine-tuning the pre-trained bert-base-NER model for the named entity recoginition (NER) task.

# Install Transformers and Datasets from Hugging Face

In [1]:
# Transformers installation
! pip install -q transformers[torch] datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m258.1/258.1 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

# NER as Token classification

Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.

In a previous lesson [Lesson 2.1](https://github.com/anyuanay/medium/blob/main/src/working_huggingface/Working_with_HuggingFace_ch2_NER_bert_base_NER.ipynb), We applied a pre-trained model, bert-base-NER, to extract 4 pre-defined entities. In many applications, we need to extract different types of entities. To do so, we will fine tune the pre-trained model on a dataset which is application-specified.

In this lesson, we begin with preparing a dataset for a fine-tuning process.





## The WNUT 2017 dataset
The Workshop on Noisy and User-generated Text (WNUT) focuses on Natural Language Processing applied to noisy user-generated text. [The WNUT 2017 shared task](https://noisy-text.github.io/2017/index.html) provided data for identifying unusual, previously-unseen entities in the context of emerging discussions. We will use the WNUT 2017 dataset to fine tune the bert-base-NER model for more entity types.

Let us begin with loading the WNUT 17 dataset from the Datasets library:

In [3]:
from datasets import load_dataset

wnut = load_dataset("wnut_17")

The dataset has been split into train, test, and validation sets:

In [4]:
wnut

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3394
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1009
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1287
    })
})

Let us take a look at an example from the WNUT 2017 test dataset:

In [7]:
rec = wnut["test"][1]
for key in rec:
    print(key, ":", rec[key])

id : 1
tokens : ['&', 'gt', ';', '*', 'Police', 'last', 'week', 'evacuated', '80', 'villagers', 'from', 'Waltengoo', 'Nar', 'where', 'dozens', 'were', 'killed', 'after', 'a', 'series', 'of', 'avalanches', 'hit', 'the', 'area', 'in', '2005', 'in', 'the', 'south', 'of', 'the', 'territory', '.']
ner_tags : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


# List the Tag Names in the WNUT 2017 Dataset

Each number in `ner_tags` represents an entity. Convert the numbers to their tag names to find out what the entities are:

In [10]:
wnut['test'].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-corporation', 'I-corporation', 'B-creative-work', 'I-creative-work', 'B-group', 'I-group', 'B-location', 'I-location', 'B-person', 'I-person', 'B-product', 'I-product'], id=None), length=-1, id=None)}

In [11]:
tag_list = wnut["test"].features[f"ner_tags"].feature.names
tag_list

['O',
 'B-corporation',
 'I-corporation',
 'B-creative-work',
 'I-creative-work',
 'B-group',
 'I-group',
 'B-location',
 'I-location',
 'B-person',
 'I-person',
 'B-product',
 'I-product']

As we have introduced before, the tags are in the B-I-O scheme. The letter that prefixes each `ner_tag` indicates the token position of the entity:

- `B-` indicates the beginning of an entity.
- `I-` indicates a token is contained inside the same entity (for example, the `State` token is a part of an entity like
  `Empire State Building`).
- `0` indicates the token doesn't correspond to any entity.

There are total 6 named entities plus the tag 'O'. The 6 named entities are: Corporation, Creative-Work, Group, Location, Person, and Product.

# Load the Toknenizer of the bert-base-NER Model to Prepare the Dataset

To fine tune the bert-base-NER model, we need to load a bert-base-NER tokenizer to preprocess the `tokens` field:

In [13]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")

Downloading (…)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

# Tokenize the Tokens into Subwords by the Tokenizer

As you saw in the early example, the 'rec' has a `tokens` field. It looks like the sentence has already been tokenized. But the sentence actually hasn't been tokenized yet and we will need to set `is_split_into_words=True` to tokenize the words into subwords. For example:

In [21]:
rec = wnut['test'][1]
tokenized_input = tokenizer(rec["tokens"], is_split_into_words=True)
for key in tokenized_input:
    print(key, ":", tokenized_input[key])

input_ids : [101, 111, 176, 1204, 132, 115, 3284, 1314, 1989, 13776, 2908, 12453, 1121, 10495, 14429, 5658, 11896, 1197, 1187, 10366, 1127, 1841, 1170, 170, 1326, 1104, 170, 7501, 23742, 1116, 1855, 1103, 1298, 1107, 1478, 1107, 1103, 1588, 1104, 1103, 3441, 119, 102]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [15]:
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]',
 '&',
 'g',
 '##t',
 ';',
 '*',
 'Police',
 'last',
 'week',
 'evacuated',
 '80',
 'villagers',
 'from',
 'Walt',
 '##eng',
 '##oo',
 'Na',
 '##r',
 'where',
 'dozens',
 'were',
 'killed',
 'after',
 'a',
 'series',
 'of',
 'a',
 '##val',
 '##anche',
 '##s',
 'hit',
 'the',
 'area',
 'in',
 '2005',
 'in',
 'the',
 'south',
 'of',
 'the',
 'territory',
 '.',
 '[SEP]']

# Realign the Subword Tokens with the Tags

However, this adds some special tokens `[CLS]` and `[SEP]` and the subword tokenization creates a mismatch between the input sentence and tags. A single word corresponding to a single tag may now be split into two subwords. we'll need to realign the subword tokens and tags by:

1. Mapping all subword tokens to their corresponding word with the [`word_ids`](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.BatchEncoding.word_ids) method.
2. Assigning the tag `-100` to the special tokens `[CLS]` and `[SEP]` so they're ignored by the PyTorch loss function.
3. Only labeling the first token of a given word. Assign `-100` to other subword tokens from the same word.

Here is how we can create a function to realign the subword tokens and tags, and truncate sequences to be no longer than bert-base-NER's maximum input length:

In [16]:
def tokenize_and_align_labels(records):
    tokenized_inputs = tokenizer(records["tokens"], truncation=True, is_split_into_words=True)

    tags = []
    for i, tag in enumerate(records[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        tag_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                tag_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                tag_ids.append(tag[word_idx])
            else:
                tag_ids.append(-100)
            previous_word_idx = word_idx
        tags.append(tag_ids)

    tokenized_inputs["tags"] = tags
    return tokenized_inputs

To apply the preprocessing function over the entire dataset, use Huggingface Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. We can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:

In [17]:
tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/3394 [00:00<?, ? examples/s]

Map:   0%|          | 0/1009 [00:00<?, ? examples/s]

Map:   0%|          | 0/1287 [00:00<?, ? examples/s]

In [19]:
for key in wnut['train'][0]:
    print(key, ":", wnut['train'][0][key])

id : 0
tokens : ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
ner_tags : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0]


In [18]:
for key in tokenized_wnut['train'][0]:
    print(key, ":", tokenized_wnut['train'][0][key])

id : 0
tokens : ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
ner_tags : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0]
input_ids : [101, 137, 185, 18318, 13868, 1135, 112, 188, 1103, 2458, 1121, 1187, 146, 112, 182, 1690, 1111, 1160, 2277, 119, 2813, 1426, 4334, 134, 142, 19117, 119, 12004, 2213, 4162, 1303, 1314, 3440, 119, 102]
token_type_ids : [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
attention_mask : [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
tags : [-100, 0, -100, -100, -100, 0, 0, -100, 0, 0, 0, 0, 0, 0, -100, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, -100, 0, 0, 0, 0, 0, 0, 0, 0, -100]


# Create Data Collator

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [22]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)