<a href="https://colab.research.google.com/github/anyuanay/medium/blob/main/src/working_huggingface/Working_with_HuggingFace_ch2_NER_bert_base_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Working with Hugging Face Models and Datasets
## Chapter 2: Named Entity Recognition (NER) using Models in Hugging Face
### Lesson 2.1: NER using the bert-base-NER model

In this lesson, we will use the pre-trained BERT model, bert-base-NER, for named entity recognition

# Install Transformers and Datasets from Hugging Face

In [1]:
# Transformers installation
! pip install transformers[torch] datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Collecting transformers[torch]
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.5-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m45.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers[torch])
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers[torch])
  Downloading tokenizers-0.14.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[t

# NER as Token classification

Token classification assigns a label to individual tokens in a sentence. One of the most common token classification tasks is Named Entity Recognition (NER). NER attempts to find a label for each entity in a sentence, such as a person, location, or organization.

The pre-trained BERT model, bert-base-NER, has been fine-tuned for Named Entity Recognition. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC). Specifically, this model is a bert-base-cased model that was fine-tuned on the English version of the standard CoNLL-2003 Named Entity Recognition dataset: https://www.aclweb.org/anthology/W03-0419.pdf



# Load Model and Tokenizer from bert-base-NER

In [2]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

Downloading (…)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Create a Pipeline from the bert-base-NER Model and Tokenizer

In [3]:
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

# Prepare a Text

In [4]:
text = "Apple Inc. plans to open a new store in San Francisco by January 2024. Tim Cook, the CEO, announced the news yesterday."

# Label Tokens with the Tags in the B-I-O Scheme

In [24]:
ner_results = nlp(text)
print(ner_results)

[{'entity': 'B-ORG', 'score': 0.9996086, 'index': 1, 'word': 'Apple', 'start': 0, 'end': 5}, {'entity': 'I-ORG', 'score': 0.99942136, 'index': 2, 'word': 'Inc', 'start': 6, 'end': 9}, {'entity': 'B-LOC', 'score': 0.99934715, 'index': 11, 'word': 'San', 'start': 40, 'end': 43}, {'entity': 'I-LOC', 'score': 0.99942625, 'index': 12, 'word': 'Francisco', 'start': 44, 'end': 53}, {'entity': 'B-PER', 'score': 0.9997869, 'index': 18, 'word': 'Tim', 'start': 71, 'end': 74}, {'entity': 'I-PER', 'score': 0.99977297, 'index': 19, 'word': 'Cook', 'start': 75, 'end': 79}]


# Extract the Named Entities

In [6]:
# The code below presumes that ner_results is a list of dictionaries, each representing a token,
# arranged in the sequence they appeared in the source sentence.
organized_results = {'LOC': [], 'PER': [], 'ORG': [], 'MISC': []}

current_entity = None
current_words = []

for result in ner_results:
    entity_type = result['entity'].split('-')[1]
    if result['entity'].startswith('B-'):
        if current_entity:
            organized_results[current_entity].append(' '.join(current_words))
        current_entity = entity_type
        current_words = [result['word']]
    elif result['entity'].startswith('I-') and current_entity == entity_type:
        current_words.append(result['word'])

# Handle the last entity
if current_entity:
    organized_results[current_entity].append(' '.join(current_words))

# Remove hash symbols from words
for key, value in organized_results.items():
    organized_results[key] = [' '.join(word.split('##')) for word in value]

print(organized_results)


{'LOC': ['San Francisco'], 'PER': ['Tim Cook'], 'ORG': ['Apple Inc'], 'MISC': []}


# Generate a List of Tokens and the Corresponding List of Entity Tags

In [7]:
token_list = []
tag_list = []
for result in ner_results:
    token_list.append(result['word'])
    tag_list.append(result['entity'])

In [8]:
token_list, tag_list

(['Apple', 'Inc', 'San', 'Francisco', 'Tim', 'Cook'],
 ['B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-PER', 'I-PER'])

## Let Us Test the Model on the CoNLL2003 Data


Start by loading the CoNLL2003 dataset from the Datasets library:

In [9]:
from datasets import load_dataset

conll = load_dataset("conll2003")

Then take a look at an example:

In [14]:
conll["test"][0]

{'id': '0',
 'tokens': ['SOCCER',
  '-',
  'JAPAN',
  'GET',
  'LUCKY',
  'WIN',
  ',',
  'CHINA',
  'IN',
  'SURPRISE',
  'DEFEAT',
  '.'],
 'pos_tags': [21, 8, 22, 37, 22, 22, 6, 22, 15, 12, 21, 7],
 'chunk_tags': [11, 0, 11, 21, 11, 12, 0, 11, 13, 11, 12, 0],
 'ner_tags': [0, 0, 5, 0, 0, 0, 0, 1, 0, 0, 0, 0]}

Each number in `ner_tags` represents an entity. Convert the numbers to their label names to find out what the entities are:

In [15]:
conll

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [16]:
conll['test'].features

{'id': Value(dtype='string', id=None),
 'tokens': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'pos_tags': Sequence(feature=ClassLabel(names=['"', "''", '#', '$', '(', ')', ',', '.', ':', '``', 'CC', 'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'NN|SYM', 'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'SYM', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB'], id=None), length=-1, id=None),
 'chunk_tags': Sequence(feature=ClassLabel(names=['O', 'B-ADJP', 'I-ADJP', 'B-ADVP', 'I-ADVP', 'B-CONJP', 'I-CONJP', 'B-INTJ', 'I-INTJ', 'B-LST', 'I-LST', 'B-NP', 'I-NP', 'B-PP', 'I-PP', 'B-PRT', 'I-PRT', 'B-SBAR', 'I-SBAR', 'B-UCP', 'I-UCP', 'B-VP', 'I-VP'], id=None), length=-1, id=None),
 'ner_tags': Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)}

In [17]:
label_list = conll["test"].features[f"ner_tags"].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

The letter that prefixes each `ner_tag` indicates the token position of the entity:

- `B-` indicates the beginning of an entity.
- `I-` indicates a token is contained inside the same entity (for example, the `State` token is a part of an entity like
  `Empire State Building`).
- `0` indicates the token doesn't correspond to any entity.

# Test the Model on a Test Data

In [63]:
test_text = " ".join(conll['test'][12]['tokens'])
test_text

"Defender Hassan Abbas rose to intercept a long ball into the area in the 84th minute but only managed to divert it into the top corner of Bitar 's goal ."

In [64]:
tokenized_input = tokenizer(test_text)
tokenized_input

{'input_ids': [101, 3177, 27896, 13583, 19166, 3152, 1106, 22205, 170, 1263, 3240, 1154, 1103, 1298, 1107, 1103, 5731, 1582, 2517, 1133, 1178, 2374, 1106, 23448, 1204, 1122, 1154, 1103, 1499, 2655, 1104, 27400, 1813, 112, 188, 2273, 119, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [65]:
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens

['[CLS]',
 'De',
 '##fender',
 'Hassan',
 'Abbas',
 'rose',
 'to',
 'intercept',
 'a',
 'long',
 'ball',
 'into',
 'the',
 'area',
 'in',
 'the',
 '84',
 '##th',
 'minute',
 'but',
 'only',
 'managed',
 'to',
 'diver',
 '##t',
 'it',
 'into',
 'the',
 'top',
 'corner',
 'of',
 'Bit',
 '##ar',
 "'",
 's',
 'goal',
 '.',
 '[SEP]']

In [67]:
ner_results = nlp(test_text)
ner_results

[{'entity': 'B-PER',
  'score': 0.99953103,
  'index': 3,
  'word': 'Hassan',
  'start': 9,
  'end': 15},
 {'entity': 'I-PER',
  'score': 0.99965537,
  'index': 4,
  'word': 'Abbas',
  'start': 16,
  'end': 21},
 {'entity': 'B-ORG',
  'score': 0.98183286,
  'index': 31,
  'word': 'Bit',
  'start': 138,
  'end': 141},
 {'entity': 'I-ORG',
  'score': 0.9706425,
  'index': 32,
  'word': '##ar',
  'start': 141,
  'end': 143}]

In [70]:
ner_results_dict = {}
for result in ner_results:
    ner_results_dict[result['word']] = result['entity']
ner_results_dict

{'Hassan': 'B-PER', 'Abbas': 'I-PER', 'Bit': 'B-ORG', '##ar': 'I-ORG'}

In [73]:
prediction = []
for tok in tokens[1:-1]:
    if tok in ner_results_dict:
        prediction.append(ner_results_dict[tok])
    else:
        prediction.append('O')
prediction

['O',
 'O',
 'B-PER',
 'I-PER',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-ORG',
 'I-ORG',
 'O',
 'O',
 'O',
 'O']

# Create a Map of ids to Their Labels with id2label and label2id

In [38]:
label_list = conll["test"].features[f"ner_tags"].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [39]:
id2label = {}
label2id = {}
for idx, lab in enumerate(label_list):
    id2label[idx] = lab
    label2id[lab] = idx

In [40]:
id2label, label2id

({0: 'O',
  1: 'B-PER',
  2: 'I-PER',
  3: 'B-ORG',
  4: 'I-ORG',
  5: 'B-LOC',
  6: 'I-LOC',
  7: 'B-MISC',
  8: 'I-MISC'},
 {'O': 0,
  'B-PER': 1,
  'I-PER': 2,
  'B-ORG': 3,
  'I-ORG': 4,
  'B-LOC': 5,
  'I-LOC': 6,
  'B-MISC': 7,
  'I-MISC': 8})

# Apply the Model to All Test Data

In [82]:
from tqdm import tqdm

In [83]:
references = []
predictions = []
for atest in tqdm(test, desc=str(len(test))):
    # add true labels to references
    references.append([id2label[id] for id in atest['ner_tags']])

    # recognize named entity in a test sentence
    test_text = " ".join(atest['tokens'])
    test_ner_results = nlp(test_text)

    # make a map from a token to the recognized tag
    ner_results_dict = {}
    for result in test_ner_results:
        ner_results_dict[result['word']] = result['entity']

    # tokenize the sentence
    tokenized_input = tokenizer(test_text)
    tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

    # for each token find its predicted tag or 'O'
    prediction = []
    for tok in tokens[1:-1]: # first and last tokens are [CLS] and [SEP]
        if tok in ner_results_dict:
            prediction.append(ner_results_dict[tok])
        else:
            prediction.append('O')
    predictions.append(prediction)

3453: 100%|██████████| 3453/3453 [08:53<00:00,  6.47it/s]


In [84]:
len(predictions) == len(references)

True

## Evaluate

We can quickly load a evaluation method with the Huggingface [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [seqeval](https://huggingface.co/spaces/evaluate-metric/seqeval) framework (see the Huggingface Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric). Seqeval actually produces several scores: precision, recall, F1, and accuracy.

In [85]:
! pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.0 responses-0.18.0


In [87]:
! pip install seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m875.6 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=f24eea345c41e655791de48bd71abea41dcee9f962bcfdc16831a56c3c11a04e
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [88]:
import evaluate

seqeval = evaluate.load("seqeval")

In [94]:
# Make the each prediction and reference the same length by padding the shorter
# one will all 'O'
true_predictions = []
true_labels = []
for pred, lab in zip(predictions, references):

    diff = len(pred) - len(lab)

    # Pad the shorter list with 'O' based on the difference
    if diff > 0:  # pred is longer
        lab.extend(['O'] * diff)
    elif diff < 0:  # lab is longer
        pred.extend(['O'] * -diff)

    true_predictions.append(pred)
    true_labels.append(lab)

In [95]:
results = seqeval.compute(predictions=true_predictions, references=true_labels)

print("precision:", results["overall_precision"]),
print("recall:", results["overall_recall"]),
print("f1:", results["overall_f1"]),
print("accuracy:", results["overall_accuracy"])

precision: 0.17113154738104758
recall: 0.2273371104815864
f1: 0.19527032164854383
accuracy: 0.7665968074880636
