<a href="https://colab.research.google.com/github/anyuanay/medium/blob/main/src/bert_NER_name_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use bert-base-NER for Named Entity Recognition
There is a fine-tuned pre-trained LLM, bert-base-NER, for 4 types of entities recognition: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC).

## Install transformers

In [2]:
# Transformers installation
! pip install transformers



## Load bert-base-NER and the Tokenizer based on bert-base-NER

In [3]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

Downloading (…)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Create an Instance of Pipeline

In [4]:
ner_cls = pipeline("ner", model=model, tokenizer=tokenizer)

## Label Entities: LOC, ORG, PER, MISC

In [7]:
example = "Hugging Face was founded in 2016 by French entrepreneurs Clément Delangue, Julien Chaumond, and Thomas Wolf in New York City."

ner_results = ner_cls(example)
print(ner_results)

[{'entity': 'B-ORG', 'score': 0.7047629, 'index': 1, 'word': 'Hu', 'start': 0, 'end': 2}, {'entity': 'I-ORG', 'score': 0.7959128, 'index': 2, 'word': '##gging', 'start': 2, 'end': 7}, {'entity': 'I-ORG', 'score': 0.96256524, 'index': 3, 'word': 'Face', 'start': 8, 'end': 12}, {'entity': 'B-MISC', 'score': 0.99964213, 'index': 9, 'word': 'French', 'start': 36, 'end': 42}, {'entity': 'B-PER', 'score': 0.9996284, 'index': 11, 'word': 'C', 'start': 57, 'end': 58}, {'entity': 'B-PER', 'score': 0.65512276, 'index': 12, 'word': '##lé', 'start': 58, 'end': 60}, {'entity': 'B-PER', 'score': 0.9970042, 'index': 13, 'word': '##ment', 'start': 60, 'end': 64}, {'entity': 'I-PER', 'score': 0.99956423, 'index': 14, 'word': 'Del', 'start': 65, 'end': 68}, {'entity': 'I-PER', 'score': 0.99864, 'index': 15, 'word': '##ang', 'start': 68, 'end': 71}, {'entity': 'I-PER', 'score': 0.97261906, 'index': 16, 'word': '##ue', 'start': 71, 'end': 73}, {'entity': 'B-PER', 'score': 0.9997683, 'index': 18, 'word': '

## Organize the Results into Respective Categories

In [11]:
organized_results = {'LOC': [], 'PER': [], 'ORG': [], 'MISC': []}

current_entity = None
current_words = []

for result in ner_results:
    entity_type = result['entity'].split('-')[1]
    if result['entity'].startswith('B-'):
        if current_entity:
            organized_results[current_entity].append(' '.join(current_words))
        current_entity = entity_type
        current_words = [result['word']]
    elif result['entity'].startswith('I-') and current_entity == entity_type:
        current_words.append(result['word'])

# Handle the last entity
if current_entity:
    organized_results[current_entity].append(' '.join(current_words))

# Remove hash symbols from words
for key, value in organized_results.items():
    organized_results[key] = [' '.join(word.split('##')) for word in value]

print(organized_results)


{'LOC': ['New York City'], 'PER': ['C', ' lé', ' ment Del  ang  ue', 'Julien Cha  um  ond', 'Thomas Wolf'], 'ORG': ['Hu  gging Face'], 'MISC': ['French']}
