# BERT Named Entity Recognition

## Authors:
- Giovanni Nocerino
- Melita Freiberga
- Niccolo' Pagano

## Requirements:
- transformers
- torch

## The notebook will:
1. Load and preprocess the dataset
2. Perform Named Entity Recognition (NER) using different BERT models
3. Visualize the results of NER

## Returns:
None. Displays the named entities recognized in the provided text examples using various BERT models.


## Example sentences:

"Tesla was a Serbian-American inventor best known for his contributions to alternating current technology." "Meanwhile, Tesla, founded by the famous enterpreneur Elon Musk, is a leading manufacturer of electric vehicles."

"The UEFA Champions League final between Real Madrid and Manchester City took place in Istanbul, Turkey, in 2024." "Madrid is the capital of Spain and Manchester is an industrial city in the UK."

"The Amazon rainforest, spanning Brazil, Peru, and Colombia, is critical for global biodiversity, while the website Amazon, founded by Jeff Bezos is a popular online shopping service."

"Paris Hilton shared her travel vlog about Paris, including her favorite spots near the Eiffel Tower and Champs-Élysées."

## **BERT-base** fine-tuned specifically for NER on the CoNLL-2003 dataset

In [1]:
# Input text
texts = [("Tesla was a Serbian-American inventor best known for his contributions to alternating current technology."
        "Meanwhile, Tesla, founded by the famous enterpreneur Elon Musk, is a leading manufacturer of electric vehicles."), ("The UEFA Champions League final between Real Madrid and Manchester City took place in Istanbul, Turkey, in 2024."
        "Madrid is the capital of Spain and Manchester is an industrial city in the UK."), ("The Amazon rainforest, spanning Brazil, Peru, and Colombia, is critical for global biodiversity, while the website Amazon, founded by Jeff Bezos is a popular online shopping service."), ("Paris Hilton shared her travel vlog about Paris, including her favorite spots near the Eiffel Tower and Champs-Élysées.")]

In [4]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import torch

# Load pre-trained model and tokenizer
model_name = "dslim/bert-base-NER"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

# Create a pipeline for NER
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device=device)

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


cuda


In [6]:
for i, text in enumerate(texts):
    # Perform NER
    results = ner_pipeline(text)

    # Display results
    print(f"Named Entities for example {i+1}:")
    for entity in results:
        print(f" - {entity['word']} ({entity['entity_group']}): {entity['score']:.2f}")
    print()

Named Entities for example 1:
 - Tesla (PER): 0.84
 - Serbian (MISC): 1.00
 - American (MISC): 0.71
 - Tesla (ORG): 0.99
 - Elon Musk (ORG): 0.99

Named Entities for example 2:
 - UEFA Champions League (MISC): 1.00
 - Real Madrid (ORG): 1.00
 - Manchester City (ORG): 0.86
 - Istanbul (LOC): 1.00
 - Turkey (LOC): 1.00
 - Madrid (LOC): 1.00
 - Spain (LOC): 1.00
 - Manchester (LOC): 1.00
 - UK (LOC): 1.00

Named Entities for example 3:
 - Amazon (LOC): 1.00
 - Brazil (LOC): 1.00
 - Peru (LOC): 1.00
 - Colombia (LOC): 1.00
 - Amazon (ORG): 0.98
 - Jeff Bezos (PER): 0.86

Named Entities for example 4:
 - Paris Hilton (PER): 0.98
 - Paris (LOC): 1.00
 - E (LOC): 0.99
 - ##iff (LOC): 0.77
 - ##el Tower (LOC): 0.95
 - Champs - Élysées (LOC): 0.91



## **BERT-large-case** fine-tuned specifically for NER on the CoNLL-2003 dataset


* **Larger version of BERT**, with:
  
    *	24 transformer layers (vs. 12 in BERT-base).
  
    *	1024 hidden dimensions (vs. 768 in BERT-base).
  
    *	16 attention heads per layer (vs. 12 in BERT-base).

* **Preserves capitalization** during tokenization and training.




In [7]:
# Input text
texts = [("Tesla was a Serbian-American inventor best known for his contributions to alternating current technology."
        "Meanwhile, Tesla, founded by the famous enterpreneur Elon Musk, is a leading manufacturer of electric vehicles."), ("The UEFA Champions League final between Real Madrid and Manchester City took place in Istanbul, Turkey, in 2024."
        "Madrid is the capital of Spain and Manchester is an industrial city in the UK."), ("The Amazon rainforest, spanning Brazil, Peru, and Colombia, is critical for global biodiversity, while the website Amazon, founded by Jeff Bezos is a popular online shopping service."), ("Paris Hilton shared her travel vlog about Paris, including her favorite spots near the Eiffel Tower and Champs-Élysées.")]

In [8]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import torch

# Load pre-trained model and tokenizer
model_name = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

# Create a pipeline for NER
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device=device)

tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


cuda


In [9]:
for i, text in enumerate(texts):
    # Perform NER
    results = ner_pipeline(text)

    # Display results
    print(f"Named Entities for example {i+1}:")
    for entity in results:
        print(f" - {entity['word']} ({entity['entity_group']}): {entity['score']:.2f}")
    print()

Named Entities for example 1:
 - Tesla (PER): 1.00
 - Serbian - American (MISC): 0.88
 - Tesla (ORG): 0.99
 - Elon Musk (PER): 1.00

Named Entities for example 2:
 - UEFA Champions League (MISC): 0.99
 - Real Madrid (ORG): 1.00
 - Manchester City (ORG): 1.00
 - Istanbul (LOC): 1.00
 - Turkey (LOC): 1.00
 - Madrid (LOC): 1.00
 - Spain (LOC): 1.00
 - Manchester (LOC): 1.00
 - UK (LOC): 1.00

Named Entities for example 3:
 - Amazon (LOC): 1.00
 - Brazil (LOC): 1.00
 - Peru (LOC): 1.00
 - Colombia (LOC): 1.00
 - Amazon (ORG): 0.95
 - Jeff Bezos (PER): 0.95

Named Entities for example 4:
 - Paris Hilton (PER): 1.00
 - Paris (LOC): 1.00
 - Eiffel Tower (LOC): 0.92
 - Champs - Ély (LOC): 0.88
 - ##s (MISC): 0.45
 - ##ées (LOC): 0.98



### With combined subwords:

In [10]:
def combine_subwords(results):
    combined_results = []
    temp_word = ""
    temp_group = None
    temp_score = []

    for entity in results:
        word = entity['word']
        group = entity['entity_group']
        score = entity['score']

        # Handle subwords
        if word.startswith("##"):
            temp_word += word[2:]
            temp_score.append(score)
        else:
            if temp_word:
                # Save the previous combined word
                combined_results.append({
                    "word": temp_word,
                    "entity_group": temp_group,
                    "score": sum(temp_score) / len(temp_score)  # Average score
                })
            temp_word = word
            temp_group = group
            temp_score = [score]

    # Append the last word
    if temp_word:
        combined_results.append({
            "word": temp_word,
            "entity_group": temp_group,
            "score": sum(temp_score) / len(temp_score)
        })

    return combined_results

for i, text in enumerate(texts):
    # Perform NER
    results = ner_pipeline(text)

    # Combine subwords
    combined_results = combine_subwords(results)

    # Display results
    print(f"Named Entities for example {i+1}:")
    for entity in combined_results:
        print(f" - {entity['word']} ({entity['entity_group']}): {entity['score']:.2f}")
    print()

Named Entities for example 1:
 - Tesla (PER): 1.00
 - Serbian - American (MISC): 0.88
 - Tesla (ORG): 0.99
 - Elon Musk (PER): 1.00

Named Entities for example 2:
 - UEFA Champions League (MISC): 0.99
 - Real Madrid (ORG): 1.00
 - Manchester City (ORG): 1.00
 - Istanbul (LOC): 1.00
 - Turkey (LOC): 1.00
 - Madrid (LOC): 1.00
 - Spain (LOC): 1.00
 - Manchester (LOC): 1.00
 - UK (LOC): 1.00

Named Entities for example 3:
 - Amazon (LOC): 1.00
 - Brazil (LOC): 1.00
 - Peru (LOC): 1.00
 - Colombia (LOC): 1.00
 - Amazon (ORG): 0.95
 - Jeff Bezos (PER): 0.95

Named Entities for example 4:
 - Paris Hilton (PER): 1.00
 - Paris (LOC): 1.00
 - Eiffel Tower (LOC): 0.92
 - Champs - Élysées (LOC): 0.77



## Multilingual model XLM-Roberta

In [11]:
# Input text
texts = [("Tesla was a Serbian-American inventor best known for his contributions to alternating current technology."
        "Meanwhile, Tesla, founded by the famous enterpreneur Elon Musk, is a leading manufacturer of electric vehicles."), ("The UEFA Champions League final between Real Madrid and Manchester City took place in Istanbul, Turkey, in 2024."
        "Madrid is the capital of Spain and Manchester is an industrial city in the UK."), ("The Amazon rainforest, spanning Brazil, Peru, and Colombia, is critical for global biodiversity, while the website Amazon, founded by Jeff Bezos is a popular online shopping service."), ("Paris Hilton shared her travel vlog about Paris, including her favorite spots near the Eiffel Tower and Champs-Élysées.")]

In [12]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import torch

# Load pre-trained model and tokenizer
model_name = "Davlan/xlm-roberta-large-ner-hrl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

# Create a pipeline for NER
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device=device)

tokenizer_config.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/982 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

cuda


In [13]:
for i, text in enumerate(texts):
    # Perform NER
    results = ner_pipeline(text)

    # Display results
    print(f"Named Entities for example {i+1}:")
    for entity in results:
        print(f" - {entity['word']} ({entity['entity_group']}): {entity['score']:.2f}")
    print()

Named Entities for example 1:
 - Tesla (ORG): 1.00
 - Tesla (ORG): 1.00
 - Elon Musk (PER): 1.00

Named Entities for example 2:
 - Real Madrid (ORG): 1.00
 - Manchester City (ORG): 1.00
 - Istanbul (LOC): 1.00
 - Turkey (LOC): 1.00
 - Madrid (LOC): 0.99
 - Spain (LOC): 1.00
 - Manchester (LOC): 1.00
 - UK (LOC): 1.00

Named Entities for example 3:
 - Amazon (LOC): 1.00
 - Brazil (LOC): 1.00
 - Peru (LOC): 1.00
 - Colombia (LOC): 1.00
 - Amazon (ORG): 1.00
 - Jeff Bezos (PER): 1.00

Named Entities for example 4:
 - Paris Hilton (PER): 1.00
 - Paris (LOC): 1.00
 - Eiffel Tower (LOC): 1.00
 - Champs-Élysées (LOC): 1.00

