# Lab4a Named-entity-recognition using fine-tuned transformers

Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

Before reading this notebook make sure you have consulted **Lab3.4 SentimentClassification using transformer models**, which contains some disclaimers, tips and explains the sentence representations obtained from the transformer models.

In this notebook we will use the simpletransformer package that provides a simple API on top of the transformer packge.

In [1]:
#Requires installing transformers, pytorch and simpletransformers
#!conda install pytorch cpuonly -c pytorch
#!pip install transformers
#!pip install simpletransformers

We load a transformer model 'bert-base-NER' from the Hugging face repository, which is fine-tuned for Named Entity recognition: 

https://huggingface.co/models

We need to load the model for the sequence classifcation and the tokenizer to convert the sentences into tokens according to the vocabulary of the model.

Loading the model takes some time and requires you have sufficient memory to load the model

In [1]:
from simpletransformers.ner import NERModel
#sentences = ["Example sentence 1", "Example sentence 2"]
englishmodel = NERModel(
        model_type="bert",
        model_name="dslim/bert-base-NER",
        use_cuda=False
)

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)lve/main/config.json: 100%|██████████| 829/829 [00:00<00:00, 818kB/s]
Downloading (…)"pytorch_model.bin";: 100%|██████████| 433M/433M [01:06<00:00, 6.48MB/s] 
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 717kB/s]
Downloading (…)in/added_tokens.json: 100%|██████████| 2.00/2.00 [00:00<00:00, 1.09kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 52.3kB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 59.0/59.0 [00:00<00:00, 39.0kB/s]


We create an instance of the NERModel that can be used for training, evaluation, and prediction in Named-Entity-Recognition (NER) tasks. The full parameter list for a NERModel object:

* model_type: The type of model (bert, roberta)
* model_name: Default Transformer model name or path to a directory containing Transformer model file (pytorch_nodel.bin).
* labels (optional): A list of all Named Entity labels. If not given, [“O”, “B-MISC”, “I-MISC”, “B-PER”, “I-PER”, “B-ORG”, “I-ORG”, “B-LOC”, “I-LOC”] will be used.
* args (optional): Default args will be used if this parameter is not provided. If provided, it should be a dict containing the args that should be changed in the default args.
* use_cuda (optional): Use GPU if available. Setting to False will force model to use CPU only.

In [3]:
predictions, raw_outputs = englishmodel.predict(["Apple sued Samsung for patents last year."])

100%|██████████| 1/1 [00:00<00:00, 125.50it/s]
Running Prediction: 100%|██████████| 1/1 [00:00<00:00,  9.88it/s]


In [4]:
predictions

[[{'Apple': 'B-ORG'},
  {'sued': 'O'},
  {'Samsung': 'B-ORG'},
  {'for': 'O'},
  {'patents': 'O'},
  {'last': 'O'},
  {'year.': 'O'}]]

In [5]:
dutchmodel = NERModel(
        model_type="bert",
        model_name="Matthijsvanhof/bert-base-dutch-cased-finetuned-NER",
        use_cuda=False
)

Downloading (…)lve/main/config.json: 100%|██████████| 1.07k/1.07k [00:00<00:00, 756kB/s]
Downloading (…)"pytorch_model.bin";: 100%|██████████| 434M/434M [02:07<00:00, 3.41MB/s] 
Downloading (…)solve/main/vocab.txt: 100%|██████████| 241k/241k [00:00<00:00, 260kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 194kB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 546/546 [00:00<00:00, 200kB/s]


In [6]:
predictions, raw_outputs = dutchmodel.predict(["Apple sleept Samsung voor de rechter vanwege schending van patenten."])

100%|██████████| 1/1 [00:00<00:00, 106.80it/s]
Running Prediction: 100%|██████████| 1/1 [00:00<00:00,  9.85it/s]


In [7]:
predictions

[[{'Apple': 'O'},
  {'sleept': 'O'},
  {'Samsung': 'B-MISC'},
  {'voor': 'O'},
  {'de': 'O'},
  {'rechter': 'O'},
  {'vanwege': 'O'},
  {'schending': 'O'},
  {'van': 'O'},
  {'patenten.': 'O'}]]

Another option for Dutch NER (https://huggingface.co/flair/ner-dutch-large):

In [8]:
#!pip install flair

In [14]:
from flair.data import Sentence
from flair.models import SequenceTagger

# load tagger
tagger = SequenceTagger.load("flair/ner-dutch-large")

InvalidVersion: Invalid version: '2.22.1ubuntu1' (package: devscripts)

In [11]:
sentence = Sentence("Apple sleept Samsung voor de rechter vanwege schending van patenten.")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

NameError: name 'tagger' is not defined

# End of this notebook