## Named entity recognition

Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations. This tutorial shows how to use EstNLTK NER tagger.

 Load the tagger

In [1]:
from estnltk.taggers import NerTagger

nertagger = NerTagger()

Create a text object and add prerequisite layers

In [2]:
from estnltk import Text

In [3]:
text = Text('''Eesti Vabariik on riik Põhja-Euroopas. Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.''')
text = text.tag_layer(['sentences', 'morph_analysis'])

Add the NER layer to the text

In [4]:
nertagger.tag(text)

text
Eesti Vabariik on riik Põhja-Euroopas. Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,17
compound_tokens,"type, normalized",,tokens,False,1
words,normalized_form,,,True,15
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,15
ner,nertag,,words,False,5


The nertag attribute shows the category of each named entity, either "LOC" - location, "PER" - person or "ORG" - organization

In [5]:
text.ner

layer name,attributes,parent,enveloping,ambiguous,span count
ner,nertag,,words,False,5

text,nertag
"['Eesti', 'Vabariik']",LOC
['Põhja-Euroopas'],LOC
['Eesti'],LOC
"['Soome', 'lahe']",LOC
"['Soome', 'Vabariigiga']",LOC


For some use cases it might be better to have the output layer with a tag for each word. This can be used for visualizing or making manual changes to the layer.

In [6]:
from estnltk.taggers import WordLevelNerTagger

word_level_ner = WordLevelNerTagger()

In [7]:
word_level_ner.tag(text)

text
Eesti Vabariik on riik Põhja-Euroopas. Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,17
compound_tokens,"type, normalized",,tokens,False,1
words,normalized_form,,,True,15
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,15
ner,nertag,,words,False,5
wordner,nertag,,words,False,15


Here, the tags are in IOB-format: B- prefix indicates that this token is the beginning of the named entity, I- prefix indicates that this token is inside the named entity. O shows that the token is outside named entities (not part of a named entity).

In [8]:
text.wordner

layer name,attributes,parent,enveloping,ambiguous,span count
wordner,nertag,,words,False,15

text,nertag
Eesti,B-LOC
Vabariik,I-LOC
on,O
riik,O
Põhja-Euroopas,B-LOC
.,O
Eesti,B-LOC
piirneb,O
põhjas,O
üle,O
