## <span style="color:purple"> Information extraction: named entities </span>

Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations.

The Estonian NER model has been created by [Tkachenko et al. (2013)](https://aclanthology.org/W13-2412/), and is available through `NerTagger` and `WordLevelNerTagger` components in EstNLTK.

You can apply NER directly via default resolver:

In [1]:
from estnltk import Text

text = Text('Eesti president on Alar Karis. Eesti Energia on Eesti riigile kuuluv energiaettevõte.')

text.tag_layer('ner')

text.ner

layer name,attributes,parent,enveloping,ambiguous,span count
ner,nertag,,words,False,4

text,nertag
['Eesti'],LOC
"['Alar', 'Karis']",PER
"['Eesti', 'Energia']",ORG
"['Eesti', 'riigile']",LOC


You can use `enclosing_text` to obtain exact strings corresponding to named entities:

In [2]:
# Get named entity strings
[named_entity.enclosing_text for named_entity in text.ner]

['Eesti', 'Alar Karis', 'Eesti Energia', 'Eesti riigile']

While NerTagger does not provide lemmatization of names, you can iterate over all words of each named entity, and you can get lemmas for these words from the `morph_analysis` layer:

In [3]:
# Get lemmas of the named entities
for named_entity in text.ner:
    for word in named_entity:
        print(word.text, word.lemma)
    print()

Eesti ['Eesti']

Alar ['Alar']
Karis ['Karis']

Eesti ['Eesti']
Energia ['energia']

Eesti ['Eesti']
riigile ['riik']



Note that lemmas are provided as a list due to ambiguities in the morph analysis layer.

## Usage as a tagger 

Next, we'll see how to use NerTagger as a separate tagger.

Loading the tagger:

In [4]:
from estnltk.taggers import NerTagger

nertagger = NerTagger()

Create a text object and add prerequisite layers

In [5]:
from estnltk import Text

In [6]:
text = Text('''Eesti Vabariik on riik Põhja-Euroopas. Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.''')
text = text.tag_layer(['sentences', 'morph_analysis'])

Add the NER layer to the text

In [7]:
nertagger.tag(text)

text
Eesti Vabariik on riik Põhja-Euroopas. Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,17
compound_tokens,"type, normalized",,tokens,False,1
words,normalized_form,,,True,15
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,15
ner,nertag,,words,False,5


The nertag attribute shows the category of each named entity, either "LOC" - location, "PER" - person or "ORG" - organization

In [8]:
text.ner

layer name,attributes,parent,enveloping,ambiguous,span count
ner,nertag,,words,False,5

text,nertag
"['Eesti', 'Vabariik']",LOC
['Põhja-Euroopas'],LOC
['Eesti'],LOC
"['Soome', 'lahe']",LOC
"['Soome', 'Vabariigiga']",LOC


For some use cases it might be better to have the output layer with a tag for each word. This can be used for visualizing or making manual changes to the layer.

In [9]:
from estnltk.taggers import WordLevelNerTagger

word_level_ner = WordLevelNerTagger()

In [10]:
word_level_ner.tag(text)

text
Eesti Vabariik on riik Põhja-Euroopas. Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,17
compound_tokens,"type, normalized",,tokens,False,1
words,normalized_form,,,True,15
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,15
ner,nertag,,words,False,5
wordner,nertag,words,,False,15


Here, the tags are in IOB-format: B- prefix indicates that this token is the beginning of the named entity, I- prefix indicates that this token is inside the named entity. O shows that the token is outside named entities (not part of a named entity).

In [11]:
text.wordner

layer name,attributes,parent,enveloping,ambiguous,span count
wordner,nertag,words,,False,15

text,nertag
Eesti,B-LOC
Vabariik,I-LOC
on,O
riik,O
Põhja-Euroopas,B-LOC
.,O
Eesti,B-LOC
piirneb,O
põhjas,O
üle,O
