## <span style="color:purple"> Information extraction: named entities </span>

Named Entity Recognition (NER) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations.

EstNLTK has multiple NER models for Estonian. 
The traditional machine learning based model was created by [Tkachenko et al. (2013)](https://aclanthology.org/W13-2412/), and is available through `NerTagger` and `WordLevelNerTagger` components in EstNLTK. This model annotates categories `PER`, `LOC`, `ORG`.

Neural Bert-based models were created by [Kairit Sirts (2023)](https://openreview.net/pdf?id=4CTnlIc1rhw), and are available through `EstBERTNERTagger` component in EstNLTK neural. There are 2 neural NER models: first using the basic `PER`, `LOC`, `ORG` categories, and the second using the extended set of categories: `PER`, `LOC`, `ORG`, `GPE`, `MONEY`, `PERCENT`, `PROD`, `TITLE`, `DATE`, `TIME`, `EVENT`.

---

## The basic NER model (traditional ML)

The traditional machine learning based NER is distributed with the EstNLTK's main package. 

You can apply NER directly via default resolver:

In [1]:
from estnltk import Text

text = Text('Eesti president on Alar Karis. Eesti Energia on Eesti riigile kuuluv energiaettevõte.')

text.tag_layer('ner')

text.ner

layer name,attributes,parent,enveloping,ambiguous,span count
ner,nertag,,words,False,4

text,nertag
['Eesti'],LOC
"['Alar', 'Karis']",PER
"['Eesti', 'Energia']",ORG
"['Eesti', 'riigile']",LOC


You can use `enclosing_text` to obtain exact strings corresponding to named entities:

In [2]:
# Get named entity strings
[named_entity.enclosing_text for named_entity in text.ner]

['Eesti', 'Alar Karis', 'Eesti Energia', 'Eesti riigile']

While NerTagger does not provide lemmatization of names, you can iterate over all words of each named entity, and you can get lemmas for these words from the `morph_analysis` layer:

In [3]:
# Get lemmas of the named entities
for named_entity in text.ner:
    for word in named_entity:
        print(word.text, word.lemma)
    print()

Eesti ['Eesti']

Alar ['Alar']
Karis ['Karis']

Eesti ['Eesti']
Energia ['energia']

Eesti ['Eesti']
riigile ['riik']



Note that lemmas are provided as a list due to ambiguities in the morph analysis layer.

## Usage as a tagger 

Next, we'll see how to use NerTagger as a separate tagger.

Loading the tagger:

In [4]:
from estnltk.taggers import NerTagger

nertagger = NerTagger()

Create a text object and add prerequisite layers

In [5]:
from estnltk import Text

In [6]:
text = Text('''Eesti Vabariik on riik Põhja-Euroopas. Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.''')
text = text.tag_layer(['sentences', 'morph_analysis'])

Add the NER layer to the text

In [7]:
nertagger.tag(text)

text
Eesti Vabariik on riik Põhja-Euroopas. Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,17
compound_tokens,"type, normalized",,tokens,False,1
words,normalized_form,,,True,15
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,15
ner,nertag,,words,False,5


The nertag attribute shows the category of each named entity, either "LOC" - location, "PER" - person or "ORG" - organization

In [8]:
text.ner

layer name,attributes,parent,enveloping,ambiguous,span count
ner,nertag,,words,False,5

text,nertag
"['Eesti', 'Vabariik']",LOC
['Põhja-Euroopas'],LOC
['Eesti'],LOC
"['Soome', 'lahe']",LOC
"['Soome', 'Vabariigiga']",LOC


For some use cases it might be better to have the output layer with a tag for each word. This can be used for visualizing or making manual changes to the layer.

In [9]:
from estnltk.taggers import WordLevelNerTagger

word_level_ner = WordLevelNerTagger()

In [10]:
word_level_ner.tag(text)

text
Eesti Vabariik on riik Põhja-Euroopas. Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,17
compound_tokens,"type, normalized",,tokens,False,1
words,normalized_form,,,True,15
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,15
ner,nertag,,words,False,5
wordner,nertag,words,,False,15


Here, the tags are in IOB-format: B- prefix indicates that this token is the beginning of the named entity, I- prefix indicates that this token is inside the named entity. O shows that the token is outside named entities (not part of a named entity).

In [11]:
text.wordner

layer name,attributes,parent,enveloping,ambiguous,span count
wordner,nertag,words,,False,15

text,nertag
Eesti,B-LOC
Vabariik,I-LOC
on,O
riik,O
Põhja-Euroopas,B-LOC
.,O
Eesti,B-LOC
piirneb,O
põhjas,O
üle,O


---

## Neural NER models

### Local installation

*In order to use the neural NER models, you need to install [estnltk_neural](https://github.com/estnltk/estnltk/tree/main/estnltk_neural) (v1.7.2+).*

Before applying `EstBERTNERTagger`, you'll need to get the models. Models are not distributed with EstNLTK due to their large size, and need to be downloaded separately:

* If you create a new instance of `EstBERTNERTagger` and the models are missing, you'll be prompted with a question asking for a permission to download the "estbertner" model;
* Alternatively, you can pre-download models manually via `download` function:

```python
from estnltk import download
download("estbertner")
download("estbertner_v2")
```

### Model estbertner

The model [estbertner_v1](https://huggingface.co/tartuNLP/EstBERT_NER) is the default model of `EstBERTNERTagger`. 
It tags named entities of `PER`, `LOC`, `ORG` categories:

In [12]:
from estnltk_neural.taggers import EstBERTNERTagger
neural_ner = EstBERTNERTagger()
neural_ner

name,output layers,output mapping,input layers
EstBERTNERTagger,estbertner,{'estbertner': ['nertag']},"('words',)"

0,1
model_location,"C:\Programmid\Miniconda3\envs\py38_estnltk_neural\lib\site-packages\estnltk-1.7. ..., type: <class 'str'>, length: 169"
nlp,"<transformers.pipelines.token_classification.TokenClassificationPipeline object ..., type: <class 'transformers.pipelines.token_classification.TokenClassificationPipeline'>"
tokenizer,"BertTokenizer(name_or_path='C:\Programmid\Miniconda3\envs\py38_estnltk_neural\li ..., type: <class 'transformers.models.bert.tokenization_bert.BertTokenizer'>, length: 50004"
custom_words_layer,words
batch_size,1750
postfix_expand_suffixes,False
postfix_concat_same_type_entities,False
postfix_remove_infix_matches,False


In [13]:
from estnltk import Text

text = Text('Eesti president on Alar Karis. Eesti Energia on Eesti riigile kuuluv energiaettevõte.')
# NER requires words layer
text.tag_layer('words')
# Add NER layer
neural_ner.tag(text)

text.estbertner

layer name,attributes,parent,enveloping,ambiguous,span count
estbertner,nertag,,words,False,4

text,nertag
['Eesti'],LOC
"['Alar', 'Karis']",PER
"['Eesti', 'Energia']",ORG
"['Eesti', 'riigile']",LOC


### Model estbertner_v2

The model [estbertner_v2](https://huggingface.co/tartuNLP/EstBERT_NER_v2) tags named entities of `PER`, `LOC`, `ORG`, `GPE`, `MONEY`, `PERCENT`, `PROD`, `TITLE`, `DATE`, `TIME`, `EVENT` categories. 
In order to use this model, you first need to download it via EstNLTK's downloader, and then pass it's location to `EstBERTNERTagger` via parameter `model_location`:

In [14]:
# Get the model
from estnltk import download, get_resource_paths
download("estbertner_v2")

Resource 'estbertner_v2_from_tartunlp_hf_2022-12-12' has already been downloaded.


True

In [15]:
# Initialize tagger with the model
model_location = get_resource_paths("estbertner_v2", only_latest=True)
neural_ner2 = EstBERTNERTagger(model_location=model_location, output_layer='estbertner2')

In [16]:
from estnltk import Text

text = Text("Kaia Kanepi (WTA 57.) langes USA-s Charlestonis toimuval WTA 500 kategooria tenniseturniiril konkurentsist "+\
            "kaheksandikfinaalis, kaotades poolatarile Magda Linette'ile (WTA 64.) 3 : 6, 6 : 4, 2 : 6.")
# NER requires words layer
text.tag_layer('words')
# Add NER layer
neural_ner2.tag(text)

text.estbertner2

layer name,attributes,parent,enveloping,ambiguous,span count
estbertner2,nertag,,words,False,8

text,nertag
"['Kaia', 'Kanepi']",PER
['WTA'],ORG
['USA-s'],GPE
['Charlestonis'],GPE
"['WTA', '500', 'kategooria']",EVENT
['poolatarile'],TITLE
"['Magda', ""Linette'ile""]",PER
['WTA'],ORG
