Named entity recognition
========================

Named-entity recognition (NER) (also known as entity identification,
entity chunking and entity extraction) is a subtask of information
extraction that seeks to locate and classify elements in text into
pre-defined categories such as the names of persons, organizations,
locations.

In this tutorial you will learn how to use estnltk's out of the box NER
utilities and how to build your own ner-models from scratch.

Getting started with NER
------------------------

The estnltk package comes with the pre-trained NER-models for Python
2.7/Python 3.4. The models distinguish 3 types of entities: person
names, organizations and locations.

A quick example below demonstrates how to extract named entities from
the raw text:

In [1]:
from estnltk import Text
from pprint import pprint

text = Text('''Eesti Vabariik on riik Põhja-Euroopas. 
    Eesti piirneb põhjas üle Soome lahe Soome Vabariigiga.
    Riigikogu on Eesti Vabariigi parlament. Riigikogule kuulub Eestis seadusandlik võim.
    2005. aastal sai peaministriks Andrus Ansip, kes püsis sellel kohal 2014. aastani.
    2006. aastal valiti presidendiks Toomas Hendrik Ilves.
    ''')

# Extract named entities
pprint(text.named_entities)

['Eesti vabariik',
 'Põhja-Euroobas|Põhja-Euroopa',
 'Eesti',
 'Soome laht',
 'Soome Vabariik',
 'Riigikogu',
 'Eesti vabariik',
 'riigikogu',
 'Eesti',
 'Andrus Ansip',
 'Toomas Hendrik Ilves']


When accessing the property **named\_entities** of
the **Text** instance, estnltk executes on the background
the whole text processing pipeline, including tokenization,
morphological analysis and named entity extraction.

The class **Text** additionally provides a number of useful
methods to get more information on the extracted entities:

In [2]:
pprint(list(zip(text.named_entities, text.named_entity_labels, text.named_entity_spans)))

[('Eesti vabariik', 'LOC', (0, 14)),
 ('Põhja-Euroobas|Põhja-Euroopa', 'LOC', (23, 37)),
 ('Eesti', 'LOC', (44, 49)),
 ('Soome laht', 'LOC', (69, 79)),
 ('Soome Vabariik', 'LOC', (80, 97)),
 ('Riigikogu', 'ORG', (103, 112)),
 ('Eesti vabariik', 'LOC', (116, 131)),
 ('riigikogu', 'ORG', (143, 154)),
 ('Eesti', 'LOC', (162, 168)),
 ('Andrus Ansip', 'PER', (223, 235)),
 ('Toomas Hendrik Ilves', 'PER', (312, 332))]


The default models use tags PER, ORG and LOC to denote person names,
organizations and locations respectively. Entity tags are encoded using
a BIO annotation scheme, where each entity label is prefixed with either
B or I letter. B- denotes the beginning and I- inside of an entity. The
prefixes are used to detect multiword entities, as shown in the example
example above. All other words, which don't refer to entities of
interest, are labelled with the O tag.

The raw labels are accessible via the property
**labels** of the **Text** instance:

In [3]:
pprint(list(zip(text.word_texts, text.labels)))

[('Eesti', 'B-LOC'),
 ('Vabariik', 'I-LOC'),
 ('on', 'O'),
 ('riik', 'O'),
 ('Põhja-Euroopas', 'B-LOC'),
 ('.', 'O'),
 ('Eesti', 'B-LOC'),
 ('piirneb', 'O'),
 ('põhjas', 'O'),
 ('üle', 'O'),
 ('Soome', 'B-LOC'),
 ('lahe', 'I-LOC'),
 ('Soome', 'B-LOC'),
 ('Vabariigiga', 'I-LOC'),
 ('.', 'O'),
 ('Riigikogu', 'B-ORG'),
 ('on', 'O'),
 ('Eesti', 'B-LOC'),
 ('Vabariigi', 'I-LOC'),
 ('parlament', 'O'),
 ('.', 'O'),
 ('Riigikogule', 'B-ORG'),
 ('kuulub', 'O'),
 ('Eestis', 'B-LOC'),
 ('seadusandlik', 'O'),
 ('võim', 'O'),
 ('.', 'O'),
 ('2005.', 'O'),
 ('aastal', 'O'),
 ('sai', 'O'),
 ('peaministriks', 'O'),
 ('Andrus', 'B-PER'),
 ('Ansip', 'I-PER'),
 (',', 'O'),
 ('kes', 'O'),
 ('püsis', 'O'),
 ('sellel', 'O'),
 ('kohal', 'O'),
 ('2014.', 'O'),
 ('aastani', 'O'),
 ('.', 'O'),
 ('2006.', 'O'),
 ('aastal', 'O'),
 ('valiti', 'O'),
 ('presidendiks', 'O'),
 ('Toomas', 'B-PER'),
 ('Hendrik', 'I-PER'),
 ('Ilves', 'I-PER'),
 ('.', 'O')]


Advanced NER
------------

### Training custom models

Default models that come with estnltk are good enough for basic tasks.
However, for some specific tasks, a custom NER model might be needed. To
train your own model, you need to provide a training corpus and custom
configuration settings. The following example demonstrates how to train
a ner-model using the default training dataset and settings:

In [4]:
from estnltk import estner
from estnltk.corpus import read_json_corpus 
from estnltk.ner import NerTrainer
# Read the default training corpus
corpus = read_json_corpus('../../../estnltk/corpora/estner.json')


# Read the default settings
ner_settings = estner.settings

# Directory to save the model
model_dir = 'output_model_directory'

# Train and save the model
trainer = NerTrainer(ner_settings)
trainer.train(corpus, model_dir)

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 27502
Seconds required: 6.768

Stochastic Gradient Descent (SGD)
c2: 0.001000
max_iterations: 1000
period: 10
delta: 0.000001

Calibrating the learning rate (eta)
calibration.eta: 0.100000
calibration.rate: 2.000000
calibration.samples: 1000
calibration.candidates: 10
calibration.max_trials: 20
Initial loss: 31638.553113
Trial #1 (eta = 0.100000): 35731.461497 (worse)
Trial #2 (eta = 0.050000): 18483.224381
Trial #3 (eta = 0.025000): 8702.151411
Trial #4 (eta = 0.012500): 4843.039483
Trial #5 (eta = 0.006250): 3396.892994
Trial #6 (eta = 0.003125): 3447.987218
Trial #7 (eta = 0.001563): 3931.664338
Trial #8 (eta = 0.000781): 4644.600628
Trial #9 (eta = 0.000391): 5567.995675
Trial #10 (eta = 0.000195): 6670.063105
Trial #11 (eta = 0.000098): 7850.731749
Best learning rate (eta): 0.006250
Seconds requ

The specified output directory will contain a resulting model file
model.bin and a copy of a settings module used for training. Now we can
load the model and tag some text using **NerTagger**:

In [5]:
from estnltk.ner import NerTagger

document = Text('Eesti koeraspordiliidu ( EKL ) presidendi Piret Laanetu intervjuu Eesti Päevalehele.')

# Load the model and settings
tagger = NerTagger(model_dir)

# ne-tag the document
tagger.tag_document(document)

pprint(list(zip(document.word_texts, document.labels)))

[('Eesti', 'B-ORG'),
 ('koeraspordiliidu', 'I-ORG'),
 ('(', 'O'),
 ('EKL', 'B-ORG'),
 (')', 'O'),
 ('presidendi', 'O'),
 ('Piret', 'B-PER'),
 ('Laanetu', 'I-PER'),
 ('intervjuu', 'O'),
 ('Eesti', 'B-ORG'),
 ('Päevalehele', 'I-ORG'),
 ('.', 'O')]


### Training dataset

To train a model with estnltk, you need to provide your training data in
a certain format (see the default dataset
estnltk/estnltk/corpora/estner.json for example). The training file
contains one document per line along with ne-labels. Let's create a
simple document:

In [6]:
text = Text('''Eesti Vabariik on riik Põhja-Euroopas.''')
text.tokenize_words()
pprint(text)

{'paragraphs': [{'end': 38, 'start': 0}],
 'sentences': [{'end': 38, 'start': 0}],
 'text': 'Eesti Vabariik on riik Põhja-Euroopas.',
 'words': [{'end': 5, 'start': 0, 'text': 'Eesti'},
           {'end': 14, 'start': 6, 'text': 'Vabariik'},
           {'end': 17, 'start': 15, 'text': 'on'},
           {'end': 22, 'start': 18, 'text': 'riik'},
           {'end': 37, 'start': 23, 'text': 'Põhja-Euroopas'},
           {'end': 38, 'start': 37, 'text': '.'}]}


Next, let's add named entity tags to each word in the document:

In [7]:
words = text.words

# label each word as "other":
for word in words:
    word['label'] = 'O'

# label words "Eesti Vabariik" as a location
words[0]['label'] = 'B-LOC'
words[1]['label'] = 'I-LOC'

# label word "Põhja-Euroopas" as a location
words[4]['label'] = 'B-LOC'

pprint(text.words)

[{'end': 5, 'label': 'B-LOC', 'start': 0, 'text': 'Eesti'},
 {'end': 14, 'label': 'I-LOC', 'start': 6, 'text': 'Vabariik'},
 {'end': 17, 'label': 'O', 'start': 15, 'text': 'on'},
 {'end': 22, 'label': 'O', 'start': 18, 'text': 'riik'},
 {'end': 37, 'label': 'B-LOC', 'start': 23, 'text': 'Põhja-Euroopas'},
 {'end': 38, 'label': 'O', 'start': 37, 'text': '.'}]


Once we have a collection of labelled documents, we can save it to disc
using the function **write\_json\_corpus()**:

In [8]:
from estnltk.corpus import write_json_corpus

documents = [text]
write_json_corpus(documents, 'output_file_name')

[{'paragraphs': [{'end': 38, 'start': 0}],
  'sentences': [{'end': 38, 'start': 0}],
  'text': 'Eesti Vabariik on riik Põhja-Euroopas.',
  'words': [{'end': 5, 'label': 'B-LOC', 'start': 0, 'text': 'Eesti'},
   {'end': 14, 'label': 'I-LOC', 'start': 6, 'text': 'Vabariik'},
   {'end': 17, 'label': 'O', 'start': 15, 'text': 'on'},
   {'end': 22, 'label': 'O', 'start': 18, 'text': 'riik'},
   {'end': 37, 'label': 'B-LOC', 'start': 23, 'text': 'Põhja-Euroopas'},
   {'end': 38, 'label': 'O', 'start': 37, 'text': '.'}]}]

This serializes each document object into a json string and saves to the
specified file line by line. The resulting training file can be used
with the **NerTrainer** as shown above.

### Ner settings

By default, estnltk uses configuration module **estnltk.estner.settings**. A
settings module defines training algorithm parameters, entity
categories, feature extractors and feature templates. The simplest way
to create a custom configuration is to make a new settings module, e.g.
*custom\_settings.py*, import the default settings and override necessary
parts. For example, a custom minimalistic configuration module could
look like this:

In [9]:
%%writefile custom_settings.py

from estnltk.estner.settings import *

# Override feature templates
TEMPLATES = [
    (('lem', 0),),
]

# Override feature extractors
FEATURE_EXTRACTORS = (
    "estnltk.estner.featureextraction.MorphFeatureExtractor",
)

Overwriting custom_settings.py


In [10]:
import custom_settings

In [11]:
ner_settings2 = custom_settings

Now, the **NerTrainer** instance can be initialized using the
custom\_settings module (make sure *custom\_settings.py* is on your python
path):

In [12]:
trainer = NerTrainer(ner_settings2)  