### Creating Text objects for training

In [1]:
from estnltk import Text
text = Text("Eesti on Euroopas.")

In [2]:
text.tag_layer()

text
Eesti on Euroopas.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,4
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,4
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4


In [3]:
text2 = Text("Kersti Kaljulaid on president.")

In [4]:
text2.tag_layer()

text
Kersti Kaljulaid on president.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,5
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,5
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5


### Setting up model parameters and training the model

In [6]:
from estnltk.taggers.standard.ner.ner_trainer import NerTrainer
from estnltk.taggers.standard.ner.model_storage_util import ModelStorageUtil
from estnltk.common import DEFAULT_PY3_NER_MODEL_DIR

model_dir=DEFAULT_PY3_NER_MODEL_DIR
modelUtil = ModelStorageUtil(model_dir)
nersettings = modelUtil.load_settings()
trainer = NerTrainer(nersettings)

The train method takes either a Text object or a list of Text objects. NER-tags can be passed in two different ways. If the Text objects do not have NER-layers, a list can be provided like in this example. In this case, the 'labels' parameter should be used. If the Text objects have NER-layers, the name of the layer can be given to the 'layer' parameter instead. The last parameter specifies the directory where the model will be saved. If not given, the default directory (ner_model), will be used.

In [7]:
# NBVAL_IGNORE_OUTPUT
trainer.train([text, text2],labels=[[['B-LOC','O','B-LOC','O']], [['B-PER', 'I-PER', 'O', 'O', 'O']]], model_dir='test')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 181
Seconds required: 0.002

Stochastic Gradient Descent (SGD)
c2: 0.001000
max_iterations: 1000
period: 10
delta: 0.000001

Calibrating the learning rate (eta)
calibration.eta: 0.100000
calibration.rate: 2.000000
calibration.samples: 2
calibration.candidates: 10
calibration.max_trials: 20
Initial loss: 12.476649
Trial #1 (eta = 0.100000): 13.424269 (worse)
Trial #2 (eta = 0.050000): 12.833209 (worse)
Trial #3 (eta = 0.025000): 12.626445 (worse)
Trial #4 (eta = 0.012500): 12.544598 (worse)
Trial #5 (eta = 0.006250): 12.508911 (worse)
Trial #6 (eta = 0.003125): 12.492355 (worse)
Trial #7 (eta = 0.001563): 12.484396 (worse)
Trial #8 (eta = 0.000781): 12.480496 (worse)
Trial #9 (eta = 0.000391): 12.478566 (worse)
Trial #10 (eta = 0.000195): 12.477606 (worse)
Trial #11 (eta = 0.000098): 12.477127 (worse)

***** Epoch #454 *****
Loss: 12.453581
Improvement ratio: 0.000041
Feature L2-norm: 0.002001
Learning rate (eta): 0.000000
Total number of feature updates: 908
Seconds required for this iteration: 0.000

***** Epoch #455 *****
Loss: 12.453530
Improvement ratio: 0.000041
Feature L2-norm: 0.002005
Learning rate (eta): 0.000000
Total number of feature updates: 910
Seconds required for this iteration: 0.001

***** Epoch #456 *****
Loss: 12.453479
Improvement ratio: 0.000041
Feature L2-norm: 0.002010
Learning rate (eta): 0.000000
Total number of feature updates: 912
Seconds required for this iteration: 0.000

***** Epoch #457 *****
Loss: 12.453428
Improvement ratio: 0.000041
Feature L2-norm: 0.002014
Learning rate (eta): 0.000000
Total number of feature updates: 914
Seconds required for this iteration: 0.000

***** Epoch #458 *****
Loss: 12.453377
Improvement ratio: 0.000041
Feature L2-norm: 0.002019
Learning rate (eta): 0.000000
Total number of feature updates: 916
Seconds required for thi

### Testing the model

In [8]:
from estnltk.taggers import NerTagger
from estnltk.taggers import WordLevelNerTagger

nertagger = NerTagger(model_dir = 'test')

Two checks to ensure that the training worked: the text that it had for training (should give the same tags) and text it has not seen before (should output nonsense).

In [9]:
testtext = Text("Eesti on Euroopas.")

In [10]:
testtext.tag_layer()

text
Eesti on Euroopas.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,4
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,4
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4


In [11]:
nertagger.tag(testtext)

text
Eesti on Euroopas.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,4
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,4
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4
ner,nertag,,words,False,2


In [12]:
testtext.ner

layer name,attributes,parent,enveloping,ambiguous,span count
ner,nertag,,words,False,2

text,nertag
['Eesti'],LOC
['Euroopas'],LOC


In [13]:
testtext2 = Text("Jüri Ratas on peaminister.")

In [14]:
testtext2.tag_layer()

text
Jüri Ratas on peaminister.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,5
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,5
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5


In [15]:
nertagger.tag(testtext2)

text
Jüri Ratas on peaminister.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,5
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,5
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5
ner,nertag,,words,False,1


In [16]:
testtext2.ner

layer name,attributes,parent,enveloping,ambiguous,span count
ner,nertag,,words,False,1

text,nertag
['Jüri'],LOC
