### Creating Text objects for training

In [1]:
from estnltk import Text
text = Text("Eesti on Euroopas.")

In [2]:
text.tag_layer()

text
Eesti on Euroopas.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,4
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,4
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4


In [3]:
text2 = Text("Kersti Kaljulaid on president.")

In [4]:
text2.tag_layer()

text
Kersti Kaljulaid on president.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,5
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,5
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5


### Setting up model parameters and training the model

In [5]:
from estnltk.taggers.estner.ner_trainer import NerTrainer
from estnltk.taggers.estner.model_storage_util import ModelStorageUtil
from estnltk.core import DEFAULT_PY3_NER_MODEL_DIR

model_dir=DEFAULT_PY3_NER_MODEL_DIR
modelUtil = ModelStorageUtil(model_dir)
nersettings = modelUtil.load_settings()
trainer = NerTrainer(nersettings)

The train method takes either a Text object or a list of Text objects. NER-tags can be passed in two different ways. If the Text objects do not have NER-layers, a list can be provided like in this example. In this case, the 'labels' parameter should be used. If the Text objects have NER-layers, the name of the layer can be given to the 'layer' parameter instead. The last parameter specifies the directory where the model will be saved. If not given, the default directory (ner_model), will be used.

In [6]:
trainer.train([text, text2],labels=[[['B-LOC','O','B-LOC','O']], [['B-PER', 'I-PER', 'O', 'O', 'O']]], model_dir='test')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 89
Seconds required: 0.005

Stochastic Gradient Descent (SGD)
c2: 0.001000
max_iterations: 1000
period: 10
delta: 0.000001

Calibrating the learning rate (eta)
calibration.eta: 0.100000
calibration.rate: 2.000000
calibration.samples: 1
calibration.candidates: 10
calibration.max_trials: 20
Initial loss: 2.772589
Trial #1 (eta = 0.100000): 2.772890 (worse)
Trial #2 (eta = 0.050000): 2.772664 (worse)
Trial #3 (eta = 0.025000): 2.772608 (worse)
Trial #4 (eta = 0.012500): 2.772593 (worse)
Trial #5 (eta = 0.006250): 2.772590 (worse)
Trial #6 (eta = 0.003125): 2.772589 (worse)
Trial #7 (eta = 0.001563): 2.772589 (worse)
Trial #8 (eta = 0.000781): 2.772589 (worse)
Trial #9 (eta = 0.000391): 2.772589 (worse)
Trial #10 (eta = 0.000195): 2.772589 (worse)
Trial #11 (eta = 0.000098): 2.772589 (worse)
Trial #12 (e

***** Epoch #583 *****
Loss: 2.765911
Improvement ratio: 0.000041
Feature L2-norm: 0.001220
Learning rate (eta): 0.000000
Total number of feature updates: 583
Seconds required for this iteration: 0.000

***** Epoch #584 *****
Loss: 2.765900
Improvement ratio: 0.000041
Feature L2-norm: 0.001222
Learning rate (eta): 0.000000
Total number of feature updates: 584
Seconds required for this iteration: 0.000

***** Epoch #585 *****
Loss: 2.765888
Improvement ratio: 0.000041
Feature L2-norm: 0.001224
Learning rate (eta): 0.000000
Total number of feature updates: 585
Seconds required for this iteration: 0.000

***** Epoch #586 *****
Loss: 2.765877
Improvement ratio: 0.000041
Feature L2-norm: 0.001226
Learning rate (eta): 0.000000
Total number of feature updates: 586
Seconds required for this iteration: 0.001

***** Epoch #587 *****
Loss: 2.765865
Improvement ratio: 0.000041
Feature L2-norm: 0.001228
Learning rate (eta): 0.000000
Total number of feature updates: 587
Seconds required for this ite

***** Epoch #903 *****
Loss: 2.762249
Improvement ratio: 0.000041
Feature L2-norm: 0.001888
Learning rate (eta): 0.000000
Total number of feature updates: 903
Seconds required for this iteration: 0.001

***** Epoch #904 *****
Loss: 2.762237
Improvement ratio: 0.000041
Feature L2-norm: 0.001890
Learning rate (eta): 0.000000
Total number of feature updates: 904
Seconds required for this iteration: 0.000

***** Epoch #905 *****
Loss: 2.762226
Improvement ratio: 0.000041
Feature L2-norm: 0.001892
Learning rate (eta): 0.000000
Total number of feature updates: 905
Seconds required for this iteration: 0.001

***** Epoch #906 *****
Loss: 2.762214
Improvement ratio: 0.000041
Feature L2-norm: 0.001895
Learning rate (eta): 0.000000
Total number of feature updates: 906
Seconds required for this iteration: 0.000

***** Epoch #907 *****
Loss: 2.762203
Improvement ratio: 0.000041
Feature L2-norm: 0.001897
Learning rate (eta): 0.000000
Total number of feature updates: 907
Seconds required for this ite

***** Epoch #346 *****
Loss: 12.459077
Improvement ratio: 0.000041
Feature L2-norm: 0.001525
Learning rate (eta): 0.000000
Total number of feature updates: 692
Seconds required for this iteration: 0.001

***** Epoch #347 *****
Loss: 12.459026
Improvement ratio: 0.000041
Feature L2-norm: 0.001530
Learning rate (eta): 0.000000
Total number of feature updates: 694
Seconds required for this iteration: 0.000

***** Epoch #348 *****
Loss: 12.458975
Improvement ratio: 0.000041
Feature L2-norm: 0.001534
Learning rate (eta): 0.000000
Total number of feature updates: 696
Seconds required for this iteration: 0.000

***** Epoch #349 *****
Loss: 12.458924
Improvement ratio: 0.000041
Feature L2-norm: 0.001538
Learning rate (eta): 0.000000
Total number of feature updates: 698
Seconds required for this iteration: 0.000

***** Epoch #350 *****
Loss: 12.458873
Improvement ratio: 0.000041
Feature L2-norm: 0.001543
Learning rate (eta): 0.000000
Total number of feature updates: 700
Seconds required for thi

***** Epoch #853 *****
Loss: 12.433301
Improvement ratio: 0.000041
Feature L2-norm: 0.003758
Learning rate (eta): 0.000000
Total number of feature updates: 1706
Seconds required for this iteration: 0.000

***** Epoch #854 *****
Loss: 12.433250
Improvement ratio: 0.000041
Feature L2-norm: 0.003762
Learning rate (eta): 0.000000
Total number of feature updates: 1708
Seconds required for this iteration: 0.000

***** Epoch #855 *****
Loss: 12.433199
Improvement ratio: 0.000041
Feature L2-norm: 0.003767
Learning rate (eta): 0.000000
Total number of feature updates: 1710
Seconds required for this iteration: 0.000

***** Epoch #856 *****
Loss: 12.433149
Improvement ratio: 0.000041
Feature L2-norm: 0.003771
Learning rate (eta): 0.000000
Total number of feature updates: 1712
Seconds required for this iteration: 0.000

***** Epoch #857 *****
Loss: 12.433098
Improvement ratio: 0.000041
Feature L2-norm: 0.003776
Learning rate (eta): 0.000000
Total number of feature updates: 1714
Seconds required fo

### Testing the model

In [7]:
from estnltk.taggers import NerTagger
from estnltk.taggers import WordLevelNerTagger

nertagger = NerTagger(model_dir = 'test')

Two checks to ensure that the training worked: the text that it had for training (should give the same tags) and text it has not seen before (should output nonsense).

In [8]:
testtext = Text("Eesti on Euroopas.")

In [9]:
testtext.tag_layer()

text
Eesti on Euroopas.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,4
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,4
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4


In [10]:
nertagger.tag(testtext)

text
Eesti on Euroopas.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,4
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,4
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4
ner,nertag,,words,False,2


In [11]:
testtext.ner

layer name,attributes,parent,enveloping,ambiguous,span count
ner,nertag,,words,False,2

text,nertag
['Eesti'],LOC
['Euroopas'],LOC


In [12]:
testtext2 = Text("Jüri Ratas on peaminister.")

In [13]:
testtext2.tag_layer()

text
Jüri Ratas on peaminister.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,5
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,5
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5


In [14]:
nertagger.tag(testtext2)

text
Jüri Ratas on peaminister.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,5
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,5
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5
ner,nertag,,words,False,1


In [15]:
testtext2.ner

layer name,attributes,parent,enveloping,ambiguous,span count
ner,nertag,,words,False,1

text,nertag
['Jüri'],LOC
