# Introduction

First, you need to grab your annotations and assemble a training corpus.

In our work scenario, we use `Webanno` to annotate and store the manually corrected texts. We keep the gold standards in `data/IOB`.

The files are 1 per page of the volume. The filenames *must* have a sequential number identifying the sequence in which we have annotated them. Every file in the `data/IOB` folder is structured as in this example:

```csv
Uebrigens	NN	<unknown>	_	O	_
bin	VAFIN	sein	_	O	_
ich	PPER	ich	_	O	_
etwas	ADV	etwas	_	O	_
zu	PTKA	zu	_	O	_
sehr	ADV	sehr	_	O	_
zersplittert	VVPP	zersplittern	_	O	_
```

The columns are:
* form
* pos (TreeTagger German tagset)
* lemma
* textual layer annotation (not used, but must be there)
* NER (the actual column we're interested in)
* NE id to gazetteers (not used, but must be there)

Unfortunately, this output is not natively produced by Webanno, and the full export of the project yields a lot of errors, with no informative log. We'll update the `data/IOB` folder manually with the last volumes.

**Optionally**, you might want to install a couple of dependency projects that I am running locally are available to $PYTHONPATH. For this NB, the custom modules are not necessary; however, they are:
* `pywebanno`: see [here](https://github.com/dainst/PyWebanno); this is optional, in case you want a tool to pool your annotations directly
* my `custom-corpus-readers` (see [here](https://github.com/francescomambrini/custom-corpus-readers)) for NLTK, which include support for flexible TSV files.

In [2]:
#this will work for my machine; install the dependency properly
import sys
sys.path.append("../../PyWebanno/pywebanno")
sys.path.append("../../custom-corpus-readers")

# This line is however needed!
sys.path.append("../")

Now these imports should work

In [3]:
# again, this is optional, in case you want to use these tools
#from corpusreader import WebAnnoTsvCorpusReader
from korr_corpusreader import KorrIOBCorpusReader

## Methodology

"Conditional Random Field" ([CRF](https://en.wikipedia.org/wiki/Conditional_random_field)) is a class of statistical modelling methods used often in machine learning; CRF is very often used in Named Entity Recognition ([NER](https://en.wikipedia.org/wiki/Named-entity_recognition)) or other classification tasks. It is suitable to treat linguistic texts where the context of a token (i.e. the series of features of the preceding and following tokens) is often very important to guess the right tag.

See [here](https://en.wikipedia.org/wiki/Conditional_random_field) for an introduction to CRF.

In what follows we perform a very simple train/test splitting on the manually annotated letters of the [first volume of letters from Braun to Gehrard](http://arachne.uni-koeln.de/item/buch/2892) (1832-35). The original annotation is stored in [Webanno](http://nlp.dainst.org:18080/webanno/).

## Load the data

This is mainly done for the purpose of inspection. The Trainer will reload the data in its own constructor!

In [4]:
newcols = ["words", "pos", "lemma", "textlayer", "chunk", "entityid"]
corpus = KorrIOBCorpusReader("../data/IOB_GOLD/", r".*\.iob", columntypes=newcols)

With the previous code we have load the data that are saved in one of the DAI's servers (nlp.dainst.org). The corpus includes the following total number of annotated pages:

In [5]:
len(corpus.fileids())

2291

Here is one example of a file

In [6]:
corpus.fileids()[-1]

'9_GerhardAnHenzen1843-1850_page231_tsv.iob'

And here are a few annotated words as read by our corpus reader:

In [7]:
corpus.full_tagged_words()[15:30]

[('1832', 'CARD', '@card@', 'HEAD', 'I-DATEletter'),
 ('Mein', 'PPOSAT', 'mein', '_', 'O'),
 ('freundlichster', 'ADJA', '<unknown>', '_', 'O'),
 ('Herr', 'NN', 'Herr', '_', 'O'),
 ('Professor', 'NN', 'Professor', '_', 'O'),
 ('Die', 'ART', 'die', '_', 'O'),
 ('Wohlthat', 'NE', '<unknown>', '_', 'O'),
 ('und', 'KON', 'und', '_', 'O'),
 ('Annehmlichkeit', 'NN', 'Annehmlichkeit', '_', 'O'),
 ('in', 'APPR', 'in', '_', 'O'),
 ('Ihrer', 'PPOSAT', 'ihr', '_', 'O'),
 ('nächsten', 'ADJA', 'nah', '_', 'O'),
 ('Nähe', 'NN', 'Nähe', '_', 'O'),
 ('zu', 'APPR', 'zu', '_', 'O'),
 ('logiren', 'NN', '<unknown>', '_', 'O')]

The length (in token number) of our corpus is:

In [9]:
len(corpus.words())

137197

# Training

In order to train our models, we make use of the `Trainer` class defined in the code of this project. The class should be *reasonably* transportable to other corpora of letters. We can discuss of any problems.

In [8]:
from training import Trainer

In [9]:
t = Trainer("../lib/config/korr_mac.json")

## Train and test split

How do you evaluate the performance of a trained model? Basically, the simplest answer is to split your manually annotated data in a `train` and a `test` part: the model trained on the 

If you want to learn more about this approach to the problem I suggest this very nice [introductory tutorial](https://www.youtube.com/watch?v=0pP4EwWJgIU&index=5&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A).

CoNLL-like file structures have the peculiarity that they group the tokens into sentences, which must be kept together for the sake of linguistic coherence. Thus, methods of train-test splitting based on the randomized selection of tokens don't work. We have to come up with a custom solution.

(Unless, that is, you want to do someting like K-fold cross-validation. In this case, you might assume that any effect of distortion introduced by varying sentence lengths will be level out in the process of K-fold validation)

The Trainer object has a function that returns a train and a stesting split of the corpus using sentences as units. The counter will stop adding sentences to the `test` set after the last sentence break after the token counter has passed the select percentage (default: 20% test).

The following is the docstring of the method:

```python
def split(self, test_perc=0.2):
    """
    Split the corpus into training and testing. Randomize the selection of sentences and add sentences
    until the total number of tokens of the test section is equal or greater than test_perc.
    :param test_perc: percentage of tokens saved for tests
    :type test_perc: float (between 0 and 1)
    :return: tuple (of lists): train corpus, test corpus
    """
```

In [10]:
training,test = t.split()

As it can be seen, the training set is *roughly* equivalent to our given percentage (default = 80%). Given the mecahnism adopted to include sentences in the test set, the number of the training tokens will always be slightly inferior to selected one.

In [7]:
len([w for t in training for w in t]) / 137197

0.7999373164136242

Let us now assign these two sub-corpora to the `training` and `testing` properties of the Trainer

In [8]:
t.training = training
t.test = test

## Feature selections

In order to train a model you need three things:

* a $X$ matrix of features for your training data
* a $y$ vector of responses that correspond to these features
* a `template` that list what features must go on $X$ for each token in the training corpus

Optionally, if you have a test corpus, you want to extract $X$ and $y$ for your testing section as well, so that you will be able to use your trained model to the testing section too

The trainer has a `set_feats_labels` functions that take care to create both $X$ and $y$ for both training and testing corpora.

Training and testing corpus are taken from two properties of the Trainer, so no need to pass arguments. However, you'll need to import your template

In [9]:
from templates import template1

Before we proceed, `t.X_train, t.X_test, t.y_train` and `t.y_test` are all empty; e.g.

In [10]:
print(t.X_train is None)
print(t.X_train is None)
print(t.y_train is None)
print(t.y_test is None)

True
True
True
True


In [11]:
%%time
t.set_feats_labels(template1)

CPU times: user 20.4 s, sys: 523 ms, total: 20.9 s
Wall time: 21 s


What about now?

In [12]:
t.X_train[0][0]

{'BOS': 'True',
 'endsWithDigit[0]': 'False',
 'endsWithDigit[1]': 'False',
 'hasDigit[0]': 'False',
 'hasDigit[1]': 'False',
 'isDigit[0]': 'False',
 'isDigit[1]': 'False',
 'isInPersonDic[0]': 'False',
 'isInPersonDic[1]': 'False',
 'isInPersonDic[2]': 'False',
 'isInPlaceDic[0]': 'False',
 'isInPlaceDic[1]': 'True',
 'isInPlaceDic[2]': 'False',
 'isTitle[0]': 'True',
 'isTitle[1]': 'True',
 'isUpper[0]': 'False',
 'isUpper[1]': 'False',
 'lemma[0]': 'in',
 'lemma[1]': 'Rom',
 'pos[0]': 'APPR',
 'pos[0]|pos[1]': 'APPR|NE',
 'pos[0]|pos[1]|pos[2]': 'APPR|NE|VAFIN',
 'pos[1]': 'NE',
 'pos[1]|pos[2]': 'NE|VAFIN',
 'pos[2]': 'VAFIN',
 'prefix_long[0]': 'In',
 'prefix_long[1]': 'Rom',
 'prefix_long[2]': 'sind',
 'prefix_short[0]': 'In',
 'prefix_short[1]': 'Rom',
 'prefix_short[2]': 'sin',
 'rank[0]': '0',
 'sent_rank[0]': '0',
 'suffix_long[0]': 'In',
 'suffix_long[0]|suffix_long[1]': 'In|Rom',
 'suffix_long[0]|suffix_long[1]|suffix_long[2]': 'In|Rom|sind',
 'suffix_long[1]': 'Rom',
 'su

In [13]:
t.y_train[0][0]

'O'

Sanity check

In [15]:
assert len(t.y_train) == len(t.X_train)

In [17]:
assert len(t.y_test) == len(t.X_test)

## Model fitting

The last step is fitting the model!

In [18]:
%%time
t.fit()

CPU times: user 3min 21s, sys: 5.82 s, total: 3min 26s
Wall time: 3min 44s


# Evaluation

## Built-in `evaluate` function

This is good enough for some basic numbers, like e.g. the F1 score or precision and recall for each label

In [26]:
ev = t.evaluate()

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [29]:
print(ev["F1_class"])

                  precision    recall  f1-score   support

    B-DATEletter      0.945     0.939     0.942       147
    I-DATEletter      0.954     0.961     0.957       406
 B-DATEmentioned      0.889     0.545     0.676        44
 I-DATEmentioned      0.809     0.585     0.679        94
           B-LIT      0.862     0.543     0.667        46
           I-LIT      0.810     0.321     0.459        53
           B-OBJ      0.837     0.488     0.617        84
           I-OBJ      0.763     0.408     0.532        71
  B-ORGmentioned      0.932     0.775     0.846        71
  I-ORGmentioned      0.769     0.385     0.513        26
  B-PERaddressee      1.000     0.960     0.980       100
  I-PERaddressee      1.000     1.000     1.000         4
     B-PERauthor      0.981     0.946     0.963       111
     I-PERauthor      1.000     0.750     0.857         8
  B-PERmentioned      0.902     0.838     0.869       618
  I-PERmentioned      0.769     0.676     0.719       148
     B-PLACEf

The built-in evaluation is also reporting the 20 top-most likely transitions and unlikely transitions learned

In [31]:
for tr in ev["Top_likely_transitions"]: print(tr)

B-DATEletter -> I-DATEletter 6.017387
I-DATEmentioned -> I-DATEmentioned 5.336065
I-PERmentioned -> I-PERmentioned 5.334186
B-LIT  -> I-LIT   5.334048
B-DATEmentioned -> I-DATEmentioned 5.325046
B-ORGmentioned -> I-ORGmentioned 5.230235
I-ORGmentioned -> I-ORGmentioned 5.153614
I-DATEletter -> I-DATEletter 5.143415
B-OBJ  -> I-OBJ   5.114480
I-LIT  -> I-LIT   4.826870
B-PERmentioned -> I-PERmentioned 4.560642
B-PLACEmentioned -> I-PLACEmentioned 4.523414
I-OBJ  -> I-OBJ   4.502887
I-PLACEmentioned -> I-PLACEmentioned 4.214961
B-PLACEfrom -> I-PLACEfrom 3.726875
B-PERaddressee -> I-PERaddressee 3.489402
I-PLACEfrom -> I-PLACEfrom 3.179804
I-PERaddressee -> I-PERaddressee 2.955939
B-PERauthor -> I-PERauthor 2.679517
B-PERmentioned -> B-LIT   2.515660


In [32]:
for tr in ev["Top_unlikely_transitions"]: print(tr)

I-LIT  -> B-PERmentioned -1.127396
I-PERauthor -> O       -1.135229
I-DATEmentioned -> I-DATEletter -1.375594
B-PERmentioned -> I-LIT   -1.507606
B-PERauthor -> B-PERauthor -1.527896
I-DATEletter -> I-DATEmentioned -1.534933
B-PERmentioned -> I-OBJ   -1.538121
B-PERmentioned -> B-PERmentioned -1.592075
B-PERaddressee -> O       -1.744607
B-DATEletter -> O       -1.923546
O      -> I-PLACEfrom -1.974725
O      -> I-PERauthor -2.034647
O      -> I-PERaddressee -2.428154
O      -> I-PLACEmentioned -2.897769
O      -> I-DATEletter -3.279177
O      -> I-DATEmentioned -3.385009
O      -> I-ORGmentioned -3.629834
O      -> I-LIT   -4.995137
O      -> I-OBJ   -5.420429
O      -> I-PERmentioned -5.687708


## Evaluation with `eli5`

The module eli5 provides some nicer graphic and functions to inspect what are the most highly rewarded features in the learning process

In [33]:
import eli5

In [40]:
#eli5.show_weights(crf, top=10, targets=['O', 'B-ORG', 'I-ORG'])
eli5.show_weights(t.crf, targets=['O','B-PLACEmentioned', "I-PLACEmentioned"])

From \ To,O,B-PLACEmentioned,I-PLACEmentioned
O,2.098,1.62,-2.898
B-PLACEmentioned,0.193,-0.354,4.523
I-PLACEmentioned,-0.079,0.215,4.215

Weight?,Feature,Unnamed: 2_level_0
Weight?,Feature,Unnamed: 2_level_1
Weight?,Feature,Unnamed: 2_level_2
+3.574,sent_rank[0]:1739,
+3.075,sent_rank[0]:3798,
+2.750,EOS:True,
+2.182,pos[0]|pos[1]|pos[2]:NE|VAFIN|ADV,
+2.156,BOS:True,
+2.156,rank[0]:0,
+2.054,isTitle[0]:False,
+2.039,suffix_short[0]:ge,
… 8656 more positive …,… 8656 more positive …,
… 7349 more negative …,… 7349 more negative …,

Weight?,Feature
+3.574,sent_rank[0]:1739
+3.075,sent_rank[0]:3798
+2.750,EOS:True
+2.182,pos[0]|pos[1]|pos[2]:NE|VAFIN|ADV
+2.156,BOS:True
+2.156,rank[0]:0
+2.054,isTitle[0]:False
+2.039,suffix_short[0]:ge
… 8656 more positive …,… 8656 more positive …
… 7349 more negative …,… 7349 more negative …

Weight?,Feature
+4.558,isInPlaceDic[0]:True
+2.404,suffix_long[0]:land
+1.783,prefix_short[0]:Nea
+1.603,suffix_short[0]:nd
+1.409,prefix_short[0]:Bon
+1.406,suffix_short[0]:ia
+1.184,prefix_long[0]:Etrur
+1.087,prefix_short[0]:Etr
+1.080,pos[-2]:NE
+0.986,suffix_short[1]:he

Weight?,Feature
+0.889,pos[-1]:NN
+0.872,suffix_short[-1]:on
+0.853,rank[0]:3
+0.733,sent_rank[0]:4799
+0.714,suffix_short[0]:so
+0.676,sent_rank[0]:6401
+0.612,sent_rank[0]:1551
+0.571,rank[0]:6
+0.532,prefix_short[-1]:Vil
+0.532,lemma[-1]:Villa


In [52]:
eli5.show_weights(t.crf, top=10, feature_re='^lemma',
                  horizontal_layout=False, show=['targets'])

Weight?,Feature
+1.202,lemma[1]:drucken
+1.034,lemma[-1]:zu
+0.748,lemma[0]:La
+0.681,lemma[0]:in
+0.633,lemma[1]:einmal
… 278 more positive …,… 278 more positive …
… 329 more negative …,… 329 more negative …
-0.646,lemma[-1]:Herr
-0.686,lemma[-1]:von
-0.843,lemma[0]:Annalen

Weight?,Feature
+0.561,lemma[-1]:@card@
+0.430,lemma[0]:26
+0.340,lemma[0]:9
+0.252,lemma[1]:1
+0.244,lemma[-1]:Gerhard
+0.195,"lemma[-1]:,"
+0.153,lemma[1]:<unknown>
+0.130,lemma[0]:30
+0.114,lemma[0]:25
… 21 more positive …,… 21 more positive …

Weight?,Feature
+1.493,lemma[0]:@card@
+0.480,lemma[-1]:9
+0.453,lemma[0]:R.
+0.404,lemma[1]:@card@
+0.373,lemma[-1]:27
+0.369,lemma[-1]:Juli
+0.289,lemma[-1]:26
… 34 more positive …,… 34 more positive …
… 20 more negative …,… 20 more negative …
-0.769,lemma[-1]:@card@

Weight?,Feature
+0.426,lemma[0]:Ende
+0.421,lemma[1]:sein
+0.347,lemma[0]:21
+0.328,lemma[1]:@card@
+0.259,lemma[1]:durchkommen
+0.238,lemma[-1]:die
+0.232,lemma[0]:Winter
+0.230,lemma[0]:Morgen
+0.195,lemma[1]:Autor
+0.193,lemma[0]:Anfang

Weight?,Feature
+0.547,lemma[-1]:21
+0.331,lemma[0]:April
+0.285,lemma[-1]:1861
+0.219,lemma[1]:\;
+0.216,lemma[1]:sollen
+0.212,lemma[-1]:Ende
+0.210,lemma[-1]:8
+0.209,lemma[0]:Monat
… 49 more positive …,… 49 more positive …
… 14 more negative …,… 14 more negative …

Weight?,Feature
+0.824,lemma[0]:Annalen
+0.781,lemma[-1]:Müller
+0.699,lemma[0]:Kunstblatt
+0.663,lemma[1]:schenken
+0.653,lemma[0]:Jahrbuch
+0.517,lemma[0]:<unknown>
+0.461,lemma[-1]:sein
+0.440,lemma[-1]:Bouillon
+0.416,lemma[1]:liegen
… 85 more positive …,… 85 more positive …

Weight?,Feature
+0.581,lemma[1]:von
+0.509,lemma[0]:Zeitung
+0.378,lemma[0]:Schrift
+0.302,lemma[0]:Museum
+0.285,lemma[1]:13
+0.254,lemma[-1]:die
+0.226,lemma[0]:Werk
+0.214,lemma[-1]:archäologisch
+0.211,lemma[0]:.
… 85 more positive …,… 85 more positive …

Weight?,Feature
+1.405,lemma[0]:Gemälde
+0.932,lemma[0]:<unknown>
+0.706,lemma[0]:Inschrift
+0.688,lemma[0]:Mosaik
+0.653,lemma[0]:Vase
+0.643,lemma[1]:können
+0.585,lemma[1]:erpressen
+0.561,lemma[1]:;
+0.517,lemma[1]:sein
+0.509,lemma[-1]:interessant

Weight?,Feature
+0.617,lemma[0]:Inschrift
+0.497,lemma[1]:beschreiben
+0.490,lemma[0]:Bronze
+0.469,lemma[0]:Spiegel
+0.390,lemma[-1]:mit
+0.296,lemma[0]:ten
+0.245,lemma[-1]:eine
+0.231,lemma[0]:Vase
+0.226,lemma[0]:Papa
… 80 more positive …,… 80 more positive …

Weight?,Feature
+2.344,lemma[0]:Institut
+0.864,lemma[-1]:in
+0.740,lemma[0]:Bibliothek
+0.690,lemma[0]:Akademie
+0.574,lemma[0]:Museum
+0.513,lemma[-1]:die
+0.431,lemma[1]:per
+0.385,lemma[0]:Via
+0.369,lemma[1]:Sammlung
+0.331,lemma[-1]:bei

Weight?,Feature
+0.374,lemma[0]:Sammlung
+0.358,lemma[-1]:<unknown>
+0.308,lemma[-1]:Via
+0.272,lemma[0]:Institut
+0.235,lemma[0]:Universität
+0.218,lemma[0]:die
+0.206,lemma[-1]:General
+0.195,lemma[-1]:die
+0.195,lemma[-1]:archäologisch
… 30 more positive …,… 30 more positive …

Weight?,Feature
+0.799,lemma[0]:Gerhard
+0.472,lemma[-1]:an
+0.440,lemma[-1]:Gerhard
+0.398,lemma[1]:Freund
+0.375,lemma[0]:Prof
+0.215,lemma[1]:darüber
+0.192,lemma[0]:Hochverehrte
+0.166,lemma[1]:Berlin
+0.108,lemma[-1]:@card@
… 15 more positive …,… 15 more positive …

Weight?,Feature
+0.306,lemma[-1]:Prof
+0.263,lemma[0]:Freund
+0.250,lemma[0]:Ancona
+0.218,lemma[-1]:.
+0.216,lemma[-1]:Freund
+0.111,lemma[-1]:Hochverehrte
+0.081,lemma[1]:Rom
+0.064,lemma[-1]:~
… 7 more positive …,… 7 more positive …
… 1 more negative …,… 1 more negative …

Weight?,Feature
+1.430,lemma[0]:Aug.
+0.941,lemma[-1]:@card@
+0.603,lemma[-1]:ihr
+0.539,lemma[0]:Braun
+0.450,lemma[1]:an
+0.428,lemma[1]:Braun
+0.391,lemma[-1]:<unknown>
+0.388,lemma[0]:B.
+0.339,lemma[0]:E
… 21 more positive …,… 21 more positive …

Weight?,Feature
+0.418,lemma[-1]:Emil
+0.380,lemma[1]:<unknown>
+0.343,lemma[0]:Braun
+0.292,lemma[-1]:.
+0.166,lemma[1]:Braun
+0.159,lemma[1]:an
+0.149,lemma[-1]:E
+0.134,lemma[0]:Brunn
+0.101,lemma[0]:.
+0.028,lemma[-1]:H

Weight?,Feature
+1.481,lemma[0]:<unknown>
+1.378,lemma[0]:Müller
+0.957,lemma[0]:Franzen
+0.901,lemma[0]:König
+0.843,lemma[-1]:alt
+0.828,lemma[-1]:bei
+0.798,lemma[0]:Prete
+0.743,lemma[-1]:dass
+0.715,lemma[0]:Prinz
… 214 more positive …,… 214 more positive …

Weight?,Feature
+0.708,lemma[-1]:Graf
+0.541,lemma[-1]:Baron
+0.485,lemma[0]:Schwester
+0.459,lemma[0]:<unknown>
+0.409,lemma[-1]:De
+0.390,lemma[-1]:von
+0.384,lemma[0]:Bruder
+0.327,lemma[1]:\;
+0.315,lemma[-1]:Camp
… 97 more positive …,… 97 more positive …

Weight?,Feature
+0.602,lemma[0]:Rom
+0.596,lemma[-1]:@card@
+0.511,lemma[0]:5
+0.294,lemma[0]:Berlin
+0.274,lemma[0]:Roma
+0.274,lemma[0]:19
+0.228,lemma[0]:28
+0.224,lemma[0]:Juli
… 26 more positive …,… 26 more positive …
… 2 more negative …,… 2 more negative …

Weight?,Feature
+0.601,lemma[-1]:Oktober
+0.386,lemma[0]:@card@
+0.318,lemma[-1]:28
+0.276,lemma[0]:Juli
+0.227,lemma[-1]:.
+0.209,lemma[1]:@card@
+0.191,lemma[-1]:19
+0.149,lemma[0]:Oktober
+0.121,lemma[-1]:5
… 7 more positive …,… 7 more positive …

Weight?,Feature
+0.973,lemma[0]:Neapel
+0.824,lemma[0]:Roma
+0.665,lemma[-1]:in
+0.604,lemma[0]:Italien
+0.463,lemma[0]:Pariser
+0.441,lemma[1]:nichts
+0.434,lemma[0]:Villa
+0.414,lemma[-1]:nach
+0.390,lemma[0]:Monte
… 53 more positive …,… 53 more positive …

Weight?,Feature
+0.532,lemma[-1]:Villa
+0.360,lemma[-1]:di
+0.305,lemma[-1]:Monte
+0.176,lemma[-1]:17
+0.152,lemma[-1]:'
+0.129,lemma[1]:obenstehend
+0.127,lemma[-1]:S
+0.117,lemma[1]:Aug.
+0.115,lemma[0]:S
… 17 more positive …,… 17 more positive …

Weight?,Feature
0.185,lemma[1]:28
0.082,lemma[-1]:an


# Final model

The final model is trained with the totality of the corpus.

In [41]:
tfinal = Trainer("../lib/config/korr_doc.json")

In [45]:
assert len(tfinal._corpus.sents()) == len(tfinal.training)

In [46]:
%%time
tfinal.set_feats_labels(template1)

CPU times: user 26 s, sys: 1.19 s, total: 27.2 s
Wall time: 46.4 s


In [47]:
%%time
tfinal.fit()

CPU times: user 4min 9s, sys: 8.87 s, total: 4min 18s
Wall time: 4min 31s


Now, let's write everything down

In [48]:
import pickle
pickle.dump( tfinal.crf, open( "../lib/models/korrespondez_model_stage3.pickle", "wb" ) )

(let's verify that everything went smoothly)

In [53]:
with open("../lib/models/korrespondez_model_stage3.pickle", "rb") as f:
    crf = pickle.load(f)

In [55]:
crf.predict(t.X_test)[0]

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

# Summary

All the previous operations in one cell

## With training/test splitting

Use the following cell if you want to do train/test splitting. Adjust the test percentage and the path to the config as you see fit

In [6]:
%%time

from templates import template1

t = Trainer("../lib/config/korr_nlp.json")
train,test = t.split()
t.train = train
t.test = test

t.set_feats_labels(template1)
t.fit()

ev = t.evaluate()

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


CPU times: user 7min 17s, sys: 2.14 s, total: 7min 19s
Wall time: 7min 19s


In [7]:
#inspect ev for the evaluation
print(ev["F1_general"])
print(ev["F1_class"])

0.970718309856
                  precision    recall  f1-score   support

  B-DATEanswered      0.917     0.917     0.917        12
  I-DATEanswered      0.885     0.885     0.885        26
    B-DATEletter      0.982     1.000     0.991       220
    I-DATEletter      0.983     0.997     0.990       580
 B-DATEmentioned      0.988     1.000     0.994        80
 I-DATEmentioned      1.000     0.983     0.991        58
  B-DATErecieved      1.000     0.952     0.976        21
  I-DATErecieved      1.000     0.932     0.965        59
           B-LIT      1.000     0.895     0.944        38
           I-LIT      1.000     0.611     0.759        54
           B-OBJ      1.000     0.940     0.969       182
           I-OBJ      1.000     0.700     0.824       120
  B-ORGaddressee      0.000     0.000     0.000         0
  I-ORGaddressee      0.000     0.000     0.000         0
  B-ORGmentioned      1.000     0.984     0.992       128
  I-ORGmentioned      1.000     0.787     0.881        4

## Train with the full corpus

Use the following cell if you want to use the full corpus for training

In [11]:
from training import Trainer
from templates import template1
import pickle

#t = Trainer("../lib/config/korr_nlp.json")
t = Trainer("../lib/config/korr_mac.json")

In [12]:
%%time
t.set_feats_labels(template1)
t.fit()

CPU times: user 16min 57s, sys: 32.7 s, total: 17min 29s
Wall time: 18min 45s


Use the following cell to write everything to a pickle file.

Don't forget to **update the output file name**!

In [13]:
outpath = '../lib/models/korrespondez_model_stage9.pickle'
with open(outpath, 'wb') as out:
    pickle.dump( t.crf, out)

---

In [38]:
def reorder(string):
    s = string.split("\n")
    lines = [l.split("\t") for l in s]
    previds = lines[0][0].split("-")
    #sentid = previds[0]
    for l in lines:
        currids = l[0].split("-")
        if currids[0] != previds[0]:
            ids = '{}-{}'.format( previds[0], int(previds[1]) + 1)
        else:
            ids = '-'.join(currids)
        previds = ids.split('-')
        print('{}\t{}'.format(ids, "\t".join(l[1:])))

In [77]:
s = '''1-1	431	CARD	@card@	_	O	O	
1-2	Gerhard	NE	Gerhard	_	B-webanno.custom.LetterEntity_	B-PERmentioned	
1-3	an	APPR	an	_	O	O	
1-4	Braun	NE	Braun	_	B-webanno.custom.LetterEntity_	B-PERaddressee	
1-5	Berlin	NE	Berlin	_	B-webanno.custom.LetterEntity_	B-PLACEfrom	
1-6	,	$,	,	_	O	O	
1-7	20	CARD	20	_	B-webanno.custom.LetterEntity_	B-DATEletter	
1-8	.	$.	.	_	I-webanno.custom.LetterEntity_	I-DATEletter	
2-1	Oktober	NN	Oktober	_	I-webanno.custom.LetterEntity_	I-DATEletter	
2-2	1844	CARD	1844	_	I-webanno.custom.LetterEntity_	I-DATEletter	
2-3	A.	APPR	an	_	O	O	
2-4	d.	ART	d.	_	O	O	
2-5	Dir.	NN	Dir.	_	O	O	'''

In [121]:
reorder(s)

1-1	431	CARD	@card@	_	O	O	
1-2	Gerhard	NE	Gerhard	_	B-webanno.custom.LetterEntity_	B-PERmentioned	
1-3	an	APPR	an	_	O	O	
1-4	Braun	NE	Braun	_	B-webanno.custom.LetterEntity_	B-PERaddressee	
1-5	Berlin	NE	Berlin	_	B-webanno.custom.LetterEntity_	B-PLACEfrom	
1-6	,	$,	,	_	O	O	
1-7	20	CARD	20	_	B-webanno.custom.LetterEntity_	B-DATEletter	
1-8	.	$.	.	_	I-webanno.custom.LetterEntity_	I-DATEletter	
1-9	Oktober	NN	Oktober	_	I-webanno.custom.LetterEntity_	I-DATEletter	
1-10	1844	CARD	1844	_	I-webanno.custom.LetterEntity_	I-DATEletter	
1-11	A.	APPR	an	_	O	O	
1-12	d.	ART	d.	_	O	O	
1-13	Dir.	NN	Dir.	_	O	O	


In [138]:
def pos_tag_sents(tokenized_sents):
    from treetagger import TreeTagger

    tagged_sents = []
    for i, s in enumerate(tokenized_sents):
        tt = TreeTagger(language='german')
        tags = [t for t in tt.tag(s) if len(t) > 1]
        tags = [tuple(tag + ["_", ""]) for tag in tags]
        tagged_sents.append(tags)
    return tagged_sents

In [139]:
pos_tag_sents(["Hallo , wie geht <> es dir?"])

[[('Hallo', 'ITJ', 'hallo', '_', ''),
  (',', '$,', ',', '_', ''),
  ('wie', 'KOUS', 'wie', '_', ''),
  ('geht', 'VVFIN', 'gehen', '_', ''),
  ('es', 'PPER', 'es', '_', ''),
  ('dir', 'PPER', 'du', '_', ''),
  ('?', '$.', '?', '_', '')]]

In [126]:
from treetagger import TreeTagger
tt = TreeTagger(language='german')

In [135]:
tags = tt.tag("Hallo , wie geht \<\> es dir?")

In [136]:
tags

[['Hallo', 'ITJ', 'hallo'],
 [',', '$,', ','],
 ['wie', 'KOUS', 'wie'],
 ['geht', 'VVFIN', 'gehen'],
 ['\\', 'XY', '\\'],
 ['<\\>'],
 ['es', 'PPER', 'es'],
 ['dir', 'PPER', 'du'],
 ['?', '$.', '?']]