<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Methodology" data-toc-modified-id="Methodology-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Methodology</a></span></li><li><span><a href="#Load-and-read-the-data" data-toc-modified-id="Load-and-read-the-data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Load and read the data</a></span></li><li><span><a href="#Train-test-splitting" data-toc-modified-id="Train-test-splitting-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Train-test splitting</a></span></li></ul></li><li><span><a href="#Training" data-toc-modified-id="Training-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Training</a></span><ul class="toc-item"><li><span><a href="#Preliminary-steps" data-toc-modified-id="Preliminary-steps-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Preliminary steps</a></span><ul class="toc-item"><li><span><a href="#Feature-Selection" data-toc-modified-id="Feature-Selection-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Feature Selection</a></span></li><li><span><a href="#Feature-and-label-extraction" data-toc-modified-id="Feature-and-label-extraction-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Feature and label extraction</a></span></li></ul></li><li><span><a href="#Initialize-the-model" data-toc-modified-id="Initialize-the-model-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Initialize the model</a></span></li><li><span><a href="#Model-Fitting" data-toc-modified-id="Model-Fitting-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Model Fitting</a></span></li></ul></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Model Evaluation</a></span><ul class="toc-item"><li><span><a href="#General-F1-Score" data-toc-modified-id="General-F1-Score-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>General F1 Score</a></span></li><li><span><a href="#F1-per-class" data-toc-modified-id="F1-per-class-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>F1 per class</a></span></li><li><span><a href="#Transitions-learned" data-toc-modified-id="Transitions-learned-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Transitions learned</a></span></li></ul></li><li><span><a href="#Inspect-with-eli_5" data-toc-modified-id="Inspect-with-eli_5-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Inspect with <code>eli_5</code></a></span></li><li><span><a href="#Conclusions-and-future-works" data-toc-modified-id="Conclusions-and-future-works-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Conclusions and future works</a></span></li></ul></div>

# Introduction

This notebook documents the creation and evaluation of CRF model to tag the Entities in a digitized corpus of letters. The training corpus is the first and second volume of the letters from E. Braun to Gerhard that are part of the DFG-funded project: *Gelehrte, Ausgräber und Kunsthändler: Die Korrespondenz des Instituto di Corrispondenza Archeologica als Wissensquelle und Netzwerkindikator.

The code here assumes that both train and test corpora are already saved in a suitable IOB, CoNLL-like format at the path saved in the `root` variable.

All the preprocessing work is documented and carried out in [another notebook](http://localhost:8888/notebooks/Korrespondenz/Annotation%20Workflow.ipynb)

In [8]:
import re
import os
from glob import glob
import pywebanno
from importlib import reload
from lxml import etree
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
#from idai_journals.nlp import tagDAI, tagDAINer, recognizeLang
#from idai_journals.utils import reg_tok
from collections import namedtuple, OrderedDict
import requests

In [12]:
sys.path.append("../")
from korr_corpusreader import KorrIOBCorpusReader

In [None]:
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.cross_validation import cross_val_score
from sklearn.grid_search import RandomizedSearchCV

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

## Methodology

"Conditional Random Field" ([CRF](https://en.wikipedia.org/wiki/Conditional_random_field)) is an algorithm of machine learning that is very often used in Named Entity Recognition ([NER](https://en.wikipedia.org/wiki/Named-entity_recognition)) tasks. It is suitable to treat linguistic texts where the context of a token (i.e. the series of features of the preceding and following tokens) is often very important to guess the right tag.

See [here](https://en.wikipedia.org/wiki/Conditional_random_field) for an introduction to CRF.

In what follows we perform a very simple train/test splitting on the manually annotated letters of the [first volume of letters from Braun to Gehrard](http://arachne.uni-koeln.de/item/buch/2892) (1832-35). The original annotation is stored in [Webanno](http://nlp.dainst.org:18080/webanno/).

## Load and read the data

In [14]:
newcols = ["words", "pos", "lemma", "textlayer", "chunk", "entityid"]
corpus = KorrIOBCorpusReader("../data/IOB/", r".*\.iob", columntypes=newcols)

With the previous code we have load the data that are saved in one of the DAI's servers (nlp.dainst.org). The corpus includes the following total number of annotated pages:

In [15]:
len(corpus.fileids())

653

Here is one example of a file:

In [16]:
corpus.fileids()[-1]

'3_Brunn1858_page241.iob'

And here are a few annotated words as read by our corpus reader:

In [17]:
corpus.full_tagged_words()[15:30]

[('1832', 'CARD', '@card@', 'HEAD', 'I-DATEletter'),
 ('Mein', 'PPOSAT', 'mein', '_', 'O'),
 ('freundlichster', 'ADJA', '<unknown>', '_', 'O'),
 ('Herr', 'NN', 'Herr', '_', 'O'),
 ('Professor', 'NN', 'Professor', '_', 'O'),
 ('Die', 'ART', 'die', '_', 'O'),
 ('Wohlthat', 'NE', '<unknown>', '_', 'O'),
 ('und', 'KON', 'und', '_', 'O'),
 ('Annehmlichkeit', 'NN', 'Annehmlichkeit', '_', 'O'),
 ('in', 'APPR', 'in', '_', 'O'),
 ('Ihrer', 'PPOSAT', 'ihr', '_', 'O'),
 ('nächsten', 'ADJA', 'nah', '_', 'O'),
 ('Nähe', 'NN', 'Nähe', '_', 'O'),
 ('zu', 'APPR', 'zu', '_', 'O'),
 ('logiren', 'NN', '<unknown>', '_', 'O')]

The length (in token number) of our corpus is:

In [18]:
len(corpus.words())

137197

## Train-test splitting

How do you evaluate the performance of a trained model? Basically, the simplest answer is to split your manually annotated data in a `train` and a `test` part: the model trained on the 

If you want to learn more about this approach to the problem I suggest this very nice [introductory tutorial](https://www.youtube.com/watch?v=0pP4EwWJgIU&index=5&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A).

CoNLL-like file structures have the peculiarity that they group the tokens into sentences, which must be kept together for the sake of linguistic coherence. Thus, methods of train-test splitting based on the randomized selection of tokens don't work. We have to come up with a custom solution.

First, then, we define two functions:
* one that splits the sentences into a training and a testing set, reserving a given percentage of sentences for the text
* one that counts the number of tokens in the set

In [19]:
def splitTrainTest(iob_corpus, perc_test=0.2):
    import random
    import numpy as np
    
    assert 0 < perc_test < 1, "The test percentage must be a value between 0 and 1"
    c = 0
    max_test_len = int(np.ceil(len(iob_corpus.words()) * perc_test))
    corpus_train, corpus_test, feature_train, feature_test = ([],[],[],[])
    sents = list(iob_corpus.full_tagged_sents())
    random.shuffle(sents)
    for sent in sents:
        if c < max_test_len:
            corpus_test.append(sent)
            c = c + len(sent)
        else:
            corpus_train.append(sent)
    return corpus_train, corpus_test#, feature_train, feature_test

In [18]:
def countSentenceTokens(sentence_list):
    return len([t for s in sentence_list for t in s])

Now we split the corpus into a collection of randomly selected sentences that will be used as training and another collection of sentences, which will be roughly around 20% of the total, used as testing set

In [19]:
train, test = splitTrainTest(corpus)
countSentenceTokens(train)

70557

The number above is the total of tokens in the `train` corpus

Do note that this function is based on random generators: if your re-run you'll get a totally different sequence of sentences.

# Training

A model is trained by feeding a series of **features** (convetionally named `X`: capital letters are for multidimensional arrays) and a corresponding list of **labels** (`y`: lower-case letters for lists) to the `fit` method of the model.

Thus, the process requires the following steps:
* we initialize the model
* we extract the features and labels for training and testing
* we fit the model

Then we're ready to move to the evaluation part, where we check the performances of our model

## Preliminary steps

Before we can inizialize the model, we need to:

1. define what features we are going to base our model on
2. extract the features and the labels

This is the most crucial part in our work. The features are a list of characters of each token that we want to annotate (e.g. its length, its last three letters, whether it is has any digit or only alphanumeric characters...) that we think are important for the classifier.

In a nutshel, we are going to transform our list of tokens into a list of "bags of features". The classifier will study this and will keep an eye, at the same time, on the target label that was assigned to each "bag of feature". In this way, it will learn what are the most important predictors to generate a new classification when a sentence that it has never seen before is presented to it.

Now the questions, obviously, becomes: what are the most relevant features that we can extract from our corpus?

### Feature Selection

In [21]:
"is"[-4:]

'is'

This list of features is preliminary. If I had to make an educated guess, I'd say (based also on previous works on German NER) that the following features will be important.

Do keep in mind that, quite apart from the characters of *each* token in itself, typical features that are used for CRF include also the characters of a windows of word before and after the target token. So, for instance, the word **"is"** in the sentence: "Today is a nice day" might be defined by the following features:

```%Python
        'word.rank' : 1, #we count from 0!
        'word.lower()': "is",
        'word.isupper()': False, # the word is not uppercase
        'word.istitle()': False, # the word is not title case
        'word.isdigit()': False, # the word is not a number
        "word.hasdigit()" :  False, # the word has no number in it
        'word.endswithdigit()' : False,
        'word.isInPersonDic' : False,
        'word.isInPlacesDic' : False,
        'postag': "VB",
        'lemma' : "be",
        #Preceding word
        '-1:word.lower()': "today",
        '-1:word.postag': "ADV",
        #Following word
        '+1:word.lower()': word1.lower(),
        '+1:word.postag': "DET",
        
        
```

So, what features do we select? In a first round, we work with the following:

* Part of Speech (as tagged by the POS-tagger [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/)
* case of targed, preceding following (whether they're lowercase, title case, all upper case)
* if a token is, has or ends with a digit
* suffix long (last 4 characters) and short (last 2)
* prefix long (first 4 characters) and short (first 2)
* lemma
* whether a token is in one of the lists of places and persons compiled by Marina Unger
* chain of the current + following/preceding 2 postags


All the above features are extracted for the current token and for those in the window of +1 and -1 (i.e. the following and preceding words).

The code to extract the features and labels is defined in the following module, which is also installed in the NLP machine. The dictionaries are saved unde `data/dictionaries`.

(BTW as the process is still experimental, it's not a bad idea to import also `reload`, so that we might reload the module in case we have to do any fix on the fly)

In [26]:
import crf_models
from importlib import reload

In [27]:
dics = {"persons" : "data/dictionaries/persons.txt", "places" : 'data/dictionaries/places.txt'}

### Feature and label extraction

Now we're ready to transform our 88k and something words in the train and test corpora into a bag of features and to extract also the list of labels.

Let us relax: it may take a while...

(see the output below under "Wall time" to know precisely how long it took)

In [30]:
%%time
X_train = [crf_models.sent2features(s, sindex, dics) for sindex,s in enumerate(train)]
y_train = [crf_models.sent2SimplifiedLabel(s) for s in train]

X_test = [crf_models.sent2features(s, sindex, dics) for sindex,s in enumerate(test)]
y_test = [crf_models.sent2SimplifiedLabel(s) for s in test]

CPU times: user 1min 4s, sys: 2.04 s, total: 1min 6s
Wall time: 1min 7s


## Initialize the model

In [31]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)

## Model Fitting

This is the step where we actually put our model to work to study the training set. *This* may actually take a while...

In [32]:
%%time
crf.fit(X_train, y_train)

CPU times: user 56 s, sys: 39.6 ms, total: 56 s
Wall time: 56 s


CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

# Model Evaluation

In this stage, we are going to compare the performances of our trained model with the manual "gold standard" that we saved in the `y_test` variable before

## General F1 Score

We follow [this very good tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html#let-s-use-conll-2002-data-to-build-a-ner-system) in the evaluation steps. The first thing we do is getting an F1 score for all the labels except `O`

In [33]:
labels = list(crf.classes_)
labels.remove('O')
labels

['B-PERmentioned',
 'B-OBJ',
 'I-OBJ',
 'B-PERauthor',
 'B-PERaddressee',
 'B-PLACEfrom',
 'B-DATEletter',
 'I-DATEletter',
 'I-PERmentioned',
 'B-PLACEmentioned',
 'B-ORGmentioned',
 'I-ORGmentioned',
 'B-LIT',
 'I-PERauthor',
 'B-DATEmentioned',
 'I-DATEmentioned',
 'I-LIT',
 'I-PERaddressee',
 'I-PLACEmentioned',
 'I-PLACEfrom']

It's quite a long list of labels! I am very much convinced that we can do a lot better if we fuse some categories (e.g. the different types of DATES and OBJECTS) in a single one. The distinctions can be then managed manually

In [34]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, 
                      average='weighted', labels=labels)

  'precision', 'predicted', average, warn_for)


0.83471825300233349

All in all, our model reached the above level of accuracy measured as a F1 score.

An F1 score is the harmonic average of [precision](https://en.wikipedia.org/wiki/Precision_and_recall#Precision) (very roughly speaking: a measure of the accuracy of the positives) and [recall](https://en.wikipedia.org/wiki/Precision_and_recall#Recall) (very roughly: a measure for the accuracy of the negatives; how many true positives are you missing?).

Remember that (as the train and test are generated randomly) you might obtain different results. A more formal and robust evaluation would replicate this randomization and the above steps something like 10 times and then average the F1 scores.

## F1 per class

The model scores 83.5% accurate on average. But how are the numbers distributed across the classes? Let us now break down the total class by class.

In [35]:
# group B and I results
sorted_labels = sorted(
    labels, 
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

                  precision    recall  f1-score   support

    B-DATEletter      0.972     0.972     0.972       106
    I-DATEletter      0.970     0.983     0.976       294
 B-DATEmentioned      0.929     0.619     0.743        21
 I-DATEmentioned      0.692     0.692     0.692        39
           B-LIT      0.774     0.453     0.571        53
           I-LIT      0.850     0.274     0.415        62
           B-OBJ      0.737     0.459     0.566        61
           I-OBJ      0.647     0.393     0.489        56
  B-ORGmentioned      0.778     0.778     0.778        45
  I-ORGmentioned      0.588     0.345     0.435        29
  B-PERaddressee      0.980     1.000     0.990        99
  I-PERaddressee      1.000     1.000     1.000        14
     B-PERauthor      0.982     0.982     0.982       111
     I-PERauthor      1.000     1.000     1.000         5
  B-PERmentioned      0.901     0.834     0.866       416
  I-PERmentioned      0.723     0.580     0.644        81
     B-PLACEf

  'precision', 'predicted', average, warn_for)


## Transitions learned

The following code will tell us what are the most used transitions that the model learned. We see that the most probable translation identified in the training is that between the first part of a date (B-DATE) and the following part (I-DATE)

In [36]:
from collections import Counter

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

Top likely transitions:
B-DATEletter -> I-DATEletter 3.998286
I-DATEletter -> I-DATEletter 3.217833
B-LIT  -> I-LIT   3.025552
I-DATEmentioned -> I-DATEmentioned 2.850486
O      -> O       2.801975
B-OBJ  -> I-OBJ   2.663603
B-DATEmentioned -> I-DATEmentioned 2.592678
I-OBJ  -> I-OBJ   2.552789
B-PLACEfrom -> I-PLACEfrom 2.521834
I-LIT  -> I-LIT   2.304620
I-PERmentioned -> I-PERmentioned 2.260970
B-ORGmentioned -> I-ORGmentioned 2.227603
B-PERaddressee -> I-PERaddressee 2.140080
I-PERaddressee -> I-PERaddressee 2.105266
I-PLACEmentioned -> I-PLACEmentioned 2.080696
I-PERauthor -> I-PERauthor 1.895762
B-PLACEfrom -> B-DATEletter 1.870470
B-PERauthor -> I-PERauthor 1.632804
B-PERmentioned -> I-PERmentioned 1.631260
I-ORGmentioned -> I-ORGmentioned 1.603498

Top unlikely transitions:
B-PERmentioned -> I-ORGmentioned -2.569072
O      -> I-PLACEfrom -2.573042
I-DATEletter -> I-DATEmentioned -2.601809
I-DATEmentioned -> I-DATEletter -2.611031
I-LIT  -> B-PLACEmentioned -2.669261
I-OBJ  -> B

# Inspect with `eli_5`

Let us now visualize all the previous evaluation with a dedicated package. We will be following [this tutorial](http://eli5.readthedocs.io/en/latest/tutorials/sklearn_crfsuite.html#inspect-model-weights)

In [37]:
import eli5

The following code visualizes the probabilities of the transitions between the given tags, as well as the most relevant features for them 

In [38]:
eli5.show_weights(crf, top=10, targets=['O', 'B-OBJ', 'I-OBJ', "B-PERmentioned", "I-PERmentioned", "B-PERauthor", "I-PERauthor"])

From \ To,O,B-OBJ,I-OBJ,B-PERmentioned,I-PERmentioned,B-PERauthor,I-PERauthor
O,2.802,0.0,-5.295,0.001,-5.279,0.006,-3.767
B-OBJ,-0.064,-1.495,2.664,-2.067,-1.688,-0.544,-0.516
I-OBJ,-0.017,-1.918,2.553,-2.683,-2.166,0.0,-0.462
B-PERmentioned,0.0,-0.421,-3.668,-3.292,1.631,-1.69,-2.021
I-PERmentioned,-0.024,-1.153,-2.036,-1.47,2.261,-0.925,-0.434
B-PERauthor,-0.0,-0.321,-1.018,-0.402,-1.091,-1.205,1.633
I-PERauthor,-1.109,0.0,-0.76,-0.702,-1.408,-0.004,1.896

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6
+3.759,postag_chain_fwd:$.-0-0,,,,,
+3.332,postag:PPER,,,,,
+3.269,BOS,,,,,
+3.182,EOS,,,,,
+2.944,postag:VVPP,,,,,
+2.800,"postag_chain_fwd:$,-CARD-$.",,,,,
+2.784,postag:VVINF,,,,,
+2.744,postag_chain_fwd:CARD-NN-APPR,,,,,
… 9785 more positive …,… 9785 more positive …,,,,,
… 4019 more negative …,… 4019 more negative …,,,,,

Weight?,Feature
+3.759,postag_chain_fwd:$.-0-0
+3.332,postag:PPER
+3.269,BOS
+3.182,EOS
+2.944,postag:VVPP
+2.800,"postag_chain_fwd:$,-CARD-$."
+2.784,postag:VVINF
+2.744,postag_chain_fwd:CARD-NN-APPR
… 9785 more positive …,… 9785 more positive …
… 4019 more negative …,… 4019 more negative …

Weight?,Feature
+3.293,word[-4:]:vase
+2.182,"postag_chain_bck:CARD-$,-NE"
+1.748,postag_chain_fwd:NN-$.-APPR
+1.729,word[:3]:Gem
+1.646,word[:3]:Aet
+1.639,word[-2:]:se
+1.564,-1:word[-4:]:anas
+1.564,-1:word.lower():campanas
+1.484,word[-4:]:egel
… 1264 more positive …,… 1264 more positive …

Weight?,Feature
+1.483,word[-4:]:iste
+1.370,word[:5]:musai
+1.370,word.lower():musaik
+1.370,word[-4:]:saik
+1.338,-1:word.lower():nr
+1.280,+1:word.lower():beschrieben
+1.279,+1:lemma:beschreiben
… 1567 more positive …,… 1567 more positive …
… 158 more negative …,… 158 more negative …
-1.319,BOS

Weight?,Feature
+2.495,word[-4:]:ands
+2.369,word[-4:]:aris
+2.363,word[:5]:Caval
+2.239,word[-2:]:ni
+2.239,word[:5]:Panof
+2.209,postag_chain_bck:VAFIN-APPR-NE
+2.161,word[:3]:Min
+2.081,"postag_chain_bck:NN-$,-NE"
+2.045,word.isInPersonDic
… 3235 more positive …,… 3235 more positive …

Weight?,Feature
+1.898,postag_chain_bck:ART-NE-NE
+1.873,"postag_chain_fwd:$.-NE-$,"
+1.631,+1:word[-2:]:io
+1.448,word.isInPersonDic
+1.400,-1:word.lower():de
+1.362,postag_chain_fwd:$.-NN-APPR
+1.303,word[-2:]:os
… 2098 more positive …,… 2098 more positive …
… 267 more negative …,… 267 more negative …
-1.401,+1:postag:$(

Weight?,Feature
+3.534,postag_chain_fwd:NN-0-0
+1.896,postag_chain_bck:0-PPOSAT-NN
+1.709,postag_chain_bck:0-CARD-NN
+1.187,-1:word[-2:]:Ce
+1.187,word[:5]:Secon
+1.187,-1:word[-4:]:eCCe
+1.187,word.lower():secondis
+1.187,word[-4:]:ndis
… 241 more positive …,… 241 more positive …
… 70 more negative …,… 70 more negative …

Weight?,Feature
+1.313,+1:word[-2:]:un
+1.116,-1:word.isupper()
+0.802,-1:word.lower():emil
+0.802,-1:lemma:Emil
+0.802,-1:word[-4:]:Emil
+0.771,word[-2:]:un
+0.739,+1:word.lower():campanari
… 128 more positive …,… 128 more positive …
… 28 more negative …,… 28 more negative …
-0.801,-1:lemma:<unknown>


In [40]:
eli5.show_weights(crf, top=10,
                  horizontal_layout=False, show=['targets'])

Weight?,Feature
+3.759,postag_chain_fwd:$.-0-0
+3.332,postag:PPER
+3.269,BOS
+3.182,EOS
+2.944,postag:VVPP
+2.800,"postag_chain_fwd:$,-CARD-$."
+2.784,postag:VVINF
+2.744,postag_chain_fwd:CARD-NN-APPR
… 9785 more positive …,… 9785 more positive …
… 4019 more negative …,… 4019 more negative …

Weight?,Feature
+1.749,"postag_chain_bck:NE-$,-CARD"
+1.437,postag_chain_fwd:VVFIN-CARD-$.
+1.279,"postag_chain_fwd:$,-NN-CARD"
+1.130,postag_chain_bck:APPR-NE-NN
+1.056,postag_chain_fwd:CARD-NN-CARD
+0.933,postag_chain_fwd:ADJA-NN-CARD
+0.809,postag_chain_fwd:NN-ADV-KOUS
… 358 more positive …,… 358 more positive …
… 86 more negative …,… 86 more negative …
-0.864,-1:postag:$.

Weight?,Feature
+2.217,word[:3]:183
+1.836,"postag_chain_bck:$,-CARD-$."
+1.746,postag_chain_fwd:$.-0-0
+1.722,lemma:@card@
+1.572,postag_chain_bck:CARD-NE-CARD
+1.492,postag_chain_bck:$.-NN-CARD
… 761 more positive …,… 761 more positive …
… 232 more negative …,… 232 more negative …
-1.198,-1:lemma:@card@
-1.237,-1:postag:NE

Weight?,Feature
+1.341,+1:word[-2:]:ch
+1.225,+1:word.lower():october
+1.129,-1:postag:APPRART
+1.075,-1:word.lower():den
+1.072,word[-2:]:28
… 452 more positive …,… 452 more positive …
… 92 more negative …,… 92 more negative …
-1.014,word.istitle()
-1.112,-1:postag:NN
-1.191,-1:lemma:<unknown>

Weight?,Feature
+2.125,word[-2:]:34
+1.062,postag_chain_bck:CARD-$.-NN
+1.017,-1:word[-2:]:ag
+1.007,postag_chain_bck:ART-CARD-$.
+0.960,word[:3]:Oct
+0.933,+1:postag:APPR
+0.924,postag_chain_bck:CARD-$(-CARD
… 737 more positive …,… 737 more positive …
… 98 more negative …,… 98 more negative …
-0.954,+1:postag:$(

Weight?,Feature
+2.531,postag_chain_bck:PPER-ART-ADJA
+2.364,word[-4:]:hnis
+2.116,word[-4:]:rien
+2.075,word[-4:]:ogie
+2.025,word[:5]:Durch
+1.946,word.lower():durchzeichnungen
+1.788,word[:3]:Mon
+1.678,word[:3]:Dur
+1.657,word[-4:]:satz
+1.650,word.lower():institutswerke

Weight?,Feature
+1.504,word[:3]:Bau
+1.428,"postag_chain_fwd:$.-$,-ADV"
+1.367,postag_chain_fwd:APPR-NE-VAFIN
+1.273,postag_chain_bck:0-ADJA-NN
+1.224,word[-2:]:tt
+1.189,postag_chain_bck:$(-NN-$.
+1.154,postag_chain_fwd:NE-VAFIN-PTKNEG
+1.005,word[:5]:itali
+1.005,word[:3]:ita
… 2114 more positive …,… 2114 more positive …

Weight?,Feature
+3.293,word[-4:]:vase
+2.182,"postag_chain_bck:CARD-$,-NE"
+1.748,postag_chain_fwd:NN-$.-APPR
+1.729,word[:3]:Gem
+1.646,word[:3]:Aet
+1.639,word[-2:]:se
+1.564,-1:word[-4:]:anas
+1.564,-1:word.lower():campanas
+1.484,word[-4:]:egel
… 1264 more positive …,… 1264 more positive …

Weight?,Feature
+1.483,word[-4:]:iste
+1.370,word[:5]:musai
+1.370,word.lower():musaik
+1.370,word[-4:]:saik
+1.338,-1:word.lower():nr
+1.280,+1:word.lower():beschrieben
+1.279,+1:lemma:beschreiben
… 1567 more positive …,… 1567 more positive …
… 158 more negative …,… 158 more negative …
-1.319,BOS

Weight?,Feature
+2.046,word[:3]:Gov
+1.960,lemma:Institut
+1.915,-1:lemma:bei
+1.832,word[:3]:Mus
+1.481,word[-2:]:ät
+1.358,postag_chain_bck:VVINF-KON-NN
+1.246,postag_chain_fwd:NN-VAFIN-PTKNEG
+1.152,word.lower():gerstäcker
… 710 more positive …,… 710 more positive …
… 135 more negative …,… 135 more negative …

Weight?,Feature
+1.659,postag_chain_bck:NN-$.-NN
+1.622,word[-4:]:lung
+1.449,-1:word[-4:]:gica
+1.449,-1:word.lower():archeologica
+1.439,-1:word.lower():coll
+1.439,-1:word[-4:]:Coll
+1.431,word.lower():deutschen
+1.337,-1:word[-2:]:st
+1.271,word[:3]:Deu
+1.271,word[:5]:Deuts

Weight?,Feature
+1.790,-1:word.isInPersonDic
+1.664,postag_chain_fwd:NN-NE-APPR
+1.283,-1:word[-2:]:40
+0.960,postag_chain_bck:ADJA-CARD-NN
+0.836,lemma:Gerhard
+0.836,word.lower():gerhard
+0.836,word[-4:]:hard
… 176 more positive …,… 176 more positive …
… 43 more negative …,… 43 more negative …
-1.113,+1:postag:CARD

Weight?,Feature
+2.075,postag_chain_fwd:$.-NE-NE
+1.597,postag_chain_bck:NE-$.-NE
+0.912,postag_chain_bck:CARD-NN-NE
+0.872,+1:word[-4:]:hard
+0.796,+1:word[-2:]:rd
+0.746,postag_chain_fwd:NE-APPR-ADJA
+0.673,-1:word.lower():freund
… 165 more positive …,… 165 more positive …
… 26 more negative …,… 26 more negative …
-0.728,-1:word.isInPersonDic

Weight?,Feature
+3.534,postag_chain_fwd:NN-0-0
+1.896,postag_chain_bck:0-PPOSAT-NN
+1.709,postag_chain_bck:0-CARD-NN
+1.187,-1:word[-2:]:Ce
+1.187,word[:5]:Secon
+1.187,-1:word[-4:]:eCCe
+1.187,word.lower():secondis
+1.187,word[-4:]:ndis
… 241 more positive …,… 241 more positive …
… 70 more negative …,… 70 more negative …

Weight?,Feature
+1.313,+1:word[-2:]:un
+1.116,-1:word.isupper()
+0.802,-1:word.lower():emil
+0.802,-1:lemma:Emil
+0.802,-1:word[-4:]:Emil
+0.771,word[-2:]:un
+0.739,+1:word.lower():campanari
… 128 more positive …,… 128 more positive …
… 28 more negative …,… 28 more negative …
-0.801,-1:lemma:<unknown>

Weight?,Feature
+2.495,word[-4:]:ands
+2.369,word[-4:]:aris
+2.363,word[:5]:Caval
+2.239,word[-2:]:ni
+2.239,word[:5]:Panof
+2.209,postag_chain_bck:VAFIN-APPR-NE
+2.161,word[:3]:Min
+2.081,"postag_chain_bck:NN-$,-NE"
+2.045,word.isInPersonDic
… 3235 more positive …,… 3235 more positive …

Weight?,Feature
+1.898,postag_chain_bck:ART-NE-NE
+1.873,"postag_chain_fwd:$.-NE-$,"
+1.631,+1:word[-2:]:io
+1.448,word.isInPersonDic
+1.400,-1:word.lower():de
+1.362,postag_chain_fwd:$.-NN-APPR
+1.303,word[-2:]:os
… 2098 more positive …,… 2098 more positive …
… 267 more negative …,… 267 more negative …
-1.401,+1:postag:$(

Weight?,Feature
+1.960,word.isInPlacesDic
+1.922,"postag_chain_fwd:NE-$,-CARD"
+1.889,postag_chain_fwd:NE-VVFIN-CARD
+1.696,word[:3]:Rom
+1.610,postag_chain_fwd:NE-CARD-NE
+1.433,"postag_chain_fwd:NE-$(-$,"
… 218 more positive …,… 218 more positive …
… 29 more negative …,… 29 more negative …
-1.048,+1:postag:NE
-1.073,-1:postag:CARD

Weight?,Feature
+0.864,+1:postag:ADJA
+0.750,+1:word[-4:]:ruar
+0.750,+1:word.lower():februar
+0.750,+1:lemma:Februar
+0.673,postag_chain_fwd:$.-NN-ADJA
+0.662,postag:ADJA
+0.573,postag_chain_bck:$.-CARD-$.
+0.490,-1:word[-4:]:ruar
… 52 more positive …,… 52 more positive …
… 11 more negative …,… 11 more negative …

Weight?,Feature
+4.590,word.isInPlacesDic
+2.898,word[-4:]:land
+2.369,word[-2:]:ia
+1.876,word[:3]:Bon
+1.678,postag_chain_fwd:NE-ART-NN
+1.601,word[:5]:Capit
… 762 more positive …,… 762 more positive …
… 226 more negative …,… 226 more negative …
-1.365,-1:postag:ADJA
-1.471,-1:word[-2:]:en

Weight?,Feature
+0.991,postag_chain_bck:NN-$.-NE
+0.911,-1:postag:NN
+0.863,-1:lemma:Villa
+0.863,-1:word.lower():villa
+0.839,-1:word[-4:]:illa
+0.827,postag_chain_fwd:$(-NE-KOKOM
+0.827,postag_chain_bck:VVFIN-$(-NE
+0.787,postag_chain_fwd:NE-KOKOM-NN
+0.763,+1:word[-2:]:ie
… 249 more positive …,… 249 more positive …


# Conclusions and future works

As it can be seen (especially for the objects and persons), some lexical features related to important words (like the suffix "-un" for the name Brun, or the lemma "emil", or the suffix "-vase" and "mosaik" for objects mentioned often) are very important. This may point to a certain "overfitting" of our model. It remains to be seen how these features will perform with a different document set in our corpus.

The next stage will be to annotate a corpus of letter involving two different persons and from a later date.