<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Annotating-data" data-toc-modified-id="Annotating-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Annotating data</a></span><ul class="toc-item"><li><span><a href="#Encoding-scheme" data-toc-modified-id="Encoding-scheme-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Encoding scheme</a></span><ul class="toc-item"><li><span><a href="#IO-encoding" data-toc-modified-id="IO-encoding-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>IO encoding</a></span></li><li><span><a href="#IOB-and-IOB2-encoding" data-toc-modified-id="IOB-and-IOB2-encoding-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>IOB and IOB2 encoding</a></span></li><li><span><a href="#BILUO-encoding" data-toc-modified-id="BILUO-encoding-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>BILUO encoding</a></span></li><li><span><a href="#Conversion-between-encodings" data-toc-modified-id="Conversion-between-encodings-2.1.4"><span class="toc-item-num">2.1.4&nbsp;&nbsp;</span>Conversion between encodings</a></span></li></ul></li><li><span><a href="#Code-part" data-toc-modified-id="Code-part-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Code part</a></span></li></ul></li><li><span><a href="#Training-a-CRF" data-toc-modified-id="Training-a-CRF-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Training a CRF</a></span><ul class="toc-item"><li><span><a href="#With-little-annotations" data-toc-modified-id="With-little-annotations-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>With little annotations</a></span><ul class="toc-item"><li><span><a href="#Loading-and-splitting-data" data-toc-modified-id="Loading-and-splitting-data-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Loading and splitting data</a></span></li><li><span><a href="#Generate-features-and-train" data-toc-modified-id="Generate-features-and-train-3.1.2"><span class="toc-item-num">3.1.2&nbsp;&nbsp;</span>Generate features and train</a></span></li></ul></li><li><span><a href="#With-more-annotations" data-toc-modified-id="With-more-annotations-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>With more annotations</a></span></li></ul></li><li><span><a href="#Predicting-for-new-data" data-toc-modified-id="Predicting-for-new-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Predicting for new data</a></span></li></ul></div>

## Introduction
In this tutorial, we will train a [Conditional Random Field (CRF)](https://en.wikipedia.org/wiki/Conditional_random_field) on a dataset of the owner field of the 1808 napoleonian _Sommarioni_ that were manually transcripted.

![sommarioni](https://images.center/iiif_sommarioni/reg5-0103/1047,5697,1485,775/500,/0/default.jpg) becomes two entries:
1. `Città di Venezia, di provenenienza della sopressa Scola di Santa Maria della Carità`
2. `Raspi Giovanni Francesco quondam Giovanni Maria`

The goal of this tutorial is to start from these entries and extract all the first names, last names, ecclesiastical entities and governmental entitie. This will be done using CRFs.

A CRF is a discriminative probabilistic graphical model. It thus relies on modeling an input sequence with a graph and tries to estimate the conditional distribution $p(y|\mathbf{x})$. A good introduction can be found in [this paper](https://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf).

Compared to a discrete classifier, such as logistic regression which predicts each sample indepentently, a CRF can process a sequence of samples and take the context into account. It predicts a seqence of labels for an input sequence. This makes it particularly useful for labelling or parsing sequential data such as text.

CRF take as input a sequence of features which are computed for each input token. These features can be hand-crafted has it will be done in this tutorial. The advantage is that one can use prior knowledge on the data, without the need to write a rule based system.

In state of the art in named entity recognition the features come from a deep learning model, but a CRF is still often used for making the actual prediction.

## Annotating data

Since CRF is a supervised machine learning model, it needs training data.

The text must be tokenized into words. In our example, each line is considered as a sequence, but it is possible to use other level, such as the paragraph, article or even the page.

The most common tagging scheme is called [Inside-Outside-Beginning (IOB for sort)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). 

We first define a set of tags of tokens we want to be able to predict:
- `FNAME` for the first name.
- `LNAME` for the last name.
- `ECC` for an ecclesiastical entity.
- `GOV` for a governmental entity.

The idea is then to prefix the tags with information about the role of the chunk inside its class.

### Encoding scheme
There exists several way to encode the chunk information, we will review briefly the principal ones, more information can be found on this [blog post](https://lingpipe-blog.com/2009/10/14/coding-chunkers-as-taggers-io-bio-bmewo-and-bmewo/).
#### IO encoding
This is the simplest encoding, each token is either _in_ an entity, shown with the `I` prefix, e.g. `I-FNAME`, or out of any entity `O`. The disadvantage is that the encoding cannot reresent two entities next to each other since there is no boundary prefix.

An example tagging is :

|**Tagging**|Raspi|Francesco|e|le|Ministerio|della|finanzia|Senato|quondam|
|-----------|-----|---------|-|--|----------|-----|--------|------|-------|
|**IO**|I-LNAME|I-FNAME|O|O|I-GOV|I-GOV|I-GOV|I-GOV|O|


It is impossible to distinguish between the two `GOV` entitites.

#### IOB and IOB2 encoding
This encoding now uses two prefixes for the entities, `B` indicates the beginning of an entity and `I` is the continuation of an entity. For single entities, in `IOB` they are tagged as `I` and in `IOB2` as `B`:

|**Tagging**|Raspi|Francesco|e|le|Ministerio|della|finanzia|Senato|quondam|
|-----------|-----|---------|-|--|----------|-----|--------|------|-------|
|**IO**|I-LNAME|I-FNAME|O|O|I-GOV|I-GOV|I-GOV|I-GOV|O|
|**IOB**|I-LNAME|I-FNAME|O|O|B-GOV|I-GOV|I-GOV|I-GOV|O|
|**IOB2**|B-LNAME|B-FNAME|O|O|B-GOV|I-GOV|I-GOV|B-GOV|O|

The advantage of IOB2 is that is is able to separate between the two `GOV` entities.

Note that IOB is also sometimes called BIO.


#### BILUO encoding
Two more prefixes are introduced, `L` which denotes the end of an entity and `U` which denotes a unique token for an entity.

|**Tagging**|Raspi|Francesco|e|le|Ministerio|della|finanzia|Senato|quondam|
|-----------|-----|---------|-|--|----------|-----|--------|------|-------|
|**IO**|I-LNAME|I-FNAME|O|O|I-GOV|I-GOV|I-GOV|I-GOV|O|
|**IOB**|I-LNAME|I-FNAME|O|O|B-GOV|I-GOV|I-GOV|I-GOV|O|
|**IOB2**|B-LNAME|B-FNAME|O|O|B-GOV|I-GOV|I-GOV|B-GOV|O|
|**BILOU**|U-LNAME|U-FNAME|O|O|B-GOV|I-GOV|L-GOV|U-GOV|O|

Note that BILUO is also sometimes BIOES or BMEWO.

#### Conversion between encodings

It is posibble to convert between encodings, the following table summarizes the possible conversions:

|Source/Target|IO|IOB|IOB2|BILUO|
|-------------|--|---|----|-----|
|IO           |✓ |⨯  |⨯   |⨯    |
|IOB          |✓ |✓  |⨯   |⨯    |
|IOB2         |✓ |✓  |✓   |✓    |
|BILUO        |✓ |✓  |✓   |✓    |

IO and IOB encoding loses information, since it is not possible to cleanly separate glued entities of the same tag.

### Code part

We first start with some imports, some functions are defined inside the `utils` folder and will be referred as functions coming from utils:

In [None]:
import random
from collections import Counter

import pandas as pd
from IPython.display import display

import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import make_scorer

import sklearn_crfsuite
from sklearn_crfsuite import metrics

import spacy

from utils.export import export_to_excel
from utils import tags_format
from utils.features import generate_features

np.random.seed(42)
random.seed(42)

nlp = spacy.load('it_core_news_sm')

We now define what we want to tag, we use the same set that we defined in part 2:
- `FNAME` for the first name.
- `LNAME` for the last name.
- `ECC` for an ecclesiastical entity.
- `GOV` for a governmental entity.

We will use the IOB2 encoding because it offers better separation between entites than IO and IOB, with only two prefixes (vs 4 in BILUO).

In [None]:
tags = [
    'FNAME',
    'LNAME',
    'ECC',
    'GOV'
]
tags_prefixed = ["-".join([prefix, tag]) for tag in tags for prefix in 'IB']
tags_prefixed.append('O')
print(tags_prefixed)

We now load the data of the owners of *Sommarioni*, it is a CSV with two columns, the index of the entry which is an internal identifier to keep track of it and a text column which contains the text of each entry.

In [None]:
df_to_annotate = pd.read_csv('./propr_to_annotate.csv', index_col=0)
df_to_annotate.head()

We first tokenize the text into words and create a dataframe with the idx and token idx as index, the full text of the entity and the token.

In [None]:
# We tokenize the text into words
df_tokens = df_to_annotate['text'].apply(lambda text: [
    str(token.text) for token in nlp(text, disable=['parser', 'tagger', 'ner'])])

# We stack the tokens to get a new token index
df_tokens = df_tokens.apply(pd.Series).rename_axis('tok_idx', axis=1).stack().to_frame('token')

# Finally, we join with the previous dataframe to get the fulltext column
df_tokens = df_tokens.join(df_to_annotate)
df_tokens = df_tokens[['text', 'token']]
display(df_tokens.head())

We then use an util function that takes as input a dataframe with a 'text' and a 'token' column and a list of tags and generates an excel for annotating the data.

In [None]:
export_to_excel(df_tokens, './propr_to_annotate.xlsx', tags=tags_prefixed)

Once the data is annotated, we can read the excel and train a CRF.

## Training a CRF

Once enough data has been annotated, it is possible to train a CRF.

We first start by using our annotated data and then try with more annotated data.

### With little annotations

#### Loading and splitting data
We first load the data from the excel and make sure tags are uppercase

In [None]:
df_annotations_small = pd.read_excel('./propr_to_annotate.xlsx', index_col=[0,1])
df_annotations_small['tag'] = df_annotations_small['tag'].str.upper()
df_annotations_small.head()

We find all the entries that are completely annotated.

In [None]:
is_annotated = (df_annotations_small['tag'].fillna(False) # Fill all NaN with False
                .apply(lambda x: True if x else False) # Make filled tag True
                .reset_index().groupby('idx') # Transform into dataframe and group by entry
                .apply(lambda x: all(x['tag']))) # Check that all tags are filled
df_annotations_small = df_annotations_small.loc[is_annotated[is_annotated].index] # Only keep entries with all tags filled

We then transform the annotation into list of tokens and tags.

In [None]:
annotations_small = df_annotations_small.groupby(level=[0])[['token', 'tag']].agg(list).rename(columns={'token': 'tokens', 'tag': 'tags_iob2'})
annotations_small.head()

We can also generate other tagging scheme using some utils functions:

In [None]:
annotations_small['tags_io'] = annotations_small['tags_iob2'].apply(tags_format.iob2_to_io)
annotations_small['tags_biluo'] = annotations_small['tags_iob2'].apply(tags_format.iob2_to_biluo)
annotations_small.head()

We can now split our data into train and test sets. We will keep them fixed as we add more features. 

In [None]:
train_idx, test_idx = train_test_split(annotations_small.index, test_size=0.5)
print(f"There is {len(train_idx)} training samples and {len(test_idx)} test samples.")

#### Generate features and train

We now need to define some features for each token. Usually, we define features for the word and for its surrounding words.

Features are often morphological properties of the token, such as the fact that it is in upper case or titled.

You can also make use of a list of venetian family names and common italian surnames.

In [None]:
# List of venetian famility names
family_names = set(pd.read_csv('venetian_names.csv', sep='\t')['famille'].tolist())

# List of italian surnames
surnames = set(pd.read_csv('italian_surnames.txt', header=None)[0].str.lower().values.tolist())

# Features for the current word
default_features = {
        'bias': None,
        'word.lower': lambda word: word.lower(),
        # TODO add other features
    }
# Features to be computed for surrounding words
default_surrounding_features = {
        # TODO add features
}

We can now generate the features using another utils function that takes as input the tokens, the current word features, the number of surrounding words and the surrounding word features.

In [None]:
def generate_features_venetian(tokens):
    return generate_features(tokens,
                      features=default_features,
                      n_surrounding=1,
                      surrounding_features=default_surrounding_features)

annotations_small['features'] = annotations_small['tokens'].apply(generate_features_venetian)

Let's take a closer look at the generated features.

In [None]:
tokens, features = annotations_small[['tokens', 'features']].iloc[0].values
print(tokens[0])
print(features[0])
print()
print(tokens[2])
print(features[2])
print()
print(tokens[-1])
print(features[-1])

The first and last words have the additional BOS and EOS features, but miss the previous and next word features.

In [None]:
train = annotations_small.loc[train_idx]
test = annotations_small.loc[test_idx]

# TODO you can change the encoding scheme to see if it performs diffently
tags_col = 'tags_io'

X_train = train['features'].values
y_train = train[tags_col].values
X_test = test['features'].values
y_test = test[tags_col].values

We now can use [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io/en/latest/) which is a scikit-learn wrapper of [python-crfsuite](https://github.com/scrapinghub/python-crfsuite) which itself is a wrapper of the C++ [CRFsuite](https://github.com/chokkan/crfsuite). The advantage of using this wrapper is that we can use the scikit-learn api with our CRF.

We create a CRF object with sensible default that can be tweaked.

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=3e-1,
    c2=1e-4,
    max_iterations=1000,
    all_possible_transitions=True
)
crf.fit(X_train, y_train);

Once the model is trained, we can evaluate it, the most common metric considered is the weighted averaged f1-score.

But we can also use the `sklearn-crfsuite.metric.classification` report to have a complete classification report.

In [None]:
labels = list(crf.classes_)
labels.remove('O')

y_pred = crf.predict(X_test)
print(f"Weighted f1_score {metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels):.3f}")
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

It is also possible to take a look at which features are the most discriminative for a certain class.

In [None]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))
        
print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(10))

print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-10:][::-1])

Let's now load more annotations and test how well our model generalize.

In [None]:
df_annotations = pd.read_excel('./propr_annotated.xlsx', index_col=[0,1])
df_annotations['tag'] = df_annotations['tag'].str.upper()
is_annotated = (df_annotations['tag'].fillna(False) # Fill all NaN with False
                .apply(lambda x: True if x else False) # Make filled tag True
                .reset_index().groupby('idx') # Transform into dataframe and group by entry
                .apply(lambda x: all(x['tag']))) # Check that all tags are filled
df_annotations = df_annotations.loc[is_annotated[is_annotated].index] # Only keep entries with all tags filled
annotations = df_annotations.groupby(level=[0])[['token', 'tag']].agg(list).rename(columns={'token': 'tokens', 'tag': 'tags_iob2'})
annotations['tags_io'] = annotations['tags_iob2'].apply(tags_format.iob2_to_io)
annotations['tags_biluo'] = annotations['tags_iob2'].apply(tags_format.iob2_to_biluo)

annotations['features'] = annotations['tokens'].apply(generate_features_venetian)
print(f"There is {len(annotations)} annotations")

In [None]:
labels = list(crf.classes_)
labels.remove('O')

X_large = annotations['features'].values
y_large = annotations[tags_col].values

y_pred = crf.predict(X_large)
print(f"Weighted f1_score {metrics.flat_f1_score(y_large, y_pred, average='weighted', labels=labels):.3f}")
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_large, y_pred, labels=sorted_labels, digits=3
))

TODO add more features or change the encoding scheme to see how it changes the results and check how it affects the results.

### With more annotations

We copy paste from the last part this time loading the bigger annotations file and without explanation.

In [None]:
df_annotations = pd.read_excel('./propr_annotated.xlsx', index_col=[0,1])
df_annotations['tag'] = df_annotations['tag'].str.upper()
is_annotated = (df_annotations['tag'].fillna(False) # Fill all NaN with False
                .apply(lambda x: True if x else False) # Make filled tag True
                .reset_index().groupby('idx') # Transform into dataframe and group by entry
                .apply(lambda x: all(x['tag']))) # Check that all tags are filled
df_annotations = df_annotations.loc[is_annotated[is_annotated].index] # Only keep entries with all tags filled
annotations = df_annotations.groupby(level=[0])[['token', 'tag']].agg(list).rename(columns={'token': 'tokens', 'tag': 'tags_iob2'})
annotations['tags_io'] = annotations['tags_iob2'].apply(tags_format.iob2_to_io)
annotations['tags_biluo'] = annotations['tags_iob2'].apply(tags_format.iob2_to_biluo)

train_idx, test_idx = train_test_split(annotations.index, test_size=0.2)
print(f"There is {len(annotations)} annotations")
print(f"There is {len(train_idx)} training samples and {len(test_idx)} test samples.")
annotations.head()

Same code as before to generate the features, you can copy-paste your previously defined features

In [None]:
tags_col = 'tags_iob2'

# List of venetian famility names
family_names = set(pd.read_csv('venetian_names.csv', sep='\t')['famille'].tolist())

surnames = set(pd.read_csv('italian_surnames.txt', header=None)[0].str.lower().values.tolist())

# Features for the current word
default_features = {
        'bias': None,
        'word.lower': lambda word: word.lower(),
        # TODO add other features
    }
# Features to be computed for surrounding words
default_surrounding_features = {
        # TODO add features
}

def generate_features_venetian(tokens):
    return generate_features(tokens,
                      features=default_features,
                      n_surrounding=1,
                      surrounding_features=default_surrounding_features)

annotations['features'] = annotations['tokens'].apply(generate_features_venetian)

train = annotations.loc[train_idx]
test = annotations.loc[test_idx]
X_train = train['features'].values
y_train = train[tags_col].values
X_test = test['features'].values
y_test = test[tags_col].values

We train again the model and check the results.

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=5e-1,
    c2=1e-3,
    max_iterations=1000,
    all_possible_transitions=True
)
crf.fit(X_train, y_train);

labels = list(crf.classes_)
labels.remove('O')

y_pred = crf.predict(X_test)
print(f"Weighted f1_score {metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels):.3f}")
# group B and I results
sorted_labels = sorted(
    labels,
    key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

We can also check the states features.

In [None]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))

print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(10))

print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-10:][::-1])

## Predicting for new data

We can now load new data

In [None]:
df_to_predict = pd.read_csv('./propr_to_predict.csv', index_col=0)
df_to_predict.head()

Tokenize it and generate the tokens features we previously defined

In [None]:
# We tokenize the text into words
df_tokens = df_to_predict['text'].apply(lambda text: [
    str(token.text) for token in nlp(text, disable=['parser', 'tagger', 'ner'])]).to_frame('tokens')

# We generate features for the tokens
df_tokens['features'] = df_tokens['tokens'].apply(generate_features_venetian)

df_tokens.head()

Use the crf to predict the features into sequences

In [None]:
df_tokens['tags'] = crf.predict(df_tokens['features'].values)
df_tokens.head()

We now can transform the data frame to be in the long form

In [None]:
# Generate long form data frames for tokens and tags
tokens = df_tokens['tokens'].apply(pd.Series).rename_axis('tok_idx', axis=1).stack().to_frame('token')
tags = df_tokens['tags'].apply(pd.Series).rename_axis('tok_idx', axis=1).stack().to_frame('tag')

# We join the tokens, the tags and the original text
df_export = tokens.join(tags).join(df_to_predict)[['text', 'token', 'tag']]
df_export.head()

We can again export it to an xlsx file and verify the annotations

In [None]:
export_to_excel(df_export, 'propr_to_verify.xlsx', tags=tags_prefixed)