# Sequence Labeling with Conditional Random Fields (CRF)

- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

*Recommended Reading*
- Lecture Slides
- Lafferty et al. (2001) [Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.803&rep=rep1&type=pdf) (__original paper__)
- Sutton & McCallum's [An Introduction to Conditional Random Fields](https://homepages.inf.ed.ac.uk/csutton/publications/crftutv2.pdf)
- Edwin Chen's [Introduction to Conditional Random Fields](http://blog.echen.me/2012/01/03/introduction-to-conditional-random-fields/)
- Michael Collin's [Log-Linear Models, MEMMs, and CRFs](http://www.cs.columbia.edu/~mcollins/crf.pdf)

__Requirements__

- [NL2SparQL4NLU](https://github.com/esrel/NL2SparQL4NLU) dataset
- [CRFsuite](http://www.chokkan.org/software/crfsuite/)
    - [python-crfsuite](https://python-crfsuite.readthedocs.io) python binding to `CRFsuite`.
    - [sklearn-crfsuite](https://sklearn-crfsuite.readthedocs.io) `python-crfsuite` wrapper providing API similar to `scikit-learn`
- [spaCy](https://spacy.io/)
- `conll.py` (in `src` folder)

__Note:__ you need to install both `python_crfsuite` and `sklearn_crfsuite` to use `sklearn_crfsuite`

## Conditional Random Fields

[CRFs](https://en.wikipedia.org/wiki/Conditional_random_field) are a type of __discriminative undirected probabilistic graphical model__. 
It is a generalization of __any__ undirected graph structure.
In Natural Language Processing, the structure is a *sequence* of words, and conditioning is on *previous transition*. This is known as __Linear Chain CRFs__.

For general graphs, the problem of exact inference in CRFs is intractable. For __Linear Chain CRFs__, however, there is an exact solution, and the used algorithm is "analogous to the forward-backward and Viterbi algorithm for the case of Hidden Markov Models". 


### Why CRF?
- Hidden Markov Models (HMM) have two issues when working with textual data:
    - Unrealistic independence assumptions: Natural Language Requires richer representation than just words (i.e. more features), which are __not independent__ (e.g. word and its suffix)
    -  HMM maximizes the likelihood of the observation sequence; however, in NLP the task is to predict the state sequence given the observation sequence, i.e. in generative approach we use __joint model__ to solve a __conditional problem__ (since we have observations, not state sequences.
    
- Maximum Entropy Markov Models (MEMM, McCallum et al. 2000) address these problems. However, they have a __label bias problem__ -- outgoing transitions from a state compete only against each other (not all the transitions in a model). In other words, "transition scores are the conditional probabilities of possible next states given the current state and the observation sequence". Consequently, MEMM favors states with fewer outgoing transitions (to the point of ignoring the observations, if there is only one outgoing transition). 

- Conditional Random Fields solve the __label bias problem__.

> A MEMM uses per-state exponential models for the conditional probabilities of next states given the current state, while a CRF has a single exponential model for the joint probability of a label sequence given the observation sequence. Since normalization is done globally rather than for each state individually, the weights of different features at different states can be traded off against each other.

### State-of-the-Art in the Age of Deep Neural Networks

Recently, it is considered to be *no-brainer* to combine CRFs with LSTMs (stack CRF on top of LSTM layer, e.g. Gobbi et al.) to improve accuracy of the model.

## Python CRF Tutorials

Authors of the python packages provide tutorials in the form of notebooks already!

__Follow the tutorials to learn the tools.__

- [sklearn-crfsuite notebook](https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb)
- [python-crfsuite notebook](https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb)


`bias` is explained [here](https://github.com/scrapinghub/python-crfsuite/issues/73).

## Language Understanding (Tagging) with CRFs

### Baseline

Let's prepare a CRF baseline for our dataset that:
- considers only word itself and the previous tag (similar to HMM).

We will make use of functions defined on previous labs.

In [2]:
def read_corpus_conll(corpus_file, fs="\t"):
    """
    read corpus in CoNLL format
    :param corpus_file: corpus in conll format
    :param fs: field separator
    :return: corpus
    """
    featn = None  # number of features for consistency check
    sents = []  # list to hold words list sequences
    words = []  # list to hold feature tuples

    for line in open(corpus_file):
        line = line.strip()
        if len(line.strip()) > 0:
            feats = tuple(line.strip().split(fs))
            if not featn:
                featn = len(feats)
            elif featn != len(feats) and len(feats) != 0:
                raise ValueError("Unexpected number of columns {} ({})".format(len(feats), featn))
            words.append(feats)
        else:
            if len(words) > 0:
                sents.append(words)
                words = []
    return sents

In [3]:
trn_data_file = 'NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.conll.txt' 
tst_data_file = 'NL2SparQL4NLU/dataset/NL2SparQL4NLU.test.conll.txt'

In [4]:
trn = read_corpus_conll(trn_data_file)
tst = read_corpus_conll(tst_data_file)
print(trn[0])

[('who', 'O'), ('plays', 'O'), ('luke', 'B-character.name'), ('on', 'O'), ('star', 'B-movie.name'), ('wars', 'I-movie.name'), ('new', 'I-movie.name'), ('hope', 'I-movie.name')]


Let's copy & re-define feature extraction functions from the tutorials.

In [5]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, label in sent]

def sent2tokens(sent):
    return [token for token, label in sent]

In [6]:
def word2features(sent, i):
    word = sent[i][0]
    return {'bias': 1.0, 'word.lower()': word.lower()}

Let's inspect our baseline features.

In [7]:
sent2features(trn[0])[0]

{'bias': 1.0, 'word.lower()': 'who'}

##### Feature Extraction

In [8]:
%%time
trn_feats = [sent2features(s) for s in trn]
trn_label = [sent2labels(s) for s in trn]

CPU times: user 18.4 ms, sys: 3.06 ms, total: 21.5 ms
Wall time: 21.5 ms


##### Training

In [9]:
%%time
from sklearn_crfsuite import CRF

crf = CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)
crf.fit(trn_feats, trn_label)

CPU times: user 6.4 s, sys: 17.2 ms, total: 6.42 s
Wall time: 6.44 s


<sklearn_crfsuite.estimator.CRF at 0x11993b550>

##### Prediction

In [22]:
tst_feats = [sent2features(s) for s in tst]
pred = crf.predict(tst_feats)

##### Evaluation
We are going to use our `conll` evaluation script. (Notice that tools report token level metrics.)

For that we will need to modify our prediction output a bit, to make it a tuple.

In [29]:
print(pred[0])

['O', 'O', 'B-movie.name']


In [30]:
hyp = [[(tst_feats[i][j], t) for j, t in enumerate(tokens)] for i, tokens in enumerate(pred)]

In [33]:
# to import conll
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

from conll import evaluate
# for nice tables
import pandas as pd

results = evaluate(tst, hyp)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
movie.type,1.0,0.0,0.0,4
movie.location,1.0,0.286,0.444,7
country.name,0.56,0.677,0.613,62
award.category,1.0,0.0,0.0,2
movie.release_region,1.0,0.0,0.0,4
movie.genre,0.96,0.667,0.787,36
movie.subject,0.8,0.636,0.709,44
producer.name,0.746,0.603,0.667,73
movie.language,0.787,0.536,0.638,69
character.name,0.5,0.067,0.118,15


## Feature Engineering

One of the strengths of CRFs lies in its ability to make use of rich feature representation. The process of extracting features from raw data is know as [feature engineering](https://en.wikipedia.org/wiki/Feature_engineering).

Common features used in sequence labeling with CRFs are:
- part-of-speech tags
- lemmas
- token character prefixes and suffixes (e.g. first and last 1, 2, 3 characters of a word; `word[-3:]` in tutorial is suffix of length 3).

#### SpaCy

[spaCy](https://spacy.io/) provides a convenient way to augment our feature set with common features using in Natural Language Processing. To make use of spaCy, please install it and download the model as:

```
pip install spacy
python -m spacy download en_core_web_sm
```

The list of provided token-level features is available [here](https://spacy.io/api/token#attributes).

Let's modify `sent2features` function to make use of spaCy features.

In [56]:
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = Tokenizer(nlp.vocab)  # to use white space tokenization (generally a bad idea for unknown data)

def sent2spacy_features(sent):
    spacy_sent = nlp(" ".join(sent2tokens(sent)))
    feats = []
    for token in spacy_sent:
        token_feats = {
            'bias': 1.0,
            'word.lower()': token.lower_,
            'pos': token.pos_,
            'lemma': token.lemma_
        }
        feats.append(token_feats)
    
    return feats

In [57]:
trn_feats = [sent2spacy_features(s) for s in trn]
trn_label = [sent2labels(s) for s in trn]
tst_feats = [sent2spacy_features(s) for s in tst]

In [60]:
print(trn_feats[0])

[{'bias': 1.0, 'word.lower()': 'who', 'pos': 'PRON', 'lemma': 'who'}, {'bias': 1.0, 'word.lower()': 'plays', 'pos': 'VERB', 'lemma': 'play'}, {'bias': 1.0, 'word.lower()': 'luke', 'pos': 'PROPN', 'lemma': 'luke'}, {'bias': 1.0, 'word.lower()': 'on', 'pos': 'ADP', 'lemma': 'on'}, {'bias': 1.0, 'word.lower()': 'star', 'pos': 'PROPN', 'lemma': 'star'}, {'bias': 1.0, 'word.lower()': 'wars', 'pos': 'NOUN', 'lemma': 'war'}, {'bias': 1.0, 'word.lower()': 'new', 'pos': 'PROPN', 'lemma': 'new'}, {'bias': 1.0, 'word.lower()': 'hope', 'pos': 'PROPN', 'lemma': 'hope'}]


In [61]:
crf = CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)
crf.fit(trn_feats, trn_label)

<sklearn_crfsuite.estimator.CRF at 0x11d7ae790>

In [62]:
pred = crf.predict(tst_feats)

hyp = [[(tst_feats[i][j], t) for j, t in enumerate(tokens)] for i, tokens in enumerate(pred)]

In [63]:
results = evaluate(tst, hyp)

pd_tbl = pd.DataFrame().from_dict(results, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
movie.type,1.0,0.0,0.0,4
movie.location,1.0,0.286,0.444,7
country.name,0.595,0.758,0.667,62
award.category,1.0,0.0,0.0,2
movie.release_region,0.0,0.0,0.0,4
movie.genre,0.933,0.778,0.848,36
movie.subject,0.811,0.682,0.741,44
producer.name,0.783,0.644,0.707,73
movie.language,0.809,0.551,0.655,69
character.name,0.5,0.133,0.211,15


## Exercises

- add suffix features to the model and report performances
- try the feature template from the tutorial on NL2SparQL4NLU
- increase the feature window (number of previous and next token) to:
    - `[-1, +1]`
    - `[-2, +2]`
- learn & experiment with [model parameters](https://sklearn-crfsuite.readthedocs.io/en/latest/api.html)