# CRF Tutorial using python-crfsuite

In this tutorial, we will try to use CRF to work on part-of-speech (POS) tagging. There are 6 main parts in this tutorial
1. Setup and preprocessing
2. Designing feature funcions
3. Training
4. Making predictions
5. Evaluation
6. Try: Design a more complex model

# 1. Setup and preprocessing

In this demo we will use [python-crfsuite](https://github.com/scrapinghub/python-crfsuite)



In [0]:
!wget https://www.dropbox.com/s/tuvrbsby4a5axe0/resources.zip
!unzip resources.zip

In [0]:
!pip install python-crfsuite

In [0]:
import pycrfsuite
import numpy

We use POS data from [ORCHID corpus](https://www.nectec.or.th/corpus/index.php?league=pm), which is a POS corpus for Thai language.
A method used to read the corpus into a list of sentences with (word, POS) pairs have been implemented already. The example usage has shown below.

In [0]:
from data.orchid_corpus import get_sentences
train_data = get_sentences('train')
test_data = get_sentences('test')
train_data[0]

## 2. Designing features functions

- __word2features()__: This method returns all feature functions for time step _i_ of an input sequence. So, this method is where all feature functions are defined. From the code, we can define just features from input sequence (word for this example), the library will manage the transition functions ($y_{t-1}$ -> $y_t$) and state functions ($y_t$ -> $X$, with all $X$ features you defined in this method) for you.
- __sent2features()__: Loop and call word2features() over the input sequence.
- __sent2labels()__: Get the output labels from train/test sequence
- __sent2tokens()__: Get words from train/test sequence (used in prediction part just to show the full result)

In [0]:
def word2features(sent, i):
    word = sent[i][0]
    
    features = {
        'word': word,
        'word.isdigit': word.isdigit(),
        'word.length': len(word),
    }
    
    features['BOS'] = (i == 0)  # beginning of sentence
    features['EOS'] = (i == len(sent)-1)  # end of sentence
    
    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for (word, label) in sent]

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [0]:
sent2features(train_data[0])[0]

In [0]:
%%time
x_train = [sent2features(sent) for sent in train_data]
y_train = [sent2labels(sent) for sent in train_data]
x_test = [sent2features(sent) for sent in test_data]
y_test = [sent2labels(sent) for sent in test_data]

## 3. Training

To train a CRF model in python-crfsuite, we have to create a trainer and load training data (pairs of __generated features__ and __labels__) to the trainer first.

In [0]:
trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(x_train, y_train):
    trainer.append(xseq, yseq)

There are several parameters you can set for the training process. You can list all parameter using this method.

In [0]:
trainer.params()

In this tutorial, we will use 3 parameters:

- __max_iterations__: Define how many times we will let the model learn through training data
- __feature.possible_transitions__: Enable the library to create transition feature functions (as we discussed in section 2)
- __feature.possible_states__: Enable state feature functions

In [0]:
trainer.set_params({
    'max_iterations': 100,
    'feature.possible_transitions': True,
    'feature.possible_states': True,
})

Finally, call the trainer to train with the specified model path.

In [0]:
%%time
model_path = 'model/crf_basic.model'
trainer.train(model_path)

## 4. Making predictions

When we finished training a model. We can use that model to predict any sequence of words.
To do this, create a tagger with path to the saved model. Then, generate features with a sequence we want to predict and send them to _tag_ method.

In [0]:
tagger = pycrfsuite.Tagger()
tagger.open(model_path)

In [0]:
example_sent = test_data[20]
print(' '.join(sent2tokens(example_sent)))

print('Predicted: ', ' '.join(tagger.tag(sent2features(example_sent))))
print('Correct: ', ' '.join(sent2labels(example_sent)))

## 5. Evaluation

To measure how good the model can perform, we have to evaluate the model on _test data_. For sequence labeling tasks, we often use __accuracy__ to measure a model's goodness. However, we can analyze further by considering each tag with
- __prediction__: How many times the predicted tag _x_ is correctly tagged (it is a tag _x_ in the test data)
- __recall__: How many times the real tag _x_ is correctly tagged (the model can answer that it is a tag _x_)

The method below, evaluation_report(), is implemented to measure all metrics described and display it in DataFrame. It is ok to just use this method and not going through this.

In [0]:
import pandas as pd
from IPython.display import display

def evaluation_report(y_true, y_pred):
    # retrieve all tags in y_true
    tag_set = set()
    for sent in y_true:
        for tag in sent:
            tag_set.add(tag)
    tag_list = sorted(list(tag_set))
    
    # count correct points
    tag_info = dict()
    for tag in tag_list:
        tag_info[tag] = {'correct_tagged': 0, 'y_true': 0, 'y_pred': 0}

    all_correct = 0
    all_count = sum([len(sent) for sent in y_true])
    for sent_true, sent_pred in zip(y_true, y_pred):
        for tag_true, tag_pred in zip(sent_true, sent_pred):
            if tag_true == tag_pred:
                tag_info[tag_true]['correct_tagged'] += 1
                all_correct += 1
            tag_info[tag_true]['y_true'] += 1
            tag_info[tag_pred]['y_pred'] += 1
    accuracy = (all_correct / all_count) * 100
            
    # summarize and make evaluation result
    eval_list = list()
    for tag in tag_list:
        eval_result = dict()
        eval_result['tag'] = tag
        eval_result['correct_count'] = tag_info[tag]['correct_tagged']
        precision = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_pred'])*100 if tag_info[tag]['y_pred'] else '-'
        recall = (tag_info[tag]['correct_tagged']/tag_info[tag]['y_true'])*100
        eval_result['precision'] = precision
        eval_result['recall'] = recall
        eval_result['f_score'] = (2*precision*recall)/(precision+recall) if (type(precision) is float and recall > 0) else '-'
        
        eval_list.append(eval_result)

    eval_list.append({'tag': 'accuracy=%.2f' % accuracy, 'correct_count': '', 'precision': '', 'recall': '', 'f_score': ''})
    
    df = pd.DataFrame.from_dict(eval_list)
    df = df[['tag', 'precision', 'recall', 'f_score', 'correct_count']]
    display(df)

Make predictions on test set (y_pred) and evaluate against the real label (y_test)

In [0]:
y_pred = [tagger.tag(x_sent) for x_sent in x_test]

In [0]:
evaluation_report(y_test, y_pred)

## 6. Use pretrained word embedding

In this exercise, we will use pretrained word embedding from previous homework as word feature in pycrfsuite. We load pretrained word embedding using pickle. The pretrained weight is a dictionary which map a word to its embedding.

In [0]:
import pickle
fp = open('basic_ff_embedding.pt', 'rb')
embeddings = pickle.load(fp)
fp.close()

In [0]:
def word2features(sent, i, emb):
    def add_embedding_features(feat, prefix, query_word):
        if query_word in emb:
            vec = emb[query_word]
        else:
            vec = numpy.zeros(32)
        
        for i in range(vec.shape[0]):
            feat[prefix + str(i)] = vec[i]
    
    features = dict()
    word = sent[i][0]
    add_embedding_features(features, 'word.embd', word)
    features.update({
        'word.word' : word,
        'word.isdigit': word.isdigit(),
        'word.length': len(word),
    })
    
    features['BOS'] = (i == 0)  # beginning of sentence
    features['EOS'] = (i == len(sent)-1)  # end of sentence
    
    return features

def sent2features(sent, emb_dict):
    return [word2features(sent, i, emb_dict) for i in range(len(sent))]

def sent2labels(sent):
    return [label for (word, label) in sent]

def sent2tokens(sent):
    return [word for (word, label) in sent]

In [0]:
%%time
x_train = [sent2features(sent, embeddings) for sent in train_data]
y_train = [sent2labels(sent) for sent in train_data]
x_test = [sent2features(sent, embeddings) for sent in test_data]
y_test = [sent2labels(sent) for sent in test_data]

In [0]:
sent2features(train_data[0], embeddings)[0]

In [0]:
%%time
trainer = pycrfsuite.Trainer(verbose=True)
trainer.set_params({
    'max_iterations': 100,
    'feature.possible_transitions': True,
    'feature.possible_states': True,
})

for xseq, yseq in zip(x_train, y_train):
    trainer.append(xseq, yseq)

In [0]:
%%time
model_path = 'model/crf_neural.model'
trainer.train(model_path)

In [0]:
%%time
model_path = 'model/crf_neural.model'
tagger = pycrfsuite.Tagger()
tagger.open(model_path)
y_pred = [tagger.tag(x_sent) for x_sent in x_test]

In [0]:
evaluation_report(y_test, y_pred)