<img src="data/images/lecture-notebook-header.png" />

# `sklearn-crfsuite` Tutorial

This notebook is taken from the [official docs](https://sklearn-crfsuite.readthedocs.io/en/latest/) of the `sklearn-crfsuite` library. However, this notebook has been simplified as some steps will yield errors caused by [version issues](https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles) that would require to downgrade the version `sklearn-crfsuite` (which you can of course do).

`sklearn-crfsuite` is a Python library that provides an interface to the CRFsuite implementation of Conditional Random Fields (CRF) for sequence labeling tasks such as named entity recognition (NER). CRF is a probabilistic model that allows us to make predictions for sequential data, and it is widely used in natural language processing for tasks such as NER and part-of-speech tagging.

The `sklearn-crfsuite` library provides a simple and intuitive API for training and evaluating CRF models for sequence labeling tasks. The library is built on top of the popular scikit-learn machine learning library, and provides similar functionality for working with CRF models.

The tutorial for sklearn-crfsuite provides a step-by-step guide for building a NER model using CRF. The tutorial covers the following topics:

* Installing the `sklearn-crfsuite` library and loading the data.
* Preparing the data for training and evaluation.
* Defining the features to be used in the model, including word and context features.
* Training and evaluating the CRF model using the `sklearn-crfsuite` API.
* Evaluating the model on a test set and generating predictions for new data.

The tutorial also includes code examples and explanations for each step, as well as tips and tricks for improving the performance of the model. Overall, the `sklearn-crfsuite` tutorial is a useful resource for anyone interested in building NER models using CRF, especially for those familiar with scikit-learn.

## Setting up the Notebook

### Import Requuired Packages

In [None]:
from itertools import chain

import nltk
import sklearn
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

## Let's use CoNLL 2002 data to build a NER system

CoNLL2002 corpus is available in NLTK. We use Spanish data in this notebook.

In [None]:
nltk.corpus.conll2002.fileids()

In [None]:
%%time
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

In [None]:
train_sents[0]

## Features

Next, define some features. In this example we use word identity, word suffix, word shape and word POS tag; also, some information from nearby words is used. 

This makes a simple baseline, but you certainly can add and remove some features to get (much?) better results - experiment with it.

sklearn-crfsuite (and python-crfsuite) supports several feature formats; here we use feature dicts.

In [None]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],        
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

This is what word2features extracts:

In [None]:
sent2features(train_sents[0])[0]

No we can extract extract the features from all our sentences:

In [None]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

## Training

To see all possible CRF parameters check its docstring. Here we are using L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.


In [None]:
%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)

##
## try-except block to handle an error; avoids downgrading version of sklearn-crfsuite
## (more details: https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles)
##
try:
    crf.fit(X_train, y_train)
except AttributeError:
    pass

## Evaluation

There are many more O entities in the data set, but we're more interested in other entities. To account for this we'll use an averaged F1 score computed for all labels except for O. ``sklearn-crfsuite.metrics`` package provides some useful metrics for sequence classification tasks, including this one.


In [None]:
labels = list(crf.classes_)
labels.remove('O')
labels

In [None]:
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)

## Hyperparameter Optimization


<font color="red">
    
* This section has been removed as it does not run with the version issue
* https://stackoverflow.com/questions/66059532/attributeerror-crf-object-has-no-attribute-keep-tempfiles
    
</font>
    

## Let's check what classifier learned

The method `print_transitions()` below prints the weights/coefficients for a given set of transitions between pairs of labels.

In [None]:
from collections import Counter

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

We can see that, for example, it is very likely that the beginning of an organization name (B-ORG) will be followed by a token inside organization name (I-ORG), but transitions to I-ORG from tokens with other labels are penalized.

Check the state features:

In [None]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(30))

print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-30:])



Some observations (the exact values might vary due to randomness):

   * **9.385823 B-ORG word.lower():psoe-progresistas** - the model remembered names of some entities - maybe it is overfit, or maybe our features are not adequate, or maybe remembering is indeed helpful;
   * **4.636151 I-LOC -1:word.lower():calle:** "calle" is a street in Spanish; model learns that if a previous word was "calle" then the token is likely a part of location;
   * **-5.632036 O word.isupper()**, **-8.215073 O word.istitle()** : UPPERCASED or TitleCased words are likely entities of some kind;
   * **-2.097561 O postag:NP** - proper nouns (NP is a proper noun in the Spanish tagset) are often entities.

The model in this notebook is just a starting point; you certainly can do better!

## Summary

There are several benefits to using `sklearn-crfsuite` for named entity recognition (NER):

* Simple and Intuitive API: `sklearn-crfsuite` provides a simple and intuitive API that is easy to use and understand, even for those who are new to CRF and NER.

* Scikit-Learn Compatibility: `sklearn-crfsuite` is built on top of scikit-learn, which is a popular machine learning library in Python. This means that the same workflows and tools used for scikit-learn can be used for `sklearn-crfsuite`, making it easy to integrate with existing workflows and tools.

* Customizable Features: `sklearn-crfsuite` allows users to define and customize their own features for the CRF model, which can improve the accuracy and performance of the model.

* Efficient Performance: `sklearn-crfsuite` is designed to be fast and efficient, and can handle large datasets and complex models with ease.

* Pre-Trained Models: `sklearn-crfsuite` provides pre-trained models for NER in multiple languages, which can be useful for getting started with NER without having to train your own model.

Overall, `sklearn-crfsuite` is a powerful and flexible library for NER that provides a simple and intuitive API, and is compatible with existing workflows and tools in Python. It is an excellent choice for anyone looking to build accurate and efficient NER models.