The package we are using requires the data to be presented in a very specific format.

In [None]:
import pandas as pd
import string

features_labels = pd.read_csv("data/features-labels.csv")
features_labels = features_labels[~features_labels['label'].isna()]
features_labels.head()

In [None]:
offer_0 = features_labels[features_labels['offer_id'] == 0]
print(offer_0)

## Features for python-crfsuite

The inputs to the algorithm must follow a particular format, where each token has its features represented by key-value pairs, each token may also have different features based on different factors, like its position. The following function takes in a dataframe and returns the corresponding features that can be consumed by the training method of our algorithm:

In [None]:
punctuation = set(string.punctuation)

def is_punctuation(token):
    return token in punctuation

def is_numeric(token):
    try:
        float(token.replace(",", ""))
        return True
    except:
        return False

In [None]:
def featurise(sentence_frame, current_idx):
    current_token = sentence_frame.iloc[current_idx]
    token = current_token['token']
    position = current_token['position']
    token_count = current_token['token_count']
    pos = current_token['pos_tag']
    
    # Shared features across tokens
    features = {
            'bias': True,
            'word.lower': token.lower(),
            'word.istitle': token.istitle(),
            'word.isdigit': is_numeric(token),
            'word.ispunct': is_punctuation(token),
            'word.position':position,
            'word.token_count': token_count,
            'postag': pos, 
    }
    
    if current_idx > 0: # The word is not the first one...
        prev_token = sentence_frame.iloc[current_idx-1]['token']
        prev_pos = sentence_frame.iloc[current_idx-1]['pos_tag']
        features.update({
            '-1:word.lower': prev_token.lower(),
            '-1:word.istitle':prev_token.istitle(),
            '-1:word.isdigit': is_numeric(prev_token),
            '-1:word.ispunct': is_punctuation(prev_token),
            '-1:postag':prev_pos 
        })
    else:
        features['BOS'] = True
    
    if current_idx < len(sentence_frame) - 1: # The word is not the last one...
        next_token = sentence_frame.iloc[current_idx+1]['token']
        next_tag = sentence_frame.iloc[current_idx+1]['pos_tag']
        features.update({
            '+1:word.lower': next_token.lower(),
            '+1:word.istitle': next_token.istitle(),
            '+1:word.isdigit': is_numeric(next_token),
            '+1:word.ispunct': is_punctuation(next_token),
            '+1:postag': next_tag 
        })
    else:
        features['EOS'] = True
    
    return features

featurise(offer_0, 1)

Since `featurize` only works over a single token, we need another method to return all the values for a single sentence:

In [None]:
def featurize_sentence(sentence_frame):
    labels = list(sentence_frame['label'].values)
    features = [featurize(sentence_frame, i) for i in range(len(sentence_frame))]
    
    return features, labels


features, labels = featurize_sentence(offer_0)
print(features[1])
print(labels[1])

As you can see the dataset is split into tokens, however, since we are working on sequence labelling we need to provide the algorithm with sequences. The following method takes care of rolling up the tokens into two lists of sentences and their labels:

In [None]:
def rollup(dataset):
    sequences = []
    labels = []
    offers = dataset.groupby('offer_id')
    for name, group in offers:
        sqs, lbls = featurize_sentence(group)
        sequences.append(sqs)
        labels.append(lbls)

    return sequences, labels

all_sequences, all_labels = rollup(features_labels)

## Training

Pretty much like in any other supervised problem, we need to split our training dataset into two (preferably three) sets of data, we can use `train_test_split` for this: 

In [None]:
from sklearn.model_selection import train_test_split

train_docs, test_docs, train_labels, test_labels = train_test_split(all_sequences, all_labels)

len(train_docs), len(test_docs)

## Creating a CRF  

Though one can use a *sklearn-like* interface to create, train and infer with python-crfsuite, I've decided to just use the original package and do all "by hand". 

The first step is to create an object of the class `Trainer`, then append our training sequences to it. Finally we can set some parameters for the training phase, feel free to play with these, as they may improve the quality of the tagger.

In [None]:
import pycrfsuite

trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(train_docs, train_labels):
    trainer.append(xseq, yseq)
    
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 200, 

    'feature.possible_transitions': True
})

Finally, we call the method train, that will, at the same time, save the model to a file that we can then use to perform inferences in new sentences.

In [None]:
trainer.train('model/vuelax.crfsuite')

## Labelling "unseen" sequences

To perform sequence labelling on instances that our algorithm did not see during training it is necessary to use an object of the `Tagger` class, and then load our saved model into it by using the `open` method.

In [None]:
crf_tagger = pycrfsuite.Tagger()
crf_tagger.open('model/vuelax.crfsuite')

Remember that each one of the sentences needs to be processed and put in the format required for the tagger to work, that means, have the same features we used for training. We already have this in our `test_docs`, and we can use them direclty:

In [None]:
test_docs[5][0]

In [None]:
predicted_tags = crf_tagger.tag(test_docs[2])
print("Predicted: ",predicted_tags)
print("Correct  : ",test_labels[2])

## Evaluating the tagger

While there may be better ways to evaluate the performance of the tagger, we'll use the traditional tools of a classification problem:

In [None]:
from sklearn.metrics import classification_report

all_true, all_pred = [], []

for i in range(len(test_docs)):
    all_true.extend(test_labels[i])
    all_pred.extend(crf_tagger.tag(test_docs[i]))
    
len(all_true), len(all_pred)

In [None]:
print(classification_report(all_true, all_pred))

In general terms, our tagger seems to be performing good. It seems to be struggling to find all the `f` tokens, but the rest. The ones we care the most about are being correctly labelled.