# Intro

This notebook explains how to load a fine-tuned MacBERTh model and use it for inference. It also evaluates the results of the model on a test dataset.

In [68]:
import tqdm
import torch
import pandas as pd
import numpy as np
from scipy.special import softmax
from transformers import AutoTokenizer, AutoModelForSequenceClassification

from finetune import read_data, encode_data

# Load Model

Here, we load the model from the huggingface hub. This model is a version of MacBERTh that is fine-tuned on a dataset of ing-forms to perform a 5-wise classification.

In [130]:
m = AutoModelForSequenceClassification.from_pretrained('emanjavacas/MacBERTh-ing')
tokenizer = AutoTokenizer.from_pretrained('emanjavacas/MacBERTh-ing')

Downloading:   0%|          | 0.00/956 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/347 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/682k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/16.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/152 [00:00<?, ?B/s]

Given an input sentence such as the following, we can use the model for classification as follows.

In [131]:
sentence = 'Church about half a mile from her house, being about twenty weeks gone ' + \
'with Child, and to her thinking very well and healthy, upon a sudden she was taken ' + \
'with great pains and miscarried before she came'

sentence

'Church about half a mile from her house, being about twenty weeks gone with Child, and to her thinking very well and healthy, upon a sudden she was taken with great pains and miscarried before she came'

In [132]:
data = tokenizer(sentence, return_tensors='pt')
with torch.no_grad():
    output = m(**data)
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.9836,  7.5749, -1.4609, -2.8214, -2.4386]]), hidden_states=None, attentions=None)

In [136]:
prediction = np.argmax(output.logits.numpy())
m.config.id2label[prediction]

'NOMINAL-ING'

## Adding the markers

However, since more than one ing-form may occur in the sentence, we fine-tuned the model adding markers to lead the model's attention towards the token that we want predictions for. 

It is important that we do this also during training, since it will have a big impact on the model's performance. The following example illustrates the divergent behaviour.

The markers are the `'[TGT]'` symbols before and after the target token.

In [128]:
sentence_raw = "Every Gay thing to be a Cavalier; Every Parish - Clerk to be a Doctor; " + \
"and Every writing Clerk in the Office must be call'd Mr Secretary. So that the whole " + \
"world, take it where you will"

data = tokenizer(sentence_raw, return_tensors='pt')
with torch.no_grad():
    output = m(**data)
prediction = np.argmax(output.logits.numpy())
m.config.id2label[prediction]

'NOUN'

In [129]:
sentence_marked = "Every Gay thing to be a Cavalier; Every Parish - Clerk to be a Doctor; " + \
"and Every [TGT] writing [TGT] Clerk in the Office must be call'd Mr Secretary. So that the whole " + \
"world, take it where you will"

data = tokenizer(sentence_marked, return_tensors='pt')
with torch.no_grad():
    output = m(**data)
prediction = np.argmax(output.logits.numpy())
m.config.id2label[prediction]

'PARTICIPLE'

# Load Dataset

Here we load a dataset that has sentences in the format as we used them during fine-tuning. We will use these sentences and the associated labels to perform an evaluation of the fine-tuned model.

In [5]:
test = pd.read_csv('./test_set_ing.EMMA.merged.csv')

In [50]:
test.head(n=1)

Unnamed: 0,index,author,year,docid,genre,lhs,match,rhs,level-1,level-2,level-3,comment,title,lhs-clean,rhs-clean,ing
0,617966,"Burnet, Gilbert",1681,10624521,prose,"that which remained of the Episcopal Dignity ,...",recovering,that which was lost . This was signified to Ca...,NOMINAL-ING,NO-OBJ-GERUND,G,,The history of the rights of princes in the di...,"that which remained of the Episcopal Dignity ,...",that which was lost . This was signified to Ca...,recovering


In [51]:
def normalise(example):
    return ' '.join(example.split())

lhs, target, rhs = 'lhs', 'match', 'rhs'
for heading in [lhs, target, rhs]:
    # replace NaNs with empty space
    test[heading] = test[heading].fillna('')
    # normalise whitespaces
    test[heading] = test[heading].transform(normalise)
    
# pull sentences together
sents, starts, ends = read_data(test[lhs], test[target], test[rhs], sym='[TGT]')
# transform into model input format
sents, spans = encode_data(tokenizer, sents, starts, ends)

# Gather predictions

In [71]:
batch_size = 40
preds, probs = [], []
n_batches = len(sents) // batch_size
for start in tqdm.tqdm(range(0, len(sents), batch_size), total=n_batches):
    with torch.no_grad():
        # input batch data
        input_data = tokenizer(sents[start:start+batch_size], return_tensors='pt', padding=True)
        # output from the model
        output = m(**input_data)
        # probabilities and predictions
        logits = output.logits.numpy()
        probs_ = softmax(output.logits.numpy(), axis=1)
        preds_ = np.argmax(probs_, axis=1)
        preds_ = [m.config.id2label[id] for id in preds_]
        preds.extend(preds_)
        probs.extend(probs_)


  0%|                                                                                                                                                                                | 0/39 [00:00<?, ?it/s][A
  3%|████▎                                                                                                                                                                   | 1/39 [00:06<04:20,  6.85s/it][A
  5%|████████▌                                                                                                                                                               | 2/39 [00:14<04:29,  7.28s/it][A
  8%|████████████▉                                                                                                                                                           | 3/39 [00:20<04:07,  6.87s/it][A
 10%|█████████████████▏                                                                                                                                                

In [73]:
from sklearn import metrics

In [74]:
print(metrics.classification_report(test['level-1'].values, preds))

              precision    recall  f1-score   support

        NAME       1.00      0.67      0.80         6
 NOMINAL-ING       0.94      0.69      0.80       657
        NOUN       0.72      0.84      0.78       160
  PARTICIPLE       0.76      0.91      0.83       729
        VERB       0.60      0.89      0.72        27

    accuracy                           0.81      1579
   macro avg       0.81      0.80      0.78      1579
weighted avg       0.83      0.81      0.81      1579

