# (Semi-)Automatic Data Annotation

This notebook explains how to load a model from the huggingface transformers library and use it to label examples from a corpus sample. It also contains code to evaluate the results of the model on a test dataset.

To demonstrate the procedure, we use a case study of Modern English _ing_-forms (1500-1920) and load a fine-tuned model based on the historical English MacBERTh model. The case study is described in the following article:

[Insert reference here]

If you are new to Jupyter notebooks (and programming in Python), we recommend Chapter 1 and 2 in [this introduction to cultural analytics and Python by Melanie Walsh](https://melaniewalsh.github.io/Intro-Cultural-Analytics/01-Command-Line/01-The-Command-Line.html).


In [1]:
# We start by installing the required libraries
!python -m pip install datasets numpy pandas scikit_learn scipy torch transformers



In [2]:
#These are the packages and functions you need to import to run this notebook. 
#Some packages are imported as a shorter alias.

import tqdm
import torch
import pandas as pd
import numpy as np
from scipy.special import softmax
from transformers import AutoTokenizer, AutoModelForSequenceClassification

from finetune import read_data, encode_data

## Load Model

First, we should load a model (and accompanying tokenizer) from the huggingface hub. The code below loads a version of the MacBERTh model that is fine-tuned on a dataset of _ing_-forms. To fine-tune your own model, please check the 'finetune.py' and 'finetune-cv.py' scripts in this repository. 

In our case study, _ing_-forms are supposed to be classified by means of a custom classification scheme. The data set and classification procedure are decribed in [this article](www.article.com).

In [3]:
#The name of the model we load is 'emanjavacas/MacBERTh-ing'. 
#To load a different model, change the name between the single quotation marks.
m = AutoModelForSequenceClassification.from_pretrained('emanjavacas/MacBERTh-ing')
tokenizer = AutoTokenizer.from_pretrained('emanjavacas/MacBERTh-ing')

Downloading:   0%|          | 0.00/956 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/347 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/682k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/16.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/152 [00:00<?, ?B/s]

In the data set we use, _ing_-forms in English could be classified as one of five different types, including names and nouns that end in _ing_ (e.g. _Reading_, _something_), but also deverbal nouns (e.g. _the building_), participles (e.g. _I am working_) and verbs (e.g. _bring_, _sing_). The labels of the five-way classification scheme are given in the following list:

In [4]:
id2label = ['NAME', 'NOMINAL-ING', 'NOUN', 'PARTICIPLE', 'VERB']

To illustrate, given an input sentence, we can use the fine-tuned model we load here for classification:

In [5]:
#First, we define a particular sentence (e.g. a randomly chosen example from a corpus)
sentence = 'Church about half a mile from her house, being about twenty weeks gone ' + \
'with Child, and to her thinking very well and healthy, upon a sudden she was taken ' + \
'with great pains and miscarried before she came'

sentence

'Church about half a mile from her house, being about twenty weeks gone with Child, and to her thinking very well and healthy, upon a sudden she was taken with great pains and miscarried before she came'

In [6]:
#Then we process the sentence
data = tokenizer(sentence, return_tensors='pt')
with torch.no_grad():
    output = m(**data)
output

SequenceClassifierOutput(loss=None, logits=tensor([[-1.9836,  7.5749, -1.4609, -2.8214, -2.4386]]), hidden_states=None, attentions=None)

In [7]:
#And finally we obtain a predicted label
prediction = np.argmax(output.logits.numpy())
id2label[prediction]

'NOMINAL-ING'

## Adding target markers

However, since more than one ing-form may occur in the sentence, we fine-tuned the model adding markers to lead the model's attention towards the token that we want predictions for. 

It is important that we do this also during training, since it will have a big impact on the model's performance. The following example illustrates the divergent behaviour.

The markers are the `'[TGT]'` symbols before and after the target token.

In [8]:
sentence_raw = "Every Gay thing to be a Cavalier; Every Parish - Clerk to be a Doctor; " + \
"and Every writing Clerk in the Office must be call'd Mr Secretary. So that the whole " + \
"world, take it where you will"

data = tokenizer(sentence_raw, return_tensors='pt')
with torch.no_grad():
    output = m(**data)
prediction = np.argmax(output.logits.numpy())
id2label[prediction]

'NOUN'

In [9]:
sentence_marked = "Every Gay thing to be a Cavalier; Every Parish - Clerk to be a Doctor; " + \
"and Every [TGT] writing [TGT] Clerk in the Office must be call'd Mr Secretary. So that the whole " + \
"world, take it where you will"

data = tokenizer(sentence_marked, return_tensors='pt')
with torch.no_grad():
    output = m(**data)
prediction = np.argmax(output.logits.numpy())
id2label[prediction]

'PARTICIPLE'

# Evaluate predictions



## Load Dataset

Here we load a dataset that has sentences in the format as we used them during fine-tuning. We will use these sentences and the associated labels to perform an evaluation of the fine-tuned model.

In [10]:
test = pd.read_csv('./test_set_ing.EMMA.merged.csv')

In [11]:
test.head(n=1)

Unnamed: 0,index,author,year,docid,genre,lhs,match,rhs,level-1,level-2,level-3,comment,title,lhs-clean,rhs-clean,ing
0,617966,"Burnet, Gilbert",1681,10624521,prose,that which remained of the Episcopal Dignity ...,recovering,that which was lost . This was signified to C...,NOMINAL-ING,NO-OBJ-GERUND,G,,The history of the rights of princes in the di...,"that which remained of the Episcopal Dignity ,...",that which was lost . This was signified to Ca...,recovering


In [17]:
def normalise(example):
    return ' '.join(example.split())

lhs, target, rhs = 'lhs', 'match', 'rhs'
for heading in [lhs, target, rhs]:
    # replace NaNs with empty space
    test[heading] = test[heading].fillna('')
    # normalise whitespaces
    test[heading] = test[heading].transform(normalise)
    
# pull sentences together
sents, starts, ends = read_data(test[lhs], test[target], test[rhs])
# transform into model input format
sents, spans = encode_data(tokenizer, sents, starts, ends)

## Gather predictions

In [18]:
batch_size = 40
preds, probs = [], []
n_batches = len(sents) // batch_size
for start in tqdm.tqdm(range(0, len(sents), batch_size), total=n_batches):
    with torch.no_grad():
        # input batch data
        input_data = tokenizer(sents[start:start+batch_size], return_tensors='pt', padding=True)
        # output from the model
        output = m(**input_data)
        # probabilities and predictions
        logits = output.logits.numpy()
        probs_ = softmax(output.logits.numpy(), axis=1)
        preds_ = np.argmax(probs_, axis=1)
        preds_ = [id2label[id] for id in preds_]
        preds.extend(preds_)
        probs.extend(probs_)

40it [02:53,  4.35s/it]                        


In [19]:
from sklearn import metrics

In [20]:
print(metrics.classification_report(test['level-1'].values, preds))

              precision    recall  f1-score   support

        NAME       1.00      0.67      0.80         6
 NOMINAL-ING       0.94      0.69      0.80       657
        NOUN       0.72      0.84      0.78       160
  PARTICIPLE       0.76      0.91      0.83       729
        VERB       0.60      0.89      0.72        27

    accuracy                           0.81      1579
   macro avg       0.81      0.80      0.78      1579
weighted avg       0.83      0.81      0.81      1579

