# Barts BIU - BLS Natural Language Processing Tutorial for Fine-tuning a Pretrained Model

The following tutorial will tackle three types of NLP problem:

1. Text Classification: Classifying an entire document into different categories.
2. Named Entity Recognition: Labelling specific parts of text with relevant tags.
3. Relation Extraction: Identifying if two or more tags are related to eachother.

The tutorial aims to demonstrate how we can use existing models and methods, and fine tune them to work on our own data. The methods we will use are primarily from Hugging Face (https://huggingface.co/)

## Install Dependencies for the Tutorials

In [None]:
!pip install datasets
!pip install transformers==4.28.0
!pip install evaluate
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.29.2
    Uninstalling transformers-4.29.2:
      Successfully uninstalled transformers-4.29.2
Successfully installed transformers-4.28.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from IPython.display import clear_output

## Text Classification

Text classification, also known as text categorization, is a natural language processing (NLP) technique that involves categorizing text into predefined categories or classes based on its content. The goal of text classification is to automatically classify a piece of text into one or more of the predefined categories, such as topics, genres, sentiment, or intent.

In this tutorial we will apply the pretrained *bert-base-cased* model to classify yelp reviews into positive and negative classes.

**Loading in our data**

The Hugging Face holds lots of annotated datasets that can be accessed via the datasets library using the function load_dataset. The command load_dataset returns the dataset in the form of a dictionary, already split into train and test sets. However, when applying pre-trained models, it is important to format your own data in the same format.

Therefore the code to achieve this, for the yelp dataset, with a labelled .csv files is provided below.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import DatasetDict, Dataset

1. Load your dataset

In [2]:
yelp_df = pd.read_csv('yelp_review_sample.csv')
yelp_df.head()

Unnamed: 0,label,text
0,0,I got 'new' tires from them and within two wee...
1,0,Don't waste your time. We had two different p...
2,0,All I can say is the worst! We were the only 2...
3,0,I have been to this restaurant twice and was d...
4,0,Food was NOT GOOD at all! My husband & I ate h...


2. Then split your data into a training and testing set.

In [24]:
train, test = train_test_split(yelp_df, stratify=yelp_df['label'], test_size=0.20, random_state=42)
train.reset_index(drop=True,inplace=True)
test.reset_index(drop=True,inplace=True)

3. Then create your Hugging Face Dataset Dictionary

In [25]:
dataset = DatasetDict()
dataset['train'] = Dataset.from_pandas(df=train)
dataset['test'] = Dataset.from_pandas(df=test)

In [26]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1306
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 327
    })
})


**Pre-processing the data**

The types of models we will be using will be Transformers. Transformers are designed to process sequential data, such as free-text, and are particularly effective at capturing long-range dependencies and relationships between words.

Before we can fine-tune a model on our dataset, it needs to be preprocessed into the expected model input format, then converted and assembled into batches of tensors.

As we are dealing with free-text data, we will use a Tokenizer to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.

In [27]:
from transformers import AutoTokenizer

1. Load the relevant tokenizer

In [28]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

2. Tokenize both the train and test set

We acknowledge that in our dataset some documents may not be the same length which can be a problem as our model requires the inputs to have a uniform shape. Therefore, we will pad our tensors by setting *padding* to *max_length*.

On the flip-side, some sentences may be too long, so we will need to shorten some sentences to the maximum length that the model accept. We will therefore set *truncation* to *True*.

In [29]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [30]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/1306 [00:00<?, ? examples/s]

Map:   0%|          | 0/327 [00:00<?, ? examples/s]

In [31]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1306
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 327
    })
})

For the purposes of this tutorial, we will create a small subset of out dataset to speed up training.

In [32]:
reduced_train_dataset = tokenized_datasets["train"].shuffle(seed=42)
reduced_test_dataset = tokenized_datasets["test"].shuffle(seed=42)

**Fine-tune the model**

We are now ready to train our model. We will use Hugging Face's *Trainer* class so we do not have to write our own training loops.

N.B. when running this tutorial, it is advised to run on a GPU to speed up training. This can be done using services like Google Colab. However, **Do NOT use Colab when using real patient data!**

In [33]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
import evaluate

1. Load the appropriate AutoModel and set the number of labels

In this example we are doing sequence classification (text classification). Therefore, we will use the function *AutoModelForSequenceClassification*. We also know our dataset has 5 classes, therefore we set *num_labels* to 5.

In [34]:
print('N Classes: ', yelp_df['label'].nunique())

N Classes:  4


In [35]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
clear_output()

2. Load the relevant evaluation functions

Because the *Trainer* class does not automatically evaluate model performance, we must pass a function to compute and report the classification metrics.

In [36]:
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
precision_metric = evaluate.load("precision")
recall_metric = evaluate.load("recall")

In [37]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)

    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    f1 = f1_metric.compute(predictions=predictions, references=labels, average='weighted')
    precision = precision_metric.compute(predictions=predictions, references=labels, average='weighted')
    recall = recall_metric.compute(predictions=predictions, references=labels, average='weighted')

    return {"accuracy": accuracy['accuracy'], "f1": f1['f1'], "precision": precision['precision'], "recall": recall['recall']}

3. Create the Trainer object with the relevant functions

Because we want to monitor our evalutation metrics during the fine-tuning stage, we will use the *TrainingArguments* function.

In [38]:
training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch", num_train_epochs=5)

In [39]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=reduced_train_dataset,
    eval_dataset=reduced_test_dataset,
    compute_metrics=compute_metrics)

In [40]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.511331,0.831804,0.829597,0.839706,0.831804
2,No log,0.551838,0.810398,0.814154,0.830181,0.810398
3,No log,0.675887,0.834862,0.836891,0.844825,0.834862
4,0.451400,0.850641,0.83792,0.838859,0.841205,0.83792
5,0.451400,0.909993,0.840979,0.842231,0.845687,0.840979


TrainOutput(global_step=820, training_loss=0.3042308574769555, metrics={'train_runtime': 638.8797, 'train_samples_per_second': 10.221, 'train_steps_per_second': 1.283, 'total_flos': 1718161470289920.0, 'train_loss': 0.3042308574769555, 'epoch': 5.0})

In [None]:
trainer.

## Named Entity Recognition

Named Entity Recognition (NER) is a subfield of Natural Language Processing (NLP) that involves identifying and extracting entities from text that are named and refer to specific objects, places, people, dates, organizations, etc.

In this tutorial we will apply *bert-base-cased* to the *CONLL2003* dataset.

**Loading in our data**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import DatasetDict, Dataset

1. Load in your data

In [None]:
raw_dataset = pd.read_csv('conll2003.csv')
raw_dataset['tokens'] = raw_dataset['tokens'].str.replace('[','',regex=True).replace(']','',regex=True).replace("'","",regex=True)
raw_dataset['ner_tags'] = raw_dataset['ner_tags'].str.replace('[','',regex=True).replace(']','',regex=True).replace("'","",regex=True)
raw_dataset.head()

Unnamed: 0,tokens,ner_tags
0,"SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRI...",0 0 5 0 0 0 0 1 0 0 0 0
1,Nadim Ladki,1 2
2,"AL-AIN , United Arab Emirates 1996-12-06",5 0 5 6 6 0
3,Japan began the defence of their Asian Cup tit...,5 0 0 0 0 0 7 8 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0
4,But China saw their luck desert them in the se...,0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0


In [None]:
tokens_list = [tokens.split() for tokens in list(raw_dataset['tokens'].values)]
tags_list = [list(map(int, tags.split())) for tags in raw_dataset['ner_tags'].values]
data_list = [[token,tag] for token,tag in zip(tokens_list,tags_list)]
dataset = pd.DataFrame(data=data_list, columns=['tokens','ner_tags'])

Each token has an associated numerical value, which corresponds to one of the 9 labels below. 'B-' at the beinning of a label indicates that it is the start of the label, 'I-' indicates it is an intermediate or end token of the label.

In [None]:
label_names = ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
words = dataset['tokens'][0]
labels = dataset['ner_tags'][0]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

SOCCER - JAPAN GET LUCKY WIN , CHINA IN SURPRISE DEFEAT . 
O      O B-LOC O   O     O   O B-PER O  O        O      O 


2. Then split your data into a training and testing set.

In [None]:
train, test = train_test_split(dataset, test_size=0.10, random_state=42)
train.reset_index(drop=True,inplace=True)
test.reset_index(drop=True,inplace=True)

3. Create your Dataset Dictionary

In [None]:
dataset = DatasetDict()
dataset['train'] = Dataset.from_pandas(df=train)
dataset['test'] = Dataset.from_pandas(df=test)

**Pre-Processing the Data**

In [None]:
from transformers import AutoTokenizer

As before, we will be using the Autotokenizer library. As we are using the tokenizer for NER rather than text classification we must flag that to the tokenizer by setting the input *is_split_into_words* to *True*.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.26.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-cased/snapshots/5532cc56f74641d4bb33641f5c76a55d11f846e0/vocab.txt
loading file tokenize

Example:

In [None]:
inputs = tokenizer(dataset["test"][1]["tokens"], is_split_into_words=True)
print('Original Tokens: ', dataset["test"][1]["tokens"])
print('Tokenized: ', inputs.tokens())

Original Tokens:  ['freestyle', 'skiing', 'moguls', 'competition', 'on', 'Friday', ':']
Tokenized:  ['[CLS]', 'freestyle', 'skiing', 'm', '##og', '##ul', '##s', 'competition', 'on', 'Friday', ':', '[SEP]']


The tokeniser has added specialised token, '[CLS]' and '[SEP]' at the beginning and end, and left the majority of words untouched. However, the word *'mogul'* has been split into four words *'m'*, *'##og'*, *'##ul'*, *'##s'*. Leaving us with less tags than tokens, so we need to correct that using the following function:

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

Example:

In [None]:
labels = dataset["test"][1]["ner_tags"]
word_ids = inputs.word_ids()
print('Original Tokens: ', dataset["test"][1]["tokens"])
print('Original Tags: ', labels)
print('Tokenized: ', inputs.tokens())
print('Aligned Tags: ', align_labels_with_tokens(labels, word_ids))

Original Tokens:  ['freestyle', 'skiing', 'moguls', 'competition', 'on', 'Friday', ':']
Original Tags:  [0, 0, 0, 0, 0, 0, 0]
Tokenized:  ['[CLS]', 'freestyle', 'skiing', 'm', '##og', '##ul', '##s', 'competition', 'on', 'Friday', ':', '[SEP]']
Aligned Tags:  [-100, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]


We can now apply this process to the whole dataset.

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
tokenized_datasets = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names)

Map:   0%|          | 0/3107 [00:00<?, ? examples/s]

Map:   0%|          | 0/346 [00:00<?, ? examples/s]

**Fine-tune the model**

In [None]:
from transformers import DataCollatorForTokenClassification, AutoModelForTokenClassification, TrainingArguments, Trainer
import numpy as np
import evaluate

To collate our, ensuring that padding both the tokens and labels are done in the same way we use the *DataCollatorForTokenClassification* function 

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

To evaluate the performance of our NER tool, we will use the traditional framework, *seqeval*, loaded via the *evaluate* library.

In [None]:
metric = evaluate.load("seqeval")

In [None]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

We are now ready to start defining our model.

In [None]:
#define our maps from label-id to label-text eg: 1 = B-PER
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [None]:
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-cased",
    id2label=id2label,
    label2id=label2id,
)
clear_output()

As before we will use the training arguements to manage the opimisation.

In [None]:
args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Again, we will use Hugging Face's Trainer tool manage the training and evaluation of our model.

In [None]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 3107
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1167
  Number of trainable parameters = 107726601
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.13444,0.851648,0.877358,0.864312,0.969024
2,0.243700,0.108022,0.865699,0.9,0.882516,0.970483
3,0.071000,0.108783,0.878843,0.916981,0.897507,0.973727


***** Running Evaluation *****
  Num examples = 346
  Batch size = 8
Saving model checkpoint to bert-finetuned-ner/checkpoint-389
Configuration saved in bert-finetuned-ner/checkpoint-389/config.json
Model weights saved in bert-finetuned-ner/checkpoint-389/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-389/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-389/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 346
  Batch size = 8
Saving model checkpoint to bert-finetuned-ner/checkpoint-778
Configuration saved in bert-finetuned-ner/checkpoint-778/config.json
Model weights saved in bert-finetuned-ner/checkpoint-778/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-778/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-778/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 346
  Batch size = 8
Saving model checkpoint to bert-fin

TrainOutput(global_step=1167, training_loss=0.14243843663729658, metrics={'train_runtime': 149.3065, 'train_samples_per_second': 62.429, 'train_steps_per_second': 7.816, 'total_flos': 204725185116750.0, 'train_loss': 0.14243843663729658, 'epoch': 3.0})