# Finetuning GLiNER

## Introduction

This notebook shows the process of fine-tuning a model from the GLiNER family on a custom dataset for improved domain-specific performance. It was drawn up as part of the [*Congruence Engine*](https://www.sciencemuseumgroup.org.uk/projects/the-congruence-engine) project (2021-4) at the Science Museum Group.

In this notebook, we will:

*   **Load** a pre-prepared dataset containing synthetic (llm-generated) texts with specific named entities drawn from a sample of 19th and 20th century textile industry glossaries.
*   **Merge** this dataset with a sample drawn from the Pile-NER dataset, in order to prevent overfitting
* **Fine-tune** a GLiNER model using the resulting merged dataset
* **Evaluate** the fine-tuned model
* **Share** the model on the HuggingFace Hub.

This notebook was prepared using Google Colab. Parts of the notebook were adapted with the help of Chat GPT.

You can find out more about *Congruence Engine*'s experiments with NER for cultural heritage by visiting the [GitHub repository](https://github.com/congruence-engine/universal-ner-with-gliner/tree/main) for this investigation. For further information on the GLiNER family of models, please refer to the [documentation](https://github.com/urchade/GLiNER).




In [None]:
! pip install transformers==4.41
! pip install gliner
! pip install accelerate -U
! pip install datasets

## **1. Merging with Pile-NER**

In order to avoid 'overfitting' the finetuned model to the domain-specific model, it is a good idea to merge the training data with a sample of the data used to train the original model. In this case we will be using the [Pile-NER](https://huggingface.co/datasets/Universal-NER/Pile-NER-type) dataset that was used to train [GLiNER](https://github.com/urchade/GLiNER).

**Import modules**

In [None]:
from datasets import load_dataset
import json
import re
import ast
from tqdm import tqdm

**Load a subset from the Pile-NER dataset**

In [None]:
dataset = load_dataset("Universal-NER/Pile-NER-type", split="train[:4000]") #this will load 4,000 examples from the dataset, from a total of 45,900

**Process the dataset into the correct format for GLiNER finetuning.**

This requires converting the data into the following format:

{"tokenized_text":["This", "is", "a", "tokenized", "text", "example"], "ner": [[0,0, "pronoun"], [3,3, "adjective"]]}

In [None]:
def tokenize_text(text):
    """Tokenizes the input text into a list of tokens."""
    return re.findall(r'\w+(?:[-_]\w+)*|\S', text)

def extract_entity_spans(entry):
    """Extracts entity spans from an entry."""
    len_start = len("What describes ")
    len_end = len(" in the text?")
    entity_types, entity_texts, negative = [], [], []

    for c in entry['conversations']:
        if c['from'] == 'human' and c['value'].startswith('Text: '):
            text = c['value'][len('Text: '):]
            tokenized_text = tokenize_text(text)
        elif c['from'] == 'human' and c['value'].startswith('What describes '):
            entity_type = c['value'][len_start:-len_end]
            entity_types.append(entity_type)
        elif c['from'] == 'gpt' and c['value'].startswith('['):
            if c['value'] == '[]':
                negative.append(entity_types.pop())
                continue
            texts_ents = ast.literal_eval(c['value'])
            entity_texts.extend(texts_ents)
            num_repeat = len(texts_ents) - 1
            entity_types.extend([entity_types[-1]] * num_repeat)

    entity_spans = []
    for j, entity_text in enumerate(entity_texts):
        entity_tokens = tokenize_text(entity_text)
        matches = []
        for i in range(len(tokenized_text) - len(entity_tokens) + 1):
            if " ".join(tokenized_text[i:i + len(entity_tokens)]).lower() == " ".join(entity_tokens).lower():
                matches.append((i, i + len(entity_tokens) - 1, entity_types[j]))
        if matches:
            entity_spans.extend(matches)

    return {"tokenized_text": tokenized_text, "ner": entity_spans, "negative": negative}

def process_data(data):
    """Processes a list of data entries to extract entity spans."""
    all_data = [extract_entity_spans(entry) for entry in tqdm(data)]
    return all_data

def save_data_to_file(data, filepath):
    """Saves the processed data to a JSON file."""
    with open(filepath, 'w') as f:
        json.dump(data, f)

if __name__ == "__main__":
    path_pile_ner = 'train.json'
    data = dataset
    processed_data = process_data(data)
    save_data_to_file(processed_data, 'pilener_train.json')

    print("dataset size:", len(processed_data))

**Save the subset as a json file**

In [None]:
output_path = '/content/pile-ner-gliner.json'

with open(output_path, 'w', encoding='utf-8') as f:
    json.dump(processed_data, f, ensure_ascii=False, indent=2)

print(f"Dataset successfully saved to '{output_path}'.")

**Next, load the dataset into a pandas dataframe**




In [None]:
df_one = pd.read_json('/content/pile-ner-gliner.json')
print(f"Dataset ONE loaded with {len(df_one)} entries.")


**Now, load your training dataset and merge it with the Pile-NER subset. Ensure that your dataset has already been prepared in the correct format for finetuning GLiNER**

In this case, we will use a dataset prepared by the *Congruence Engine* project via a url.

In [None]:
import pandas as pd

In [None]:
  df_two = pd.read_json('https://raw.githubusercontent.com/congruence-engine/universal-ner-with-gliner/refs/heads/main/datasets/full_synthetic_data_gpt_4o_6_dec.json')

  print(f"Dataset TWO loaded with {len(df_two)} entries.")


**You are now ready to merge the two datasets**

In [None]:
merged_dataset_path = '/content/merged_dataset.json'  # Output path

In [None]:
merged_df = pd.concat([df_one, df_two], ignore_index=True)
print(f"Merged dataset contains {len(merged_df)} entries.")

In [None]:
# Convert DataFrame to list of dictionaries
merged_data = merged_df.to_dict(orient='records')

# Save to JSON file
with open(merged_dataset_path, 'w', encoding='utf-8') as f:
    json.dump(merged_data, f, ensure_ascii=False, indent=2)
print(f"Merged dataset saved to '{merged_dataset_path}'.")

## **2. Fine-tuning**


**Import modules**

In [None]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"
from datasets import load_dataset
import torch
from gliner import GLiNERConfig, GLiNER
from gliner.training import Trainer, TrainingArguments
from gliner.data_processing.collator import DataCollatorWithPadding, DataCollator
from gliner.utils import load_config_as_namespace
from gliner.data_processing import WordsSplitter, GLiNERDataset

**Load the merged dataset, and split into training and testing**

In [None]:
import json
import random

train_path = "/content/full_synthetic_data_gpt_4o_5_dec"

with open(train_path, "r", encoding="utf-8") as f:
    data = json.load(f)

print('Dataset size:', len(data))

In [None]:
random.shuffle(data)
print('Dataset is shuffled...')

train_dataset = data[:int(len(data)*0.8)]
test_dataset = data[int(len(data)*0.8):]

print('Dataset is split...')
print('Train size:', len(train_dataset))
print('Test size:', len(test_dataset))

**Load the GLiNER model that you want to train on. In this case we will be using gliner-community/gliner_medium-v2.5**


In [None]:
from gliner import GLiNER
model = GLiNER.from_pretrained("gliner-community/gliner_medium-v2.5")

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

model = model.to(device)

In [None]:
data_collator = DataCollator(model.config, data_processor=model.data_processor, prepare_labels=True)
model.to(device)
print("done")

**Start fine-tuning**

The following code contains several parameters which will affect the resulting model in various ways. As this notebook is intended for experimentation, the number of epochs has been set at 500, but more may be needed to improve accuracy

As the model proceeds with fine-tuning based on your data, it will log training and evaluation loss every 50 steps. All being well, these numbers should decrease as the model's accuracy improves with training

In [None]:
# calculate number of epochs
num_steps = 500
batch_size = 8
data_size = len(train_dataset)
num_batches = data_size // batch_size
num_epochs = max(1, num_steps // num_batches)

training_args = TrainingArguments(
    output_dir="/content/models",
    learning_rate=5e-6,
    weight_decay=0.01,
    others_lr=1e-5,
    others_weight_decay=0.01,
    lr_scheduler_type="linear", #cosine
    warmup_ratio=0.1,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=num_epochs,
    eval_strategy="steps",
    logging_steps=50,
    save_steps = 100,
    save_total_limit=10,
    dataloader_num_workers = 0,
    use_cpu = False,
    report_to="none",
    )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=model.data_processor.transformer_tokenizer,
    data_collator=data_collator,
)

trainer.train()

**Save the model**

A few different 'checkpoints' (or versions) of the model will be saved in a file named 'models' in your directory. Based on the metrics you have observed above, you can save the version of the model with the best metrics (likely to be the one of the final checkpoints

In [None]:
trained_model = GLiNER.from_pretrained("/content/models/checkpoint-500", load_tokenizer=True)

You can now test your trained model on a sample of text. In the code below, we test the model with some of the labels that were used in training:

In [None]:
text = """
The 19th century textile industry was a vibrant period of innovation and expansion, fueled by advancements in materials and techniques. Barwood, a natural dye source imported from Africa, played a crucial role in achieving rich red hues. Skilled colourists experimented with this and other natural dyes to create striking fabrics that met the era’s demand for color diversity.

Processes like degumming were essential in preparing silk for dyeing and weaving, removing sericin to achieve a smooth finish. Similarly, scouring, the thorough cleaning of wool and other fibers, ensured that impurities did not interfere with dyeing or spinning processes. Innovations like the scotch feed mechanism improved efficiency in spinning mills, streamlining the delivery of fibers to machinery.

Domett, a plain but durable cloth, was widely used for practical garments and household items, exemplifying the industry’s focus on both utility and style. These combined efforts shaped the thriving textile trade of the era.
"""

# Labels for entity prediction
labels = ["textile machinery", "textile fabric", "textile industry occupation", "textile dye", "textile manufacturing process"]

# Perform entity prediction
entities = trained_model.predict_entities(text, labels, threshold=0.5)

# Display predicted entities and their labels
for entity in entities:
    print(entity["text"], "=>", entity["label"])

## **3. Evaluate your model**

You can evaluate your model to see how it performs against any dataset of your choice. In this example we will evaluate it against the original training dataset.

This will return the following metrics. The F1 score is the best standard to assess an NER model's accuracy:


**P**: Precision (out of all entities identified, x% were correct)

**tR**: Recall (number of actual positives identified by the model)

**tF1**: A combination of Precision and Recall

**n'**: F1 score as a decimal


In [None]:
import json
import random

train_path = "/content/your-dataset"

with open(train_path, "r", encoding="utf-8") as f:
    data = json.load(f)

print('Dataset size:', len(data))

In [None]:
evaluation_results = model.evaluate(
    data, flat_ner=True, entity_types=["textile manufacturing chemical", "textile dye", "textile machinery", "textile fibre", "textile fabric", "textile fabric component", "textile fabric imperfection", "textile waste material", "textile weave", "textile manufacturing process", "textile industry unit of measurement", "textile industry occupation"]
)

## **4. Share your model to Huggingface**


As runtimes delete automatically from Google Colab, you stand to lose your model unless you save it!

It is a good idea to push the model to the Huggingface Hub - this means that you can safely use it subsequently any time you need it. This will require a (free) [Huggingface account](https://huggingface.co/).

In [None]:
!pip install huggingface_hub
from huggingface_hub import notebook_login

In [None]:
notebook_login()

In [None]:
trained_model.push_to_hub("your_model_name")