# **Using LLMs to Generate Synthetic Data for Fine-Tuning GLiNER**

In this notebook, we'll explore a simple way to generate synthetic data for fine-tuning GLiNER. I have used a similar approach to generate training data for [**PII extraction**](https://huggingface.co/urchade/gliner_multi_pii-v1). We will be using `Mistral-7B-Instruct-v0.2`, though I think there are better LLMs available online (like LLaMa-3 ... etc).

Additionally, the prompt used in this example is far from optimal, so you should adapt it to your specific use case or domain. This notebook serves only as an example for practitioners, as some people have requested one.

In this notebook, we generate **fully synthetic data**, including both text and entity annotations, but if you have quality data from your target domain, *you can alternatively have the LLM annotate your existing data*. 📊📝

Feel free to experiment and tailor the approach to better suit your needs! *Happy fine-tuning!* 🌟

In [10]:

import os
os.environ["HF_TOKEN"] = "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"

In [4]:
!pip install vllm

Collecting vllm
  Downloading vllm-0.5.0.post1-cp310-cp310-manylinux1_x86_64.whl (130.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.2/130.2 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
Collecting ninja (from vllm)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi (from vllm)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
Collecting openai (from vllm)
  Downloading openai-1.35.4-py3-none-any.whl (327 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.4/327.4 kB[0m [31m37.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.30.1-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━

In [5]:
from vllm import LLM, SamplingParams

## Load large language model

In [6]:
LLM_MODEL = "mistralai/Mistral-7B-Instruct-v0.2" # you can use a better model
# LLM_MODEL = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
LLM_MODEL = "NousResearch/Hermes-2-Pro-Llama-3-8B"
NUM_GPUs = 1

In [7]:
llm = LLM(model=LLM_MODEL, tensor_parallel_size=NUM_GPUs, dtype="half")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

INFO 06-26 17:32:55 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='NousResearch/Hermes-2-Pro-Llama-3-8B', speculative_config=None, tokenizer='NousResearch/Hermes-2-Pro-Llama-3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=NousResearch/Hermes-2-Pro-Llama-3-8B)


tokenizer_config.json:   0%|          | 0.00/57.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


generation_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

INFO 06-26 17:33:03 weight_utils.py:218] Using model weights format ['*.safetensors']


adapter_model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

INFO 06-26 17:35:08 model_runner.py:160] Loading model weights took 14.9605 GB
INFO 06-26 17:35:10 gpu_executor.py:83] # GPU blocks: 9718, # CPU blocks: 2048
INFO 06-26 17:35:12 model_runner.py:889] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 06-26 17:35:12 model_runner.py:893] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 06-26 17:35:29 model_runner.py:965] Graph capturing finished in 17 secs.


In [8]:
# sampling parameters
sampling_params = SamplingParams(top_k=100, max_tokens=1000, top_p=0.8, stop="<end>")

## Prompting function

In [9]:
def create_json_prompt_for_synthetic_data(**kwargs):

    # Use dictionary comprehension to filter out 'n/a' values and to keep the code flexible
    attributes = {key: value for key, value in kwargs.items() if value != "n/a"}

    # Building the initial part of the prompt
    prompt = """
**Objective:**
Produce realistic text passages that include clearly identified named entities. Each entity should be meticulously labeled according to its type for straightforward extraction.

**Format Requirements:**
- The output should be formatted in JSON, containing the text and the corresponding entities list.
- Each entity in the text should be accurately marked and annotated in the 'entities' list.
- Meticulously follow all the listed attributes.

**Entity Annotation Details:**
- All entity types must be in lowercase. For example, use "type" not "TYPE".
- Entity types can be multiwords separate by space. For instance, use "entity type" rather than "entity_type".
- Entities spans can be nested within other entities.
- A single entity may be associated with multiple types. list them in the key "types".

**Output Schema:**

<start attribute_1="value1" attribute_2="value2" ...>
{
  "text": "{text content}",
  "entities": [
    {"entity": "entity name", "types": ["type 1", "type 2", ...]},
    ...
  ]
}
<end>

**Here are some real world examples**:"""

    # Create a string of attributes for the <start> tag, excluding any 'n/a' values
    attributes_string = " ".join([f'{key}="{value}"' for key, value in attributes.items()])

    # Adding the dynamically created attributes string to the prompt
    prompt += f"""
<start {attributes_string}>
"""

    return prompt

## Example of generation

In [10]:
import json

def generate(**kwargs):
    outputs = llm.generate([create_json_prompt_for_synthetic_data(**kwargs)], sampling_params)
    return json.loads(outputs[0].outputs[0].text)

In [11]:
generate(language="french", types_of_text="detailled job ads", sector="machine learning", country="france")

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.11s/it, est. speed input: 85.78 toks/s, output: 64.57 toks/s]


{'text': "Les startups françaises de l'intelligence artificielle et du machine learning sont actuellement à la recherche de talents pour renforcer leurs équipes. Par exemple, Dataiku, une entreprise française de machine learning, a besoin d'un data scientist expérimenté pour aider à développer ses produits. Onetame, une autre startup française de l'IA, cherche un data engineer pour soutenir ses projets d'analyse de données.",
 'entities': [{'entity': 'Dataiku', 'types': ['ORG']},
  {'entity': 'machine learning', 'types': ['FIELD']},
  {'entity': 'Dataiku', 'types': ['ORG']},
  {'entity': 'data scientist', 'types': ['ROLE']},
  {'entity': 'Onetame', 'types': ['ORG']},
  {'entity': 'data engineer', 'types': ['ROLE']}]}

## Functions

In [12]:
# post processing functions

import re

def tokenize_text(text):
    """Tokenize the input text into a list of tokens."""
    return re.findall(r'\w+(?:[-_]\w+)*|\S', text)

def extract_entities(data):
    all_examples = []

    for dt in data:

        # Attempt to extract entities; skip current record on failure
        try:
            tokens = tokenize_text(dt['text'])
            ents = [(k["entity"], k["types"]) for k in dt['entities']]
        except:
            continue

        spans = []
        for entity in ents:
            entity_tokens = tokenize_text(str(entity[0]))

            # Find the start and end indices of each entity in the tokenized text
            for i in range(len(tokens) - len(entity_tokens) + 1):
                if " ".join(tokens[i:i + len(entity_tokens)]).lower() == " ".join(entity_tokens).lower():
                    for el in entity[1]:
                        spans.append((i, i + len(entity_tokens) - 1, el.lower().replace('_', ' ')))

        # Append the tokenized text and its corresponding named entity recognition data
        all_examples.append({"tokenized_text": tokens, "ner": spans})

    return all_examples

# generation functions
def generate_from_prompts(prompts, llm, sampling_params):
    outputs = llm.generate(prompts, sampling_params)

    all_outs = []

    for output in outputs:
        try:
            js = json.loads(output.outputs[0].text.strip())
        except:
            continue

        all_outs.append(js)

    return all_outs, extract_entities(all_outs)

## Use case: synthetic data for job ads

In [13]:
# I have used GPT-4 to generate these

# List of countries
countries = [
    "Madagascar", "Taiwan", "USA", "Germany", "France", "Spain", "Russia", "China",
    "Japan", "Brazil", "India", "Egypt", "South Africa", "Australia", "Canada",
    "Mexico", "Indonesia", "Nigeria", "Turkey", "United Kingdom", "Italy", "Poland",
    "Argentina", "Netherlands", "Belgium", "Switzerland", "Sweden", "Norway", "Finland",
    "Denmark", "Portugal", "Greece", "Iran", "Thailand", "Philippines", "Vietnam",
    "South Korea", "Saudi Arabia", "Israel", "UAE", "New Zealand", "Ireland", "Malaysia",
    "Singapore", "Hong Kong", "Czech Republic", "Hungary", "Romania", "Colombia",
    "Peru", "Venezuela", "Chile", "Morocco", "Algeria", "Tunisia", "Nepal", "Pakistan", "Bangladesh",
    "Kazakhstan", "Ukraine", "Austria", "Croatia", "Serbia", "Kenya", "Ghana", "Zimbabwe",
    "Cuba", "Panama", "Fiji", "Mongolia", "North Korea", "Myanmar", "Ethiopia", "Tanzania",
    "Algeria", "Libya", "Jordan", "Qatar", "Oman", "Kuwait", "Lebanon", "Bulgaria", "Slovakia",
    "Lithuania", "Latvia", "Estonia", "Cyprus", "Luxembourg", "Macao", "Bhutan", "Maldives",
    "Angola", "Cameroon", "Senegal", "Mali", "Zambia", "Uganda", "Namibia", "Botswana",
    "Mozambique", "Ivory Coast", "Burkina Faso", "Malawi", "Gabon", "Lesotho", "Gambia",
    "Guinea", "Cape Verde", "Rwanda", "Benin", "Burundi", "Somalia", "Eritrea", "Djibouti",
    "Togo", "Seychelles", "Chad", "Central African Republic", "Liberia", "Mauritania", "Sri Lanka",
    "Sierra Leone", "Equatorial Guinea", "Swaziland", "Congo (Kinshasa)", "Congo (Brazzaville)"
]

# job sectors
job_sectors = [
    # Finance Sector Specializations
    "Investment Banking",
    "Corporate Finance",
    "Asset Management",
    "Risk Management",
    "Quantitative Analysis",
    "Financial Planning",

    # Machine Learning and AI Specializations
    "Natural Language Processing",
    "Computer Vision",
    "Deep Learning",
    "Reinforcement Learning",
    "Predictive Analytics",
    "Algorithm Development",

    # Healthcare Sector Specializations
    "Medical Research",
    "Clinical Trials",
    "Health Informatics",
    "Biomedical Engineering",
    "Public Health Administration",
    "Pharmaceuticals",

    # Education Sector Specializations
    "Curriculum Development",
    "Educational Technology",
    "Special Education",
    "Higher Education Administration",
    "Educational Policy",
    "Language Instruction",

    # Manufacturing Sector Specializations
    "Process Engineering",
    "Quality Control",
    "Industrial Design",
    "Supply Chain Optimization",
    "Robotics Manufacturing",
    "Lean Manufacturing",

    # Energy Sector Specializations
    "Renewable Energy Systems",
    "Oil and Gas Exploration",
    "Energy Efficiency Consulting",
    "Nuclear Engineering",
    "Smart Grid Technology",
    "Energy Policy",

    # Environmental Sector Specializations
    "Wildlife Conservation",
    "Environmental Science",
    "Water Resource Management",
    "Sustainability Strategy",
    "Climate Change Analysis",
    "Environmental Law",

    # Media and Communications Specializations
    "Digital Marketing",
    "Journalism",
    "Public Relations",
    "Film Production",
    "Broadcasting",
    "Content Strategy",

    # Legal Sector Specializations
    "Corporate Law",
    "International Law",
    "Intellectual Property",
    "Environmental Law",
    "Civil Litigation",
    "Criminal Defense",

    # Retail Sector Specializations
    "E-commerce Strategy",
    "Store Management",
    "Merchandise Planning",
    "Customer Experience Management",
    "Retail Analytics",
    "Supply Chain Logistics"
]

### Generate prompts

In [14]:
# create prompts
NUM_SAMPLES = 100

import random

all_prompts = []

for i in range(NUM_SAMPLES):
    # sample
    job_sector = random.choice(job_sectors)
    country = random.choice(countries)

    prompt = create_json_prompt_for_synthetic_data(language="english",
                                                   types_of_text="detailled job ads",
                                                   sector=job_sector,
                                                   country=country)
    all_prompts.append(prompt)

In [25]:
all_prompts[0:3]

['\n**Objective:**\nProduce realistic text passages that include clearly identified named entities. Each entity should be meticulously labeled according to its type for straightforward extraction.\n\n**Format Requirements:**\n- The output should be formatted in JSON, containing the text and the corresponding entities list.\n- Each entity in the text should be accurately marked and annotated in the \'entities\' list.\n- Meticulously follow all the listed attributes.\n\n**Entity Annotation Details:**\n- All entity types must be in lowercase. For example, use "type" not "TYPE".\n- Entity types can be multiwords separate by space. For instance, use "entity type" rather than "entity_type".\n- Entities spans can be nested within other entities.\n- A single entity may be associated with multiple types. list them in the key "types".\n\n**Output Schema:**\n\n<start attribute_1="value1" attribute_2="value2" ...>\n{\n  "text": "{text content}",\n  "entities": [\n    {"entity": "entity name", "typ

### Generate outputs

In [15]:
output, processed_output = generate_from_prompts(all_prompts, llm, sampling_params)

Processed prompts: 100%|██████████| 100/100 [00:10<00:00,  9.95it/s, est. speed input: 2655.16 toks/s, output: 1481.86 toks/s]


In [16]:
output[0]

{'text': 'We are looking for a highly motivated and skilled International Law professional to join our team in Algeria. The role is a full-time position and the successful candidate will be responsible for providing legal advice and support to our clients. The ideal candidate should have a minimum of 5 years of experience in the field of International Law and be fluent in both English and Arabic.',
 'entities': [{'entity': 'International Law', 'types': ['field of law']},
  {'entity': 'Algeria', 'types': ['country']},
  {'entity': 'English', 'types': ['language']},
  {'entity': 'Arabic', 'types': ['language']}]}

In [17]:
processed_output[0]

{'tokenized_text': ['We',
  'are',
  'looking',
  'for',
  'a',
  'highly',
  'motivated',
  'and',
  'skilled',
  'International',
  'Law',
  'professional',
  'to',
  'join',
  'our',
  'team',
  'in',
  'Algeria',
  '.',
  'The',
  'role',
  'is',
  'a',
  'full-time',
  'position',
  'and',
  'the',
  'successful',
  'candidate',
  'will',
  'be',
  'responsible',
  'for',
  'providing',
  'legal',
  'advice',
  'and',
  'support',
  'to',
  'our',
  'clients',
  '.',
  'The',
  'ideal',
  'candidate',
  'should',
  'have',
  'a',
  'minimum',
  'of',
  '5',
  'years',
  'of',
  'experience',
  'in',
  'the',
  'field',
  'of',
  'International',
  'Law',
  'and',
  'be',
  'fluent',
  'in',
  'both',
  'English',
  'and',
  'Arabic',
  '.'],
 'ner': [(9, 10, 'field of law'),
  (58, 59, 'field of law'),
  (17, 17, 'country'),
  (65, 65, 'language'),
  (67, 67, 'language')]}

### Some statistics

In [18]:
lengths = []

for d in processed_output:
    lengths.append(len(d["tokenized_text"]))

print("Avg num tokens:", sum(lengths) / len(lengths))

Avg num tokens: 56.96


In [19]:
len_ner = []

for d in processed_output:
    len_ner.append(len(d["ner"]))

print("Avg num of entities:", sum(len_ner) / len(len_ner))

Avg num of entities: 5.66


In [29]:
unique_entities = set()

for d in processed_output:
    for n in d["ner"]:
        unique_entities.add((str(n[2]).lower()))

print("Unique entity types:", len(unique_entities))

print(unique_entities)

Unique entity types: 89
{'gpe', 'years of experience', 'job title', 'industry', 'tool', 'job ad type', 'concept', 'abbreviation', 'academic discipline', 'acronym', 'city', 'gender neutral pronoun', 'objective', 'design technology', 'product type', 'market', 'date', 'research topic', 'geographical area', 'discipline', 'loc', 'software', 'topic', 'fieldofstudy', 'group', 'audience', 'benefit', 'object', 'experience length', 'organization', 'time', 'achievement', 'person', 'team', 'employment type', 'subdomain', 'company', 'trend', 'analysis', 'job position', 'design material', 'responsibility', 'salary', 'job field', 'field', 'sector', 'project', 'organization name', 'domain', 'title', 'programming language', 'occupation', 'study type', 'jobtitle', 'education', 'language', 'role', 'degree', 'job description', 'position', 'product category', 'qualification', 'department', 'education level', 'technique', 'product', 'task', 'media', 'compensation', 'field of work', 'job ads', 'field of law'

In [21]:
# Top 10 entity types

from collections import Counter
Counter(unique_entities).most_common()[:10]

[('job title', 86),
 ('country', 75),
 ('field of study', 40),
 ('city', 28),
 ('field', 21),
 ('location', 21),
 ('position', 20),
 ('sector', 16),
 ('industry', 16),
 ('organization', 14)]

### Save for training

In [22]:
# Save to JSON
def save_data_to_file(data, filepath):
    """Saves the processed data to a JSON file."""
    with open(filepath, 'w') as f:
        json.dump(data, f)

In [23]:
output_file = "job_ads_data_gliner.json"

save_data_to_file(processed_output, output_file)