# **Using LLMs to Generate Synthetic Data for Fine-Tuning GLiNER**

In this notebook, we'll explore a simple way to generate synthetic data for fine-tuning GLiNER. I have used a similar approach to generate training data for [**PII extraction**](https://huggingface.co/urchade/gliner_multi_pii-v1). We will be using `Mistral-7B-Instruct-v0.2`, though I think there are better LLMs available online (like LLaMa-3 ... etc).

Additionally, the prompt used in this example is far from optimal, so you should adapt it to your specific use case or domain. This notebook serves only as an example for practitioners, as some people have requested one.

In this notebook, we generate **fully synthetic data**, including both text and entity annotations, but if you have quality data from your target domain, *you can alternatively have the LLM annotate your existing data*. üìäüìù

Feel free to experiment and tailor the approach to better suit your needs! *Happy fine-tuning!* üåü

In [1]:
!pip install vllm

Collecting vllm
  Downloading vllm-0.6.6.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (11 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.2-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)
  Downloading lm_format_enforcer-0.10.9-py3-none-any.whl.metadata (17 kB)
Collecting outlines==0.1.11 (from vllm)
  Downloading outlines-0.1.11-py3-none-any.whl.metadata (17 kB)
Collecting

In [1]:
from vllm import LLM, SamplingParams

## Load large language model

In [5]:
LLM_MODEL = "mistralai/Mistral-7B-Instruct-v0.2" # you can use a better model
NUM_GPUs = 1

In [3]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [6]:
llm = LLM(model=LLM_MODEL, tensor_parallel_size=NUM_GPUs, dtype="half")

INFO 01-20 18:20:25 config.py:510] This model supports multiple tasks: {'classify', 'embed', 'reward', 'score', 'generate'}. Defaulting to 'generate'.
INFO 01-20 18:20:25 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='mistralai/Mistral-7B-Instruct-v0.2', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, s

  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)


tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

INFO 01-20 18:20:28 selector.py:120] Using Flash Attention backend.
INFO 01-20 18:20:28 model_runner.py:1094] Starting to load model mistralai/Mistral-7B-Instruct-v0.2...
INFO 01-20 18:20:29 weight_utils.py:251] Using model weights format ['*.safetensors']


model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]


INFO 01-20 18:22:37 model_runner.py:1099] Loading model weights took 13.4966 GB
INFO 01-20 18:22:40 worker.py:241] Memory profiling takes 3.17 seconds
INFO 01-20 18:22:40 worker.py:241] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.90) = 35.61GiB
INFO 01-20 18:22:40 worker.py:241] model weights take 13.50GiB; non_torch_memory takes 0.10GiB; PyTorch activation peak memory takes 3.38GiB; the rest of the memory reserved for KV Cache is 18.63GiB.
INFO 01-20 18:22:41 gpu_executor.py:76] # GPU blocks: 9538, # CPU blocks: 2048
INFO 01-20 18:22:41 gpu_executor.py:80] Maximum concurrency for 32768 tokens per request: 4.66x
INFO 01-20 18:22:43 model_runner.py:1415] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utiliz

Capturing CUDA graph shapes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 35/35 [00:27<00:00,  1.27it/s]

INFO 01-20 18:23:11 model_runner.py:1535] Graph capturing finished in 28 secs, took 0.83 GiB
INFO 01-20 18:23:11 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 33.41 seconds





In [7]:
# sampling parameters
sampling_params = SamplingParams(top_k=100, max_tokens=1000, top_p=0.8, stop="<end>")

## Prompting function

In [117]:
def create_json_prompt_for_synthetic_data(**kwargs):

    # Use dictionary comprehension to filter out 'n/a' values and to keep the code flexible
    attributes = {key: value for key, value in kwargs.items() if value != "n/a"}

    # Building the initial part of the prompt
    prompt = """
**Objective:**
Produce realistic text passages that include clearly identified named entities. Each entity should be meticulously labeled according to its type for straightforward extraction.

**Format Requirements:**
- The output should be formatted in JSON, containing the text and the corresponding entities list.
- Each entity in the text should be accurately marked and annotated in the 'entities' list.
- Meticulously follow all the listed attributes.

**Entity Annotation Details:**
- Must always include only the following entity types and no more: 'patient occupation', 'patient proper name' (firstname and/or lastname), 'patient employed by' (not necessarily employed in the health sector).
- Don't label organizations where the patient does not work.
**Output Schema:**

<start attribute_1="value1" attribute_2="value2" ...>
{
  "text": "{text content}",
  "entities": [
    {"entity": "entity name", "types": "type name"},
    ...
  ]
}
<end>

**Here are some real world examples**:"""

    # Create a string of attributes for the <start> tag, excluding any 'n/a' values
    attributes_string = " ".join([f'{key}="{value}"' for key, value in attributes.items()])

    # Adding the dynamically created attributes string to the prompt
    prompt += f"""
<start {attributes_string}>
"""
    print(prompt)
    return prompt

## Example of generation

In [118]:
import json

def generate(**kwargs):
    outputs = llm.generate([create_json_prompt_for_synthetic_data(**kwargs)], sampling_params)
    return json.loads(outputs[0].outputs[0].text)

In [119]:
generate(language="english", types_of_text="clinical encounter notes with a abbreviations (f/u for follow up, c/o for cough, ekg, but don't be limited to cough related ailments), sentences should look like they were written in a hurry with articles and pronouns missing (for instance, instead of \"he presents with chest pain\", it should be \"chest pain, yellow discharge\", avoid using pronouns in every sentence unless the subject changes), formatted in SOAP format, but not necessarily with the SOAP headings. Use believable surrogates", patient_employment_company="Peak of the Market", patient_occupation="Vegetable Packer", health_org_patient_visited="St. Boniface Hospital", country="winnipeg, canada")


**Objective:**
Produce realistic text passages that include clearly identified named entities. Each entity should be meticulously labeled according to its type for straightforward extraction.

**Format Requirements:**
- The output should be formatted in JSON, containing the text and the corresponding entities list.
- Each entity in the text should be accurately marked and annotated in the 'entities' list.
- Meticulously follow all the listed attributes.

**Entity Annotation Details:**
- Must always include only the following entity types and no more: 'patient occupation', 'patient proper name' (firstname and/or lastname), 'patient employed by' (not necessarily employed in the health sector).
- Don't label organizations where the patient does not work.
**Output Schema:**

<start attribute_1="value1" attribute_2="value2" ...>
{
  "text": "{text content}",
  "entities": [
    {"entity": "entity name", "types": "type name"},
    ...
  ]
}
<end>

**Here are some real world examples**:
<sta

Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.32s/it, est. speed input: 130.46 toks/s, output: 71.71 toks/s]


{'text': 'Mr. Thompson, 53 yrs, f/u chest pain, y/d, shortness of breath, wt 82kg. c/o swollen left leg, y/d. pt denies fevers or chills. pt works at Peak of the Market as a vegetable packer. EKG: sinus rhythm, rate 98bpm, normal axis, no ST segment deviations. Pt was seen at St. Boniface Hospital in Winnipeg, Canada.',
 'entities': [{'entity': 'Mr. Thompson', 'types': ['patient proper name']},
  {'entity': 'Peak of the Market', 'types': ['patient employed by']},
  {'entity': 'vegetable packer', 'types': ['patient occupation']},
  {'entity': 'St. Boniface Hospital', 'types': ['health_org_patient_visited']},
  {'entity': 'Winnipeg, Canada', 'types': ['location']}]}

## Functions

In [120]:
# post processing functions

import re

def tokenize_text(text):
    """Tokenize the input text into a list of tokens."""
    return re.findall(r'\w+(?:[-_]\w+)*|\S', text)

def extract_entities(data):
    all_examples = []

    for dt in data:

        # Attempt to extract entities; skip current record on failure
        try:
            tokens = tokenize_text(dt['text'])
            ents = [(k["entity"], k["types"]) for k in dt['entities']]
        except:
            continue

        spans = []
        for entity in ents:
            entity_tokens = tokenize_text(str(entity[0]))

            # Find the start and end indices of each entity in the tokenized text
            for i in range(len(tokens) - len(entity_tokens) + 1):
                if " ".join(tokens[i:i + len(entity_tokens)]).lower() == " ".join(entity_tokens).lower():
                    for el in entity[1]:
                        spans.append((i, i + len(entity_tokens) - 1, el.lower().replace('_', ' ')))

        # Append the tokenized text and its corresponding named entity recognition data
        all_examples.append({"tokenized_text": tokens, "ner": spans})

    return all_examples

# generation functions
def generate_from_prompts(prompts, llm, sampling_params):
    outputs = llm.generate(prompts, sampling_params)

    all_outs = []

    for output in outputs:
        try:
            js = json.loads(output.outputs[0].text.strip())
        except:
            continue

        all_outs.append(js)

    return all_outs, extract_entities(all_outs)

## Use case: synthetic data for job ads

In [128]:
# I have used GPT-4 to generate these

# List of countries
countries = [
    "Winnipeg, Canada", "Winnipeg", "Manitoba", "St. Vital, Winnipeg", "Osborne, Winnipeg", "Industrial Area, Winnipeg", "West Broadway, Winnipeg", "Winkler, Winnipeg", "The Forks, Winnipeg", "Windsor, Winnipeg"\
    , "Wolseley, Winnipeg", "Assiniboine, Winnipeg"
]

# job sectors

health_orgs = [
    "St. Boniface Hospital",
    "Concordia Hospital",
    "Grace Hospital",
    "Seven Oaks General Hospital",
    "Victoria General Hospital",
    "Misericordia Health Centre",
    "Health Sciences Centre Winnipeg",
    "Pan Am Clinic",
    "Klinic Community Health",
    "WRHA Community Health Clinics",
    "Mount Carmel Clinic",
    "Access Winnipeg West",
    "Access Fort Garry",
    "Access NorWest",
    "Access River East",
    "Access Downtown",
    "CancerCare Manitoba",
    "Children's Hospital of Winnipeg",
    "Women's Health Clinic",
    "Centre de sant√© Saint-Boniface",
    "Crisis Response Centre",
    "Nine Circles Community Health Centre"
]

employers_in_winnipeg = [
    # Agriculture and Farming
    "Paterson GlobalFoods",
    "Richardson International",
    "Cargill Canada",
    "Parrish & Heimbecker",
    "Maple Leaf Agri-Farms",
    "HyLife",
    "Viterra",
    "Bayer CropScience",
    "Monsanto Canada",
    "Peak of the Market",

    # Education
    "University of Manitoba",
    "Red River College Polytechnic",
    "Winnipeg School Division",
    "Seven Oaks School Division",
    "Louis Riel School Division",
    "Pembina Trails School Division",
    "St. John's-Ravenscourt School",
    "Manitoba Institute of Trades and Technology",

    # Healthcare
    "Health Sciences Centre Winnipeg",
    "St. Boniface Hospital",
    "Grace Hospital",
    "CancerCare Manitoba",
    "Misericordia Health Centre",

    # Transportation and Logistics
    "Bison Transport",
    "TransX",
    "CN Rail",
    "Manitoba Trucking Association",
    "Winnipeg Airports Authority",

    # Manufacturing
    "New Flyer Industries",
    "MacDon Industries Ltd.",
    "StandardAero",
    "Palliser Furniture",
    "Winpak",

    # Trades and Construction
    "Maple Leaf Construction",
    "Randall Plumbing and Heating Ltd.",
    "Pinnacle Roofing Ltd.",
    "Qualico Homes",
    "Westland Construction",

    # Retail and Hospitality
    "Princess Auto",
    "Stella‚Äôs Caf√© and Bakery",
    "The Forks Market",
    "King‚Äôs Head Pub",
    "Peasant Cookery",

    # Financial Services
    "Great-West Lifeco",
    "IGM Financial",
    "Assiniboine Credit Union",
    "Manitoba Public Insurance",

    # Others
    "Manitoba Hydro",
    "Royal Canadian Mint",
    "SkipTheDishes",
    "City of Winnipeg",
    "Red River Mutual"
]

employers_with_jobs = {
    # Agriculture and Farming
    "Paterson GlobalFoods": ["Grain Handler", "Warehouse Worker", "Quality Control Technician", "Forklift Operator", "Farm Equipment Mechanic"],
    "Richardson International": ["Grain Elevator Operator", "Agronomist", "Logistics Coordinator", "Maintenance Technician", "Truck Driver"],
    "Cargill Canada": ["Food Processing Worker", "Plant Operator", "Quality Assurance Technician", "Sanitation Worker", "Production Line Worker"],
    "HyLife": ["Pork Production Worker", "Herdsperson", "Livestock Handler", "Feed Mill Operator", "Maintenance Worker"],
    "Peak of the Market": ["Vegetable Packer", "Farm Worker", "Tractor Operator", "Warehouse Associate", "Driver"],
    "Viterra": ["Grain Elevator Assistant", "Agricultural Sales Representative", "Truck Driver", "Plant Operator", "Safety Coordinator"],
    "Bayer CropScience": ["Field Technician", "Lab Assistant", "Seed Technician", "Equipment Operator", "Data Entry Clerk"],
    "Monsanto Canada": ["Research Technician", "Field Worker", "Warehouse Worker", "Lab Assistant", "Quality Control Technician"],
    "Parrish & Heimbecker": ["Grain Elevator Operator", "Maintenance Worker", "Truck Driver", "Plant Operator", "Safety Coordinator"],
    "Maple Leaf Agri-Farms": ["Grain Elevator Operator", "Maintenance Worker", "Truck Driver", "Plant Operator", "Safety Coordinator"],

    # Education
    "University of Manitoba": ["Teaching Assistant", "Lab Technician", "Custodian", "Groundskeeper", "Security Guard"],
    "Red River College Polytechnic": ["Instructor", "Lab Assistant", "Facilities Technician", "Cleaner", "Food Service Worker"],
    "Winnipeg School Division": ["Teacher", "Educational Assistant", "School Bus Driver", "Janitor", "Lunchroom Supervisor"],
    "Seven Oaks School Division": ["Teacher", "Library Technician", "Clerical Assistant", "Custodian", "Lunchroom Supervisor"],
    "Louis Riel School Division": ["Teacher", "Educational Assistant", "Library Assistant", "Bus Driver", "Office Administrator"],
    "Pembina Trails School Division": ["Teacher", "Administrative Assistant", "Library Technician", "Maintenance Worker", "Clerical Staff"],
    "St. John's-Ravenscourt School": ["Teacher", "Coach", "Custodian", "Kitchen Assistant", "Groundskeeper"],
    "Manitoba Institute of Trades and Technology": ["Instructor", "Lab Supervisor", "Facilities Manager", "Food Service Assistant", "Janitor"],

    # Healthcare
    "Health Sciences Centre Winnipeg": ["Nursing Assistant", "Phlebotomist", "Housekeeping Attendant", "Medical Laboratory Technician", "Food Service Worker"],
    "St. Boniface Hospital": ["Ward Clerk", "Porter", "Environmental Services Worker", "Cook", "Laundry Worker"],
    "Grace Hospital": ["Patient Transporter", "Kitchen Helper", "Ward Clerk", "Laundry Worker", "Pharmacy Assistant"],
    "CancerCare Manitoba": ["Medical Assistant", "Lab Technician", "Clerical Support", "Housekeeper", "Pharmacy Assistant"],
    "Misericordia Health Centre": ["Nursing Assistant", "Environmental Services Worker", "Clerk", "Kitchen Worker", "Patient Support Assistant"],

    # Transportation and Logistics
    "Bison Transport": ["Truck Driver", "Freight Loader", "Dispatcher", "Diesel Mechanic", "Safety Officer"],
    "TransX": ["Logistics Coordinator", "Freight Handler", "Truck Driver", "Maintenance Technician", "Forklift Operator"],
    "CN Rail": ["Train Conductor", "Track Maintenance Worker", "Signal Technician", "Freight Car Mechanic", "Operator"],
    "Winnipeg Airports Authority": ["Baggage Handler", "Security Screener", "Aircraft Refueler", "Maintenance Worker", "Customer Service Agent"],
    "Manitoba Trucking Association": ["Logistics Coordinator", "Freight Handler", "Truck Driver", "Maintenance Technician", "Forklift Operator"],

    # Manufacturing
    "New Flyer Industries": ["Assembler", "Welder", "Painter", "Machine Operator", "Inspector"],
    "MacDon Industries Ltd.": ["CNC Operator", "Material Handler", "Fabricator", "Production Worker", "Tool Crib Attendant"],
    "StandardAero": ["Aircraft Mechanic", "Machinist", "Parts Inspector", "Tool Crib Attendant", "Maintenance Worker"],
    "Palliser Furniture": ["Furniture Assembler", "Warehouse Worker", "Forklift Operator", "Upholsterer", "Shipping Clerk"],
    "Winpak": ["Machine Operator", "Packager", "Quality Control Technician", "Forklift Operator", "Maintenance Mechanic"],

    # Trades and Construction
    "Maple Leaf Construction": ["Construction Laborer", "Heavy Equipment Operator", "Survey Assistant", "Concrete Finisher", "Traffic Control Worker"],
    "Randall Plumbing and Heating Ltd.": ["Plumber", "HVAC Technician", "Pipefitter", "Helper", "Dispatcher"],
    "Pinnacle Roofing Ltd.": ["Roofer", "Sheet Metal Worker", "Estimator", "Laborer", "Safety Officer"],
    "Qualico Homes": ["Drywaller", "Painter", "Carpenter", "Plumber", "Electrician"],
    "Westland Construction": ["Concrete Finisher", "Heavy Equipment Operator", "Carpenter", "Laborer", "Estimator"],

    # Retail and Hospitality
    "Princess Auto": ["Retail Associate", "Warehouse Worker", "Stockroom Clerk", "Cashier", "Maintenance Worker"],
    "Stella‚Äôs Caf√© and Bakery": ["Server", "Dishwasher", "Line Cook", "Baker", "Barista"],
    "The Forks Market": ["Retail Assistant", "Vendor Staff", "Janitor", "Food Prep Worker", "Cashier"],
    "King‚Äôs Head Pub": ["Bartender", "Server", "Kitchen Porter", "Cook", "Maintenance Staff"],
    "Peasant Cookery": ["Chef", "Dishwasher", "Host/Hostess", "Kitchen Assistant", "Delivery Driver"],

    # Financial Services
    "Great-West Lifeco": ["Customer Service Representative", "Data Entry Clerk", "Mailroom Worker", "Building Maintenance Worker", "Clerical Assistant"],
    "IGM Financial": ["Administrative Assistant", "Receptionist", "Data Processor", "Building Maintenance Worker", "Courier"],
    "Assiniboine Credit Union": ["Bank Teller", "Customer Service Representative", "Loan Processor", "Security Officer", "Data Entry Clerk"],
    "Manitoba Public Insurance": ["Bank Teller", "Customer Service Representative", "Loan Processor", "Security Officer", "Data Entry Clerk"],

    # Utilities
    "Manitoba Hydro": ["Power Line Technician", "Meter Reader", "Utility Worker", "Electrician", "Groundskeeper"],

    # Others
    "Royal Canadian Mint": ["Machine Operator", "Packaging Worker", "Quality Control Technician", "Maintenance Technician", "Janitor"],
    "SkipTheDishes": ["Food Delivery Driver", "Customer Support Representative", "Warehouse Staff", "Order Picker", "Technical Support Agent"],
    "City of Winnipeg": ["Sanitation Worker", "Parks Maintenance Worker", "Snow Plow Operator", "Transit Operator", "Clerk"],
    "Red River Mutual": ["Insurance Claims Processor", "Customer Support Representative", "Risk Assessment Specialist", "Data Entry Clerk", "Maintenance Technician"]
}





### Generate prompts

In [129]:
# create prompts
NUM_SAMPLES = 100

import random

all_prompts = []

for i in range(NUM_SAMPLES):
    # sample
    patient_employment_company = random.choice(employers_in_winnipeg)
    country = random.choice(countries)
    health_org = random.choice(health_orgs)
    patient_occupation = random.choice(employers_with_jobs[patient_employment_company])

    prompt = create_json_prompt_for_synthetic_data(language="english",
                                                   types_of_text="clinical encounter notes with a abbreviations (f/u for follow up, c/o for cough, ekg, but don't be limited to cough related ailments), sentences should look like they were written in a hurry with articles and pronouns missing (for instance, instead of \"he presents with chest pain\", it should be \"chest pain, yellow discharge\", avoid using pronouns in every sentence unless the subject changes), formatted in SOAP format, but not necessarily with the SOAP headings.",
                                                   patient_employment_company=patient_employment_company,
                                                   health_org_patient_visited=health_org,
                                                   patient_occupation=patient_occupation,
                                                   country=country)
    all_prompts.append(prompt)


**Objective:**
Produce realistic text passages that include clearly identified named entities. Each entity should be meticulously labeled according to its type for straightforward extraction.

**Format Requirements:**
- The output should be formatted in JSON, containing the text and the corresponding entities list.
- Each entity in the text should be accurately marked and annotated in the 'entities' list.
- Meticulously follow all the listed attributes.

**Entity Annotation Details:**
- Must always include only the following entity types and no more: 'patient occupation', 'patient proper name' (firstname and/or lastname), 'patient employed by' (not necessarily employed in the health sector).
- Don't label organizations where the patient does not work.
**Output Schema:**

<start attribute_1="value1" attribute_2="value2" ...>
{
  "text": "{text content}",
  "entities": [
    {"entity": "entity name", "types": "type name"},
    ...
  ]
}
<end>

**Here are some real world examples**:
<sta

KeyError: 'Manitoba Trucking Association'

### Generate outputs

In [108]:
output, processed_output = generate_from_prompts(all_prompts, llm, sampling_params)

Processed prompts: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 100/100 [00:22<00:00,  4.52it/s, est. speed input: 2155.11 toks/s, output: 1512.86 toks/s]


In [109]:
output[0]

{'text': 'pt has chest pain, yd, wtlh shortness of breath. pt denies any history of cvd. last ekg normal. f/u 2wks.',
 'entities': [{'entity': 'pt', 'types': ['patient proper name']},
  {'entity': 'chest pain', 'types': ['symptom']},
  {'entity': 'yd', 'types': ['patient symptom', 'unit of measurement']},
  {'entity': 'wtlh', 'types': ['conjunction']},
  {'entity': 'shortness of breath', 'types': ['symptom']},
  {'entity': 'pt denies', 'types': ['negation']},
  {'entity': 'cvd', 'types': ['disease']},
  {'entity': 'last ekg', 'types': ['procedure']},
  {'entity': 'normal', 'types': ['result']},
  {'entity': 'f/u', 'types': ['procedure']},
  {'entity': '2wks', 'types': ['time', 'unit of measurement']}]}

### Some statistics

In [None]:
lengths = []

for d in processed_output:
    lengths.append(len(d["tokenized_text"]))

print("Avg num tokens:", sum(lengths) / len(lengths))

Avg num tokens: 76.82291666666667


In [None]:
len_ner = []

for d in processed_output:
    len_ner.append(len(d["ner"]))

print("Avg num of entities:", sum(len_ner) / len(len_ner))

Avg num of entities: 11.875


In [None]:
unique_entities = []

for d in processed_output:
    for n in d["ner"]:
        unique_entities.append((str(n[2]).lower()))

print("Unique entity types:", len(unique_entities))

Unique entity types: 1140


In [None]:
# Top 10 entity types

from collections import Counter
Counter(unique_entities).most_common()[:10]

[('organization', 106),
 ('location', 86),
 ('job title', 83),
 ('person', 71),
 ('country', 41),
 ('technology', 40),
 ('field of study', 38),
 ('education', 29),
 ('degree', 24),
 ('quantity', 23)]

### Save for training

In [None]:
# Save to JSON
def save_data_to_file(data, filepath):
    """Saves the processed data to a JSON file."""
    with open(filepath, 'w') as f:
        json.dump(data, f)

In [None]:
output_file = "job_ads_data_gliner.json"

save_data_to_file(processed_output, output_file)