## Generating Synthetic NER data for fine-tuning GLiNER
This notebook is intended to show you how to create a synthetic NER dataset using OpenAI's API. This example was developed by the *Congruence Engine* team with a view to creating a synthetic dataset for fine-tuning the NER model [GLiNER](https://github.com/urchade/GLiNER).

In this example, we use a dataset of 2,517 textile terms extracted from four glossaries published in the US and United Kingdom in the late 19th and early 20th centuries. Prior to this, we classified the terms according to the following labels:
* textile fabric
* textile fabric component
* textile fabric imperfection
* textile fibre
* textile manufacturing process
* textile machinery
* textile weave
* textile manufacturing chemical
* textile dye
* textile industry expression
* textile industry unit of measurement
* textile waste material

To run this code, you will need an OpenAI API Key. You can read the documentation for the API [here](https://platform.openai.com/docs/overview). It is also possible to run this code on other models and APIs, including by setting up a smaller model to run locally on your device via a programme like [LM Studio](https://lmstudio.ai/).

To see how we used this data to fine-tune a GLiNER model, see [this notebook](https://github.com/congruence-engine/universal-ner-with-gliner/blob/main/code/GLiNER_finetune_notebook.ipynb). {You can find a [Colab version here](https://colab.research.google.com/drive/1j1tE2bi5qrWVBKyTEgwbMISXUOvasMbs?usp=sharing)}

**Import the required libraries**

In [None]:
import json
import concurrent.futures
import time
import pandas as pd
from openai import OpenAI

**Authenticate with the OpenAI API**

You will need an [OpenAI API](https://platform.openai.com/docs/overview) Key for this.

In [None]:
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


**The code below does the following:**
* Establishes a standard prompt to send to the model, which includes:
      1. An explanation of the task
      2. A single term extracted from the dataset, along with its definition, and its entity label
      3. Format requirements
      4. An outsput schema in JSON format
      5. A full example
* Determines how many examples to generate from each term (here set to 3)
* Iterates over each term in the dataframe, and passes the relevant data to the model
* Ensures that the output is in valid json, and collects the sum of all responses into a single json output.




In [None]:
def create_json_prompt_for_synthetic_data(term, definition, predicted_label, **kwargs):
    attributes = {key: value for key, value in kwargs.items() if value != "n/a"}

    prompt = f"""
**Objective:**
For the Term stated below, produce a realistic text passage that includes clearly identified named entities. Each entity should be meticulously labeled according to its type for straightforward extraction.

**Term:**
{term}

**Definition of the Term:**
{definition}

**Entity Label**
{predicted_label}

**Format Requirements:**
- The output should be formatted in JSON, containing the text and the corresponding entities list.
- Ensure that none of the texts exceed 380 tokens.
- Each entity in the text should be accurately marked and annotated in the "entities" list.
- Meticulously follow all the listed attributes.
- Do not include any additional entities other than the 'Term to Include' mentioned above.
- Do not print "json" before the output, and do not include a code block in your response.

**Entity Annotation Details:**
- Entity types can be multiwords separated by space. For instance, use "Entity Type" rather than "entity_type".

**Output Schema:**

{{{{
  "text": "{{text content}}",
  "entities": [
    {{"entity": "entity name", "types": ["Type One", "Type Two", ...]}},
    ...
  ]
}}}}

**Here are some real world examples**:
{{{{
"text": "The burring machine in general use consists of the following parts: feed-sheet and rollers, revolving fan, lattice-sheet, revolving brush for passing the wool on to the surft, or cylinder; main cylinder, burr rollers, grid, and a large roller for beating the burrs on to the same; and, lastly, revolving brush for removing the wool off the cylinder. The tappet loom is one of the finest pieces of equipment available to the modern weaver",
 "entities": [{{"entity": "burring machine",
   "types": ["Textile Machinery"]}},
  {{"entity": "tappet loom", "types": ["Textile Machinery"]}},
  {{"entity": "rollers", "types": ["Textile Machinery"]}},
  {{"entity": "revolving fan", "types": ["Textile Machinery"]}},
  {{"entity": "revolving brush", "types": ["Textile Machinery"]}},
  {{"entity": "surft", "types": ["Textile Machinery"]}},
  {{"entity": "cylinder", "types": ["Textile Machinery"]}},
  {{"entity": "burr rollers", "types": ["Textile Machinery"]}},
  {{"entity": "large roller", "types": ["Textile Machinery"]}}]
}}}}

"""

    attributes_string = " ".join([f'{key}="{value}"' for key, value in attributes.items()])
    prompt += f"\n<start {attributes_string}>\n"

    return prompt

all_prompts = []

data = pd.read_csv('path/to/your/data.csv') # Replace with your actual data source

NUM_EXAMPLES_PER_TERM = 3  # Define the number of examples per term

# Iterate over each row in the DataFrame
for index, row in data.iterrows():
    term = row['Term']
    definition = row['Definition']
    predicted_label = row['categories_1']  # Ensure the column name matches your CSV

    # Generate multiple prompts per term
    for _ in range(NUM_EXAMPLES_PER_TERM):
        prompt = create_json_prompt_for_synthetic_data(
            term=term,
            definition=definition,
            predicted_label=predicted_label,
            language="english",
            types_of_text="descriptions of historic textile machinery and their use"
        )
        all_prompts.append(prompt)

NUM_SAMPLES = len(all_prompts)

def generate_from_prompts(prompts, model="gpt-4o-mini", temperature=0.7, max_workers=5):
    all_outs = []

    def process_prompt(prompt):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": prompt}
                ],
                temperature=temperature
            )

            output_text = response.choices[0].message.content.strip()

            output_text = output_text.strip()

            if output_text.startswith("{") and output_text.endswith("}"):
                js = json.loads(output_text)
                return js
            else:
                print("Output does not look like JSON.")
                return None

        except json.JSONDecodeError as e:
            print(f"JSON decoding error: {e}")
            print("Response was:", output_text)
            return None
        except Exception as e:
            print(f"An error occurred: {e}")
            return None

    # Use ThreadPoolExecutor for concurrent API calls
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(process_prompt, prompt) for prompt in prompts]
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            if result is not None:
                all_outs.append(result)

    return all_outs

all_outs = generate_from_prompts(all_prompts)

# Structure the results into a single JSON object
results = {
    "results": all_outs
}

json_results = json.dumps(results, indent=2)



**Save the results to a JSON file in your directory**



In [None]:
with open("path/to/your/output/results.json", "w") as json_file: # Define your output directory
    json_file.write(json_results)