# Specilized Artificial data generation with openrouter and hugging face model on google colab

With this specilaized synthetic generator, you can generate synthetic data in various formats(Json, xml, and Sql) for your particular industry of interest.

From the Gradi interface, you provide the following input

- Industry/niche
- description of what the data will be used for
- example of the data
- Select the model to use (default Open AI)
- Select output format (default Json)
- Sample size (default 20)

openai

For Hugging face version: https://colab.research.google.com/drive/1RznL2L8ZC-vndzBcsF1Osj8alXAijYNg?usp=sharing

I recommed a low-cost or free T4 box. I have ensured the list of available models can atleast run on a free t4 box

In [70]:
#import states for all lib used
import os
from dotenv import load_dotenv
from openai import OpenAI
import gradio as gr

In [71]:
#load env values
load_dotenv(override=True)
openai_api_key = os.getenv('OPENAI_API_KEY')
openrouter_api_key = os.getenv('OPENROUTER_API_KEY')


In [72]:
# function for building prompt

def generate_prompt(example, sample_size, industry, description):
    system_prompt = """
    You are a specialized synthentic data generator.
    You generate new realistic but fake records

    You can output generated data in Json, xml, and SQL only given:
    - Sample size.
    - Industry/Niche.
    - Description - what the data will be used for.
    - Example -  You will infer the data type from the example.
    
    """

    user_prompt = f"""
    Here is an example of data to generate: {example}
    Task
    - Generate {sample_size} new sample(s) of the example.
    - Here is the purspose: {description} in {industry} Industry/niche

    Rules:
    - Output only valid format based on the example given
    - Output must be a list, in the format of the example
    - No Explanation.
    - Do not wrap response in code block
    """

    return system_prompt, user_prompt


In [73]:
# using models available via openrouter

OPENROUTER_URL = "https://openrouter.ai/api/v1"

openrouter = OpenAI(
        base_url= OPENROUTER_URL
    )

def openrouter_model(model, system_prompt, user_prompt):
    
    messages = [
        { "role": "system", "content": system_prompt },
        { "role": "user", "content": user_prompt },
    ]

    response  = openrouter.chat.completions.create(
        model = model,
        messages = messages
    )

    return response.choices[0].message.content

In [74]:
#map model to provider
model_to_provider = {
    'openai/gpt-4': 'openrouter',
    'google/gemini-3.1-pro-preview': 'openrouter',
    'mistralai/Mistral-7B-Instruct-v0.1': 'openrouter'
}
MODEL_GPT = 'openai/gpt-4'

def get_model_provider(model=MODEL_GPT):
    return model_to_provider.get(model, 'openrouter')


In [79]:
#let's route request to provider based on model

def route_to_provider (model, system_prompt, user_prompt): 
    provider = get_model_provider(model)
    if provider == 'openrouter':
        return openrouter_model(model, system_prompt, user_prompt)
    else:
        # handle unknown provider
        response = f"No provider is implemented for the {model} yet"
        print(response)
        return response  

In [80]:
# define generator callback function  for Gradio

def generate_fake_data (model, sample_size, description, example, industry):
    
    system_prompt, user_prompt = generate_prompt(example, sample_size, industry, description)

    return route_to_provider(model, system_prompt, user_prompt) 

In [78]:
# Now to gradio Interface for this new Implementation

availableModels = list[str](model_to_provider.keys())
industry = ['Fintech', 'PropTech', 'Edutech', 'HealthTech', 'devTools', 'Generic']
interface = gr.Interface(
    fn=generate_fake_data,
    inputs=[
        gr.Dropdown(availableModels, label="Select Model"),
        gr.Slider(1, 100, value=10, label="Number of samples", step=1),
        gr.TextArea(label="Describe the purpose of the data"),
        gr.TextArea(label="Share the example of the data you want"),
        gr.Dropdown(industry, label="Industry/Niche"),
    ],
    outputs=gr.TextArea(label="Specialized Synthentic Data Generator"),
    title="Multi-Model Synthetic Data Generator",
    description="Generate structured synthetic datasets using your preferred model."
)

interface.launch()

* Running on local URL:  http://127.0.0.1:7874
* To create a public link, set `share=True` in `launch()`.





    You are a specialized synthentic data generator.
    You generate new realistic but fake records

    You can output generated data in Json, xml, and SQL only given:
    - Sample size.
    - Industry/Niche.
    - Description - what the data will be used for.
    - Example -  You will infer the data type from the example.

     
    Here is an example of data to generate: <?xml version="1.0" encoding="UTF-8"?>
<User>
    <email>user@mail.com</email>
    <status>suspended</status>
    <loginSession>sess_yfjhjdhjsdjhjhdf</loginSession>
    <tier>5</tier>
    <approved>true</approved>
</User>
    Task
    - Generate 10 new sample(s) of the example.
    - Here is the purspose: User Api data in Fintech Industry/niche

    Rules:
    - Output only valid format based on the example given
    - Output must be a list, in the format of the example
    - No Explanation.
    - Do not wrap response in code block
    
openrouter here
