# Synthetic Dataset generator
- 🚀 Live Demo: https://huggingface.co/spaces/lisekarimi/datagen
- 🧑‍💻 Repo: https://github.com/lisekarimi/datagen

---

- 🌍 **Task**: Generate realistic synthetic datasets
- 🎯 **Supported Data Types**: Tabular, Text, Time-series
- 🧠 **Models**: GPT (OpenAI) , Claude (Anthropic), CodeQwen1.5-7B-Chat (via Hugging Face Inference) / Llama (in Google Colab through T4 GPU)
- 🚀 **Tools**: Python, Gradio UI, OpenAI / Anthropic / HuggingFace APIs
- 📤 **Output Formats**: JSON and CSV file
- 🧑‍💻 **Skill Level**: Intermediate

🎯 **How It Works**

1️⃣ Define your business problem or dataset topic.

2️⃣ Choose the dataset type, output format, model, and number of samples.

3️⃣ The LLM generates the code; you can adjust or modify it as needed.

4️⃣ Execute the code to generate your output file.

🛠️ **Requirements** 
- ⚙️ **Hardware**: ✅ GPU required (model download); Google Colab recommended (T4)
- 🔑 OpenAI API Key (for GPT)  
- 🔑 Anthropic API Key (for Claude)  
- 🔑 Hugging Face Token 

**Deploy CodeQwen Endpoint:**
- Visit https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat
- Click **Deploy** → **Inference Endpoints** → **Create Endpoint** (requires credit card)
- Copy your endpoint URL: `https://[id].us-east-1.aws.endpoints.huggingface.cloud`

⚙️ **Customizable by user**  
- 🤖 Selected model: GPT / Claude / Llama  / Code Qwen
- 📜 `system_prompt`: Controls model behavior (concise, accurate, structured)  
- 💬 `user_prompt`: Dynamic — include other fields

---
📢 Find more LLM notebooks on my [GitHub repository](https://github.com/lisekarimi/lexo)

## Imports

In [None]:
# Install required packages in Google Colab
%pip install -q python-dotenv gradio anthropic openai requests torch bitsandbytes transformers sentencepiece accelerate

In [None]:
import re
import sys
import subprocess
import threading
import anthropic
import torch
import gradio as gr
from openai import OpenAI
from huggingface_hub import InferenceClient, login
from google.colab import userdata
from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer, BitsAndBytesConfig

## Initialization

In [None]:
# Google Colab User Data
# Ensure you have set the following in your Google Colab environment:
openai_api_key = userdata.get("OPENAI_API_KEY")
anthropic_api_key = userdata.get("ANTHROPIC_API_KEY")
hf_token = userdata.get('HF_TOKEN')

In [None]:
OPENAI_MODEL = "gpt-4o-mini"
CLAUDE_MODEL = "claude-3-5-sonnet-20240620"
LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"

code_qwen = "Qwen/CodeQwen1.5-7B-Chat"
CODE_QWEN_URL = "https://zfkokxzs1xrqv13v.us-east-1.aws.endpoints.huggingface.cloud"

login(hf_token, add_to_git_credential=True)
openai = OpenAI(api_key=openai_api_key)
claude = anthropic.Anthropic(api_key=anthropic_api_key)

## Prompts definition

In [None]:
system_message = """
You are a helpful assistant whose main purpose is to generate datasets for business problems.

Be less verbose.
Be accurate and concise.

The user will describe a business problem. Based on this, you must generate a synthetic dataset that fits the context.

The dataset should be saved in a specific format such as CSV, JSON — the desired format will be specified by the user.

The dependencies for python code should include only standard python libraries such as numpy, pandas and built-in libraries.

When saving a DataFrame to JSON using `to_json()`, do not use the `encoding` parameter. Instead, manually open the file with `open()` and specify the encoding. Then pass the file object to `to_json()`.

Ensure Python code blocks are correctly indented, especially inside `with`, `for`, `if`, `try`, and `def` blocks.

Return only the Python code that generates and saves the dataset.
After saving the file, print the code that was executed and a message confirming the dataset was generated successfully.
"""


In [None]:
def user_prompt(**input_data):
  user_prompt = f"""
      Generate a synthetic {input_data["dataset_type"].lower()} dataset in {input_data["output_format"].upper()} format.
      Business problem: {input_data["business_problem"]}
      Samples: {input_data["num_samples"]}
      """
  return user_prompt


## Call API for Closed Models

In [None]:
def stream_gpt(user_prompt):
  stream = openai.chat.completions.create(
      model=OPENAI_MODEL,
      messages=[
          {"role": "system", "content": system_message},
          {"role": "user","content": user_prompt},
      ],
      stream=True,
  )

  response = ""
  for chunk in stream:
      response += chunk.choices[0].delta.content or ""
      yield response

  return response


def stream_claude(user_prompt):
  result = claude.messages.stream(
      model=CLAUDE_MODEL,
      max_tokens=2000,
      system=system_message,
      messages=[
          {"role": "user","content": user_prompt}
      ]
  )
  reply = ""
  with result as stream:
      for text in stream.text_stream:
          reply += text
          yield reply
          print(text, end="", flush=True)
  return reply


## Call Open Source Models
- Llama is downloaded and run on T4 GPU (Google Colab).
- Code Qwen is run through inference endpoint

In [None]:
def stream_llama(user_prompt):
  try:
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user","content": user_prompt},
    ]

    tokenizer = AutoTokenizer.from_pretrained(LLAMA)
    tokenizer.pad_token = tokenizer.eos_token

    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4"
    )

    model = AutoModelForCausalLM.from_pretrained(
        LLAMA,
        device_map="auto",
        quantization_config=quant_config
    )

    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
    streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)

    thread = threading.Thread(target=model.generate, kwargs={
        "input_ids": inputs,
        "max_new_tokens": 1000,
        "pad_token_id": tokenizer.eos_token_id,
        "streamer": streamer
    })
    thread.start()

    started = False
    reply = ""

    for new_text in streamer:
        if not started:
            if "<|start_header_id|>assistant<|end_header_id|>" in new_text:
                started = True
                new_text = new_text.split("<|start_header_id|>assistant<|end_header_id|>")[-1].strip()
            else:
                continue

        if "<|eot_id|>" in new_text:
            new_text = new_text.replace("<|eot_id|>", "")
            if new_text.strip():
                reply += new_text
                yield reply
            break

        if new_text.strip():
            reply += new_text
            yield reply

    return reply

  except Exception as e:
    print(f"LLaMA error: {e}")
    raise


In [None]:
def stream_code_qwen(user_prompt):
    tokenizer = AutoTokenizer.from_pretrained(code_qwen)
    messages=[
            {"role": "system", "content": system_message},
            {"role": "user","content": user_prompt},
        ]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    client = InferenceClient(CODE_QWEN_URL, token=hf_token)
    stream = client.text_generation(text, stream=True, details=True, max_new_tokens=3000)
    result = ""
    for r in stream:
        result += r.token.text
        yield result

## Select the model and generate the ouput

In [None]:
def generate_from_inputs(model, **input_data):
  # print("🔍 input_data received:", input_data)
  user_prompt_str = user_prompt(**input_data)

  if model == "GPT":
    result = stream_gpt(user_prompt_str)
  elif model == "Claude":
    result = stream_claude(user_prompt_str)
  elif model == "Llama":
    result = stream_llama(user_prompt_str)
  elif model == "Code Qwen":
    result = stream_code_qwen(user_prompt_str)
  else:
    raise ValueError("Unknown model")

  for stream_so_far in result:
    yield stream_so_far

  return result


In [None]:
def handle_generate(business_problem, dataset_type, dataset_format, num_samples, model):
  input_data = {
      "business_problem": business_problem,
      "dataset_type": dataset_type,
      "output_format": dataset_format,
      "num_samples": num_samples,
  }

  response = generate_from_inputs(model, **input_data)
  for chunk in response:
      yield chunk


## Extract python code from the LLM output and execute it locally

In [None]:
def extract_code(text):
  match = re.search(r"```python(.*?)```", text, re.DOTALL)

  if match:
      code = match.group(0).strip()
  else:
      code = ""
      print("No matching substring found.")

  return code.replace("```python\n", "").replace("```", "")


def execute_code_in_virtualenv(text, python_interpreter=sys.executable):
  if not python_interpreter:
      raise EnvironmentError("Python interpreter not found in the specified virtual environment.")

  code_str = extract_code(text)
  command = [python_interpreter, '-c', code_str]

  try:
      result = subprocess.run(command, check=True, capture_output=True, text=True)
      stdout = result.stdout
      return stdout

  except subprocess.CalledProcessError as e:
      return f"Execution error:\n{e}"


## Gradio interface

In [None]:
def update_output_format(dataset_type):
    if dataset_type in ["Tabular", "Time-series"]:
        return gr.update(choices=["JSON", "csv"], value="JSON")
    elif dataset_type == "Text":
        return gr.update(choices=["JSON"], value="JSON")

with gr.Blocks() as ui:
    gr.Markdown("## Create a dataset for a business problem")

    with gr.Column():
        business_problem = gr.Textbox(label="Business problem", lines=2)
        dataset_type = gr.Dropdown(
            ["Tabular", "Time-series", "Text"], label="Dataset type"
        )

        output_format = gr.Dropdown( choices=["JSON", "csv"], value="JSON",label="Output Format")

        num_samples = gr.Number(label="Number of samples", value=10, precision=0)

        model = gr.Dropdown(["GPT", "Claude", "Llama", "Code Qwen"], label="Select model", value="GPT")

        dataset_type.change(update_output_format,inputs=[dataset_type], outputs=[output_format])

    with gr.Row():
            with gr.Column():
              dataset_run = gr.Button("Create a dataset")
              gr.Markdown("""⚠️ For Llama and Code Qwen: The generated code might not be optimal. It's recommended to review it before execution.
                            Some mistakes may occur.""")

            with gr.Column():
              code_run = gr.Button("Execute code for a dataset")
              gr.Markdown("""⚠️ Be cautious when sharing this app with code execution publicly, as it could pose safety risks.
                            The execution of user-generated code may lead to potential vulnerabilities, and it’s important to use this tool responsibly.""")

    with gr.Row():
        dataset_out = gr.Textbox(label="Generated Dataset")
        code_out = gr.Textbox(label="Executed code")

    dataset_run.click(
        handle_generate,
        inputs=[business_problem, dataset_type, output_format, num_samples, model],
        outputs=[dataset_out]
    )

    code_run.click(
        execute_code_in_virtualenv,
        inputs=[dataset_out],
        outputs=[code_out]
    )

In [None]:
ui.launch(inbrowser=True)