# SynthGen: an LLM-Powered Dataset Synthesizer
> ⚡ Powered by LLaMa 3 from HuggingFace and an intuitive Gradio UI.

This notebook demonstrates how to generate realistic, structured tabular data using natural language prompts powered by the Meta-LLaMA-3.1 8B-Instruct model. By specifying column names, value types, and logical constraints in plain English, users can produce tailored synthetic datasets interactively via a Gradio interface.




### 1. Installing Required Packages

In [None]:
!pip install -q --upgrade torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
!pip install -q requests bitsandbytes==0.46.0 transformers==4.48.3 accelerate==1.3.0

## 2. Importing Libraries and HF Setup

In [None]:
import os
from IPython.display import Markdown, display, update_display

import gradio as gr
import pandas as pd

from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
import gc
import json
import re



In [None]:
hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [None]:
# We are using the Meta-Llama-3.1-8B-Instruct model. Feel free to change this to any other model you prefer, 
# by adjusting the model variant accordingly, and ensuring the template is compatible.
model_variant = "meta-llama/Meta-Llama-3.1-8B-Instruct"

### 3. Loading the LLaMA 3.1 Model (Quantized)
Depending on the model, this may take a few.

In [None]:
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    model_variant,
    quantization_config=quant_config,
    device_map="auto",
    trust_remote_code=True,
    use_auth_token=True
)


tokenizer = AutoTokenizer.from_pretrained(model_variant)
tokenizer.pad_token = tokenizer.eos_token

### 4. Prompt Construction and Inference

In this section, we've crafted a flexible prompt template designed to guide the LLM toward generating clean, structured output that Gradio can easily process. The goal is to balance natural language flexibility with enough constraints to encourage valid, consistent, and readable tabular data.

That said, LLMs can still be unpredictable or inconsistent, especially with loosely defined instructions. Feel free to adjust the prompt logic or system message to better suit your use case or to improve output reliability.


In [None]:

def build_system_query(user_instructions: str = '', include_system_prompt: bool = True):
    messages = [
        {
            "role": "system",
            "content": (
                "You are a dataset generation assistant. The user will provide prompts containing column names, value types, number of rows, and other parameters and constraints.\n"
                "Your task is to generate synthetic datasets based on the user's request.\n\n"
                "Always respond with a valid JSON object (dictionary), where:\n"
                "- Each key is a column name (formatted for pandas).\n"
                "- Each value is a list of values for that column.\n\n"
                "Example format:\n"
                "{\n"
                "  \"first_name\": [\"Alex\", \"Anthony\", \"Ava\", \"Amber\", \"Annabelle\"],\n"
                "  \"age\": [23, 27, 35, 29, 31]\n"
                "}\n\n"
                "If a value type or constraint is missing or unclear, make a reasonable assumption based on the column name and context.\n"
                "Do not include any explanations, comments, or extra text — only the raw JSON.\n"
                "Ensure the response is compact, well-formatted, and syntactically valid for parsing by JSON tools or conversion into a pandas DataFrame.\n"
                "Convert column names to pandas-compatible formats when needed (e.g., replace spaces with underscores, remove special characters, lowercase).\n"
            )

        },
        {"role": "user", "content": user_instructions}
    ]

    return messages if include_system_prompt else messages[1:]

### 4.5 Let's Test It!  
*(The following query intentionally includes some typos)*

In [None]:
user_query = "Generate 3 columns, firstNAme, last nam, aGe, 10 rows. Male names should correspond to an age range between 30 and 50."
user_query += "Female names should be between 34 and 40. Make sure they match row-wise. \n"



In [None]:
messages = build_system_query(user_instructions=user_query, include_system_prompt=True)
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(
    inputs,
    max_new_tokens=800,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id  # Optional, to reinforce stopping
)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)


In [None]:
# Let's first print the full output to see what we got
print(decoded)

In [None]:
# The following function attempts to extract a valid JSON response from the model's output.
# It searches for the last valid JSON object in the text, which is useful if the model
# generates additional text or explanations that are not part of the JSON response.
def extract_response(text: str):
    decoder = json.JSONDecoder()
    starts = [m.start() for m in re.finditer(r'{', text)]
    for pos in reversed(starts):  # Start from last candidate
        try:
            parsed, _ = decoder.raw_decode(text[pos:])
            if isinstance(parsed, dict):
                return parsed
        except json.JSONDecodeError:
            continue
    return None


In [None]:
# Convert the extracted JSON response into a pandas DataFrame
if decoded:
    pd.DataFrame(extract_response(decoded))

### 5. Interactive UI with Gradio

In [None]:
# The following functions are used to handle the Gradio interface and dataset management.


def remove_items(selected_features, existing_dataframe: pd.DataFrame):
    """Remove selected features from the existing DataFrame."""
    print("Before removal:", existing_dataframe)

    if isinstance(selected_features, str):
        selected_features = [selected_features]

    edited_df = existing_dataframe.copy()

    if selected_features:
        edited_df.drop(columns=selected_features, axis=1, inplace=True)

    # Force clean index/column structure
    edited_df = edited_df.copy()  # Ensures no hidden pandas artifacts

    updated_features_list = list(edited_df.columns)

    print("After removal:", edited_df)
    print('shape', edited_df.shape[0])
    return (
        gr.update(choices=updated_features_list, value=[]),
        edited_df,
        gr.update(value=edited_df, visible=not edited_df.empty),
        gr.update(interactive=edited_df.empty)
    )


def generate_features(n_rows: str, instructions: str, existing_dataframe: pd.DataFrame):
    """Generate new features based on user instructions and existing data."""
    has_valid_rows = n_rows.isnumeric() and int(n_rows) >= 1
    if has_valid_rows and instructions:
        try:
            # Prepare prompt for the model
            user_query = instructions.strip() + f"\nGenerate {n_rows} rows."
            messages = build_system_query(user_instructions=user_query, include_system_prompt=True)

            # Tokenize and generate
            inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
            outputs = model.generate(
                inputs,
                max_new_tokens=800,
                do_sample=True,
                temperature=0.7,
                top_p=0.95,
                pad_token_id=tokenizer.eos_token_id,
                eos_token_id=tokenizer.eos_token_id,
            )

            # Decode and parse model output
            decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
            new_df = pd.DataFrame(extract_response(decoded.split("assistant")[-1]))

            # Combine with existing data
            if len(existing_dataframe) == 0:
                updated_df = new_df
            elif len(existing_dataframe) == len(new_df):
                updated_df = pd.concat([existing_dataframe.reset_index(drop=True),
                                        new_df.reset_index(drop=True)], axis=1)
            else:
                raise ValueError("Row count mismatch between existing and new feature data.")

            # UI component updates
            rows_component_update = gr.update(interactive=False)
            table_view_update = gr.update(visible=True, value=updated_df, headers=list(updated_df.columns))
            feature_display_update = gr.update(choices=list(updated_df.columns))

            return rows_component_update, table_view_update, feature_display_update, updated_df, ''

        except Exception as e:
            print("Error generating feature:", e)

    # Return empty updates if input is invalid
    return (gr.update(),) * 5

def on_selected_feture(selected_items):
    """Update the UI based on selected features."""
    return gr.update(interactive=not selected_items==[])

def export_dataset(dataframe: pd.DataFrame):
    """Export the DataFrame to a CSV file."""
    try:
        n = 0
        while True:
            if n == 0:
                filename = "dataset.csv"
            else:
                filename = f"dataset{n}.csv"
            if not os.path.exists(filename):
                break
            n += 1

        dataframe.to_csv(filename, index=False)
    except Exception as e:
        print("Error exporting dataset:", e)



def on_table_change(changed_data):
    """Handle changes in the table and return a DataFrame."""
    print('on table change')
    df = pd.DataFrame(changed_data)
    return df if not df.empty else pd.DataFrame(), df.shape[0] if df.shape[0] > 0 else None


In [None]:

# UI layout
with gr.Blocks() as demo:
    feature_state  = gr.State(pd.DataFrame())

    with gr.Row():
        # Left Column: Inputs and Controls
        with gr.Column(scale=0):
            with gr.Group():
                gr.Markdown("### 🧾 Current Feature List")
                feature_display = gr.CheckboxGroup(
                    #choices=feature_dictionary,
                    label="Features",
                    info="Select features to remove"
                )
                with gr.Row(elem_classes="centered-button-row"):
                    remove_btn = gr.Button(value="Remove Selected", elem_classes="small-button", interactive=False)

            # Add feature section
            with gr.Group():
                gr.Markdown("### ➕ Add a New Feature")
                with gr.Row(equal_height=True):
                    n_rows = gr.Text(
                    label="Number of Rows",
                    placeholder="e.g., 100"
                    )

                    instructions = gr.Text(
                        label="Instructions",
                        placeholder="e.g., first names; starting with A or B. \nAge, numeric; range 21–55, males should be between 30-40. Correlate rows",
                        scale=1,
                        lines=4
                    )

                with gr.Row(elem_classes="centered-button-row"):
                    add_btn = gr.Button(value="Add", elem_classes="small-button")

            # Dataset generation section
            with gr.Group():
                with gr.Row(elem_classes="centered-button-row"):
                    export_btn = gr.Button(value="💾 Export...", elem_classes="small-button")

        # Right Column: Dataset display
        with gr.Column():
            gr.Markdown("### 📊 Generated Dataset")
            table_view = gr.Dataframe(
                interactive=True,
                visible=False,
                label="Dataset"
            )

    feature_display.change(
        fn=on_selected_feture,
        inputs=[feature_display],
        outputs=[remove_btn]
    )

    export_btn.click(
                fn=export_dataset,
                inputs=[feature_state]
            )


    remove_btn.click(
                    fn=remove_items,
                    inputs=[feature_display, feature_state],
                    outputs=[feature_display, feature_state, table_view, n_rows]
                )

    add_btn.click(
        fn=generate_features,
        inputs=[n_rows, instructions, feature_state],
        outputs=[n_rows, table_view, feature_display, feature_state, instructions]
    )

    table_view.change(
        fn=on_table_change,
        inputs=[table_view],
        outputs=[feature_state, n_rows]
    )




### Let's test it!

In [None]:
demo.launch(debug=True) # Set debug=True to see detailed error messages in the console
# demo.launch(share=True) # Uncomment this line to share the app publicly