# HR Synthetic Data Generator

Generate synthetic **resumes** and **job postings** for HR and recruiting (NER, resume parsing, job–resume matching, ATS testing). All data and UI are in English.

**Requirements:** Set `OPENAI_API_KEY` and `OPENROUTER_API_KEY` in your `.env` (e.g. in the project root).  
**Run:** From the repo root use `uv run jupyter notebook` or open this notebook in your IDE with the UV Python kernel, then run all cells and launch the app.

## Step 1: Dependencies

Dependencies are provided by the project root `pyproject.toml`: `openai`, `gradio`, `pandas`, `python-dotenv`. Ensure you run the notebook with the UV environment (e.g. `uv run jupyter notebook` from repo root).

In [None]:
import os
import json
import sys
import traceback
import pandas as pd
import gradio as gr
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")

if not OPENAI_API_KEY:
    print("OPENAI_API_KEY is not set. Add it to .env for GPT-5 mini.")
if not OPENROUTER_API_KEY:
    print("OPENROUTER_API_KEY is not set. Add it to .env for Gemini 2.0 Flash.")
if OPENAI_API_KEY and OPENROUTER_API_KEY:
    print("Both API keys are set.")

## Step 2: API clients and model config

Two models: **GPT-5 mini** (OpenAI) and **Gemini 2.0 Flash** (OpenRouter). `get_client_and_model(model_key)` returns the client and model id; raises a clear error if the required API key is missing.

In [None]:
OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1"

MODELS = {
    "GPT-5 mini": {
        "provider": "openai",
        "model_id": "gpt-5-mini",
    },
    "Gemini 2.0 Flash": {
        "provider": "openrouter",
        "model_id": "google/gemini-2.0-flash-001",
    },
}


def get_client_and_model(model_key):
    """
    Return (client, model_id) for the given model key.
    Uses OPENAI_API_KEY for OpenAI and OPENROUTER_API_KEY for OpenRouter.
    Raises ValueError if the required key is missing.
    """
    if model_key not in MODELS:
        raise ValueError(f"Unknown model: {model_key}. Choose from {list(MODELS.keys())}")
    info = MODELS[model_key]
    model_id = info["model_id"]
    if info["provider"] == "openai":
        if not OPENAI_API_KEY:
            raise ValueError("OPENAI_API_KEY is not set. Add it to .env for GPT-5 mini.")
        client = OpenAI(api_key=OPENAI_API_KEY)
        return client, model_id
    if info["provider"] == "openrouter":
        if not OPENROUTER_API_KEY:
            raise ValueError("OPENROUTER_API_KEY is not set. Add it to .env for Gemini 2.0 Flash.")
        client = OpenAI(base_url=OPENROUTER_BASE_URL, api_key=OPENROUTER_API_KEY)
        return client, model_id
    raise ValueError(f"Unknown provider: {info.get('provider')}")

## Step 3: Schemas and prompt building

Explicit schemas for **resumes** and **job postings**. All field values must be in English. Dates in YYYY-MM or YYYY; seniority: Junior/Middle/Senior (Senior only when years_experience >= 5). Prompts instruct the model to output JSONL only.

In [None]:
RESUME_SCHEMA = """
- first_name (string)
- last_name (string)
- email (string, realistic format)
- phone (string, realistic format)
- experience (array of objects: company, position, start_date, end_date; dates YYYY-MM or YYYY)
- skills (array of strings)
- education (array of objects: institution, degree, year)
- desired_salary_min (number)
- desired_salary_max (number)
- years_experience (number, total)
- seniority: "Junior" or "Middle" or "Senior"; use "Senior" only when years_experience >= 5
"""

JOB_SCHEMA = """
- title (string)
- company (string)
- requirements (string or array of strings)
- responsibilities (string or array of strings)
- salary_min (number)
- salary_max (number)
- region (string, e.g. Remote, NYC, London)
- industry (string, e.g. IT, Finance, Retail)
- employment_type (string: full-time, part-time, remote, contract, etc.)
"""


def build_system_prompt(record_type: str, region: str = "") -> str:
    """Build system prompt for resume or job generation. Enforces English and JSONL."""
    if record_type == "resume":
        schema = RESUME_SCHEMA
        entity = "resume"
    else:
        schema = JOB_SCHEMA
        entity = "job posting"
    region_note = ""
    if region and region.strip():
        region_note = (
            " When a region or country is specified, use typical first and last names for that region "
            "(e.g. Finnish names for Finland, Japanese for Japan) and salary amounts in local currency "
            "appropriate for that market (e.g. EUR for Finland/Europe, local levels). "
        )
    return (
        f"You are an expert at generating synthetic HR data. Generate realistic {entity} records. "
        "Output language for field values: English for companies, skills, cities; use local names when region is set. "
        "Output format: JSONL only — one JSON object per line. No explanations or extra text. "
        "Use consistent formats: dates as YYYY-MM or YYYY; seniority only Senior when years_experience >= 5. "
        f"{region_note}"
        f"Schema (include these fields): {schema}"
    ).strip()


def build_user_prompt(record_type: str, num_rows: int, industry: str = "", region: str = "") -> str:
    """Build user prompt with row count and optional industry/region (for both resumes and jobs)."""
    num_rows = max(10, min(200, int(num_rows)))
    base = f"Generate exactly {num_rows} {record_type} records. Output only JSONL: one JSON object per line, no other text."
    extra = []
    if industry:
        extra.append(f"Focus industry: {industry}.")
    if region:
        extra.append(f"Focus region: {region}. Use typical local first and last names and salary in local currency for this region.")
    if extra:
        base += " " + " ".join(extra)
    return base

## Step 4: Generation and parsing

`parse_jsonl(text)` parses line-by-line JSON; skips non-JSON lines. `generate_hr_records(...)` calls the selected API, builds messages from schemas, and returns (status_message, list_of_dicts). num_rows clamped to 10–200.

In [None]:
def _extract_jsonl_text(text: str) -> str:
    """Strip markdown code fences (```json ... ``` or ``` ... ```) so we can parse JSONL."""
    if not text or not text.strip():
        return text
    s = text.strip()
    for start in ("```json", "```JSON", "```"):
        if s.startswith(start):
            s = s[len(start):].lstrip("\n")
            break
    if s.endswith("```"):
        s = s[:-3].rstrip()
    return s


def parse_jsonl(text: str) -> list[dict]:
    """Parse lines that look like JSON objects into a list of dicts. Skips non-JSON lines. Handles markdown fences and single JSON array."""
    if not text or not text.strip():
        return []
    text = _extract_jsonl_text(text)
    records = []
    for line in text.split("\n"):
        line = line.strip()
        if not line.startswith("{"):
            continue
        try:
            obj = json.loads(line)
            if isinstance(obj, dict):
                records.append(obj)
        except json.JSONDecodeError:
            continue
    if not records and text.strip().startswith("["):
        try:
            arr = json.loads(text)
            if isinstance(arr, list):
                records = [x for x in arr if isinstance(x, dict)]
        except json.JSONDecodeError:
            pass
    return records


def _clean_date(val) -> str:
    """Return readable string for date; None or 'None' -> 'Present'."""
    if val is None or val == "" or str(val).strip().lower() == "none":
        return "Present"
    return str(val).strip()


def _format_experience(exp) -> str:
    """Convert experience array of objects to a single readable string."""
    if isinstance(exp, str):
        return exp
    if not isinstance(exp, list):
        return str(exp) if exp is not None else ""
    parts = []
    for item in exp:
        if isinstance(item, dict):
            c = item.get("company") or ""
            p = item.get("position") or ""
            s = _clean_date(item.get("start_date"))
            e = _clean_date(item.get("end_date"))
            parts.append(f"{c}, {p} ({s}–{e})".strip())
        else:
            parts.append(str(item))
    return "; ".join(parts) if parts else ""


def _format_education(edu) -> str:
    """Convert education array of objects to a single readable string."""
    if isinstance(edu, str):
        return edu
    if not isinstance(edu, list):
        return str(edu) if edu is not None else ""
    parts = []
    for item in edu:
        if isinstance(item, dict):
            inst = item.get("institution") or ""
            deg = item.get("degree") or ""
            year = item.get("year")
            year_str = "" if year is None or str(year).strip().lower() == "none" else str(year).strip()
            if year_str:
                parts.append(f"{inst}, {deg} ({year_str})".strip())
            else:
                parts.append(f"{inst}, {deg}".strip() or "")
        else:
            parts.append(str(item))
    return "; ".join(parts) if parts else ""


def _format_string_list(val) -> str:
    """Convert array of strings to a single string (for requirements/responsibilities in jobs)."""
    if isinstance(val, str):
        return val
    if isinstance(val, list):
        return ", ".join(str(x).strip() for x in val if x is not None and str(x).strip().lower() != "none") if val else ""
    return str(val) if val is not None else ""


def flatten_record_for_display(record: dict, record_type: str) -> dict:
    """Convert experience, education, skills, and list fields to readable strings for display and CSV."""
    out = dict(record)
    if "experience" in out:
        out["experience"] = _format_experience(out["experience"])
    if "education" in out:
        out["education"] = _format_education(out["education"])
    if "skills" in out:
        out["skills"] = _format_string_list(out["skills"])
    if record_type == "job":
        if "requirements" in out:
            out["requirements"] = _format_string_list(out["requirements"])
        if "responsibilities" in out:
            out["responsibilities"] = _format_string_list(out["responsibilities"])
    # Replace any remaining None values with empty string for CSV
    for k, v in list(out.items()):
        if v is None or (isinstance(v, str) and v.strip().lower() == "none"):
            out[k] = ""
    return out


def generate_hr_records(
    record_type: str,
    num_rows: int,
    temperature: float,
    model_key: str,
    industry: str = "",
    region: str = "",
) -> tuple[str, list[dict]]:
    """
    Generate synthetic resume or job records via the selected model API.
    Returns (status_message, list_of_dicts). Handles API and parse errors.
    """
    num_rows = max(10, min(200, int(num_rows)))
    try:
        client, model_id = get_client_and_model(model_key)
    except ValueError as e:
        return f"Error: {e}", []

    system = build_system_prompt(record_type, region=region)
    user = build_user_prompt(record_type, num_rows, industry=industry, region=region)
    messages = [{"role": "system", "content": system}, {"role": "user", "content": user}]

    # GPT-5 mini only supports temperature=1; use 1.0 for that model
    req_temperature = 1.0 if model_id == "gpt-5-mini" else float(temperature)
    try:
        response = client.chat.completions.create(
            model=model_id,
            messages=messages,
            temperature=req_temperature,
            max_completion_tokens=8192,
        )
        content = (response.choices[0].message.content or "").strip()
    except Exception as e:
        return f"API error: {e}", []

    records = parse_jsonl(content)
    if not records:
        snippet = (content[:800] + "...") if len(content) > 800 else content
        print(f"[parse] No JSONL parsed. Model output snippet:\n{snippet}", file=sys.stderr, flush=True)
        return "No valid JSONL rows parsed. Try again or check the model output. See console for raw snippet.", []
    records = [flatten_record_for_display(r, record_type) for r in records]
    if len(records) < num_rows:
        return f"Generated {len(records)} of {num_rows} requested rows.", records
    return f"Generated {len(records)} rows successfully.", records

## Step 5: CSV export

Save generated records to CSV. Output directory: current directory or temp; filenames `hr_resumes.csv` or `hr_jobs.csv` depending on record type.

In [None]:
import tempfile

OUTPUT_DIR = tempfile.gettempdir()
CSV_NAMES = {"resume": "hr_resumes.csv", "job": "hr_jobs.csv"}


def save_to_csv(records: list[dict], record_type: str, dir_path: str | None = None) -> str | None:
    """
    Save list of dicts to CSV. Returns path for Gradio File download.
    record_type is 'resume' or 'job'; filename is hr_resumes.csv or hr_jobs.csv.
    """
    if not records:
        return None
    base = dir_path or OUTPUT_DIR
    filename = CSV_NAMES.get(record_type, "hr_data.csv")
    path = os.path.join(base, filename)
    df = pd.DataFrame(records)
    df.to_csv(path, index=False)
    return path

## Step 6: Gradio UI

Radio choice: **Resumes** or **Job postings**. Model choice (GPT-5 mini / Gemini 2.0 Flash), temperature and number of rows sliders; for jobs, optional Industry and Region. Generate button, Status, Data preview, Download CSV. All labels in English.

In [None]:
def run_generate(record_type, model_name, temperature, num_rows, industry, region):
    """Single handler: generate records, build dataframe, save CSV; return (status, df, path)."""
    try:
        if not record_type:
            return "Select record type: Resumes or Job postings.", None, None
        record_type = "resume" if record_type == "Resumes" else "job"
        print(f"[UI] Generating {record_type}s: model={model_name}, rows={num_rows}", flush=True)
        status, records = generate_hr_records(
            record_type=record_type,
            num_rows=num_rows,
            temperature=temperature,
            model_key=model_name,
            industry=industry or "",
            region=region or "",
        )
        if not records:
            print(f"[UI] No records: {status}", flush=True)
            return status, None, None
        df = pd.DataFrame(records)
        path = save_to_csv(records, record_type)
        print(f"[UI] OK: {len(records)} rows, CSV -> {path}", flush=True)
        return status, df, path
    except Exception as e:
        msg = f"Error: {e}"
        print(f"[UI] ERROR: {msg}", file=sys.stderr, flush=True)
        traceback.print_exc(file=sys.stderr)
        return msg, None, None


model_choices = list(MODELS.keys())
TEMPERATURE_FIXED_MODEL = "GPT-5 mini"  # only temperature=1 supported


def update_temperature_ui(model_key):
    """Disable temperature slider and set to 1.0 for GPT-5 mini; enable for others."""
    if model_key == TEMPERATURE_FIXED_MODEL:
        return gr.update(interactive=False, value=1.0)
    return gr.update(interactive=True, value=0.7)


with gr.Blocks(title="HR Synthetic Data Generator") as demo:
    gr.Markdown("## Generate synthetic resumes or job postings")
    record_type_radio = gr.Radio(
        choices=["Resumes", "Job postings"],
        value="Resumes",
        label="Record type",
    )
    with gr.Row():
        model_dropdown = gr.Dropdown(
            choices=model_choices,
            value=model_choices[0],
            label="Model",
        )
        _temp_fixed = model_choices[0] == TEMPERATURE_FIXED_MODEL
        temperature_slider = gr.Slider(
            0.0, 1.5,
            value=1.0 if _temp_fixed else 0.7,
            step=0.1,
            label="Temperature",
            interactive=not _temp_fixed,
        )
        num_rows_slider = gr.Slider(10, 200, value=30, step=1, label="Number of rows")
    with gr.Row():
        industry_text = gr.Textbox(label="Industry (optional)", placeholder="e.g. IT, Finance, Health Care")
        region_text = gr.Textbox(label="Region (names & salaries)", placeholder="e.g. Finland, Japan, Remote, NYC")
    gr.Markdown("*Low temperature = more consistent; high = more varied. GPT-5 mini uses fixed temperature (1). Region sets local names and salary currency (e.g. Finland → Finnish names, EUR).*")
    gen_btn = gr.Button("Generate")
    status_out = gr.Textbox(label="Status", interactive=False)
    # Wider columns for phone, experience, skills, education; wrap so text isn't cut off
    PREVIEW_COLUMN_WIDTHS = ["90px", "90px", "150px", "130px", "320px", "240px", "280px", "85px", "85px", "85px", "80px", "90px"]
    preview_df = gr.Dataframe(
        label="Data preview",
        interactive=False,
        wrap=True,
        column_widths=PREVIEW_COLUMN_WIDTHS,
    )
    download_file = gr.File(label="Download CSV", interactive=False)

    model_dropdown.change(
        fn=update_temperature_ui,
        inputs=[model_dropdown],
        outputs=[temperature_slider],
    )
    gen_btn.click(
        fn=run_generate,
        inputs=[record_type_radio, model_dropdown, temperature_slider, num_rows_slider, industry_text, region_text],
        outputs=[status_out, preview_df, download_file],
    )

## Step 7: Launch

Run the cell below to start the Gradio app. Use the local URL in your browser. For IDE: run from repo root with `uv run jupyter notebook` or use the UV Python kernel.

In [None]:
demo.launch(inbrowser=True, debug=True)

## Additional:  
I also implemented General Purpose Generator using open-source models from Hugging Face Qwen2.5-7B-Instruct and Mistral-7B-Instruct-v0.3                     Link on collab is here: https://colab.research.google.com/drive/1cVSoNRkb714qJOMJmQdWyhEPlOH04UR7
