🛠️ ToolsGen

A modular Python library for synthesizing tool-calling datasets from JSON tool definitions using an LLM-as-a-judge pipeline. Designed for OpenAI-compatible APIs.

⚠️ Development Status: This project is under active development. The API is not yet stable and may undergo significant changes. Breaking changes may occur between versions.

Overview

ToolsGen automates the creation of tool-calling datasets for training and evaluating language models. It generates realistic user requests, produces corresponding tool calls, and evaluates their quality using a multi-dimensional rubric system.

Key Features

Multi-role LLM Pipeline: Separate models for problem generation, tool calling, and quality evaluation
Flexible Sampling Strategies: Random, parameter-aware, and semantic clustering approaches
LLM-as-a-Judge Scoring: Rubric-based evaluation with structured outputs
OpenAI-Compatible: Works with OpenAI API and compatible providers (Azure OpenAI, local models via vLLM, etc.)
Hugging Face Ready: JSONL output format compatible with Hugging Face datasets
Configurable Quality Control: Adjustable scoring thresholds and retry mechanisms
Train/Val Splitting: Built-in dataset splitting for model training workflows
Parallel Generation: Multiprocessing pipeline to accelerate dataset creation on multi-core hosts

Requirements

Python 3.9+
OpenAI API key (or compatible API endpoint)

Installation

git clone https://github.com/atasoglu/toolsgen.git
cd toolsgen
pip install .

Usage

CLI Usage

# Check version
toolsgen version

# Set your OpenAI API key
export OPENAI_API_KEY="your-api-key-here"

# Generate dataset with default settings
toolsgen generate \
  --tools tools.json \
  --out output_dir \
  --num 100

# Advanced: Use different models and temperatures for each role
toolsgen generate \
  --tools tools.json \
  --out output_dir \
  --num 1000 \
  --strategy param_aware \
  --seed 42 \
  --train-split 0.9 \
  --workers 4 \
  --worker-batch-size 8 \
  --problem-model gpt-4o-mini --problem-temp 0.9 \
  --caller-model gpt-4o --caller-temp 0.3 \
  --judge-model gpt-4o --judge-temp 0.0

# Parallel generation with 6 workers processing four samples per task
toolsgen generate \
  --tools tools.json \
  --out output_dir \
  --num 500 \
  --workers 6 \
  --worker-batch-size 4

Python API Usage

import os
from pathlib import Path
from toolsgen.core import GenerationConfig, ModelConfig, generate_dataset

os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Configuration
tools_path = Path("tools.json")
output_dir = Path("output")

gen_config = GenerationConfig(
    num_samples=100,
    strategy="random",
    seed=42,
    train_split=0.9,  # 90% train, 10% validation
    batch_size=10,  # optional: iterate tools in batches
    shuffle_tools=True,  # optional: reshuffle tools between batches
    num_workers=4,  # enable multiprocessing
    worker_batch_size=2,  # samples per worker task
)

model_config = ModelConfig(
    model="gpt-4o-mini",
    temperature=0.7,
)

# Generate dataset from file
manifest = generate_dataset(output_dir, gen_config, model_config, tools_path=tools_path)

# Or use tools list directly (alternative to tools_path)
# from toolsgen.schema import ToolSpec
# tools = [ToolSpec(...), ToolSpec(...)]
# manifest = generate_dataset(output_dir, gen_config, model_config, tools=tools)

print(f"Generated {manifest['num_generated']}/{manifest['num_requested']} records")
print(f"Failed: {manifest['num_failed']} attempts")

See examples/ directory for complete working examples.

Note: The examples in examples/ use python-dotenv for convenience (load API keys from .env file). Install it with pip install python-dotenv if you want to use this approach.

Output Format

Dataset Files (JSONL)

Each line in train.jsonl (or val.jsonl) is a JSON record:

{
  "id": "record_000001",
  "language": "english",
  "tools": [...],
  "messages": [
    {"role": "user", "content": "What's the weather in San Francisco?"}
  ],
  "assistant_calls": [
    {
      "id": "call_abc123",
      "type": "function",
      "function": {
        "name": "get_weather",
        "arguments": "{\"location\": \"San Francisco, CA\"}"
      }
    }
  ],
  "problem_metadata": {"generated": true, "user_request": "..."},
  "judge": {
    "tool_relevance": 0.4,
    "argument_quality": 0.38,
    "clarity": 0.2,
    "score": 0.98,
    "verdict": "accept",
    "rationale": "Excellent tool selection and argument quality",
    "rubric_version": "0.1.0",
    "model": "gpt-4o",
    "temperature": 0.0
  },
  "quality_tags": [],
  "tools_metadata": {"num_tools": 5}
}

Manifest File

manifest.json contains generation metadata:

{
  "version": "0.1.0",
  "num_requested": 1000,
  "num_generated": 987,
  "num_failed": 13,
  "strategy": "param_aware",
  "seed": 42,
  "train_split": 0.9,
  "tools_count": 15,
  "models": {
    "problem_generator": "gpt-4o-mini",
    "tool_caller": "gpt-4o",
    "judge": "gpt-4o"
  },
  "splits": {
    "train": 888,
    "val": 99
  }
}

Testing

# Run all tests with coverage
pytest --cov=src

# Run specific test file
pytest tests/test_generator.py

# Run with verbose output
pytest -v

Development

# Install development dependencies
pip install -r requirements-dev.txt

# Run tests with coverage
pytest --cov=src

# Run code quality checks
ruff check src tests --fix
ruff format src tests

Architecture

For detailed information about the system architecture, pipeline, and core components, see ARCHITECTURE.md.

Roadmap

Planned Features

Multi-turn conversation support
Custom prompt template system
Parallel generation with multiprocessing
Additional sampling strategies (coverage-based, difficulty-based)
Integration with Hugging Face Hub for direct dataset uploads
Support for more LLM providers (Anthropic, Cohere, etc.)
Web UI for dataset inspection and curation
Advanced filtering and deduplication

Known Limitations

Single-turn conversations only
English-focused prompts (multilingual support is experimental)
No built-in tool execution or validation
Limited to OpenAI-compatible APIs

Contributing

Contributions are welcome! Please note that the API is still evolving. Before starting major work, please open an issue to discuss your proposed changes.

License

MIT License - see LICENSE for details.

Citation

If you use ToolsGen in your research, please cite:

@software{toolsgen2025,
  title = {ToolsGen: Synthetic Tool-Calling Dataset Generator},
  author = {Ataşoğlu, Ahmet},
  year = {2025},
  url = {https://github.com/atasoglu/toolsgen}
}

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
.vscode		.vscode
examples		examples
src/toolsgen		src/toolsgen
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🛠️ ToolsGen

Overview

Key Features

Requirements

Installation

Usage

CLI Usage

Python API Usage

Output Format

Dataset Files (JSONL)

Manifest File

Testing

Development

Architecture

Roadmap

Planned Features

Known Limitations

Contributing

License

Citation

About

Uh oh!

Releases 5

Packages

Languages

License

atasoglu/toolsgen

Folders and files

Latest commit

History

Repository files navigation

🛠️ ToolsGen

Overview

Key Features

Requirements

Installation

Usage

CLI Usage

Python API Usage

Output Format

Dataset Files (JSONL)

Manifest File

Testing

Development

Architecture

Roadmap

Planned Features

Known Limitations

Contributing

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages