A modular Python library for synthesizing tool-calling datasets from JSON tool definitions using an LLM-as-a-judge pipeline. Designed for OpenAI-compatible APIs.
⚠️ Development Status: This project is under active development. The API is not yet stable and may undergo significant changes. Breaking changes may occur between versions.
ToolsGen automates the creation of tool-calling datasets for training and evaluating language models. It generates realistic user requests, produces corresponding tool calls, and evaluates their quality using a multi-dimensional rubric system.
- Multi-role LLM Pipeline: Separate models for problem generation, tool calling, and quality evaluation
- Flexible Sampling Strategies: Random, parameter-aware, and semantic clustering approaches
- LLM-as-a-Judge Scoring: Rubric-based evaluation with structured outputs
- OpenAI-Compatible: Works with OpenAI API and compatible providers (Azure OpenAI, local models via vLLM, etc.)
- Hugging Face Ready: JSONL output format compatible with Hugging Face datasets
- Configurable Quality Control: Adjustable scoring thresholds and retry mechanisms
- Train/Val Splitting: Built-in dataset splitting for model training workflows
- Parallel Generation: Multiprocessing pipeline to accelerate dataset creation on multi-core hosts
- Python 3.9+
- OpenAI API key (or compatible API endpoint)
git clone https://github.com/atasoglu/toolsgen.git
cd toolsgen
pip install .# Check version
toolsgen version
# Set your OpenAI API key
export OPENAI_API_KEY="your-api-key-here"
# Generate dataset with default settings
toolsgen generate \
--tools tools.json \
--out output_dir \
--num 100
# Advanced: Use different models and temperatures for each role
toolsgen generate \
--tools tools.json \
--out output_dir \
--num 1000 \
--strategy param_aware \
--seed 42 \
--train-split 0.9 \
--workers 4 \
--worker-batch-size 8 \
--problem-model gpt-4o-mini --problem-temp 0.9 \
--caller-model gpt-4o --caller-temp 0.3 \
--judge-model gpt-4o --judge-temp 0.0
# Parallel generation with 6 workers processing four samples per task
toolsgen generate \
--tools tools.json \
--out output_dir \
--num 500 \
--workers 6 \
--worker-batch-size 4import os
from pathlib import Path
from toolsgen.core import GenerationConfig, ModelConfig, generate_dataset
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
# Configuration
tools_path = Path("tools.json")
output_dir = Path("output")
gen_config = GenerationConfig(
num_samples=100,
strategy="random",
seed=42,
train_split=0.9, # 90% train, 10% validation
batch_size=10, # optional: iterate tools in batches
shuffle_tools=True, # optional: reshuffle tools between batches
num_workers=4, # enable multiprocessing
worker_batch_size=2, # samples per worker task
)
model_config = ModelConfig(
model="gpt-4o-mini",
temperature=0.7,
)
# Generate dataset from file
manifest = generate_dataset(output_dir, gen_config, model_config, tools_path=tools_path)
# Or use tools list directly (alternative to tools_path)
# from toolsgen.schema import ToolSpec
# tools = [ToolSpec(...), ToolSpec(...)]
# manifest = generate_dataset(output_dir, gen_config, model_config, tools=tools)
print(f"Generated {manifest['num_generated']}/{manifest['num_requested']} records")
print(f"Failed: {manifest['num_failed']} attempts")See examples/ directory for complete working examples.
Note: The examples in examples/ use python-dotenv for convenience (load API keys from .env file). Install it with pip install python-dotenv if you want to use this approach.
Each line in train.jsonl (or val.jsonl) is a JSON record:
{
"id": "record_000001",
"language": "english",
"tools": [...],
"messages": [
{"role": "user", "content": "What's the weather in San Francisco?"}
],
"assistant_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco, CA\"}"
}
}
],
"problem_metadata": {"generated": true, "user_request": "..."},
"judge": {
"tool_relevance": 0.4,
"argument_quality": 0.38,
"clarity": 0.2,
"score": 0.98,
"verdict": "accept",
"rationale": "Excellent tool selection and argument quality",
"rubric_version": "0.1.0",
"model": "gpt-4o",
"temperature": 0.0
},
"quality_tags": [],
"tools_metadata": {"num_tools": 5}
}manifest.json contains generation metadata:
{
"version": "0.1.0",
"num_requested": 1000,
"num_generated": 987,
"num_failed": 13,
"strategy": "param_aware",
"seed": 42,
"train_split": 0.9,
"tools_count": 15,
"models": {
"problem_generator": "gpt-4o-mini",
"tool_caller": "gpt-4o",
"judge": "gpt-4o"
},
"splits": {
"train": 888,
"val": 99
}
}# Run all tests with coverage
pytest --cov=src
# Run specific test file
pytest tests/test_generator.py
# Run with verbose output
pytest -v# Install development dependencies
pip install -r requirements-dev.txt
# Run tests with coverage
pytest --cov=src
# Run code quality checks
ruff check src tests --fix
ruff format src testsFor detailed information about the system architecture, pipeline, and core components, see ARCHITECTURE.md.
- Multi-turn conversation support
- Custom prompt template system
- Parallel generation with multiprocessing
- Additional sampling strategies (coverage-based, difficulty-based)
- Integration with Hugging Face Hub for direct dataset uploads
- Support for more LLM providers (Anthropic, Cohere, etc.)
- Web UI for dataset inspection and curation
- Advanced filtering and deduplication
- Single-turn conversations only
- English-focused prompts (multilingual support is experimental)
- No built-in tool execution or validation
- Limited to OpenAI-compatible APIs
Contributions are welcome! Please note that the API is still evolving. Before starting major work, please open an issue to discuss your proposed changes.
MIT License - see LICENSE for details.
If you use ToolsGen in your research, please cite:
@software{toolsgen2025,
title = {ToolsGen: Synthetic Tool-Calling Dataset Generator},
author = {Ataşoğlu, Ahmet},
year = {2025},
url = {https://github.com/atasoglu/toolsgen}
}