DatasetForge

DatasetForge is an application for generating, managing, and exporting AI training datasets.

Key Features

Template-Based Generation: Create templates with placeholders and use them with seed data to generate examples.
Seed Bank: Manage collections of seed data for use in generation.
Workflows: Connect generation, transformation, filtering, and other steps into reusable workflows.
Paraphrasing: Create variations of examples for data augmentation.
Export Options: Export data in a variety of formats including JSONL, CSV, and custom templates.
Tool Calling Support: Generate examples with tool calls for function calling training.

Setup

Using Docker (Recommended)

Clone this repository
Copy .env.example to .env and configure:
- Set OLLAMA_HOST to host.docker.internal to access Ollama running on your host machine
- Set up default model names (DEFAULT_GEN_MODEL, DEFAULT_PARA_MODEL)
- Set context sizes (GEN_MODEL_CONTEXT_SIZE, PARA_MODEL_CONTEXT_SIZE, DEFAULT_CONTEXT_SIZE)
Run docker-compose up to start the application
Open http://localhost:3000 in your browser

Running Locally (Development)

Backend (Python API)

Create a virtual environment: python -m venv dataforge_env
Activate it:
- Windows: dataforge_env\Scripts\activate
- Mac/Linux: source dataforge_env/bin/activate
Install backend dependencies: cd backend && pip install -r requirements.txt
Download spaCy model: python -m spacy download en_core_web_sm
Copy .env.example to .env and configure:
- Set OLLAMA_HOST to localhost for local Ollama
- Set default models and context sizes
Start the backend: cd backend && python -m app.main

Frontend (React UI)

Install Node.js dependencies: cd frontend && npm install
Start the frontend development server: cd frontend && npm run dev
Open http://localhost:3000 in your browser

Configuration

The application uses environment variables for configuration:

DB_PATH: Location of the SQLite database file
OLLAMA_HOST/PORT/TIMEOUT: Configuration for connecting to Ollama
DEFAULT_GEN_MODEL: Default model for generation (e.g., "mistral:latest")
DEFAULT_PARA_MODEL: Default model for paraphrasing (e.g., "mistral:latest")
GEN_MODEL_CONTEXT_SIZE: Context size for generation model (in tokens)
PARA_MODEL_CONTEXT_SIZE: Context size for paraphrase model (in tokens)

CLI Commands

DatasetForge includes a command-line interface with these utilities:

database_stats: Display database statistics
show_examples: View examples from a dataset
reset_database: Reset the database (warning: deletes all data)
restore_database: Restore from a backup
database_status: Show database file information
run_migration: Update database schema when upgrading
export_database: Export the database to another location
import_database: Import a database from another location

To run these commands:

cd backend
# In development:
python -m app.cli command_name [options]
# In Docker:
docker-compose exec backend python -m app.cli command_name [options]

Acknowledgments

This application builds on several open source projects:

FastAPI for the backend API
React and Vite for the frontend
SQLModel for database models
ReactFlow for workflow visualization

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
workflow_api_docs.md		workflow_api_docs.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DatasetForge

Key Features

Setup

Using Docker (Recommended)

Running Locally (Development)

Backend (Python API)

Frontend (React UI)

Configuration

CLI Commands

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

benhmoore/DatasetForge

Folders and files

Latest commit

History

Repository files navigation

DatasetForge

Key Features

Setup

Using Docker (Recommended)

Running Locally (Development)

Backend (Python API)

Frontend (React UI)

Configuration

CLI Commands

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages