DatasetForge is an application for generating, managing, and exporting AI training datasets.
- Template-Based Generation: Create templates with placeholders and use them with seed data to generate examples.
- Seed Bank: Manage collections of seed data for use in generation.
- Workflows: Connect generation, transformation, filtering, and other steps into reusable workflows.
- Paraphrasing: Create variations of examples for data augmentation.
- Export Options: Export data in a variety of formats including JSONL, CSV, and custom templates.
- Tool Calling Support: Generate examples with tool calls for function calling training.
- Clone this repository
- Copy
.env.exampleto.envand configure:- Set
OLLAMA_HOSTtohost.docker.internalto access Ollama running on your host machine - Set up default model names (
DEFAULT_GEN_MODEL,DEFAULT_PARA_MODEL) - Set context sizes (
GEN_MODEL_CONTEXT_SIZE,PARA_MODEL_CONTEXT_SIZE,DEFAULT_CONTEXT_SIZE)
- Set
- Run
docker-compose upto start the application - Open http://localhost:3000 in your browser
- Create a virtual environment:
python -m venv dataforge_env - Activate it:
- Windows:
dataforge_env\Scripts\activate - Mac/Linux:
source dataforge_env/bin/activate
- Windows:
- Install backend dependencies:
cd backend && pip install -r requirements.txt - Download spaCy model:
python -m spacy download en_core_web_sm - Copy
.env.exampleto.envand configure:- Set
OLLAMA_HOSTtolocalhostfor local Ollama - Set default models and context sizes
- Set
- Start the backend:
cd backend && python -m app.main
- Install Node.js dependencies:
cd frontend && npm install - Start the frontend development server:
cd frontend && npm run dev - Open http://localhost:3000 in your browser
The application uses environment variables for configuration:
- DB_PATH: Location of the SQLite database file
- OLLAMA_HOST/PORT/TIMEOUT: Configuration for connecting to Ollama
- DEFAULT_GEN_MODEL: Default model for generation (e.g., "mistral:latest")
- DEFAULT_PARA_MODEL: Default model for paraphrasing (e.g., "mistral:latest")
- GEN_MODEL_CONTEXT_SIZE: Context size for generation model (in tokens)
- PARA_MODEL_CONTEXT_SIZE: Context size for paraphrase model (in tokens)
DatasetForge includes a command-line interface with these utilities:
database_stats: Display database statisticsshow_examples: View examples from a datasetreset_database: Reset the database (warning: deletes all data)restore_database: Restore from a backupdatabase_status: Show database file informationrun_migration: Update database schema when upgradingexport_database: Export the database to another locationimport_database: Import a database from another location
To run these commands:
cd backend
# In development:
python -m app.cli command_name [options]
# In Docker:
docker-compose exec backend python -m app.cli command_name [options]This application builds on several open source projects:
- FastAPI for the backend API
- React and Vite for the frontend
- SQLModel for database models
- ReactFlow for workflow visualization