Automated training infrastructure for Ash Forge
Forge Train is a complete training pipeline for fine-tuning, quantizing, and publishing specialized AI models for the Ash Forge ecosystem.
One-command setup on Ubuntu 26.04 LTS:
curl -sSL https://raw.githubusercontent.com/ash-forge/forge-train/main/setup.sh | sudo bashThis installs:
- Python 3.12 + PyTorch (CPU)
- Redis (job queue)
- PostgreSQL (job history)
- Ollama (model serving)
- Forge Train CLI
# Start 2 training workers
sudo systemctl start forge-worker@{1..2}
sudo systemctl enable forge-worker@{1..2}
# Start queue manager
sudo systemctl start forge-queue
sudo systemctl enable forge-queueforge-train submit \
--model ash-code:python \
--dataset datasets/python-expert \
--epochs 3 \
--notify discord# Check queue
forge-train queue list
# Check job status
forge-train status job_12345
# System resources
forge-train system resourcesβ
One-Command Setup - Complete installation in 15 minutes
β
Parallel Training - Run 2+ models simultaneously
β
Full Automation - Dataset β Training β Quantization β Publishing
β
CPU Optimized - Efficient LoRA fine-tuning on CPU
β
Queue System - Priority-based job scheduling
β
Monitoring - Real-time metrics and progress tracking
β
Discord Notifications - Get notified when training completes
β
Checkpoint Recovery - Auto-resume from failures
forge-train submit --model ash-code:python --dataset datasets/python-expertThe job is added to the Redis queue with your specified priority.
A worker pulls the job from the queue and:
- Downloads the base model (
gemma4:turbo) - Validates and prepares the dataset
- Configures LoRA training parameters
- Starts training with Axolotl
The worker:
- Checkpoints every 500 steps
- Logs loss curves and metrics
- Monitors CPU/memory usage
- Can auto-resume if interrupted
After training completes:
- Merge LoRA adapters with base model
- Quantize to GGUF Q4_K_M (3GB size)
- Package as Ollama model
- Push to registry
- Send Discord notification
ollama pull ashforge/ash-code:pythonβββββββββββββββ
β CLI User β
ββββββββ¬βββββββ
β submit job
βΌ
βββββββββββββββββββ
β Redis Queue β ββββ Priority-based job queue
ββββββββββ¬βββββββββ
β
ββββββ΄βββββ
βΌ βΌ
ββββββββββ ββββββββββ
βWorker 1β βWorker 2β ββββ Parallel execution
βββββ¬βββββ βββββ¬βββββ
β β
βββ 4 CPU cores, 32GB RAM
βββ Dataset validation
βββ LoRA training (Axolotl)
βββ Checkpoint management
βββ Quantization (GGUF)
βββ Ollama packaging
β
βΌ
ββββββββββββββββ
β PostgreSQL β ββββ Job history & metrics
ββββββββββββββββ
β
βΌ
ββββββββββββββββ
β Ollama β ββββ Model serving
ββββββββββββββββ
# Submit job
forge-train submit --model <name> --dataset <path>
# Queue operations
forge-train queue list
forge-train queue list --status running
# Job status
forge-train status <job-id>
# Cancel job
forge-train cancel <job-id># Start workers
forge-train worker start --workers 2
# Stop workers
forge-train worker stop
# Worker status
forge-train worker status# List trained models
forge-train models list
forge-train models list --category code
# Test model
forge-train test ash-code:python --prompt "Write a function to sort a list"# System status
forge-train system status
# Resource monitoring
forge-train system resources
# Start monitoring dashboard
forge-train monitor dashboard --port 8080# Validate dataset
forge-train dataset validate datasets/python-expert
# Dataset statistics
forge-train dataset stats datasets/python-expertCPU Training (Xeon E3-1270 v6, 4c/8t, 64GB RAM):
| Model Size | Examples | Time |
|---|---|---|
| Small | 2k | 24-48h |
| Medium | 5k | 48-72h |
| Large | 10k | 96-168h |
Throughput:
- 2 workers Γ 24/7 = 12-20 models/month
- Full 50-model ecosystem = 3-4 months
/opt/forge-train/ # Installation directory
βββ venv/ # Python virtual environment
βββ forge-train/ # Repository
βββ config/ # Configuration files
/var/lib/forge-train/ # Data directory
βββ jobs/ # Job metadata
βββ models/ # Trained models
βββ datasets/ # Training datasets
βββ checkpoints/ # Training checkpoints
/var/log/forge-train/ # Logs
βββ worker-1.log
βββ worker-2.log
βββ queue.log
βββ monitor.log
Edit /opt/forge-train/config/forge-train.yaml:
# Workers
workers:
count: 2
cpu_per_worker: 4
memory_per_worker: "32G"
# Training defaults
training:
batch_size: 4
learning_rate: 0.0002
num_epochs: 3
# LoRA defaults
lora:
rank: 16
alpha: 32
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
# Notifications
notifications:
discord:
enabled: true
webhook_url: "https://discord.com/api/webhooks/..."Default credentials:
- Database password:
forge_train_password - Change in:
/opt/forge-train/config/forge-train.yaml
Update PostgreSQL password:
sudo -u postgres psql -c "ALTER USER forge PASSWORD 'new_password';"Firewall:
- SSH (port 22) - open
- Ollama (port 11434) - closed by default
- Dashboard (port 8080) - closed by default
Forge Train is part of the Ash Forge ecosystem:
- forge-train (this repo) - Training infrastructure
- ash-bot - Discord bot with learning system
- ash-engine - C++ inference engine
- forge-creator - Model creation tools
- forge-models - Pre-trained model catalog
v1.0 (Current):
- β CPU-based LoRA training
- β Parallel worker execution
- β Redis job queue
- β GGUF quantization
- β Ollama packaging
v1.1 (Planned):
- GPU support (optional)
- Distributed training across multiple servers
- Advanced monitoring dashboard (Grafana)
- Model versioning and rollback
- A/B testing framework
v2.0 (Future):
- Web UI for job management
- Automatic hyperparameter tuning
- Community model marketplace
- Federated learning support
We welcome contributions! See CONTRIBUTING.md for guidelines.
Areas we need help:
- Dataset creation and curation
- Training optimization
- Documentation
- Testing on different hardware
- Community model submissions
Apache 2.0 - See LICENSE for details
Forge your AI. Your way.
- Website: ash-forge.com
- GitHub: github.com/ash-forge
- Discord: Join our community
Questions? Issues? Ideas?
Open an issue or join our Discord community!