Skip to content

ash-forge/forge-train

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Forge Train 🦞πŸ”₯

Automated training infrastructure for Ash Forge

Forge Train is a complete training pipeline for fine-tuning, quantizing, and publishing specialized AI models for the Ash Forge ecosystem.


πŸš€ Quick Start

1. Installation

One-command setup on Ubuntu 26.04 LTS:

curl -sSL https://raw.githubusercontent.com/ash-forge/forge-train/main/setup.sh | sudo bash

This installs:

  • Python 3.12 + PyTorch (CPU)
  • Redis (job queue)
  • PostgreSQL (job history)
  • Ollama (model serving)
  • Forge Train CLI

2. Start Workers

# Start 2 training workers
sudo systemctl start forge-worker@{1..2}
sudo systemctl enable forge-worker@{1..2}

# Start queue manager
sudo systemctl start forge-queue
sudo systemctl enable forge-queue

3. Submit Training Job

forge-train submit \
  --model ash-code:python \
  --dataset datasets/python-expert \
  --epochs 3 \
  --notify discord

4. Monitor Progress

# Check queue
forge-train queue list

# Check job status
forge-train status job_12345

# System resources
forge-train system resources

πŸ“‹ Features

βœ… One-Command Setup - Complete installation in 15 minutes
βœ… Parallel Training - Run 2+ models simultaneously
βœ… Full Automation - Dataset β†’ Training β†’ Quantization β†’ Publishing
βœ… CPU Optimized - Efficient LoRA fine-tuning on CPU
βœ… Queue System - Priority-based job scheduling
βœ… Monitoring - Real-time metrics and progress tracking
βœ… Discord Notifications - Get notified when training completes
βœ… Checkpoint Recovery - Auto-resume from failures


🎯 How It Works

1. Submit Job

forge-train submit --model ash-code:python --dataset datasets/python-expert

The job is added to the Redis queue with your specified priority.

2. Worker Picks Up Job

A worker pulls the job from the queue and:

  1. Downloads the base model (gemma4:turbo)
  2. Validates and prepares the dataset
  3. Configures LoRA training parameters
  4. Starts training with Axolotl

3. Training Runs (24-72 hours)

The worker:

  • Checkpoints every 500 steps
  • Logs loss curves and metrics
  • Monitors CPU/memory usage
  • Can auto-resume if interrupted

4. Post-Processing

After training completes:

  1. Merge LoRA adapters with base model
  2. Quantize to GGUF Q4_K_M (3GB size)
  3. Package as Ollama model
  4. Push to registry
  5. Send Discord notification

5. Model Ready!

ollama pull ashforge/ash-code:python

πŸ“¦ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   CLI User  β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚ submit job
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Redis Queue    β”‚ ◄─── Priority-based job queue
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
    β–Ό         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Worker 1β”‚ β”‚Worker 2β”‚ ◄─── Parallel execution
β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
    β”‚          β”‚
    β”œβ”€β”€ 4 CPU cores, 32GB RAM
    β”œβ”€β”€ Dataset validation
    β”œβ”€β”€ LoRA training (Axolotl)
    β”œβ”€β”€ Checkpoint management
    β”œβ”€β”€ Quantization (GGUF)
    └── Ollama packaging
         β”‚
         β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ PostgreSQL   β”‚ ◄─── Job history & metrics
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚   Ollama     β”‚ ◄─── Model serving
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ CLI Commands

Job Management

# Submit job
forge-train submit --model <name> --dataset <path>

# Queue operations
forge-train queue list
forge-train queue list --status running

# Job status
forge-train status <job-id>

# Cancel job
forge-train cancel <job-id>

Worker Management

# Start workers
forge-train worker start --workers 2

# Stop workers
forge-train worker stop

# Worker status
forge-train worker status

Model Management

# List trained models
forge-train models list
forge-train models list --category code

# Test model
forge-train test ash-code:python --prompt "Write a function to sort a list"

System Management

# System status
forge-train system status

# Resource monitoring
forge-train system resources

# Start monitoring dashboard
forge-train monitor dashboard --port 8080

Dataset Management

# Validate dataset
forge-train dataset validate datasets/python-expert

# Dataset statistics
forge-train dataset stats datasets/python-expert

πŸ“Š Training Times

CPU Training (Xeon E3-1270 v6, 4c/8t, 64GB RAM):

Model Size Examples Time
Small 2k 24-48h
Medium 5k 48-72h
Large 10k 96-168h

Throughput:

  • 2 workers Γ— 24/7 = 12-20 models/month
  • Full 50-model ecosystem = 3-4 months

πŸ“‚ Directory Structure

/opt/forge-train/          # Installation directory
β”œβ”€β”€ venv/                  # Python virtual environment
β”œβ”€β”€ forge-train/           # Repository
└── config/                # Configuration files

/var/lib/forge-train/      # Data directory
β”œβ”€β”€ jobs/                  # Job metadata
β”œβ”€β”€ models/                # Trained models
β”œβ”€β”€ datasets/              # Training datasets
└── checkpoints/           # Training checkpoints

/var/log/forge-train/      # Logs
β”œβ”€β”€ worker-1.log
β”œβ”€β”€ worker-2.log
β”œβ”€β”€ queue.log
└── monitor.log

βš™οΈ Configuration

Edit /opt/forge-train/config/forge-train.yaml:

# Workers
workers:
  count: 2
  cpu_per_worker: 4
  memory_per_worker: "32G"

# Training defaults
training:
  batch_size: 4
  learning_rate: 0.0002
  num_epochs: 3

# LoRA defaults
lora:
  rank: 16
  alpha: 32
  target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]

# Notifications
notifications:
  discord:
    enabled: true
    webhook_url: "https://discord.com/api/webhooks/..."

πŸ” Security

Default credentials:

  • Database password: forge_train_password
  • Change in: /opt/forge-train/config/forge-train.yaml

Update PostgreSQL password:

sudo -u postgres psql -c "ALTER USER forge PASSWORD 'new_password';"

Firewall:

  • SSH (port 22) - open
  • Ollama (port 11434) - closed by default
  • Dashboard (port 8080) - closed by default

πŸ“š Documentation


🧩 Ecosystem

Forge Train is part of the Ash Forge ecosystem:

  • forge-train (this repo) - Training infrastructure
  • ash-bot - Discord bot with learning system
  • ash-engine - C++ inference engine
  • forge-creator - Model creation tools
  • forge-models - Pre-trained model catalog

πŸ“ˆ Roadmap

v1.0 (Current):

  • βœ… CPU-based LoRA training
  • βœ… Parallel worker execution
  • βœ… Redis job queue
  • βœ… GGUF quantization
  • βœ… Ollama packaging

v1.1 (Planned):

  • GPU support (optional)
  • Distributed training across multiple servers
  • Advanced monitoring dashboard (Grafana)
  • Model versioning and rollback
  • A/B testing framework

v2.0 (Future):

  • Web UI for job management
  • Automatic hyperparameter tuning
  • Community model marketplace
  • Federated learning support

🀝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Areas we need help:

  • Dataset creation and curation
  • Training optimization
  • Documentation
  • Testing on different hardware
  • Community model submissions

πŸ“œ License

Apache 2.0 - See LICENSE for details


🦞 Built by Ash Forge

Forge your AI. Your way.


πŸ”₯ Happy Forging!

Questions? Issues? Ideas?

Open an issue or join our Discord community!

About

Training infrastructure for Ash Forge - Automated model fine-tuning, quantization, and publishing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors