Skip to content

diegonmarcos/ml-MachineLearning

Repository files navigation

n8n Model Orchestration Platform

"Where I teach computers to be smart so I don't have to be."

A comprehensive platform for orchestrating fine-tuning of open-source language models using private data across multiple VPS providers, with built-in cost management and monitoring capabilities.

Overview

This platform provides an end-to-end solution for:

  • Fine-tuning LLMs: Support for Llama, GPT-Neo, Mistral, Falcon, and other open-source models
  • Multi-Provider Management: Seamless integration with RunPod, Vast.ai, Lambda Labs, Modal, and more
  • Cost Optimization: Real-time cost tracking, forecasting, and optimization recommendations
  • n8n Orchestration: Powerful workflow automation for complex ML pipelines
  • Private Data Handling: Secure management of proprietary training datasets
  • Pay-Per-Use Model: Optimize costs with flexible VPS provider selection

Key Features

🚀 Training Orchestration

  • Automated provider selection based on cost and availability
  • Support for full fine-tuning, LoRA, and QLoRA
  • Real-time progress monitoring and metrics
  • Automatic checkpointing and recovery
  • Distributed training support

💰 Cost Management

  • Real-time cost tracking across all providers
  • Budget management with alerts and hard limits
  • Cost forecasting using ML models
  • Optimization recommendations
  • Detailed cost breakdowns and analytics

🔧 Provider Management

  • Unified interface for 6+ VPS providers
  • Automatic failover and retry logic
  • Health monitoring and alerting
  • Dynamic pricing and availability tracking
  • Spot instance support

📊 Comprehensive UI

  • Modern React dashboard with real-time updates
  • Visual workflow designer (n8n integration)
  • Cost analytics and visualization
  • Dataset and model management
  • Team collaboration features

🔐 Security & Privacy

  • End-to-end encryption for private data
  • RBAC and multi-tenant isolation
  • Secure credential management with Vault
  • Audit logging and compliance

📈 Monitoring & Observability

  • Prometheus metrics collection
  • Grafana dashboards
  • Real-time WebSocket updates
  • Comprehensive logging
  • Alert management

Documentation

Core Documentation

Quick Links

Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                     User Interface (React)                   │
│  Dashboard • Jobs • Models • Datasets • Cost • Providers    │
└─────────────────────────────────────────────────────────────┘
                           │
┌─────────────────────────────────────────────────────────────┐
│                      API Gateway (FastAPI)                   │
│  Authentication • Rate Limiting • Request Validation        │
└─────────────────────────────────────────────────────────────┘
                           │
┌─────────────────────────────────────────────────────────────┐
│                   n8n Orchestration Layer                    │
│  Job Management • Cost Tracking • Provider Selection        │
└─────────────────────────────────────────────────────────────┘
                           │
┌─────────────────────────────────────────────────────────────┐
│                    VPS Provider Layer                        │
│  RunPod • Vast.ai • Lambda • Modal • TensorDock • CoreWeave│
└─────────────────────────────────────────────────────────────┘

Technology Stack

Backend

  • Python 3.11+: Core API and services
  • FastAPI: REST API framework
  • PostgreSQL: Primary database
  • TimescaleDB: Time-series metrics
  • Redis: Caching and job queues
  • n8n: Workflow orchestration
  • MinIO: Object storage

Frontend

  • React 18: UI framework
  • TypeScript: Type safety
  • TailwindCSS: Styling
  • Recharts: Data visualization
  • React Query: State management
  • Socket.io: Real-time updates

Infrastructure

  • Docker: Containerization
  • Kubernetes: Orchestration
  • Prometheus: Metrics
  • Grafana: Visualization
  • Terraform: Infrastructure as code

Getting Started

Prerequisites

  • Docker 24.0+
  • Docker Compose 2.20+
  • Node.js 18+ (for local frontend development)
  • Python 3.11+ (for local backend development)

Quick Start

  1. Clone the repository
git clone https://github.com/your-org/ml-orchestration-platform.git
cd ml-orchestration-platform
  1. Configure environment
cp .env.example .env
# Edit .env with your API keys and configuration
  1. Start services
docker-compose up -d
  1. Access the platform
  1. Create your first training job
  • Navigate to http://localhost:3000
  • Upload a dataset
  • Select a model
  • Configure training parameters
  • Submit job and monitor progress

For detailed setup instructions, see DEPLOYMENT.md.

Supported Models

Language Models

Model Size Use Case Recommended GPU
Llama 2 7B, 13B, 70B General purpose A100, H100
Mistral 7B Fast inference RTX 4090, A100
GPT-Neo 1.3B, 2.7B, 20B Cost-effective RTX 3090, A100
Falcon 7B, 40B, 180B Long context A100, H100
Qwen 1.8B, 7B, 14B, 72B Multilingual A100
DeepSeek 1.3B, 7B, 33B Code generation RTX 4090, A100

Training Methods

  • Full Fine-tuning: Complete model parameter updates
  • LoRA: Low-rank adaptation (efficient, recommended)
  • QLoRA: Quantized LoRA (most cost-effective)
  • Prefix Tuning: Lightweight adaptation
  • Adapter Layers: Modular fine-tuning

VPS Providers

Supported Providers

Provider GPU Types Billing Price Range Availability
RunPod RTX 3090, 4090, A100, H100 Per-second $0.39-$4.99/hr High
Vast.ai Various (marketplace) Per-hour $0.10-$5.00/hr High
Lambda Labs RTX 6000, A100, H100 Per-minute $0.60-$13.20/hr Medium
Modal A100, H100 Per-second $0.00003-$0.00006/sec High
TensorDock RTX 3060-4090, A100 Per-hour $0.19-$2.39/hr Medium
CoreWeave A100, H100 Per-hour Enterprise Medium

For detailed provider specifications, see VPS_PROVIDERS.md.

Cost Comparison

Example: Fine-tuning Llama 2 7B (LoRA)

Provider GPU Duration Cost Notes
RunPod 4x A100 4 hours $21.60 Spot available
Vast.ai 4x A100 4 hours $16.20 Community hosts
Lambda Labs 4x A100 4 hours $17.60 Reliable
Modal 4x A100 4 hours $19.44 Serverless

Estimated savings with optimization: 15-30% through automatic provider selection

For detailed cost analysis, see COST_MANAGEMENT.md.

Core Workflows

1. Training Job Lifecycle

Submit Job → Validate → Check Budget → Select Provider →
Provision Resources → Upload Data → Train → Monitor →
Save Checkpoints → Complete → Cleanup → Store Model

2. Cost Tracking

Job Starts → Track Cost (every 30s) → Check Budget →
Alert if Threshold → Stop if Hard Limit → Finalize Cost

3. Provider Health Monitoring

Schedule (5 min) → Check API Health → Check Availability →
Calculate Score → Update Status → Alert if Down

For complete workflow specifications, see N8N_WORKFLOWS.md.

API Examples

Create Training Job

import requests

response = requests.post(
    'http://localhost:8000/v1/jobs',
    headers={'Authorization': 'Bearer YOUR_TOKEN'},
    json={
        'name': 'llama-2-7b-custom',
        'model_name': 'meta-llama/Llama-2-7b-hf',
        'dataset_id': 'ds_123',
        'training_config': {
            'training_type': 'lora',
            'epochs': 10,
            'batch_size': 4,
            'learning_rate': 2e-5
        }
    }
)

job = response.json()
print(f"Job created: {job['id']}")

Monitor Progress

import { io } from 'socket.io-client';

const socket = io('ws://localhost:8000', {
  auth: { token: 'YOUR_TOKEN' }
});

socket.on('job:status', (data) => {
  console.log(`Progress: ${data.progress}%`);
  console.log(`Cost: $${data.current_cost}`);
});

For complete API documentation, see API_SPECIFICATIONS.md.

Use Cases

1. Customer Support Chatbot

Fine-tune Llama 2 7B on your support conversations to create a custom chatbot that understands your products and policies.

Dataset: 50K customer conversations Model: Llama 2 7B with LoRA Cost: ~$20 per training run Time: 4-6 hours

2. Code Generation

Adapt DeepSeek Coder to your company's coding standards and internal APIs.

Dataset: 100K code samples Model: DeepSeek Coder 7B Cost: ~$25 per training run Time: 5-7 hours

3. Domain-Specific Q&A

Create a specialized model for medical, legal, or financial Q&A.

Dataset: Domain-specific documents Model: Mistral 7B with LoRA Cost: ~$18 per training run Time: 3-5 hours

4. Multilingual Translation

Fine-tune Qwen for high-quality translation in specific language pairs.

Dataset: Parallel text corpus Model: Qwen 7B Cost: ~$22 per training run Time: 4-6 hours

Roadmap

Phase 1: Core Platform (Q1 2025) ✓

  • Multi-provider integration
  • Cost tracking and management
  • n8n workflow orchestration
  • Basic UI
  • API development

Phase 2: Advanced Features (Q2 2025)

  • Distributed training support
  • Model marketplace
  • Advanced optimization algorithms
  • Team collaboration features
  • Enhanced security features

Phase 3: Enterprise Features (Q3 2025)

  • Multi-region deployment
  • Advanced analytics
  • Custom model serving
  • SLA guarantees
  • White-label options

Phase 4: ML Operations (Q4 2025)

  • AutoML integration
  • Hyperparameter optimization
  • A/B testing framework
  • Model versioning and rollback
  • Production deployment automation

Performance Metrics

Platform Performance

  • API Response Time: <200ms (p95)
  • Job Submission: <5 seconds
  • Real-time Updates: <1 second latency
  • System Uptime: 99.9% target

Training Performance

  • Llama 2 7B LoRA: 4-6 hours (4x A100)
  • Mistral 7B: 3-5 hours (4x A100)
  • Cost Optimization: 15-30% savings vs single provider

Security

  • Encryption: AES-256 at rest, TLS 1.3 in transit
  • Authentication: JWT with refresh tokens
  • Authorization: Role-based access control (RBAC)
  • Compliance: GDPR, SOC 2 ready
  • Audit: Comprehensive logging

Contributing

We welcome contributions! Please see our contributing guidelines:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Acknowledgments

This project integrates with and is inspired by many amazing tools:

  • n8n: Workflow automation
  • Hugging Face: Model hub and transformers
  • Ray: Distributed computing
  • Weights & Biases: Experiment tracking
  • All the VPS providers who make GPU compute accessible

Team

  • Architecture: [Your Name]
  • Backend: [Developer Names]
  • Frontend: [Developer Names]
  • DevOps: [Developer Names]

Built with ❤️ for the ML community

Making fine-tuning accessible, affordable, and automated.

About

"Where I teach computers to be smart so I don't have to be."

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages