# üöÄ TrainForge External GPU Worker

## Connect Google Colab GPU to Your Local TrainForge Instance

This notebook allows you to use Google Colab's free GPU for training models managed by your local TrainForge instance.

---

### üìã Prerequisites:
1. ‚úÖ **Local Machine**: TrainForge API running on `localhost:3000`
2. ‚úÖ **ngrok**: Tunnel exposing your API to the internet
3. ‚úÖ **Google Colab**: GPU runtime enabled (Runtime ‚Üí Change runtime type ‚Üí GPU)

### üéØ What this notebook does:
- Connects to your local TrainForge via ngrok
- Registers as an external GPU worker
- Polls for training jobs
- Executes jobs using Colab's GPU
- Streams logs and results back to your API

---

## Step 1: Enable GPU Runtime

**‚ö†Ô∏è IMPORTANT: Make sure you've enabled GPU!**

1. Go to: **Runtime ‚Üí Change runtime type**
2. Select: **GPU** (T4, V100, or A100)
3. Click: **Save**

Then run the cell below to verify GPU is available.

In [None]:
import torch
import subprocess

print("="*60)
print("üîç Checking GPU Availability")
print("="*60)

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory // 1024**3
    print(f"\n‚úÖ GPU Available: {gpu_name}")
    print(f"üíæ GPU Memory: {gpu_memory}GB")
    print(f"üîß CUDA Version: {torch.version.cuda}")
    print(f"üêç PyTorch Version: {torch.__version__}")
    print("\nüìä nvidia-smi Output:")
    print("="*60)
    subprocess.run(['nvidia-smi'])
    print("\n‚úÖ GPU check passed! You can proceed.")
else:
    print("\n‚ùå No GPU available!")
    print("‚ö†Ô∏è Please enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU")
    print("‚ö†Ô∏è Then restart the runtime")

## Step 2: Setup ngrok Tunnel

### On Your Local Machine:

#### Option A: Using ngrok (Recommended)

```bash
# 1. Start TrainForge API
cd d:/capstone/trainforge/api
npm start

# 2. In another terminal, start ngrok
ngrok http 3000

# 3. Copy the HTTPS URL shown
# Example: https://abc123-def-456.ngrok.io
```

#### Option B: Using localtunnel

```bash
# 1. Install localtunnel
npm install -g localtunnel

# 2. Start tunnel
lt --port 3000

# 3. Copy the URL provided
```

### Test Your Tunnel:

Run this cell to test if your ngrok tunnel is working:

In [None]:
import requests

# Enter your ngrok URL here
API_URL = input("Enter your TrainForge API URL (from ngrok): ").strip()

if not API_URL:
    print("‚ùå Please enter a valid API URL")
    print("üí° Example: https://abc123.ngrok.io")
else:
    # Clean up URL
    if not API_URL.startswith('http'):
        API_URL = f'https://{API_URL}'
    
    print(f"\nüîç Testing connection to: {API_URL}")
    
    try:
        headers = {
            'ngrok-skip-browser-warning': 'true',
            'User-Agent': 'TrainForge-Worker/1.0'
        }
        response = requests.get(f"{API_URL}/health", headers=headers, timeout=10)
        
        if response.status_code == 200:
            data = response.json()
            print("\n‚úÖ Connection successful!")
            print(f"   API Status: {data.get('status', 'unknown')}")
            print(f"   Database: {data.get('database', 'unknown')}")
            print(f"   Version: {data.get('version', 'unknown')}")
            print("\nüéâ You're ready to start the worker!")
        else:
            print(f"\n‚ùå Connection failed: HTTP {response.status_code}")
            print(f"Response: {response.text[:200]}")
    except requests.exceptions.Timeout:
        print("\n‚ùå Connection timeout - ngrok tunnel might be slow or down")
    except requests.exceptions.ConnectionError:
        print("\n‚ùå Connection error - ngrok tunnel might be down")
        print("\nüìã Troubleshooting:")
        print("   1. Make sure ngrok is running on your local machine")
        print("   2. Check the ngrok URL is correct")
        print("   3. Visit http://127.0.0.1:4040 to see ngrok status")
    except Exception as e:
        print(f"\n‚ùå Error: {e}")

## Step 3: Install Dependencies

Install required packages for the TrainForge worker.

In [None]:
# Install required packages
!pip install -q requests torch

print("‚úÖ Dependencies installed successfully!")

## Step 4: Download TrainForge Worker

Download the worker script from your repository or paste it directly.

In [None]:
# Option 1: Download from GitHub (replace with your actual URL)
# !wget https://raw.githubusercontent.com/YOUR-REPO/trainforge/main/external-gpu/colab_worker_complete.py -O colab_worker.py

# Option 2: Upload from your computer
from google.colab import files
print("üì§ Please upload the colab_worker_complete.py file")
uploaded = files.upload()

# Rename to colab_worker.py
import shutil
if 'colab_worker_complete.py' in uploaded:
    shutil.move('colab_worker_complete.py', 'colab_worker.py')
    print("‚úÖ Worker script uploaded successfully!")
else:
    print("‚ö†Ô∏è Please upload colab_worker_complete.py")

## Step 5: Start TrainForge Worker

**‚ö†Ô∏è IMPORTANT:**
- This cell will run continuously
- Keep it running to maintain worker connection
- The worker will poll for jobs every 5 seconds
- Stop the cell (‚èπÔ∏è) when you want to disconnect

### What happens:
1. Worker connects to your TrainForge API
2. Registers as an external GPU worker
3. Polls for pending jobs
4. When a job arrives:
   - Downloads project files
   - Installs dependencies
   - Executes training
   - Streams logs back
   - Uploads results

### Run this cell and keep it running!

In [None]:
# Start the TrainForge worker
import sys

# Make sure we have the API URL
if 'API_URL' not in locals() or not API_URL:
    API_URL = input("Enter your TrainForge API URL: ").strip()
    if not API_URL.startswith('http'):
        API_URL = f'https://{API_URL}'

print("="*60)
print("üöÄ Starting TrainForge Worker")
print("="*60)
print(f"üì° API URL: {API_URL}")
print(f"üíª Worker ID: colab-{int(__import__('time').time())}")
print("\n‚ö†Ô∏è Keep this cell running!")
print("‚ö†Ô∏è Worker will poll for jobs every 5 seconds")
print("‚ö†Ô∏è Press ‚èπÔ∏è to stop the worker\n")
print("="*60)

# Load and run the worker
try:
    # Import the worker class
    sys.path.append('/content')
    from colab_worker import ColabGPUWorker
    
    # Create worker instance
    worker = ColabGPUWorker(API_URL)
    
    # Start the worker
    worker.start()
    
except KeyboardInterrupt:
    print("\n‚ö†Ô∏è Worker stopped by user")
except FileNotFoundError:
    print("‚ùå Worker script not found!")
    print("Please run Step 4 to upload the worker script")
except Exception as e:
    print(f"‚ùå Error: {e}")
    import traceback
    traceback.print_exc()

## üìä Step 6: Monitor Worker Status (Optional)

Run this cell in another window/tab to check worker status while training is running.

In [None]:
import requests
import json
from datetime import datetime

if 'API_URL' not in locals():
    API_URL = input("Enter your TrainForge API URL: ").strip()
    if not API_URL.startswith('http'):
        API_URL = f'https://{API_URL}'

headers = {
    'ngrok-skip-browser-warning': 'true',
    'User-Agent': 'TrainForge-Monitor/1.0'
}

print("="*60)
print(f"üìä TrainForge Status - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("="*60)

try:
    # Check API health
    health = requests.get(f"{API_URL}/health", headers=headers, timeout=10)
    if health.status_code == 200:
        data = health.json()
        print(f"\n‚úÖ API Health: {data.get('status', 'unknown')}")
        print(f"   Database: {data.get('database', 'unknown')}")
        print(f"   Uptime: {data.get('uptime', 0)} seconds")
    
    # Check workers
    print("\n" + "="*60)
    print("üë∑ Active Workers:")
    print("="*60)
    workers = requests.get(f"{API_URL}/api/workers", headers=headers, timeout=10)
    if workers.status_code == 200:
        worker_list = workers.json()
        if worker_list:
            for w in worker_list:
                worker_id = w.get('worker_id', 'Unknown')
                status = w.get('status', 'Unknown')
                worker_type = w.get('worker_type', 'Unknown')
                location = w.get('location', 'Unknown')
                
                caps = w.get('capabilities', {})
                gpu_count = caps.get('gpu_count', 0)
                
                print(f"\n  üîß {worker_id}")
                print(f"     Status: {status}")
                print(f"     Type: {worker_type}")
                print(f"     Location: {location}")
                print(f"     GPUs: {gpu_count}")
                
                gpu_info = caps.get('gpu_info', {})
                if gpu_info:
                    print(f"     GPU: {gpu_info.get('name', 'Unknown')} ({gpu_info.get('memory_gb', 0)}GB)")
        else:
            print("\n  ‚ö†Ô∏è No active workers")
    
    # Check pending jobs
    print("\n" + "="*60)
    print("üìã Pending Jobs:")
    print("="*60)
    jobs = requests.get(f"{API_URL}/api/jobs/pending", headers=headers, timeout=10)
    if jobs.status_code == 200:
        job_list = jobs.json()
        if job_list:
            for job in job_list:
                job_id = job.get('job_id', 'Unknown')
                project = job.get('project_name', 'Unknown')
                status = job.get('status', 'Unknown')
                
                print(f"\n  üì¶ {job_id}")
                print(f"     Project: {project}")
                print(f"     Status: {status}")
        else:
            print("\n  ‚úÖ No pending jobs")
    
    print("\n" + "="*60)
    
except Exception as e:
    print(f"\n‚ùå Error checking status: {e}")

## üéØ Usage Tips

### Submitting Jobs from Your Local Machine:

```bash
# Using TrainForge CLI
trainforge submit --project my-model --config trainforge.yaml

# Or using API directly
curl -X POST http://localhost:3000/api/jobs \
  -H "Content-Type: application/json" \
  -d '{"project_name": "my-model", "training_script": "train.py"}'
```

### Project Structure:
```
my-project/
‚îú‚îÄ‚îÄ train.py          # Main training script
‚îú‚îÄ‚îÄ requirements.txt  # Dependencies
‚îú‚îÄ‚îÄ data/            # Data files
‚îî‚îÄ‚îÄ config.yaml      # Configuration
```

### Important Notes:
- ‚è∞ Colab sessions timeout after 12-24 hours
- üíæ Save checkpoints frequently
- üîÑ Worker auto-installs requirements.txt
- üìä All logs are streamed to your API
- üéØ Results are uploaded when training completes

### Running Multiple Workers:
- Open multiple Colab notebooks
- Run this notebook in each
- Get multiple GPUs working in parallel!

---

## üÜò Troubleshooting

### Worker can't connect:
- Check ngrok is still running
- Visit http://127.0.0.1:4040 for ngrok status
- Test: `curl http://localhost:3000/health`

### No GPU:
- Runtime ‚Üí Change runtime type ‚Üí GPU
- Restart runtime

### Jobs not appearing:
- Check job is submitted: `curl http://localhost:3000/api/jobs`
- Verify worker is registered: `curl http://localhost:3000/api/workers`

---

**Happy Training! üöÄ**