# SparkTrainer Quick Start Tutorial

This notebook provides a hands-on introduction to SparkTrainer, walking you through the basics of:

1. Connecting to the SparkTrainer API
2. Listing available models and datasets
3. Submitting a simple training job
4. Monitoring job progress
5. Retrieving trained models

## Prerequisites

- SparkTrainer instance running (via Docker Compose)
- Python 3.8+
- `requests` library installed

Let's get started!

In [None]:
# Install required dependencies
!pip install requests pandas matplotlib tqdm

In [None]:
import requests
import json
import time
import pandas as pd
import matplotlib.pyplot as plt
from typing import Dict, Any
from tqdm.notebook import tqdm

# SparkTrainer API configuration
SPARKTRAINER_URL = "http://localhost:5000"
API_BASE = f"{SPARKTRAINER_URL}/api"

print(f"SparkTrainer API: {API_BASE}")

## 1. Health Check

Let's verify that SparkTrainer is running and healthy.

In [None]:
def check_health():
    """Check SparkTrainer health status"""
    try:
        response = requests.get(f"{API_BASE}/health")
        if response.status_code == 200:
            print("✅ SparkTrainer is healthy and running!")
            return True
        else:
            print(f"⚠️ Health check returned status: {response.status_code}")
            return False
    except requests.exceptions.ConnectionError:
        print("❌ Cannot connect to SparkTrainer. Is it running?")
        print("Run: docker-compose up -d")
        return False

check_health()

## 2. System Information

Let's check available GPU resources and system capabilities.

In [None]:
def get_system_info():
    """Get system information including GPU details"""
    response = requests.get(f"{API_BASE}/system/info")
    return response.json()

system_info = get_system_info()

print("System Resources:")
print("=" * 50)
print(f"GPUs: {system_info.get('gpu', {}).get('count', 'N/A')}")
print(f"GPU Models: {system_info.get('gpu', {}).get('models', [])}")
print(f"Total Memory: {system_info.get('memory', {}).get('total_gb', 'N/A')} GB")
print(f"Available Memory: {system_info.get('memory', {}).get('available_gb', 'N/A')} GB")

## 3. Browse Available Models

Let's see what base models are available for training.

In [None]:
def list_base_models():
    """List all available base models"""
    response = requests.get(f"{API_BASE}/base-models")
    return response.json()

models = list_base_models()

# Convert to DataFrame for better visualization
models_df = pd.DataFrame(models)
print(f"\nFound {len(models_df)} available models\n")
print(models_df[['name', 'family', 'modality', 'parameters', 'trainable']].head(10))

## 4. List Available Recipes

Training recipes define how to train models (LoRA, QLoRA, full fine-tuning, etc.).

In [None]:
def list_recipes():
    """List all available training recipes"""
    response = requests.get(f"{API_BASE}/recipes")
    return response.json()

recipes = list_recipes()

print("\nAvailable Training Recipes:")
print("=" * 50)
for recipe in recipes:
    print(f"📝 {recipe['name']}: {recipe.get('description', 'No description')}")

## 5. List Datasets

Check what datasets are already loaded and available for training.

In [None]:
def list_datasets():
    """List all available datasets"""
    response = requests.get(f"{API_BASE}/datasets")
    return response.json()

datasets = list_datasets()

if datasets:
    datasets_df = pd.DataFrame(datasets)
    print(f"\nFound {len(datasets_df)} datasets\n")
    print(datasets_df[['name', 'type', 'size', 'created_at']].head())
else:
    print("No datasets found. Upload a dataset using the UI or API.")

## 6. Create a Training Job

Now let's submit a simple LoRA training job!

In [None]:
def create_job(config: Dict[str, Any]):
    """Create a new training job"""
    response = requests.post(f"{API_BASE}/jobs", json=config)
    return response.json()

# Example job configuration
job_config = {
    "name": "my-first-lora-training",
    "recipe": "lora_qlora",
    "base_model": "meta-llama/Llama-2-7b-hf",  # Adjust to available model
    "dataset": "my-dataset",  # Replace with your dataset name
    "hyperparameters": {
        "learning_rate": 2e-4,
        "num_epochs": 3,
        "batch_size": 4,
        "lora_r": 16,
        "lora_alpha": 32,
        "lora_dropout": 0.05
    },
    "resources": {
        "gpu_count": 1
    }
}

# Submit the job
try:
    job = create_job(job_config)
    print(f"✅ Job created successfully!")
    print(f"Job ID: {job['id']}")
    print(f"Status: {job['status']}")
    
    # Store job ID for monitoring
    JOB_ID = job['id']
except Exception as e:
    print(f"❌ Error creating job: {e}")
    JOB_ID = None

## 7. Monitor Job Progress

Let's monitor the training job in real-time.

In [None]:
def get_job_status(job_id: str):
    """Get current job status"""
    response = requests.get(f"{API_BASE}/jobs/{job_id}")
    return response.json()

def monitor_job(job_id: str, check_interval: int = 5, max_checks: int = 100):
    """Monitor job until completion or timeout"""
    pbar = tqdm(total=100, desc="Training Progress")
    
    for _ in range(max_checks):
        status = get_job_status(job_id)
        
        current_status = status.get('status', 'unknown')
        progress = status.get('progress', 0)
        
        pbar.n = progress
        pbar.set_description(f"Status: {current_status}")
        pbar.refresh()
        
        if current_status in ['completed', 'failed', 'cancelled']:
            pbar.close()
            return status
        
        time.sleep(check_interval)
    
    pbar.close()
    return status

if JOB_ID:
    print(f"Monitoring job {JOB_ID}...\n")
    final_status = monitor_job(JOB_ID)
    print(f"\n{'='*50}")
    print(f"Final Status: {final_status['status']}")
    print(f"{'='*50}")
else:
    print("No job to monitor. Create a job first.")

## 8. View Training Metrics

Let's visualize the training metrics.

In [None]:
def get_job_metrics(job_id: str):
    """Get job training metrics"""
    response = requests.get(f"{API_BASE}/jobs/{job_id}/metrics")
    return response.json()

if JOB_ID:
    metrics = get_job_metrics(JOB_ID)
    
    # Plot training loss
    if 'train_loss' in metrics:
        fig, axes = plt.subplots(1, 2, figsize=(15, 5))
        
        # Plot 1: Training Loss
        axes[0].plot(metrics['train_loss'], label='Training Loss', color='blue')
        axes[0].set_xlabel('Step')
        axes[0].set_ylabel('Loss')
        axes[0].set_title('Training Loss Over Time')
        axes[0].legend()
        axes[0].grid(True, alpha=0.3)
        
        # Plot 2: Learning Rate
        if 'learning_rate' in metrics:
            axes[1].plot(metrics['learning_rate'], label='Learning Rate', color='green')
            axes[1].set_xlabel('Step')
            axes[1].set_ylabel('Learning Rate')
            axes[1].set_title('Learning Rate Schedule')
            axes[1].legend()
            axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    else:
        print("No training metrics available yet.")
else:
    print("No job to retrieve metrics from.")

## 9. List All Jobs

View all jobs in the system.

In [None]:
def list_jobs(status_filter: str = None):
    """List all jobs with optional status filter"""
    params = {}
    if status_filter:
        params['status'] = status_filter
    
    response = requests.get(f"{API_BASE}/jobs", params=params)
    return response.json()

all_jobs = list_jobs()

if all_jobs:
    jobs_df = pd.DataFrame(all_jobs)
    print(f"\nTotal Jobs: {len(jobs_df)}\n")
    print(jobs_df[['id', 'name', 'status', 'created_at']].head(10))
    
    # Status distribution
    status_counts = jobs_df['status'].value_counts()
    print("\nJob Status Distribution:")
    print(status_counts)
else:
    print("No jobs found.")

## 10. Access Trained Models

Once training is complete, you can access the trained model artifacts.

In [None]:
def get_model_adapters(model_id: str):
    """Get all adapters for a model"""
    response = requests.get(f"{API_BASE}/models/{model_id}/adapters")
    return response.json()

# List available models with adapters
models = list_base_models()

print("\nModels with Trained Adapters:")
print("=" * 50)

for model in models[:5]:  # Show first 5
    adapters = get_model_adapters(model['id'])
    if adapters:
        print(f"\n📦 {model['name']}")
        print(f"   Adapters: {len(adapters)}")
        for adapter in adapters[:3]:  # Show first 3 adapters
            print(f"   - {adapter['name']} (created: {adapter['created_at']})")

## Next Steps

Congratulations! You've completed the SparkTrainer quick start tutorial. Here's what you can explore next:

1. **Custom LoRA Recipe Tutorial** (`02_custom_lora_recipe.ipynb`) - Create custom training recipes
2. **Multimodal Training** (`03_multimodal_training.ipynb`) - Train vision-language models
3. **Advanced Optimization** (`04_advanced_optimization.ipynb`) - Hyperparameter tuning with Optuna
4. **Production Deployment** (`05_model_deployment.ipynb`) - Deploy models with vLLM/TGI

### Resources

- 📚 [Full Documentation](https://github.com/def1ant1/SparkTrainer)
- 🔧 [API Reference](http://localhost:5000/api/docs)
- 💬 [GitHub Discussions](https://github.com/def1ant1/SparkTrainer/discussions)
- 🐛 [Report Issues](https://github.com/def1ant1/SparkTrainer/issues)

Happy training! 🚀