This repository demonstrates how to efficiently orchestrate Large Language Model (LLM) fine-tuning using SkyPilot for resource management and MLflow for experiment tracking. The example showcases fine-tuning LLaMA 3 8B model using the Orca Math dataset.
- Python 3.10+
- Access to a Kubernetes cluster or cloud provider supported by SkyPilot
- MLflow tracking server (can be self-hosted or managed service like Nebius AI MLflow)
- Hugging Face account and access token for LLaMA 3.1 models
- Install SkyPilot with Kubernetes support:
pip install "skypilot[kubernetes]"- Configure environment variables by creating a
.envfile:
MLFLOW_TRACKING_URI=https://your-mlflow-server
MLFLOW_TRACKING_SERVER_CERT_PATH=/path/to/cert.pem # needs to be downloaded or synced to the SkyPilot cluster
MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING=true
MLFLOW_EXPERIMENT_NAME=LLM_Fine_Tuning
MLFLOW_TRACKING_USERNAME=your-username
MLFLOW_TRACKING_PASSWORD=your-password
HF_TOKEN=your-huggingface-token
# TEST_MODE=true # Uncomment for development. This will use a smaller subset of the dataset..
├── generate_train_dataset.py # Script to prepare training data
├── requirements.txt # Project dependencies
├── sky.yaml # SkyPilot configuration
├── train.py # Main training script
└── recipes/ # Training configurations
├── llama-3-1-8b-qlora.yaml # QLoRA fine-tuning config
├── llama-3-1-8b-spectrum.yaml # Spectrum fine-tuning config
└── spectrum_config/ # Spectrum-specific configurations
- Generate the training dataset:
python generate_train_dataset.py- Launch initial training job:
sky launch -c dev sky.yaml --env-file .env- For subsequent training runs on the same cluster:
sky exec dev sky.yaml --env-file .env- Monitor cluster status:
sky status - Stop the cluster:
sky down dev - Connect to cluster via SSH:
ssh dev - View logs:
sky logs dev
The repository includes two main training configurations:
-
QLoRA (Quantized Low-Rank Adaptation):
- Uses 4-bit quantization
- Applies LoRA to all linear layers
- Configured in
recipes/llama-3-1-8b-qlora.yaml
-
Spectrum:
- Selectively unfreezes specific model layers
- Uses a cosine learning rate schedule
- Configured in
recipes/llama-3-1-8b-spectrum.yaml
The training script integrates with MLflow to track:
- Training metrics (loss, learning rate, batch size, training speed)
- System metrics (GPU utilization, memory usage, CPU usage)
- Model parameters and configurations
- Training artifacts
Access the MLflow UI through your configured tracking server to monitor experiments.
This project is based on Philipp Schmid's "How to fine-tune open LLMs in 2025" blog post and extends it with SkyPilot and MLflow integration.




