This pipeline demonstrates a systematic approach to fine-tuning small language models for enhanced reasoning and instruction-following capabilities.
TinyTrainer implements a multi-stage training approach to enhance TinyLlama's reasoning and instruction-following capabilities through:
- Dataset Curation & Formatting - Custom preprocessing pipelines for diverse datasets
- Supervised Fine-tuning (SFT) - Multi-domain instruction following training
- Chain-of-Thought (CoT) Training - Enhanced reasoning capability development
- Preference Data Generation - Collecting multi-response data for alignment (planned)
- DPO Training - Direct Preference Optimization for model alignment (planned)
- Modular Pipeline: Each training stage is independent and configurable
- LoRA Integration: Memory-efficient training using Low-Rank Adaptation and Quantization
- Multi-GPU Support: Distributed training with Accelerate
- Stacked Adapters: Progressive capability building through adapter stacking
- Comprehensive Datasets: 78.5k samples across math, science, and general reasoning
- Dataset Curation & Formatting
- Supervised Fine-tuning (SFT)
- Chain-of-Thought (CoT) Training
- Preference Data Generation
- DPO Training
- Python 3.8+
- 2 x P100 Kaggle GPUs
git clone <repository-url>
cd TinyTrainer
# Install dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers datasets accelerate peft bitsandbytesConfigure multi-GPU training:
accelerate configOr use the provided configuration:
# accelerate_config.yaml
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_processes: 2
machine_rank: 0
num_machines: 1
gpu_ids: 0,1
use_cpu: false
mixed_precision: fp16This stage involves custom preprocessing pipelines for various datasets
Please read the Dataset Curation & Formatting README for more details.
To download and prepare datasets, run the following:
chmod +x run_preprocess.sh
./run_preprocess.shOr you can use the same dataset which I have in the Huggingface
Train the base model on instruction-following tasks:
chmod +x run_train_sft.sh
./run_train_sft.shBuild reasoning capabilities on top of the SFT model:
chmod +x run_train_cot.sh
./run_train_cot.shKey Features:
- Adapter Stacking: CoT adapter stacked on frozen SFT adapter
- Progressive Learning: Builds reasoning on top of instruction-following
- Memory Efficient: Only CoT parameters are trainable
python sft/merge_base_lora.py
python sft/test_sft.pypython cot/stack_base_sft_cot.pyThe project includes simple evaluation scripts:
prompt = "what is 10 times 21"
response = generate_response(prompt)
print(response)Create your own evaluation scripts:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained("base_model")
model = PeftModel.from_pretrained(model, "checkpoints/step_1_sft")
model.load_adapter("checkpoints/step_2_cot", adapter_name="cot")
def evaluate_model(prompts):
# Your evaluation logic here
passThe pipeline uses several techniques to reduce memory usage:
- 4/8-bit Quantization: BitsAndBytesConfig with NF4
- LoRA: Low-rank adaptation instead of full fine-tuning
- Gradient Checkpointing: Trades compute for memory
- Mixed Precision: FP16 training
- SFT Training: ~3 epochs on 56k samples
- CoT Training: ~3 epochs on 22.5k samples
- Total Training Time: ~6-8 hours on 2x RTX 4090
After training, the model demonstrates improved:
- Instruction Following: Better adherence to user requests
- Mathematical Reasoning: Step-by-step problem solving
- Chain-of-Thought: Explicit reasoning processes
- Multi-domain Knowledge: Performance across math, science, and general tasks
# Reduce batch size in config files
"train_batch_size": 2 # Instead of 4
"gradient_accumulation_steps": 8 # Increase to maintain effective batch size# Enable optimizations
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
export TOKENIZERS_PARALLELISM=false# Ensure correct paths in config files
"sft_lora_path": "checkpoints/step_1_sft" # Must exist before CoT trainingThis project is licensed under the MIT License - see the LICENSE file for details.
- TinyLlama Team: For the excellent base model
- Hugging Face: For transformers and datasets libraries
- PEFT Team: For efficient parameter-efficient fine-tuning
- Dataset Creators: All the dataset authors whose work made this possible