EVA V1 is a reasoning model built on Meta Llama 3.1 8B Instruct using a two-phase training pipeline. The model is first fine-tuned with conversational data (SFT), then learns structured reasoning through reward functions (GRPO).
Llama 3.1 8B Instruct
|
LoRA (rank=32)
|
Phase 1: SFT ---- Conversational fine-tuning (FineTome-100k)
|
Phase 2: GRPO --- Reward-based reasoning training (GSM8K)
|
EVA V1
- Two-phase training: SFT + GRPO pipeline in a single notebook
- Structured output: Reasoning with
<reasoning>and<answer>XML tags - Fast training: 2x faster with 70% less VRAM using Unsloth
- vLLM integration: Fast inference with
fast_inference=True - Memory management: Automatic memory cleanup between phases
- Flexible export: LoRA, merged 16bit/4bit, GGUF (q8_0, q4_k_m, q5_k_m, f16)
- Google Colab (free Tesla T4) or CUDA-capable GPU
- Python 3.10+
- Minimum 15GB VRAM (with 4-bit quantization)
pip install unsloth vllm
pip install transformers==4.56.2
pip install --no-deps trl==0.22.2The model is fine-tuned with conversational data:
- Dataset: mlabonne/FineTome-100k (ShareGPT format)
- Custom Chat Template: SYSTEM_MESSAGE is automatically embedded into every conversation
- System role filter: Prevents duplicate system messages
- Training: SFTTrainer, 1 epoch, lr=2e-5, adamw_8bit
| Parameter | Value |
|---|---|
| Batch size | 2 |
| Gradient accumulation | 4 |
| Epochs | 1 |
| Warmup steps | 20 |
| Learning rate | 2e-5 |
| Optimizer | adamw_8bit |
| Scheduler | linear |
| Max seq length | 2048 |
After SFT, the model learns structured reasoning through reward functions:
- Dataset: openai/gsm8k (math problems)
- Output format:
<reasoning>...</reasoning><answer>...</answer> - Training: GRPOTrainer, 1 epoch, lr=2e-5, paged_adamw_8bit
| Parameter | Value |
|---|---|
| Batch size | 1 |
| Num generations | 6 |
| Epochs | 1 |
| Warmup steps | 20 |
| Learning rate | 2e-5 |
| Optimizer | paged_adamw_8bit |
| Scheduler | cosine |
| Max prompt length | 256 |
| Max completion length | 1792 |
The model is trained with 5 different reward signals:
| Function | Metric | Max Reward |
|---|---|---|
correctness_reward_func |
Is the answer correct? | 2.0 |
int_reward_func |
Is the answer numeric? | 0.5 |
strict_format_reward_func |
Exact XML format match? | 0.5 |
soft_format_reward_func |
Loose XML format match? | 0.5 |
xmlcount_reward_func |
Partial XML tag scoring | 0.5 |
- Model:
unsloth/meta-Llama-3.1-8B-Instruct - Quantization: 4-bit (bitsandbytes)
- Max sequence length: 2048
- Rank: 32
- Alpha: 32
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Gradient checkpointing: unsloth (long context support)
You are a helpful reasoning assistant. You think step by step
and provide clear, structured answers. Use <reasoning> tags for
your thought process and <answer> tags for your final answer.
<reasoning>
Step-by-step thought process...
</reasoning>
<answer>
Final answer
</answer>- Open the notebook in Google Colab
- Select Runtime > Run all
- Phase 1 (SFT) takes ~15-20 minutes
- Phase 2 (GRPO) takes ~30-45 minutes
from vllm import SamplingParams
text = tokenizer.apply_chat_template([
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Your question here"},
], tokenize=False, add_generation_prompt=True)
sampling_params = SamplingParams(
temperature=0.8,
top_p=0.95,
max_tokens=1024,
)
output = model.fast_generate(
[text],
sampling_params=sampling_params,
lora_request=model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text<reasoning>
To calculate pi we can use the Leibniz formula:
pi/4 = 1 - 1/3 + 1/5 - 1/7 + ...
Let's compute the first 1000 terms:
Sum = 1 - 0.3333 + 0.2 - 0.1429 + ...
Sum = 0.7854
pi = 4 * 0.7854 = 3.1416
</reasoning>
<answer>
3.1416
</answer>model.save_lora("grpo_saved_lora")model.save_pretrained_merged("model_16bit", tokenizer, save_method="merged_16bit")model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")| Format | Size | Use Case |
|---|---|---|
| LoRA | ~100MB | Lightweight, loaded as adapter |
| Merged 16bit | ~16GB | VLLM deployment |
| Merged 4bit | ~4GB | Low memory deployment |
| GGUF q8_0 | ~8GB | llama.cpp, Ollama |
| GGUF q4_k_m | ~4GB | llama.cpp, Ollama (recommended) |
| GGUF q5_k_m | ~5GB | llama.cpp, Ollama |
EVA_V1.ipynb
|
|-- Setup (Cell 4-5)
|-- Model Setup (Cell 8)
|-- Phase 1: SFT
| |-- System Message (Cell 10)
| |-- Chat Template (Cell 11)
| |-- Dataset Loading (Cell 12)
| |-- SFT Trainer (Cell 13)
| |-- SFT Training (Cell 14)
| |-- Template Reset (Cell 15)
|
|-- Phase 2: GRPO
| |-- Data Prep + Rewards (Cell 18)
| |-- GRPO Config (Cell 20)
| |-- GRPO Training (Cell 22)
|
|-- Inference (Cell 24, 28)
|-- Save/Export (Cell 31, 33)
Change the SYSTEM_MESSAGE content in Cell 10.
Replace load_dataset("mlabonne/FineTome-100k") in Cell 12 with your own dataset. ShareGPT format is required.
- SFT: Increase
num_train_epochs=1in Cell 13 (2-3 epochs recommended) - GRPO: Increase
num_train_epochs=1in Cell 20 (2-3 epochs recommended)
Change the model_name parameter in Cell 8. Unsloth-supported models:
unsloth/meta-Llama-3.1-8B-Instructunsloth/Meta-Llama-3.1-70B-bnb-4bitunsloth/Llama-3.2-3B-Instruct-bnb-4bitunsloth/mistral-7b-instruct-v0.3-bnb-4bitunsloth/gemma-2-9b-bnb-4bit
| Component | Version |
|---|---|
| Base Model | Llama 3.1 8B Instruct |
| Framework | Unsloth |
| Training | trl 0.22.2 (SFTTrainer + GRPOTrainer) |
| Inference | vLLM |
| Transformers | 4.56.2 |
| Quantization | bitsandbytes (4-bit) |
| LoRA | PEFT (rank 32) |
This notebook is licensed under LGPL-3.0.