QLoRA fine-tuning for ReAct-style agent reasoning in small language models (1.5B–7B)
In our previous work, we showed that QLoRA fine-tuning enables small models to achieve 86–89% exact match on single-turn function calling. This project takes the next step: teaching small models to think and act like agents — planning multi-step solutions, calling tools, processing results, and adapting their approach.
| Model | Metric | Zero-Shot | Fine-Tuned | Δ |
|---|---|---|---|---|
| Qwen2.5-3B-Instruct | Task Success Rate | 93.3% | 100% | +6.7% |
| Tool Selection Acc | 30.0% | 100% | +70.0% | |
| Exact Tool Match | 30.0% | 100% | +70.0% | |
| Qwen2.5-7B-Instruct | Task Success Rate | 83.3% | 100% | +16.7% |
| Tool Selection Acc | 53.3% | 100% | +46.7% | |
| Exact Tool Match | 53.3% | 100% | +46.7% |
| Training Samples | Tool Selection Acc | Exact Tool Match | Training Loss | Time |
|---|---|---|---|---|
| 50 | 73.3% | 60.0% | 1.880 | 65s |
| 100 | 100% | 93.3% | 1.452 | 127s |
| 250 | 96.7% | 93.3% | 0.785 | 317s |
| 500 | 96.7% | 96.7% | 0.419 | 634s |
| 1000 | 96.7% | 93.3% | 0.229 | 1266s |
Key finding: Just 100 training samples are enough to reach near-perfect tool selection accuracy. The 3B model matches the 7B model after fine-tuning, demonstrating that QLoRA can close the capability gap between model sizes for agent reasoning tasks.
User: "What's the weather in Tokyo? If it's cold, recommend a ramen place."
Agent (fine-tuned Qwen2.5-3B):
Thought: I need to check the weather in Tokyo first.
Action: {"name": "get_weather", "arguments": {"city": "Tokyo"}}
Observation: {"temperature": 8, "condition": "cloudy"}
Thought: 8°C is cold. I should recommend a ramen restaurant.
Action: {"name": "search_restaurant", "arguments": {"location": "Tokyo", "cuisine": "ramen"}}
Observation: {"name": "Ichiran Ramen", "rating": 4.6}
Thought: I have all the information.
Answer: Tokyo is 8°C and cloudy. I recommend Ichiran Ramen (rated 4.6)!
┌─────────────┐
│ User Query │
└──────┬──────┘
│
┌───────────▼───────────┐
│ Fine-tuned LLM │
│ (Qwen2.5-3B + LoRA) │
└───────────┬───────────┘
│
┌───────────▼───────────┐
│ ReAct Loop │
│ Thought → Action → │
│ Observation → ... │◄──── Tool Registry
│ → Answer │ (10 tools)
└───────────────────────┘
Open notebooks/01_Agent_FineTune.ipynb in Google Colab with an L4 GPU. The notebook is self-contained — all code is inline.
pip install -r requirements.txt
# Train
python -m src.training.train \
--model_id Qwen/Qwen2.5-3B-Instruct \
--train_samples 500 \
--output_dir ./output/agent_qwen25_3b
# Evaluate
python -m src.evaluation.evaluate \
--model_id Qwen/Qwen2.5-3B-Instruct \
--adapter_path ./output/agent_qwen25_3b
# Demo
python demo/app.py \
--model_id Qwen/Qwen2.5-3B-Instruct \
--adapter_path ./output/agent_qwen25_3bagenttune/
├── README.md
├── LICENSE
├── requirements.txt
├── .gitignore
├── src/
│ ├── agent/
│ │ ├── tools.py # 10 tool definitions + simulated executor
│ │ └── runtime.py # ReAct execution engine
│ ├── data/
│ │ ├── react_formatter.py # Format trajectories → training text
│ │ └── build_dataset.py # Seed trajectories + data generation
│ ├── training/
│ │ └── train.py # QLoRA fine-tuning script
│ └── evaluation/
│ └── evaluate.py # Agent task evaluation
├── notebooks/
│ ├── 01_Agent_FineTune.ipynb # Main Colab notebook (training + eval)
│ └── 02_Scaling_Experiments.ipynb # Model comparison + data scaling
├── demo/
│ └── app.py # Gradio web demo
└── results/
Each training sample is a complete agent trajectory in ReAct format:
Thought → Action → Observation → Thought → ... → Answer
Data sources:
- Seed trajectories: 10 hand-crafted multi-step examples (1–3 tool calls)
- Augmented data: Seed tasks re-executed with varied simulated tool responses
- Synthetic generation: (Planned) GPT-4/Claude-generated diverse trajectories
| Component | Setting |
|---|---|
| Quantization | QLoRA 4-bit (NF4, double quantization) |
| LoRA rank / alpha | 16 / 32 |
| LoRA targets | All attention + MLP projections |
| Learning rate | 2e-4 (cosine schedule) |
| Max sequence length | 2048 tokens |
| Training epochs | 3 |
| Metric | Description |
|---|---|
| Task Success Rate | Agent reaches a final answer |
| Tool Selection Accuracy | Correct tools called |
| Exact Tool Match | Only the expected tools called |
| Step Efficiency | Completed within expected step range |
| Phase 1: Tool Use | Phase 2: Agent Reasoning | |
|---|---|---|
| Task | Single-turn function calling | Multi-step planning & execution |
| Format | User → JSON tool call | ReAct: Thought → Action → Observation → Answer |
| Sequence length | 512 tokens | 2048 tokens |
| Key question | Can small models call tools? | Can small models think like agents? |
@misc{agenttune-2026,
title={AgentTune: Teaching Small LLMs Multi-Step Agent Reasoning via QLoRA},
author={ChengXie},
year={2026},
url={https://github.com/XIECHENG6/agenttune}
}MIT
