AgentTune: Teaching Small LLMs Multi-Step Agent Reasoning

QLoRA fine-tuning for ReAct-style agent reasoning in small language models (1.5B–7B)

Motivation

In our previous work, we showed that QLoRA fine-tuning enables small models to achieve 86–89% exact match on single-turn function calling. This project takes the next step: teaching small models to think and act like agents — planning multi-step solutions, calling tools, processing results, and adapting their approach.

Key Results

Model Comparison (500 training samples, QLoRA, 3 epochs)

Model	Metric	Zero-Shot	Fine-Tuned	Δ
Qwen2.5-3B-Instruct	Task Success Rate	93.3%	100%	+6.7%
	Tool Selection Acc	30.0%	100%	+70.0%
	Exact Tool Match	30.0%	100%	+70.0%
Qwen2.5-7B-Instruct	Task Success Rate	83.3%	100%	+16.7%
	Tool Selection Acc	53.3%	100%	+46.7%
	Exact Tool Match	53.3%	100%	+46.7%

Data Scaling (Qwen2.5-3B-Instruct)

Training Samples	Tool Selection Acc	Exact Tool Match	Training Loss	Time
50	73.3%	60.0%	1.880	65s
100	100%	93.3%	1.452	127s
250	96.7%	93.3%	0.785	317s
500	96.7%	96.7%	0.419	634s
1000	96.7%	93.3%	0.229	1266s

Key finding: Just 100 training samples are enough to reach near-perfect tool selection accuracy. The 3B model matches the 7B model after fine-tuning, demonstrating that QLoRA can close the capability gap between model sizes for agent reasoning tasks.

How It Works

User: "What's the weather in Tokyo? If it's cold, recommend a ramen place."

Agent (fine-tuned Qwen2.5-3B):
  Thought: I need to check the weather in Tokyo first.
  Action: {"name": "get_weather", "arguments": {"city": "Tokyo"}}
  Observation: {"temperature": 8, "condition": "cloudy"}
  Thought: 8°C is cold. I should recommend a ramen restaurant.
  Action: {"name": "search_restaurant", "arguments": {"location": "Tokyo", "cuisine": "ramen"}}
  Observation: {"name": "Ichiran Ramen", "rating": 4.6}
  Thought: I have all the information.
  Answer: Tokyo is 8°C and cloudy. I recommend Ichiran Ramen (rated 4.6)!

Architecture

                         ┌─────────────┐
                         │  User Query │
                         └──────┬──────┘
                                │
                    ┌───────────▼───────────┐
                    │   Fine-tuned LLM      │
                    │  (Qwen2.5-3B + LoRA)  │
                    └───────────┬───────────┘
                                │
                    ┌───────────▼───────────┐
                    │    ReAct Loop          │
                    │  Thought → Action →    │
                    │  Observation → ...     │◄──── Tool Registry
                    │  → Answer             │      (10 tools)
                    └───────────────────────┘

Quick Start

Run in Colab

Open notebooks/01_Agent_FineTune.ipynb in Google Colab with an L4 GPU. The notebook is self-contained — all code is inline.

Local Setup

pip install -r requirements.txt

# Train
python -m src.training.train \
    --model_id Qwen/Qwen2.5-3B-Instruct \
    --train_samples 500 \
    --output_dir ./output/agent_qwen25_3b

# Evaluate
python -m src.evaluation.evaluate \
    --model_id Qwen/Qwen2.5-3B-Instruct \
    --adapter_path ./output/agent_qwen25_3b

# Demo
python demo/app.py \
    --model_id Qwen/Qwen2.5-3B-Instruct \
    --adapter_path ./output/agent_qwen25_3b

Project Structure

agenttune/
├── README.md
├── LICENSE
├── requirements.txt
├── .gitignore
├── src/
│   ├── agent/
│   │   ├── tools.py           # 10 tool definitions + simulated executor
│   │   └── runtime.py         # ReAct execution engine
│   ├── data/
│   │   ├── react_formatter.py # Format trajectories → training text
│   │   └── build_dataset.py   # Seed trajectories + data generation
│   ├── training/
│   │   └── train.py           # QLoRA fine-tuning script
│   └── evaluation/
│       └── evaluate.py        # Agent task evaluation
├── notebooks/
│   ├── 01_Agent_FineTune.ipynb    # Main Colab notebook (training + eval)
│   └── 02_Scaling_Experiments.ipynb  # Model comparison + data scaling
├── demo/
│   └── app.py                 # Gradio web demo
└── results/

Method

Training Data: ReAct Trajectories

Each training sample is a complete agent trajectory in ReAct format:

Thought → Action → Observation → Thought → ... → Answer

Data sources:

Seed trajectories: 10 hand-crafted multi-step examples (1–3 tool calls)
Augmented data: Seed tasks re-executed with varied simulated tool responses
Synthetic generation: (Planned) GPT-4/Claude-generated diverse trajectories

Fine-Tuning Configuration

Component	Setting
Quantization	QLoRA 4-bit (NF4, double quantization)
LoRA rank / alpha	16 / 32
LoRA targets	All attention + MLP projections
Learning rate	2e-4 (cosine schedule)
Max sequence length	2048 tokens
Training epochs	3

Evaluation Metrics

Metric	Description
Task Success Rate	Agent reaches a final answer
Tool Selection Accuracy	Correct tools called
Exact Tool Match	Only the expected tools called
Step Efficiency	Completed within expected step range

Relationship to Phase 1

	Phase 1: Tool Use	Phase 2: Agent Reasoning
Task	Single-turn function calling	Multi-step planning & execution
Format	User → JSON tool call	ReAct: Thought → Action → Observation → Answer
Sequence length	512 tokens	2048 tokens
Key question	Can small models call tools?	Can small models think like agents?

Citation

@misc{agenttune-2026,
  title={AgentTune: Teaching Small LLMs Multi-Step Agent Reasoning via QLoRA},
  author={ChengXie},
  year={2026},
  url={https://github.com/XIECHENG6/agenttune}
}

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentTune: Teaching Small LLMs Multi-Step Agent Reasoning

Motivation

Key Results

Model Comparison (500 training samples, QLoRA, 3 epochs)

Data Scaling (Qwen2.5-3B-Instruct)

How It Works

Architecture

Quick Start

Run in Colab

Local Setup

Project Structure

Method

Training Data: ReAct Trajectories

Fine-Tuning Configuration

Evaluation Metrics

Relationship to Phase 1

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
demo		demo
notebooks		notebooks
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AgentTune: Teaching Small LLMs Multi-Step Agent Reasoning

Motivation

Key Results

Model Comparison (500 training samples, QLoRA, 3 epochs)

Data Scaling (Qwen2.5-3B-Instruct)

How It Works

Architecture

Quick Start

Run in Colab

Local Setup

Project Structure

Method

Training Data: ReAct Trajectories

Fine-Tuning Configuration

Evaluation Metrics

Relationship to Phase 1

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages