VALM

A JAX-based research framework for online reinforcement learning with Large Language Models.

Overview

This project explores efficient multi-step RL for LLMs and investigates value approximation strategies for language tasks. The project features:

Custom Qwen3 Implementation: A from-scratch JAX/Flax implementation optimized for RL training with integrated value networks
Parameter-Efficient Fine-Tuning: LoRA support for attention and MLP layers
Fast Rust Environments: Wordle and Arithmetic environments implemented in Rust via PyO3
Flexible Training: Support for both online RL and offline value network pre-training
Visualization Tools: Episode viewer for debugging and analysis

Installation

Prerequisites

Python 3.13+
Rust (for building environments)
CUDA-compatible GPU (recommended)

Setup

# Clone the repository
git clone <repo-url>
cd vaml

# Install with uv (recommended)
uv sync

This will compile the Rust environments and install all Python dependencies.

Download Base Model

huggingface-cli download Qwen/Qwen3-4B-Instruct-2507 \
    --local-dir ./base-models/Qwen/Qwen3-4B-Instruct-2507 \
    --exclude "*.bin"

Quick Start

Train with Online RL

uv run vaml train configs/test.json

Train Value Network (Offline)

# First, build offline data
uv run vaml build-offline configs/offline.json ./offline_data 100 100

# Then train value network
uv run vaml train-value configs/value-net.json ./offline_data

Evaluate Models

# Evaluate an OpenRouter model
uv run vaml eval openrouter openrouter/meta-llama/llama-3.3-8b-instruct:free --env wordle

# Evaluate a trained checkpoint
uv run vaml eval checkpoint <experiment-name> --episodes 100

Environments

Wordle

Guess a 5-letter word in 6 tries. Feedback: G=Green (correct), Y=Yellow (wrong position), -=Grey (not in word).

Arithmetic

Solve arithmetic expressions (+, -, *, /) with numbers up to 10,000.

Note: This project is in early stages of development.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
configs		configs
env_assets		env_assets
episode_viewer		episode_viewer
python/vaml		python/vaml
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
pyrefly.toml		pyrefly.toml
start.sh		start.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VALM

Overview

Installation

Prerequisites

Setup

Download Base Model

Quick Start

Train with Online RL

Train Value Network (Offline)

Evaluate Models

Environments

Wordle

Arithmetic

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VALM

Overview

Installation

Prerequisites

Setup

Download Base Model

Quick Start

Train with Online RL

Train Value Network (Offline)

Evaluate Models

Environments

Wordle

Arithmetic

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages