Skip to content

anonymous-projectpage/DriCo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DriCo: Planner / Actor / Coordinator

A three-LoRA-adapter system for cooperative cooking. A single base model (Qwen2.5-7B-Instruct) is fine-tuned with three separate LoRA adapters that each play a distinct role at inference time:

Adapter Role
planner High-level subgoal generation in natural language (e.g. "Pick up carrot from counter and put it into pot0.")
actor Low-level action chunking — turns a subgoal into up to 3 ml_actions (e.g. pickup(carrot, ingredient_dispenser))
coordinator Shared-context generation + critique of planner subgoals (PASS / REJECT + feedback)

At eval time, the coordinator's REJECT events trigger the planner to refine its subgoal (up to 2 retries) before handing off to the actor.


Quick start

End-to-end: clone the repo, pull the pre-trained adapters and data from HuggingFace into the right folders, and run evaluation.

# 1) Clone the project
git clone https://github.com/anonymous-projectpage/DriCo.git
cd DriCo

# 2) Install dependencies
pip install -U huggingface_hub
pip install -r requirements.txt        # if a requirements file exists
# (or install torch / transformers / peft / overcooked_ai / wandb manually)

# 3) Pull the LoRA adapters into drico/out/
hf download anonymous-24421/DriCo-adapters \
    --local-dir drico/out

# 4) (Optional) pull the training / DPO data into drico/dataset/
hf download anonymous-24421/DriCo \
    --repo-type dataset \
    --local-dir drico/dataset

# 5) Run evaluation
cd drico
CUDA_VISIBLE_DEVICES=0 python eval_overcooked.py \
    --base_model           Qwen/Qwen2.5-7B-Instruct \
    --adapter_root         out/ \
    --layout               new_env \
    --recipe_split         test \
    --statistics_save_dir  data/eval_test

After step 3, the directory tree looks like:

DriCo/
└── drico/
    └── out/
        ├── planner/             ← LoRA adapter weights
        ├── actor/
        ├── coordinator/
        └── coordinator_dpo/

That's exactly the layout eval_proposed.py --adapter_root out/ expects, so step 5 just works.


1. Repo layout

The training/eval code lives in drico/. This README sits at the repo root.

.
├── README.md                       # ← this file
├── drico/
│   ├── agent/
│   │   ├── hf_agent.py             # Planner+Actor agent (no collab.* deps)
│   │   ├── human_agent.py          # Human CLI agent (drop-in replacement)
│   │   ├── parsers.py              # Parses Planner/Actor outputs
│   │   ├── hf_model.py             # HuggingFace model wrapper
│   │   └── leader.py               # Coordinator agent
│   │
│   ├── prompts/
│   │   ├── recipe/                 # train split (seen recipes)
│   │   └── test/recipe/            # test split (unseen recipes)
│   │
│   ├── dataset/
│   │   ├── planner/                # planner SFT data
│   │   ├── actor/                  # actor SFT data
│   │   ├── coordinator/            # coordinator SFT data
│   │   └── coordinator_dpo/
│   │       └── dpo.json            # coordinator DPO data
│   │
│   ├── train_planner_coordinator.py
│   ├── train_actor.py
│   ├── train_coordinator_dop.py
│   ├── eval_proposed.py
│   └── out/                        # default LoRA adapter output
│       ├── planner/
│       ├── actor/
│       └── coordinator/            (and coordinator_dpo/ after DPO)
│
└── … (the rest of the repo: upgraded benchmark code, layouts, etc.)

All command-line examples below assume you are running from inside drico/:

cd drico

2. Training

Train each adapter independently. All three share the same base model (Qwen/Qwen2.5-7B-Instruct); only the LoRA weights differ.

2.1 Planner + Coordinator (SFT)

Trains the planner and coordinator adapters together from one script:

python train_planner_coordinator.py \
    --data_dir   dataset/planner/ \
    --base_model Qwen/Qwen2.5-7B-Instruct \
    --output_dir out/ \
    --epochs     3 \
    --bf16

Produces out/planner/ and out/coordinator/.

2.2 Actor (DPO)

python train_actor.py \
    --data_dir   dataset/actor/ \
    --base_model Qwen/Qwen2.5-7B-Instruct \
    --output_dir out/ \
    --bf16

Produces out/actor/.

2.3 Coordinator (DPO refinement)

After the coordinator SFT adapter is ready, refine it with DPO on preference pairs:

python train_coordinator_dop.py \
    --data_path            dataset/coordinator_dpo/dpo.json \
    --coordinator_adapter  out/coordinator/ \
    --out_coordinator      out/coordinator_dpo \
    --coordinator_data_dir dataset/coordinator/

Produces out/coordinator_dpo/. Swap this in for out/coordinator/ at eval time if you want the DPO-refined coordinator.


3. Evaluation

Single CLI for the full Planner → Coordinator-Critique → Actor pipeline.

3.1 Basic — all LLMs, test split

CUDA_VISIBLE_DEVICES=1 python eval_proposed.py \
    --base_model            Qwen/Qwen2.5-7B-Instruct \
    --adapter_root          out/ \
    --layout                circuit_env \
    --recipe_split          test \
    --statistics_save_dir   data/eval_test

The script:

  1. Loads out/{planner,actor,coordinator} as three named adapters on one shared base model.
  2. Evaluates every recipe in prompts/test/recipe.
  3. Computes DriCo-style PC (completed_milestones / total_milestones, with success ⇒ PC=1.0).
  4. Writes per-recipe JSONs and a summary at data/eval_test/eval_summary_test.json.

3.2 Run a human against the LLM

Either player can be controlled from the CLI:

# Chef = human, Assistant = LLM
CUDA_VISIBLE_DEVICES=1 python eval_proposed.py \
    --base_model          Qwen/Qwen2.5-7B-Instruct \
    --adapter_root        out/ \
    --layout              circuit_env \
    --recipe_split        test \
    --statistics_save_dir data/eval_test \
    --human_player        p1
Flag value Effect
none (default) Both players LLM-controlled
p1 / chef First player (Chef) typed by human
p2 / assistant Second player (Assistant) typed by human

When a human is in the game, --episode is forced to 1.

The human types ml_action strings each time a new one is needed (auto-repeats across timesteps until completion). A cheat-sheet of the action grammar is shown on the first prompt; every prompt also lists the actions that look valid in the current state. Examples of what to type:

pickup(carrot, ingredient_dispenser)
put_obj_in_utensil(pot0)
cook(pot0)
wait(3)
deliver_soup()

Commands:

  • Enter / r → repeat last action
  • h → reprint the grammar
  • q → end the episode

The human's last action is also converted to a natural-language subgoal that the partner LLM sees as the teammate's intent (so the partner can plan around it without any special-casing in LLM-side code).

3.3 Other useful flags

# Different recipe splits
--recipe_split test      # prompts/test/recipe       (default)
--recipe_split train     # prompts/recipe
--recipe_split both      # both folders (test wins on name conflicts)

# Restrict to specific recipes (comma-separated)
--orders pea_soup,mashed_zucchini_and_pea_patty

# Restrict to specific difficulty levels
--levels 1               # only level 1
--levels 1,3             # levels 1 and 3
--levels 2-4             # levels 2,3,4

# Turn the coordinator critique on/off
--use_critique True      # default — every subgoal reviewed, refined on REJECT
--use_critique False     # planner output goes straight to actor

# Episode budget
--episode 3              # 3 episodes per recipe (ignored if --human_player set)
--horizon  80            # max timesteps per episode

# Generation / dtype
--planner_max_tokens 128
--actor_max_tokens    64
--dtype               bf16   # bf16 | fp16 | fp32

3.4 Combined examples

# Train-split recipes, level 1 only
python eval_proposed.py ... --recipe_split train --levels 1

# Both splits, just one recipe, human as Chef
python eval_proposed.py ... \
    --recipe_split both \
    --orders pea_soup \
    --human_player p1

# Pure LLM eval, no critique, levels 2 through 4
python eval_proposed.py ... \
    --recipe_split test \
    --levels 2-4 \
    --use_critique False

4. Output files

For each evaluation run, the script writes:

data/eval_test/
├── diff1_pea_soup/
│   ├── pea_soup_ep1_2026-05-12_…json     # per-episode trace
│   └── pea_soup_ep2_2026-05-12_…json
├── diff2_carrot_soup/
│   └── …
└── eval_summary_test.json                 # aggregated metrics

eval_summary_<suffix>.json contains:

  • Per-recipe and per-difficulty success rate (SR) and progress completeness (PC, DriCo-style)
  • Average first_reward_t over successful episodes (completion speed)
  • Drift analysis: how often the coordinator REJECTed, what drift types it flagged, how often refinement resolved them within 3 attempts
  • Coordinator call counts (shared-context vs. critique-initial vs. critique-refine)

The filename suffix encodes the active filters (e.g. eval_summary_both_lv1-3_orders-pea_soup_human-p1.json) so re-runs don't clobber prior results.


5. End-to-end example

# 1. Train all three adapters
python train_planner_coordinator.py \
    --data_dir dataset/planner/ \
    --base_model Qwen/Qwen2.5-7B-Instruct \
    --output_dir out/ --epochs 3 --bf16

python train_actor.py \
    --data_dir dataset/actor/ \
    --base_model Qwen/Qwen2.5-7B-Instruct \
    --output_dir out/ --bf16

python train_coordinator_dop.py \
    --data_path dataset/coordinator_dpo/dpo.json \
    --coordinator_adapter out/coordinator/ \
    --out_coordinator out/coordinator_dpo \
    --coordinator_data_dir dataset/coordinator/

# 2. Evaluate on test split — full LLM team
CUDA_VISIBLE_DEVICES=1 python eval_proposed.py \
    --base_model Qwen/Qwen2.5-7B-Instruct \
    --adapter_root out/ \
    --layout circuit_env \
    --recipe_split test \
    --statistics_save_dir data/eval_test

# 3. Same eval, but you play the Chef
CUDA_VISIBLE_DEVICES=1 python eval_proposed.py \
    --base_model Qwen/Qwen2.5-7B-Instruct \
    --adapter_root out/ \
    --layout circuit_env \
    --recipe_split test \
    --statistics_save_dir data/eval_test \
    --human_player p1

6. Pre-trained adapters & data on HuggingFace

The LoRA adapters and training data are hosted on HuggingFace and mirror the local folder layout, so a one-shot download (see Quick start above) drops everything in the right place.

Asset Repo
LoRA adapters (4) https://huggingface.co/anonymous-24421/DriCo-adapters
Training / DPO data https://huggingface.co/datasets/anonymous-24421/DriCo

Selective download

If you only need a subset:

# Just the actor adapter
hf download anonymous-24421/DriCo-adapters \
    --include "actor/*" \
    --local-dir drico/out

# Just the DPO-refined coordinator
hf download anonymous-24421/DriCo-adapters \
    --include "coordinator_dpo/*" \
    --local-dir drico/out

# Just the DPO data
hf download anonymous-24421/DriCo \
    --repo-type dataset \
    --include "coordinator_dpo/*" \
    --local-dir drico/dataset

Manual loading in Python

If you want to use the adapters in your own code instead of going through the eval script:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = "Qwen/Qwen2.5-7B-Instruct"
tok  = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(
    base, torch_dtype="bfloat16", device_map="auto",
)

# Register all three roles as named adapters on the same base model
model = PeftModel.from_pretrained(
    model, "anonymous-24421/DriCo-adapters",
    subfolder="planner", adapter_name="planner",
)
model.load_adapter("anonymous-24421/DriCo-adapters",
                   subfolder="actor", adapter_name="actor")
model.load_adapter("anonymous-24421/DriCo-adapters",
                   subfolder="coordinator_dpo", adapter_name="coordinator")

# Switch role per generation
model.set_adapter("planner")
# ...generate planner subgoal...

model.set_adapter("actor")
# ...generate action chunk...

About

2026 NeurIPS code

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors