DAgger (Dataset Aggregation) is an imitation learning algorithm designed to overcome a key limitation of Behavior Cloning (BC): distribution shift.

Start with expert demonstrations (like BC).

Train a student policy.

Run the student policy in the environment to collect new states.

Ask the expert what actions it would take in those new states.

Aggregate these new (state, expert-action) pairs into the dataset.

Retrain the student on the growing dataset.

Repeat steps 3–6 for multiple iterations.

In [5]:
import tempfile
import numpy as np
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.ppo import MlpPolicy

from imitation.algorithms import bc
from imitation.algorithms.dagger import SimpleDAggerTrainer

In [10]:

rng = np.random.default_rng(0)
env = gym.make("CartPole-v1",render_mode="human")
# expert = PPO(policy=MlpPolicy, env=env)
expert = PPO.load("models/ppo_cartpole_expert", env=env)
# expert.learn(1000)
venv = DummyVecEnv([lambda: gym.make("CartPole-v1")])


Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [11]:

bc_trainer = bc.BC(
    observation_space=env.observation_space,
    action_space=env.action_space,
    rng=rng,
)


In [12]:

with tempfile.TemporaryDirectory(prefix="dagger_example_") as tmpdir:
    print(tmpdir)
    dagger_trainer = SimpleDAggerTrainer(
        venv=venv,
        scratch_dir=tmpdir,
        expert_policy=expert,
        bc_trainer=bc_trainer,
        rng=rng,
    )
    dagger_trainer.train(2000)


/tmp/dagger_example_tc7xoc1i


Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

0batch [00:00, ?batch/s]

---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000693 |
|    entropy        | 0.693     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 72.5      |
|    loss           | 0.692     |
|    neglogp        | 0.693     |
|    prob_true_act  | 0.5       |
|    samples_so_far | 32        |
| rollout/          |           |
|    return_max     | 23        |
|    return_mean    | 16.4      |
|    return_min     | 10        |
|    return_std     | 5.61      |
---------------------------------


184batch [00:00, 314.15batch/s]


Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1 [00:00<?, ? examples/s]

0batch [00:00, ?batch/s]

---------------------------------
| batch_size        | 32        |
| bc/               |           |
|    batch          | 0         |
|    ent_loss       | -0.000314 |
|    entropy        | 0.314     |
|    epoch          | 0         |
|    l2_loss        | 0         |
|    l2_norm        | 88.2      |
|    loss           | 0.295     |
|    neglogp        | 0.295     |
|    prob_true_act  | 0.789     |
|    samples_so_far | 32        |
| rollout/          |           |
|    return_max     | 133       |
|    return_mean    | 82.4      |
|    return_min     | 46        |
|    return_std     | 34.5      |
---------------------------------


372batch [00:01, 324.70batch/s]


In [16]:

reward, _ = evaluate_policy(dagger_trainer.policy, env, 10)
print("Reward:", reward)

error: display Surface quit

In [14]:
env.close()