# Lab 9 Inverse RL & Reward Inference  
### Adversarial Inverse Reinforcement Learning (AIRL) on **LunarLander-v2**

In this lab you will:

1. Train or load an **expert** policy with PPO (forward RL).
2. Collect **demonstrations** from the expert.
3. Train **AIRL** to infer a reward and learn a policy from demos.

Run on Google Colab:

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/duke-trust-lab/intro_modern_rl/blob/main/lab9/lab9.ipynb)

---

## Learning objectives

By the end of this notebook, you should be able to:

- Explain the difference between **RL**, **IRL**, and **imitation learning**.
- Describe (at a high level) how **AIRL** uses an adversarial game to learn a reward.
- Run an AIRL pipeline end-to-end and evaluate whether the learned policy matches expert behavior.


### Concept refresher

**Forward RL:** reward is given, learn a policy.

- Policy:  π(a|s)
- Objective: maximize expected return  
  **J(π) = E[∑ γᵗ r(sₜ, aₜ)]**

**Inverse RL:** demonstrations are given, infer a reward (and usually a policy).

- Given expert trajectories τᴱ = (s₀,a₀,s₁,a₁,…)
- Find r̂(s,a) such that the optimal (or near-optimal) policy under r̂ explains the demonstrations.

**AIRL (Adversarial IRL):** trains a discriminator to distinguish expert transitions from policy transitions.
The discriminator structure corresponds to a learned reward + shaping term, encouraging a *reward that generalizes* beyond the expert’s exact state distribution.



In [None]:
!pip -q uninstall -y gymnasium shimmy stable-baselines3 imitation box2d-py Box2D pygame

!pip -q install -U pip setuptools wheel
!apt-get -qq update
!apt-get -qq install -y swig ffmpeg

!pip -q install "gymnasium==0.29.1" "shimmy==1.3.0" "stable-baselines3==2.3.2" "imitation==1.0.0" "imageio[ffmpeg]"

!pip -q install "Box2D" "pygame==2.5.2" --only-binary=:all:


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m68.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ipython 7.34.0 requires jedi>=0.16, which is not installed.[0m[31m
[0mW: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Selecting previously unselected package swig4.0.
(Reading database ... 121689 files and directories currently installed.)
Preparing to unpack .../swig4.0_4.0.2-1ubuntu1_amd64.deb ...
Unpacking swig4.0 (4.0.2-1ubuntu1) ...
Selecting previously unselected package swig.
Preparing to unpack .../swig_4.0.2-1ubuntu1_all.de

In [None]:
import os, numpy as np, torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import imageio.v2 as imageio
import gymnasium as gym

from stable_baselines3 import PPO
from imitation.util.util import make_vec_env
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from imitation.algorithms.adversarial.airl import AIRL
from imitation.rewards.reward_nets import BasicShapedRewardNet

OUT_DIR = "outputs_lab11"
os.makedirs(OUT_DIR, exist_ok=True)


Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


We’ll use:
- `evaluate_policy`: compute mean/std episodic return for a policy
- `record_video`: record an MP4 for qualitative inspection

> Qualitative inspection matters in IRL because “high reward” is not the only thing we care about ... we care about **matching behavior**


In [None]:
def evaluate_policy(model, env_id="LunarLander-v2", n_episodes=10, deterministic=True, max_steps=1000, seed=0):
    rewards = []
    for ep in range(n_episodes):
        env = gym.make(env_id)
        obs, _ = env.reset(seed=seed + ep)
        total = 0.0
        for _ in range(max_steps):
            action, _ = model.predict(obs, deterministic=deterministic)
            obs, r, terminated, truncated, _ = env.step(action)
            total += float(r)
            if terminated or truncated:
                break
        rewards.append(total)
        env.close()
    return float(np.mean(rewards)), float(np.std(rewards))


def record_video(model, env_id="LunarLander-v2", out_dir=OUT_DIR, name="policy_demo", fps=30, max_steps=1000):
    env = gym.make(env_id, render_mode="rgb_array")
    frames = []
    obs, _ = env.reset()
    total = 0.0
    for _ in range(max_steps):
        action, _ = model.predict(obs, deterministic=True)
        obs, r, terminated, truncated, _ = env.step(action)
        total += float(r)
        frames.append(env.render())
        if terminated or truncated:
            break
    env.close()
    path = os.path.join(out_dir, f"{name}.mp4")
    imageio.mimsave(path, frames, fps=fps, codec="libx264")
    print(f"Saved: {path} | Return: {total:.1f} | Steps: {len(frames)}")
    return path, total


### Train or load an Expert (Forward RL)

We first train an expert PPO policy on the environment’s **true reward**.  
Then AIRL will try to reproduce expert behavior **without directly using the reward** (only demos)


In [None]:
ENV_ID = "LunarLander-v2"
EXPERT_PATH = os.path.join(OUT_DIR, "expert_ppo.zip")

# PPO params

expert_cfg = dict(
    n_envs=8, #<--adjust as needed (16)
    total_timesteps=500_000, #<--adjust as needed (200_000)
    learning_rate=3e-4,
    n_steps=1024, #<--adjust as needed (2048)
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,
)


In [None]:
def load_or_train_expert(env_id=ENV_ID, expert_path=EXPERT_PATH, seed=42, verbose=1, **cfg):
    rng = np.random.default_rng(seed=seed)
    env = make_vec_env(
        env_id,
        n_envs=cfg["n_envs"],
        rng=rng,
        post_wrappers=[lambda env, idx: RolloutInfoWrapper(env)]
    )

    if os.path.exists(expert_path):
        print("Loading existing expert:", expert_path)
        expert = PPO.load(expert_path, env=env)
    else:
        print("Training expert PPO")
        expert = PPO(
            "MlpPolicy",
            env,
            verbose=verbose,
            learning_rate=cfg["learning_rate"],
            n_steps=cfg["n_steps"],
            batch_size=cfg["batch_size"],
            n_epochs=cfg["n_epochs"],
            gamma=cfg["gamma"],
            gae_lambda=cfg["gae_lambda"],
            clip_range=cfg["clip_range"],
            ent_coef=cfg["ent_coef"],
            seed=seed,
        )
        expert.learn(total_timesteps=cfg["total_timesteps"])
        expert.save(expert_path)
        print("Saved expert to:", expert_path)

    mean_r, std_r = evaluate_policy(expert, env_id=env_id, n_episodes=10)
    print(f"Expert eval: {mean_r:.1f} ± {std_r:.1f} (10 eps)")

    return expert, env, mean_r, std_r

expert, vec_env, expert_mean, expert_std = load_or_train_expert(**expert_cfg)

Loading existing expert: outputs_lab11/expert_ppo.zip
Expert eval: 247.2 ± 19.5 (10 eps)


**Quick check:** record one expert episode for qualitative sanity.

If the expert is good, you’ll see controlled hovering and a safe landing (most of the time).


In [None]:
_ = record_video(expert, ENV_ID, OUT_DIR, name="expert_demo")




Saved: outputs_lab11/expert_demo.mp4 | Return: 254.8 | Steps: 438


---

### Collect expert demonstrations

AIRL trains using a dataset of transitions sampled from expert rollouts.  
We’ll collect `min_timesteps` transitions across multiple trajectories.

> For IRL, the **coverage** of demonstrations matters: if the expert never visits some region of the state space, the inferred reward may be unreliable there.


In [None]:
def collect_demonstrations(expert, env, min_timesteps=20_000, seed=42):
    rng = np.random.default_rng(seed=seed)
    rollouts_ = rollout.rollout(
        expert,
        env,
        rollout.make_sample_until(min_timesteps=min_timesteps),
        rng=rng
    )
    transitions_ = rollout.flatten_trajectories(rollouts_)
    print(f"Collected transitions: {len(transitions_)}")
    print(f"Number of trajectories: {len(rollouts_)}")
    return transitions_, rollouts_

MIN_DEMO_TIMESTEPS = 50_000

transitions, rollouts_raw = collect_demonstrations(expert, vec_env, min_timesteps=MIN_DEMO_TIMESTEPS)


Collected transitions: 52187
Number of trajectories: 164


### Train AIRL

AIRL components:

- **Generator / policy** (PPO): produces rollouts to improve.
- **Discriminator**: tries to tell expert transitions apart from generator transitions.
- **Reward network**: inferred reward used by the generator policy.

In this notebook we use a shaped reward network:
`BasicShapedRewardNet = r_θ(s,a) + γΦ(s') − Φ(s)`

That shaping term helps represent *equivalent rewards* that produce the same optimal behavior


In [None]:
def train_airl(
    transitions,
    env_id=ENV_ID,
    out_dir=OUT_DIR,
    load_existing=True,
    verbose=1,
    # env
    n_envs=8,
    env_seed=999,
    # reward net
    use_next_state=False,
    use_done=True,
    reward_hid_sizes=(256, 256),
    potential_hid_sizes=(256, 256),
    discount_factor=0.99,
    # generator PPO
    gen_learning_rate=3e-4,
    gen_n_steps=1024,
    gen_batch_size=64,
    gen_n_epochs=10,
    gen_gamma=0.99,
    gen_gae_lambda=0.95,
    gen_clip_range=0.2,
    gen_ent_coef=0.01,
    # discriminator
    disc_learning_rate=3e-4,
    n_disc_updates_per_round=4,
    demo_batch_size=512,
    # training
    gen_train_timesteps_multiplier=1.0,
    total_timesteps=300_000,
):
    # AIRL training environment
    env_airl = make_vec_env(
        env_id,
        n_envs=n_envs,
        rng=np.random.default_rng(seed=env_seed),
        post_wrappers=[lambda env, idx: RolloutInfoWrapper(env)]
    )

    reward_net = BasicShapedRewardNet(
        env_airl.observation_space,
        env_airl.action_space,
        use_state=True,
        use_action=True,
        use_next_state=use_next_state,
        use_done=use_done,
        reward_hid_sizes=list(reward_hid_sizes),
        potential_hid_sizes=list(potential_hid_sizes),
        discount_factor=discount_factor,
    )

    # gen_train_timesteps is how many env steps PPO runs per AIRL round
    gen_train_timesteps = int(n_envs * gen_n_steps * gen_train_timesteps_multiplier)

    learner = PPO(
        "MlpPolicy",
        env_airl,
        verbose=verbose,
        learning_rate=gen_learning_rate,
        n_steps=gen_n_steps,
        batch_size=gen_batch_size,
        n_epochs=gen_n_epochs,
        gamma=gen_gamma,
        gae_lambda=gen_gae_lambda,
        clip_range=gen_clip_range,
        ent_coef=gen_ent_coef,
    )

    policy_path = os.path.join(out_dir, "airl_policy.zip")
    reward_path = os.path.join(out_dir, "airl_reward_net.pt")

    if load_existing and os.path.exists(policy_path) and os.path.exists(reward_path):
        print("Loading existing AIRL artifacts")
        learner = PPO.load(policy_path, env=env_airl)
        reward_net.load_state_dict(torch.load(reward_path, map_location="cpu"))

    airl = AIRL(
        demonstrations=transitions,
        demo_batch_size=demo_batch_size,
        venv=env_airl,
        gen_algo=learner,
        reward_net=reward_net,
        allow_variable_horizon=True,
        disc_opt_kwargs={"lr": disc_learning_rate},
        n_disc_updates_per_round=n_disc_updates_per_round,
        gen_train_timesteps=gen_train_timesteps,
    )

    if not (load_existing and os.path.exists(policy_path) and os.path.exists(reward_path)):
        print(f"Training AIRL for {total_timesteps:,} timesteps")
        airl.train(total_timesteps=total_timesteps)
        print("Saving AIRL artifacts")
        airl.gen_algo.save(policy_path)
        torch.save(airl._reward_net.state_dict(), reward_path)

    mean_r, std_r = evaluate_policy(airl.gen_algo, env_id=env_id, n_episodes=20)
    print(f"AIRL eval: {mean_r:.1f} ± {std_r:.1f} (20 eps)")

    return airl, mean_r, std_r


In [None]:
airl_cfg_fast = dict(
    n_envs=4,
    reward_hid_sizes=(256, 256),
    potential_hid_sizes=(256, 256),
    demo_batch_size=512,
    disc_learning_rate=3e-4,
    n_disc_updates_per_round=4,
    gen_learning_rate=3e-4,
    gen_n_steps=1024,
    gen_gae_lambda=0.95,
    gen_ent_coef=0.01,
    gen_train_timesteps_multiplier=1.0,
    total_timesteps=300_000,
    load_existing=True,
    verbose=1
)

airl_cfg = dict(
    n_envs=8,
    reward_hid_sizes=(512, 512),
    potential_hid_sizes=(512, 512),
    demo_batch_size=1024,
    disc_learning_rate=4.9e-4,
    n_disc_updates_per_round=4,
    gen_learning_rate=6.35e-4,
    gen_n_steps=2048,
    gen_gae_lambda=0.949,
    gen_ent_coef=0.0488,
    gen_train_timesteps_multiplier=0.813,
    total_timesteps=800_000, #<- May want to reduce, takes ~20 mins to run on an A100 with 800k timesteps
    load_existing=True,
    verbose=1
)

airl, airl_mean, airl_std = train_airl(transitions, **airl_cfg)


Using cuda device
Running with `allow_variable_horizon` set to True. Some algorithms are biased towards shorter or longer episodes, which may significantly confound results. Additionally, even unbiased algorithms can exploit the information leak from the termination condition, producing spuriously high performance. See https://imitation.readthedocs.io/en/latest/getting-started/variable-horizon.html for more information.
Training AIRL for 800,000 timesteps


round:   0%|          | 0/60 [00:00<?, ?it/s]

------------------------------------------
| raw/                        |          |
|    gen/rollout/ep_len_mean  | 91.3     |
|    gen/rollout/ep_rew_mean  | -174     |
|    gen/time/fps             | 2388     |
|    gen/time/iterations      | 1        |
|    gen/time/time_elapsed    | 6        |
|    gen/time/total_timesteps | 16384    |
------------------------------------------
--------------------------------------------------
| raw/                                |          |
|    disc/disc_acc                    | 0.5      |
|    disc/disc_acc_expert             | 1        |
|    disc/disc_acc_gen                | 0        |
|    disc/disc_entropy                | 0.485    |
|    disc/disc_loss                   | 0.924    |
|    disc/disc_proportion_expert_pred | 1        |
|    disc/disc_proportion_expert_true | 0.5      |
|    disc/global_step                 | 1        |
|    disc/n_expert                    | 1.02e+03 |
|    disc/n_generated                 | 1.02e+03 |
-

round:   2%|▏         | 1/60 [00:19<19:10, 19.50s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 83.1        |
|    gen/rollout/ep_rew_mean         | -204        |
|    gen/rollout/ep_rew_wrapped_mean | 1.7         |
|    gen/time/fps                    | 2375        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 32768       |
|    gen/train/approx_kl             | 0.019998265 |
|    gen/train/clip_fraction         | 0.296       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.37       |
|    gen/train/explained_variance    | 0.14343667  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | -0.0726     |
|    gen/train/n_updates             | 10          |
|    gen/train/policy_gradient_loss  | -0.0214     |
|    gen/train/value_loss            | 0.021  

round:   3%|▎         | 2/60 [00:38<18:35, 19.24s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 81.6        |
|    gen/rollout/ep_rew_mean         | -153        |
|    gen/rollout/ep_rew_wrapped_mean | -74.5       |
|    gen/time/fps                    | 2375        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 49152       |
|    gen/train/approx_kl             | 0.007217309 |
|    gen/train/clip_fraction         | 0.0473      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.36       |
|    gen/train/explained_variance    | 0.025496364 |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 0.525       |
|    gen/train/n_updates             | 20          |
|    gen/train/policy_gradient_loss  | -0.00394    |
|    gen/train/value_loss            | 2.75   

round:   5%|▌         | 3/60 [00:57<18:13, 19.18s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 82.4        |
|    gen/rollout/ep_rew_mean         | -164        |
|    gen/rollout/ep_rew_wrapped_mean | -146        |
|    gen/time/fps                    | 2390        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 65536       |
|    gen/train/approx_kl             | 0.009178352 |
|    gen/train/clip_fraction         | 0.0587      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.36       |
|    gen/train/explained_variance    | 0.5140095   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 1.17        |
|    gen/train/n_updates             | 30          |
|    gen/train/policy_gradient_loss  | -0.00185    |
|    gen/train/value_loss            | 15     

round:   7%|▋         | 4/60 [01:16<17:48, 19.08s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_len_mean         | 83.7       |
|    gen/rollout/ep_rew_mean         | -171       |
|    gen/rollout/ep_rew_wrapped_mean | -205       |
|    gen/time/fps                    | 2367       |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 6          |
|    gen/time/total_timesteps        | 81920      |
|    gen/train/approx_kl             | 0.00693228 |
|    gen/train/clip_fraction         | 0.0426     |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.35      |
|    gen/train/explained_variance    | 0.7518946  |
|    gen/train/learning_rate         | 0.000635   |
|    gen/train/loss                  | 3.62       |
|    gen/train/n_updates             | 40         |
|    gen/train/policy_gradient_loss  | -0.00211   |
|    gen/train/value_loss            | 23         |
------------

round:   8%|▊         | 5/60 [01:35<17:27, 19.05s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_len_mean         | 83.9         |
|    gen/rollout/ep_rew_mean         | -141         |
|    gen/rollout/ep_rew_wrapped_mean | -220         |
|    gen/time/fps                    | 2371         |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 6            |
|    gen/time/total_timesteps        | 98304        |
|    gen/train/approx_kl             | 0.0069801575 |
|    gen/train/clip_fraction         | 0.0605       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.35        |
|    gen/train/explained_variance    | 0.88821703   |
|    gen/train/learning_rate         | 0.000635     |
|    gen/train/loss                  | 1.73         |
|    gen/train/n_updates             | 50           |
|    gen/train/policy_gradient_loss  | -0.00266     |
|    gen/train/value_loss   

round:  10%|█         | 6/60 [01:54<17:08, 19.04s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_len_mean         | 79.3       |
|    gen/rollout/ep_rew_mean         | -131       |
|    gen/rollout/ep_rew_wrapped_mean | -206       |
|    gen/time/fps                    | 2394       |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 6          |
|    gen/time/total_timesteps        | 114688     |
|    gen/train/approx_kl             | 0.00796659 |
|    gen/train/clip_fraction         | 0.0984     |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.34      |
|    gen/train/explained_variance    | 0.96152395 |
|    gen/train/learning_rate         | 0.000635   |
|    gen/train/loss                  | 3.61       |
|    gen/train/n_updates             | 60         |
|    gen/train/policy_gradient_loss  | -0.0024    |
|    gen/train/value_loss            | 9.15       |
------------

round:  12%|█▏        | 7/60 [02:13<16:48, 19.03s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 79.7        |
|    gen/rollout/ep_rew_mean         | -113        |
|    gen/rollout/ep_rew_wrapped_mean | -195        |
|    gen/time/fps                    | 2371        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 131072      |
|    gen/train/approx_kl             | 0.013390884 |
|    gen/train/clip_fraction         | 0.118       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.33       |
|    gen/train/explained_variance    | 0.986164    |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 1.81        |
|    gen/train/n_updates             | 70          |
|    gen/train/policy_gradient_loss  | -0.00474    |
|    gen/train/value_loss            | 4.25   

round:  13%|█▎        | 8/60 [02:32<16:31, 19.07s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 77.5        |
|    gen/rollout/ep_rew_mean         | -105        |
|    gen/rollout/ep_rew_wrapped_mean | -217        |
|    gen/time/fps                    | 2319        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 7           |
|    gen/time/total_timesteps        | 147456      |
|    gen/train/approx_kl             | 0.009310478 |
|    gen/train/clip_fraction         | 0.0896      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.33       |
|    gen/train/explained_variance    | 0.99342215  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 0.951       |
|    gen/train/n_updates             | 80          |
|    gen/train/policy_gradient_loss  | -0.00647    |
|    gen/train/value_loss            | 4.59   

round:  15%|█▌        | 9/60 [02:52<16:20, 19.23s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 90.5        |
|    gen/rollout/ep_rew_mean         | -94.6       |
|    gen/rollout/ep_rew_wrapped_mean | -232        |
|    gen/time/fps                    | 2360        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 163840      |
|    gen/train/approx_kl             | 0.030023996 |
|    gen/train/clip_fraction         | 0.159       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.35       |
|    gen/train/explained_variance    | 0.99185264  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 4.16        |
|    gen/train/n_updates             | 90          |
|    gen/train/policy_gradient_loss  | -0.0112     |
|    gen/train/value_loss            | 5.92   

round:  17%|█▋        | 10/60 [03:11<15:59, 19.18s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 102         |
|    gen/rollout/ep_rew_mean         | -89.4       |
|    gen/rollout/ep_rew_wrapped_mean | -247        |
|    gen/time/fps                    | 2398        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 180224      |
|    gen/train/approx_kl             | 0.014600068 |
|    gen/train/clip_fraction         | 0.265       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.35       |
|    gen/train/explained_variance    | 0.9949081   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 2.62        |
|    gen/train/n_updates             | 100         |
|    gen/train/policy_gradient_loss  | -0.0235     |
|    gen/train/value_loss            | 6.66   

round:  18%|█▊        | 11/60 [03:30<15:35, 19.09s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 128         |
|    gen/rollout/ep_rew_mean         | -92.3       |
|    gen/rollout/ep_rew_wrapped_mean | -262        |
|    gen/time/fps                    | 2385        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 196608      |
|    gen/train/approx_kl             | 0.013583495 |
|    gen/train/clip_fraction         | 0.276       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.31       |
|    gen/train/explained_variance    | 0.99236506  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 4.96        |
|    gen/train/n_updates             | 110         |
|    gen/train/policy_gradient_loss  | -0.028      |
|    gen/train/value_loss            | 11.9   

round:  20%|██        | 12/60 [03:49<15:15, 19.07s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 150         |
|    gen/rollout/ep_rew_mean         | -92.6       |
|    gen/rollout/ep_rew_wrapped_mean | -303        |
|    gen/time/fps                    | 2389        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 212992      |
|    gen/train/approx_kl             | 0.014173923 |
|    gen/train/clip_fraction         | 0.259       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.27       |
|    gen/train/explained_variance    | 0.95643     |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 6.26        |
|    gen/train/n_updates             | 120         |
|    gen/train/policy_gradient_loss  | -0.0135     |
|    gen/train/value_loss            | 24     

round:  22%|██▏       | 13/60 [04:08<14:55, 19.06s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 171         |
|    gen/rollout/ep_rew_mean         | -99.7       |
|    gen/rollout/ep_rew_wrapped_mean | -402        |
|    gen/time/fps                    | 2389        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 229376      |
|    gen/train/approx_kl             | 0.013146279 |
|    gen/train/clip_fraction         | 0.191       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.19       |
|    gen/train/explained_variance    | 0.9664936   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 11.8        |
|    gen/train/n_updates             | 130         |
|    gen/train/policy_gradient_loss  | -0.0155     |
|    gen/train/value_loss            | 60.5   

round:  23%|██▎       | 14/60 [04:27<14:34, 19.02s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 213         |
|    gen/rollout/ep_rew_mean         | -89.8       |
|    gen/rollout/ep_rew_wrapped_mean | -406        |
|    gen/time/fps                    | 2405        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 245760      |
|    gen/train/approx_kl             | 0.012654444 |
|    gen/train/clip_fraction         | 0.185       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.14       |
|    gen/train/explained_variance    | 0.9771383   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 8.1         |
|    gen/train/n_updates             | 140         |
|    gen/train/policy_gradient_loss  | -0.0155     |
|    gen/train/value_loss            | 36.4   

round:  25%|██▌       | 15/60 [04:46<14:15, 19.01s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 263         |
|    gen/rollout/ep_rew_mean         | -76.6       |
|    gen/rollout/ep_rew_wrapped_mean | -381        |
|    gen/time/fps                    | 2363        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 262144      |
|    gen/train/approx_kl             | 0.013114579 |
|    gen/train/clip_fraction         | 0.139       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.14       |
|    gen/train/explained_variance    | 0.96705943  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 11.9        |
|    gen/train/n_updates             | 150         |
|    gen/train/policy_gradient_loss  | -0.0081     |
|    gen/train/value_loss            | 32.9   

round:  27%|██▋       | 16/60 [05:05<13:56, 19.01s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_len_mean         | 291        |
|    gen/rollout/ep_rew_mean         | -80        |
|    gen/rollout/ep_rew_wrapped_mean | -332       |
|    gen/time/fps                    | 2355       |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 6          |
|    gen/time/total_timesteps        | 278528     |
|    gen/train/approx_kl             | 0.01022627 |
|    gen/train/clip_fraction         | 0.116      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.15      |
|    gen/train/explained_variance    | 0.97546    |
|    gen/train/learning_rate         | 0.000635   |
|    gen/train/loss                  | 18         |
|    gen/train/n_updates             | 160        |
|    gen/train/policy_gradient_loss  | -0.00756   |
|    gen/train/value_loss            | 42.9       |
------------

round:  28%|██▊       | 17/60 [05:24<13:38, 19.03s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_len_mean         | 359          |
|    gen/rollout/ep_rew_mean         | -87.7        |
|    gen/rollout/ep_rew_wrapped_mean | -376         |
|    gen/time/fps                    | 2327         |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 7            |
|    gen/time/total_timesteps        | 294912       |
|    gen/train/approx_kl             | 0.0089989565 |
|    gen/train/clip_fraction         | 0.0941       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.15        |
|    gen/train/explained_variance    | 0.98840785   |
|    gen/train/learning_rate         | 0.000635     |
|    gen/train/loss                  | 8.16         |
|    gen/train/n_updates             | 170          |
|    gen/train/policy_gradient_loss  | -0.00685     |
|    gen/train/value_loss   

round:  30%|███       | 18/60 [05:43<13:21, 19.08s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 425         |
|    gen/rollout/ep_rew_mean         | -77.3       |
|    gen/rollout/ep_rew_wrapped_mean | -488        |
|    gen/time/fps                    | 2356        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 311296      |
|    gen/train/approx_kl             | 0.008378351 |
|    gen/train/clip_fraction         | 0.0941      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.16       |
|    gen/train/explained_variance    | 0.98018605  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 19.7        |
|    gen/train/n_updates             | 180         |
|    gen/train/policy_gradient_loss  | -0.00515    |
|    gen/train/value_loss            | 34     

round:  32%|███▏      | 19/60 [06:02<13:03, 19.10s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_len_mean         | 507          |
|    gen/rollout/ep_rew_mean         | -72.6        |
|    gen/rollout/ep_rew_wrapped_mean | -630         |
|    gen/time/fps                    | 2357         |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 6            |
|    gen/time/total_timesteps        | 327680       |
|    gen/train/approx_kl             | 0.0146102905 |
|    gen/train/clip_fraction         | 0.178        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.15        |
|    gen/train/explained_variance    | 0.9506227    |
|    gen/train/learning_rate         | 0.000635     |
|    gen/train/loss                  | 12.6         |
|    gen/train/n_updates             | 190          |
|    gen/train/policy_gradient_loss  | -0.00606     |
|    gen/train/value_loss   

round:  33%|███▎      | 20/60 [06:21<12:45, 19.13s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 574         |
|    gen/rollout/ep_rew_mean         | -66.2       |
|    gen/rollout/ep_rew_wrapped_mean | -752        |
|    gen/time/fps                    | 2343        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 344064      |
|    gen/train/approx_kl             | 0.009841142 |
|    gen/train/clip_fraction         | 0.145       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.14       |
|    gen/train/explained_variance    | 0.87547255  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 18.8        |
|    gen/train/n_updates             | 200         |
|    gen/train/policy_gradient_loss  | -0.00751    |
|    gen/train/value_loss            | 36.2   

round:  35%|███▌      | 21/60 [06:41<12:28, 19.18s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 680         |
|    gen/rollout/ep_rew_mean         | -71.1       |
|    gen/rollout/ep_rew_wrapped_mean | -798        |
|    gen/time/fps                    | 2403        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 360448      |
|    gen/train/approx_kl             | 0.008594587 |
|    gen/train/clip_fraction         | 0.113       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.1        |
|    gen/train/explained_variance    | 0.8140403   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 16.1        |
|    gen/train/n_updates             | 210         |
|    gen/train/policy_gradient_loss  | -0.00839    |
|    gen/train/value_loss            | 37.8   

round:  37%|███▋      | 22/60 [07:00<12:07, 19.15s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 745         |
|    gen/rollout/ep_rew_mean         | -70.1       |
|    gen/rollout/ep_rew_wrapped_mean | -835        |
|    gen/time/fps                    | 2420        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 376832      |
|    gen/train/approx_kl             | 0.009158922 |
|    gen/train/clip_fraction         | 0.0983      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.04       |
|    gen/train/explained_variance    | 0.8241888   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 8.43        |
|    gen/train/n_updates             | 220         |
|    gen/train/policy_gradient_loss  | -0.01       |
|    gen/train/value_loss            | 36.9   

round:  38%|███▊      | 23/60 [07:19<11:46, 19.10s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 830         |
|    gen/rollout/ep_rew_mean         | -68.1       |
|    gen/rollout/ep_rew_wrapped_mean | -838        |
|    gen/time/fps                    | 2424        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 393216      |
|    gen/train/approx_kl             | 0.008964132 |
|    gen/train/clip_fraction         | 0.105       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.01       |
|    gen/train/explained_variance    | 0.8205959   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 7.27        |
|    gen/train/n_updates             | 230         |
|    gen/train/policy_gradient_loss  | -0.0116     |
|    gen/train/value_loss            | 18.2   

round:  40%|████      | 24/60 [07:38<11:26, 19.07s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 882         |
|    gen/rollout/ep_rew_mean         | -63.2       |
|    gen/rollout/ep_rew_wrapped_mean | -893        |
|    gen/time/fps                    | 2453        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 409600      |
|    gen/train/approx_kl             | 0.008498039 |
|    gen/train/clip_fraction         | 0.0805      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1          |
|    gen/train/explained_variance    | 0.8966123   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 1.93        |
|    gen/train/n_updates             | 240         |
|    gen/train/policy_gradient_loss  | -0.00725    |
|    gen/train/value_loss            | 7.44   

round:  42%|████▏     | 25/60 [07:57<11:04, 18.99s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_len_mean         | 950          |
|    gen/rollout/ep_rew_mean         | -60.1        |
|    gen/rollout/ep_rew_wrapped_mean | -925         |
|    gen/time/fps                    | 2338         |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 7            |
|    gen/time/total_timesteps        | 425984       |
|    gen/train/approx_kl             | 0.0073961327 |
|    gen/train/clip_fraction         | 0.0977       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.04        |
|    gen/train/explained_variance    | 0.9293506    |
|    gen/train/learning_rate         | 0.000635     |
|    gen/train/loss                  | 0.844        |
|    gen/train/n_updates             | 250          |
|    gen/train/policy_gradient_loss  | -0.00426     |
|    gen/train/value_loss   

round:  43%|████▎     | 26/60 [08:16<10:47, 19.04s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_len_mean         | 971          |
|    gen/rollout/ep_rew_mean         | -57          |
|    gen/rollout/ep_rew_wrapped_mean | -992         |
|    gen/time/fps                    | 2428         |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 6            |
|    gen/time/total_timesteps        | 442368       |
|    gen/train/approx_kl             | 0.0077876355 |
|    gen/train/clip_fraction         | 0.0777       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.04        |
|    gen/train/explained_variance    | 0.8515475    |
|    gen/train/learning_rate         | 0.000635     |
|    gen/train/loss                  | 1.38         |
|    gen/train/n_updates             | 260          |
|    gen/train/policy_gradient_loss  | -0.00279     |
|    gen/train/value_loss   

round:  45%|████▌     | 27/60 [08:35<10:26, 18.99s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 992         |
|    gen/rollout/ep_rew_mean         | -48.3       |
|    gen/rollout/ep_rew_wrapped_mean | -1.06e+03   |
|    gen/time/fps                    | 2441        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 458752      |
|    gen/train/approx_kl             | 0.011778139 |
|    gen/train/clip_fraction         | 0.121       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.08       |
|    gen/train/explained_variance    | 0.7383288   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 1.33        |
|    gen/train/n_updates             | 270         |
|    gen/train/policy_gradient_loss  | 0.00035     |
|    gen/train/value_loss            | 2.57   

round:  47%|████▋     | 28/60 [08:53<10:06, 18.94s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_len_mean         | 996          |
|    gen/rollout/ep_rew_mean         | -46.1        |
|    gen/rollout/ep_rew_wrapped_mean | -1.14e+03    |
|    gen/time/fps                    | 2448         |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 6            |
|    gen/time/total_timesteps        | 475136       |
|    gen/train/approx_kl             | 0.0065560783 |
|    gen/train/clip_fraction         | 0.0815       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.07        |
|    gen/train/explained_variance    | 0.76850665   |
|    gen/train/learning_rate         | 0.000635     |
|    gen/train/loss                  | 0.531        |
|    gen/train/n_updates             | 280          |
|    gen/train/policy_gradient_loss  | 0.000455     |
|    gen/train/value_loss   

round:  48%|████▊     | 29/60 [09:12<09:45, 18.89s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 997         |
|    gen/rollout/ep_rew_mean         | -42.6       |
|    gen/rollout/ep_rew_wrapped_mean | -1.26e+03   |
|    gen/time/fps                    | 2447        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 491520      |
|    gen/train/approx_kl             | 0.006533757 |
|    gen/train/clip_fraction         | 0.105       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.06       |
|    gen/train/explained_variance    | 0.80711824  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 1.54        |
|    gen/train/n_updates             | 290         |
|    gen/train/policy_gradient_loss  | 0.000214    |
|    gen/train/value_loss            | 4      

round:  50%|█████     | 30/60 [09:31<09:26, 18.88s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 997         |
|    gen/rollout/ep_rew_mean         | -37         |
|    gen/rollout/ep_rew_wrapped_mean | -1.37e+03   |
|    gen/time/fps                    | 2446        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 507904      |
|    gen/train/approx_kl             | 0.016337097 |
|    gen/train/clip_fraction         | 0.128       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.08       |
|    gen/train/explained_variance    | 0.7910244   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 1.04        |
|    gen/train/n_updates             | 300         |
|    gen/train/policy_gradient_loss  | -0.000133   |
|    gen/train/value_loss            | 4.81   

round:  52%|█████▏    | 31/60 [09:50<09:08, 18.90s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_len_mean         | 996          |
|    gen/rollout/ep_rew_mean         | -27.9        |
|    gen/rollout/ep_rew_wrapped_mean | -1.44e+03    |
|    gen/time/fps                    | 2376         |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 6            |
|    gen/time/total_timesteps        | 524288       |
|    gen/train/approx_kl             | 0.0151295755 |
|    gen/train/clip_fraction         | 0.189        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.12        |
|    gen/train/explained_variance    | 0.79348767   |
|    gen/train/learning_rate         | 0.000635     |
|    gen/train/loss                  | 4.5          |
|    gen/train/n_updates             | 310          |
|    gen/train/policy_gradient_loss  | -0.00752     |
|    gen/train/value_loss   

round:  53%|█████▎    | 32/60 [10:09<08:50, 18.95s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 991         |
|    gen/rollout/ep_rew_mean         | -19.8       |
|    gen/rollout/ep_rew_wrapped_mean | -1.5e+03    |
|    gen/time/fps                    | 2350        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 540672      |
|    gen/train/approx_kl             | 0.019468253 |
|    gen/train/clip_fraction         | 0.287       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.15       |
|    gen/train/explained_variance    | 0.81535655  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 16.4        |
|    gen/train/n_updates             | 320         |
|    gen/train/policy_gradient_loss  | -0.0229     |
|    gen/train/value_loss            | 37     

round:  55%|█████▌    | 33/60 [10:28<08:34, 19.05s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_len_mean         | 983        |
|    gen/rollout/ep_rew_mean         | 2.55       |
|    gen/rollout/ep_rew_wrapped_mean | -1.51e+03  |
|    gen/time/fps                    | 2321       |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 7          |
|    gen/time/total_timesteps        | 557056     |
|    gen/train/approx_kl             | 0.02304464 |
|    gen/train/clip_fraction         | 0.332      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.19      |
|    gen/train/explained_variance    | 0.78125817 |
|    gen/train/learning_rate         | 0.000635   |
|    gen/train/loss                  | 19.8       |
|    gen/train/n_updates             | 330        |
|    gen/train/policy_gradient_loss  | -0.0251    |
|    gen/train/value_loss            | 39.7       |
------------

round:  57%|█████▋    | 34/60 [10:48<08:17, 19.12s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_len_mean         | 740        |
|    gen/rollout/ep_rew_mean         | 23.6       |
|    gen/rollout/ep_rew_wrapped_mean | -1.45e+03  |
|    gen/time/fps                    | 2335       |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 7          |
|    gen/time/total_timesteps        | 573440     |
|    gen/train/approx_kl             | 0.02383396 |
|    gen/train/clip_fraction         | 0.353      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.23      |
|    gen/train/explained_variance    | 0.7263148  |
|    gen/train/learning_rate         | 0.000635   |
|    gen/train/loss                  | 16.7       |
|    gen/train/n_updates             | 340        |
|    gen/train/policy_gradient_loss  | -0.0311    |
|    gen/train/value_loss            | 47.9       |
------------

round:  58%|█████▊    | 35/60 [11:07<07:59, 19.16s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 212         |
|    gen/rollout/ep_rew_mean         | -6.74       |
|    gen/rollout/ep_rew_wrapped_mean | -950        |
|    gen/time/fps                    | 2334        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 7           |
|    gen/time/total_timesteps        | 589824      |
|    gen/train/approx_kl             | 0.014448245 |
|    gen/train/clip_fraction         | 0.238       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.17       |
|    gen/train/explained_variance    | 0.7542894   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 32.5        |
|    gen/train/n_updates             | 350         |
|    gen/train/policy_gradient_loss  | -0.0127     |
|    gen/train/value_loss            | 67.5   

round:  60%|██████    | 36/60 [11:26<07:40, 19.20s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 194         |
|    gen/rollout/ep_rew_mean         | -16.6       |
|    gen/rollout/ep_rew_wrapped_mean | -208        |
|    gen/time/fps                    | 2323        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 7           |
|    gen/time/total_timesteps        | 606208      |
|    gen/train/approx_kl             | 0.014469795 |
|    gen/train/clip_fraction         | 0.158       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.1        |
|    gen/train/explained_variance    | 0.85768807  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 20.4        |
|    gen/train/n_updates             | 360         |
|    gen/train/policy_gradient_loss  | -0.00845    |
|    gen/train/value_loss            | 74.7   

round:  62%|██████▏   | 37/60 [11:45<07:22, 19.23s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 126         |
|    gen/rollout/ep_rew_mean         | -22.9       |
|    gen/rollout/ep_rew_wrapped_mean | -235        |
|    gen/time/fps                    | 2345        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 622592      |
|    gen/train/approx_kl             | 0.010459445 |
|    gen/train/clip_fraction         | 0.16        |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.11       |
|    gen/train/explained_variance    | 0.90367043  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 44.3        |
|    gen/train/n_updates             | 370         |
|    gen/train/policy_gradient_loss  | -0.00892    |
|    gen/train/value_loss            | 88.3   

round:  63%|██████▎   | 38/60 [12:05<07:02, 19.23s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 110         |
|    gen/rollout/ep_rew_mean         | -45.6       |
|    gen/rollout/ep_rew_wrapped_mean | -147        |
|    gen/time/fps                    | 2355        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 638976      |
|    gen/train/approx_kl             | 0.011393763 |
|    gen/train/clip_fraction         | 0.156       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.04       |
|    gen/train/explained_variance    | 0.9313028   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 23.8        |
|    gen/train/n_updates             | 380         |
|    gen/train/policy_gradient_loss  | -0.00906    |
|    gen/train/value_loss            | 47.1   

round:  65%|██████▌   | 39/60 [12:24<06:42, 19.19s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 96.9        |
|    gen/rollout/ep_rew_mean         | -43.7       |
|    gen/rollout/ep_rew_wrapped_mean | -148        |
|    gen/time/fps                    | 2393        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 655360      |
|    gen/train/approx_kl             | 0.010084178 |
|    gen/train/clip_fraction         | 0.121       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.01       |
|    gen/train/explained_variance    | 0.92222905  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 13.7        |
|    gen/train/n_updates             | 390         |
|    gen/train/policy_gradient_loss  | -0.00898    |
|    gen/train/value_loss            | 47.6   

round:  67%|██████▋   | 40/60 [12:43<06:22, 19.12s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_len_mean         | 95.7       |
|    gen/rollout/ep_rew_mean         | -38.5      |
|    gen/rollout/ep_rew_wrapped_mean | -158       |
|    gen/time/fps                    | 2360       |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 6          |
|    gen/time/total_timesteps        | 671744     |
|    gen/train/approx_kl             | 0.01108063 |
|    gen/train/clip_fraction         | 0.149      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.02      |
|    gen/train/explained_variance    | 0.9092701  |
|    gen/train/learning_rate         | 0.000635   |
|    gen/train/loss                  | 4.83       |
|    gen/train/n_updates             | 400        |
|    gen/train/policy_gradient_loss  | -0.00858   |
|    gen/train/value_loss            | 20.8       |
------------

round:  68%|██████▊   | 41/60 [13:02<06:02, 19.10s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_len_mean         | 96.7         |
|    gen/rollout/ep_rew_mean         | -28.8        |
|    gen/rollout/ep_rew_wrapped_mean | -177         |
|    gen/time/fps                    | 2351         |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 6            |
|    gen/time/total_timesteps        | 688128       |
|    gen/train/approx_kl             | 0.0093886005 |
|    gen/train/clip_fraction         | 0.125        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -0.984       |
|    gen/train/explained_variance    | 0.94580287   |
|    gen/train/learning_rate         | 0.000635     |
|    gen/train/loss                  | 7.01         |
|    gen/train/n_updates             | 410          |
|    gen/train/policy_gradient_loss  | -0.00541     |
|    gen/train/value_loss   

round:  70%|███████   | 42/60 [13:21<05:43, 19.10s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 96.2        |
|    gen/rollout/ep_rew_mean         | -22.4       |
|    gen/rollout/ep_rew_wrapped_mean | -198        |
|    gen/time/fps                    | 2374        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 704512      |
|    gen/train/approx_kl             | 0.009047637 |
|    gen/train/clip_fraction         | 0.125       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -0.948      |
|    gen/train/explained_variance    | 0.9646984   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 23.8        |
|    gen/train/n_updates             | 420         |
|    gen/train/policy_gradient_loss  | -0.00586    |
|    gen/train/value_loss            | 10.7   

round:  72%|███████▏  | 43/60 [13:40<05:24, 19.06s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 104         |
|    gen/rollout/ep_rew_mean         | 0.429       |
|    gen/rollout/ep_rew_wrapped_mean | -209        |
|    gen/time/fps                    | 2357        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 720896      |
|    gen/train/approx_kl             | 0.011936614 |
|    gen/train/clip_fraction         | 0.145       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -0.944      |
|    gen/train/explained_variance    | 0.96847117  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 4.46        |
|    gen/train/n_updates             | 430         |
|    gen/train/policy_gradient_loss  | -0.00949    |
|    gen/train/value_loss            | 13.5   

round:  73%|███████▎  | 44/60 [13:59<05:04, 19.04s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 121         |
|    gen/rollout/ep_rew_mean         | 14.2        |
|    gen/rollout/ep_rew_wrapped_mean | -205        |
|    gen/time/fps                    | 2370        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 737280      |
|    gen/train/approx_kl             | 0.012541825 |
|    gen/train/clip_fraction         | 0.158       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -0.912      |
|    gen/train/explained_variance    | 0.9839312   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 2.89        |
|    gen/train/n_updates             | 440         |
|    gen/train/policy_gradient_loss  | -0.011      |
|    gen/train/value_loss            | 7.17   

round:  75%|███████▌  | 45/60 [14:18<04:45, 19.02s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 140         |
|    gen/rollout/ep_rew_mean         | 36.9        |
|    gen/rollout/ep_rew_wrapped_mean | -196        |
|    gen/time/fps                    | 2355        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 753664      |
|    gen/train/approx_kl             | 0.015490683 |
|    gen/train/clip_fraction         | 0.146       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -0.903      |
|    gen/train/explained_variance    | 0.9747607   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 3.69        |
|    gen/train/n_updates             | 450         |
|    gen/train/policy_gradient_loss  | -0.0115     |
|    gen/train/value_loss            | 9.51   

round:  77%|███████▋  | 46/60 [14:37<04:26, 19.01s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 145         |
|    gen/rollout/ep_rew_mean         | 40.2        |
|    gen/rollout/ep_rew_wrapped_mean | -193        |
|    gen/time/fps                    | 2280        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 7           |
|    gen/time/total_timesteps        | 770048      |
|    gen/train/approx_kl             | 0.014834968 |
|    gen/train/clip_fraction         | 0.164       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -0.914      |
|    gen/train/explained_variance    | 0.9666593   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 6.48        |
|    gen/train/n_updates             | 460         |
|    gen/train/policy_gradient_loss  | -0.00678    |
|    gen/train/value_loss            | 15.5   

round:  78%|███████▊  | 47/60 [14:56<04:08, 19.10s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 177         |
|    gen/rollout/ep_rew_mean         | 63.8        |
|    gen/rollout/ep_rew_wrapped_mean | -213        |
|    gen/time/fps                    | 2380        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 786432      |
|    gen/train/approx_kl             | 0.010815643 |
|    gen/train/clip_fraction         | 0.131       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -0.897      |
|    gen/train/explained_variance    | 0.9741566   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 5.62        |
|    gen/train/n_updates             | 470         |
|    gen/train/policy_gradient_loss  | -0.00773    |
|    gen/train/value_loss            | 26.5   

round:  80%|████████  | 48/60 [15:15<03:49, 19.13s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 237         |
|    gen/rollout/ep_rew_mean         | 96.7        |
|    gen/rollout/ep_rew_wrapped_mean | -205        |
|    gen/time/fps                    | 2333        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 7           |
|    gen/time/total_timesteps        | 802816      |
|    gen/train/approx_kl             | 0.008177994 |
|    gen/train/clip_fraction         | 0.0971      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -0.863      |
|    gen/train/explained_variance    | 0.9697691   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 57.7        |
|    gen/train/n_updates             | 480         |
|    gen/train/policy_gradient_loss  | -0.00688    |
|    gen/train/value_loss            | 64.1   

round:  82%|████████▏ | 49/60 [15:35<03:30, 19.16s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_len_mean         | 294        |
|    gen/rollout/ep_rew_mean         | 106        |
|    gen/rollout/ep_rew_wrapped_mean | -167       |
|    gen/time/fps                    | 2349       |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 6          |
|    gen/time/total_timesteps        | 819200     |
|    gen/train/approx_kl             | 0.00608939 |
|    gen/train/clip_fraction         | 0.0706     |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -0.816     |
|    gen/train/explained_variance    | 0.9758827  |
|    gen/train/learning_rate         | 0.000635   |
|    gen/train/loss                  | 33.6       |
|    gen/train/n_updates             | 490        |
|    gen/train/policy_gradient_loss  | -0.0062    |
|    gen/train/value_loss            | 76.7       |
------------

round:  83%|████████▎ | 50/60 [15:54<03:11, 19.15s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_len_mean         | 380        |
|    gen/rollout/ep_rew_mean         | 97.6       |
|    gen/rollout/ep_rew_wrapped_mean | -155       |
|    gen/time/fps                    | 2348       |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 6          |
|    gen/time/total_timesteps        | 835584     |
|    gen/train/approx_kl             | 0.03348951 |
|    gen/train/clip_fraction         | 0.158      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -0.892     |
|    gen/train/explained_variance    | 0.97936904 |
|    gen/train/learning_rate         | 0.000635   |
|    gen/train/loss                  | 29.9       |
|    gen/train/n_updates             | 500        |
|    gen/train/policy_gradient_loss  | -0.0077    |
|    gen/train/value_loss            | 67.7       |
------------

round:  85%|████████▌ | 51/60 [16:13<02:52, 19.15s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 492         |
|    gen/rollout/ep_rew_mean         | 104         |
|    gen/rollout/ep_rew_wrapped_mean | -185        |
|    gen/time/fps                    | 2323        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 7           |
|    gen/time/total_timesteps        | 851968      |
|    gen/train/approx_kl             | 0.010828761 |
|    gen/train/clip_fraction         | 0.0886      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -0.854      |
|    gen/train/explained_variance    | 0.97295433  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 11.6        |
|    gen/train/n_updates             | 510         |
|    gen/train/policy_gradient_loss  | -0.00276    |
|    gen/train/value_loss            | 24.8   

round:  87%|████████▋ | 52/60 [16:32<02:33, 19.19s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_len_mean         | 598        |
|    gen/rollout/ep_rew_mean         | 105        |
|    gen/rollout/ep_rew_wrapped_mean | -263       |
|    gen/time/fps                    | 2302       |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 7          |
|    gen/time/total_timesteps        | 868352     |
|    gen/train/approx_kl             | 0.04548409 |
|    gen/train/clip_fraction         | 0.111      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -0.988     |
|    gen/train/explained_variance    | 0.93684787 |
|    gen/train/learning_rate         | 0.000635   |
|    gen/train/loss                  | 6.57       |
|    gen/train/n_updates             | 520        |
|    gen/train/policy_gradient_loss  | -5.28e-05  |
|    gen/train/value_loss            | 13.9       |
------------

round:  88%|████████▊ | 53/60 [16:52<02:14, 19.27s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 696         |
|    gen/rollout/ep_rew_mean         | 94.6        |
|    gen/rollout/ep_rew_wrapped_mean | -400        |
|    gen/time/fps                    | 2288        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 7           |
|    gen/time/total_timesteps        | 884736      |
|    gen/train/approx_kl             | 0.021131575 |
|    gen/train/clip_fraction         | 0.172       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.07       |
|    gen/train/explained_variance    | 0.7685041   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 2.34        |
|    gen/train/n_updates             | 530         |
|    gen/train/policy_gradient_loss  | -0.00125    |
|    gen/train/value_loss            | 12.9   

round:  90%|█████████ | 54/60 [17:11<01:55, 19.33s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 779         |
|    gen/rollout/ep_rew_mean         | 77.6        |
|    gen/rollout/ep_rew_wrapped_mean | -589        |
|    gen/time/fps                    | 2333        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 7           |
|    gen/time/total_timesteps        | 901120      |
|    gen/train/approx_kl             | 0.034495763 |
|    gen/train/clip_fraction         | 0.0834      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.12       |
|    gen/train/explained_variance    | 0.3556801   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 7.43        |
|    gen/train/n_updates             | 540         |
|    gen/train/policy_gradient_loss  | -0.00119    |
|    gen/train/value_loss            | 17.1   

round:  92%|█████████▏| 55/60 [17:30<01:36, 19.33s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 825         |
|    gen/rollout/ep_rew_mean         | 63.2        |
|    gen/rollout/ep_rew_wrapped_mean | -765        |
|    gen/time/fps                    | 2351        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 917504      |
|    gen/train/approx_kl             | 0.023453891 |
|    gen/train/clip_fraction         | 0.191       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.12       |
|    gen/train/explained_variance    | 0.40049106  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 10.8        |
|    gen/train/n_updates             | 550         |
|    gen/train/policy_gradient_loss  | -0.0102     |
|    gen/train/value_loss            | 27.5   

round:  93%|█████████▎| 56/60 [17:50<01:18, 19.54s/it]

--------------------------------------------------
| raw/                               |           |
|    gen/rollout/ep_len_mean         | 881       |
|    gen/rollout/ep_rew_mean         | 52.5      |
|    gen/rollout/ep_rew_wrapped_mean | -908      |
|    gen/time/fps                    | 2344      |
|    gen/time/iterations             | 1         |
|    gen/time/time_elapsed           | 6         |
|    gen/time/total_timesteps        | 933888    |
|    gen/train/approx_kl             | 0.0169469 |
|    gen/train/clip_fraction         | 0.232     |
|    gen/train/clip_range            | 0.2       |
|    gen/train/entropy_loss          | -1.07     |
|    gen/train/explained_variance    | 0.7119242 |
|    gen/train/learning_rate         | 0.000635  |
|    gen/train/loss                  | 12.8      |
|    gen/train/n_updates             | 560       |
|    gen/train/policy_gradient_loss  | -0.0169   |
|    gen/train/value_loss            | 33.8      |
-------------------------------

round:  95%|█████████▌| 57/60 [18:10<00:58, 19.53s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 901         |
|    gen/rollout/ep_rew_mean         | 28.9        |
|    gen/rollout/ep_rew_wrapped_mean | -1.01e+03   |
|    gen/time/fps                    | 2418        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 950272      |
|    gen/train/approx_kl             | 0.011748522 |
|    gen/train/clip_fraction         | 0.194       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.01       |
|    gen/train/explained_variance    | 0.8402858   |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 15.5        |
|    gen/train/n_updates             | 570         |
|    gen/train/policy_gradient_loss  | -0.0178     |
|    gen/train/value_loss            | 50.5   

round:  97%|█████████▋| 58/60 [18:29<00:38, 19.43s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_len_mean         | 942          |
|    gen/rollout/ep_rew_mean         | 14.6         |
|    gen/rollout/ep_rew_wrapped_mean | -1.05e+03    |
|    gen/time/fps                    | 2406         |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 6            |
|    gen/time/total_timesteps        | 966656       |
|    gen/train/approx_kl             | 0.0109842345 |
|    gen/train/clip_fraction         | 0.154        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -0.955       |
|    gen/train/explained_variance    | 0.8967231    |
|    gen/train/learning_rate         | 0.000635     |
|    gen/train/loss                  | 6.6          |
|    gen/train/n_updates             | 580          |
|    gen/train/policy_gradient_loss  | -0.0211      |
|    gen/train/value_loss   

round:  98%|█████████▊| 59/60 [18:48<00:19, 19.29s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_len_mean         | 962         |
|    gen/rollout/ep_rew_mean         | -0.916      |
|    gen/rollout/ep_rew_wrapped_mean | -1.1e+03    |
|    gen/time/fps                    | 2435        |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 6           |
|    gen/time/total_timesteps        | 983040      |
|    gen/train/approx_kl             | 0.009657002 |
|    gen/train/clip_fraction         | 0.135       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -0.951      |
|    gen/train/explained_variance    | 0.89644456  |
|    gen/train/learning_rate         | 0.000635    |
|    gen/train/loss                  | 4.73        |
|    gen/train/n_updates             | 590         |
|    gen/train/policy_gradient_loss  | -0.0152     |
|    gen/train/value_loss            | 10.3   

round: 100%|██████████| 60/60 [19:07<00:00, 19.13s/it]

Saving AIRL artifacts





AIRL eval: -30.1 ± 23.4 (20 eps)


### Compare performance

We’ll compare:
- Expert mean return
- AIRL mean return

### Checkpoint
If AIRL gets *high reward* but looks behaviorally different, is that “success”? Why or why not?


In [None]:
print(f"Expert: {expert_mean:.1f} ± {expert_std:.1f}")
print(f"AIRL:   {airl_mean:.1f} ± {airl_std:.1f}")


Expert: 247.2 ± 19.5
AIRL:   -30.1 ± 23.4


Record an AIRL episode.  
Compare the video with the expert: smoothness, hovering, leg contact timing, crash vs safe landings, etc.


In [None]:
_ = record_video(airl.gen_algo, ENV_ID, OUT_DIR, name="airl_demo")




Saved: outputs_lab11/airl_demo.mp4 | Return: -3.2 | Steps: 1000


### Checkpoint

*Reflect on the following statements about IRL. How do your results highlight some of these challenges with IRL?*

IRL ≠ imitation → AIRL struggles because it must infer why actions were taken

High-quality demos are necessary but not sufficient

Reward learning is fragile

Small reward errors → catastrophic policy failure, especially in long-horizon control

In [None]:
# TODO
# How can you improve the performance of AIRL?

### Checkpoint

1. Why can multiple different rewards explain the same demonstrations?
2. If AIRL learned a reward that works near the expert’s states, why might it fail off-distribution?
3. What evaluation would you add to judge *behavioral similarity* beyond episodic return?
4. Can AIRL learn a reward that “looks right” on demos but induces unintended behavior? Give a concrete example in LunarLander


## TODO: Hyperparameter search (Optuna)


In [None]:
# Uncomment to run a tiny Optuna search (can still take a while)
# !pip -q install optuna

# import optuna
# from optuna.pruners import MedianPruner
# from optuna.samplers import TPESampler

# def objective(trial):
#     reward_width = trial.suggest_categorical("reward_width", [128, 256, 512])
#     disc_lr = trial.suggest_float("disc_lr", 1e-5, 1e-3, log=True)
#     gen_lr = trial.suggest_float("gen_lr", 1e-5, 1e-3, log=True)
#     ent = trial.suggest_float("ent_coef", 0.001, 0.05, log=True)
#     n_envs = trial.suggest_categorical("n_envs", [2, 4, 8])

#     airl_tmp, mean_r, std_r = train_airl(
#         transitions,
#         n_envs=n_envs,
#         reward_hid_sizes=(reward_width, reward_width),
#         potential_hid_sizes=(reward_width, reward_width),
#         disc_learning_rate=disc_lr,
#         gen_learning_rate=gen_lr,
#         gen_ent_coef=ent,
#         total_timesteps=120_000,
#         load_existing=False,
#         verbose=0
#     )
#     return mean_r

# study = optuna.create_study(direction="maximize", sampler=TPESampler(seed=0), pruner=MedianPruner())
# study.optimize(objective, n_trials=10)

# print("Best trial:", study.best_trial.number)
# print("Best value:", study.best_value)
# print("Best params:", study.best_params)


### Checkpoint

- What are we optimizing in AIRL (at a high level)?
- What does it mean for IRL to “succeed” beyond having a high episodic return?

