# Learning a Reward Function using Preference Comparisons on Atari

In this case, we will use a convolutional neural network for our policy and reward model. We will also shape the learned reward model with the policy's learned value function, since these shaped rewards will be more informative for training - incentivizing agents to move to high-value states. In the interests of execution time, we will only do a little bit of training - much less than in the previous preference comparison notebook. To run this notebook, be sure to install the `atari` extras, for example by running `pip install imitation[atari]`.

First, we will set up the environment, reward network, et cetera.

In [1]:
import torch as th
import gym
from gym.wrappers import TimeLimit

from seals.util import AutoResetWrapper

from stable_baselines3 import PPO
from stable_baselines3.common.atari_wrappers import AtariWrapper
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.ppo import CnnPolicy

from imitation.algorithms import preference_comparisons
from imitation.policies.base import NormalizeFeaturesExtractor
from imitation.rewards.reward_nets import CnnRewardNet

device = th.device("cuda" if th.cuda.is_available() else "cpu")

# Here we ensure that our environment has constant-length episodes by resetting
# it when done, and running until 100 timesteps have elapsed.
# For real training, you will want a much longer time limit.
def constant_length_asteroids(num_steps):
    atari_env = gym.make("AsteroidsNoFrameskip-v4")
    preprocessed_env = AtariWrapper(atari_env)
    endless_env = AutoResetWrapper(preprocessed_env)
    return TimeLimit(endless_env, max_episode_steps=num_steps)


# For real training, you will want a vectorized environment with 8 environments in parallel.
# This can be done by passing in n_envs=8 as an argument to make_vec_env.
venv = make_vec_env(constant_length_asteroids, env_kwargs={"num_steps": 100})
venv = VecFrameStack(venv, n_stack=4)

reward_net = CnnRewardNet(
    venv.observation_space,
    venv.action_space,
).to(device)

fragmenter = preference_comparisons.RandomFragmenter(warning_threshold=0, seed=0)
gatherer = preference_comparisons.SyntheticGatherer(seed=0)
preference_model = preference_comparisons.PreferenceModel(reward_net)
reward_trainer = preference_comparisons.BasicRewardTrainer(
    model=reward_net,
    loss=preference_comparisons.CrossEntropyRewardLoss(preference_model),
    epochs=3,
)

agent = PPO(
    policy=CnnPolicy,
    env=venv,
    seed=0,
    n_steps=16,  # To train on atari well, set this to 128
    batch_size=16,  # To train on atari well, set this to 256
    ent_coef=0.01,
    learning_rate=0.00025,
    n_epochs=4,
)

trajectory_generator = preference_comparisons.AgentTrainer(
    algorithm=agent,
    reward_fn=reward_net,
    venv=venv,
    exploration_frac=0.0,
    seed=0,
)

pref_comparisons = preference_comparisons.PreferenceComparisons(
    trajectory_generator,
    reward_net,
    num_iterations=2,
    fragmenter=fragmenter,
    preference_gatherer=gatherer,
    reward_trainer=reward_trainer,
    fragment_length=10,
    transition_oversampling=1,
    initial_comparison_frac=0.1,
    allow_variable_horizon=False,
    seed=0,
    initial_epoch_multiplier=1,
)

We are now ready to train the reward model.

In [2]:
pref_comparisons.train(
    total_timesteps=16,
    total_comparisons=15,
)

Query schedule: [1, 9, 5]
Collecting 2 fragments (20 transitions)
Requested 20 transitions but only 0 in buffer. Sampling 20 additional transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 1 comparisons


Training reward model: 100%|██████████| 3/3 [00:02<00:00,  1.42it/s]

Training agent for 8 timesteps





--------------------------------------
| raw/                    |          |
|    agent/rollout/ep_... | -2.89    |
|    agent/time/fps       | 37       |
|    agent/time/iterat... | 1        |
|    agent/time/time_e... | 0        |
|    agent/time/total_... | 16       |
--------------------------------------
--------------------------------------
| mean/                   |          |
|    agent/rollout/ep_... | -2.89    |
|    agent/time/fps       | 37       |
|    agent/time/iterat... | 1        |
|    agent/time/time_e... | 0        |
|    agent/time/total_... | 16       |
|    agent/train/appro... | 0.000103 |
|    agent/train/clip_... | 0.2      |
|    agent/train/entro... | -2.64    |
|    agent/train/expla... | 0.0135   |
|    agent/train/learn... | 0.00025  |
|    agent/train/loss     | -0.0377  |
|    agent/train/n_upd... | 4        |
|    agent/train/polic... | -0.00596 |
|    agent/train/value... | 0.0107   |
|    preferences/entropy  | 0.693    |
|    reward/accuracy     

Training reward model: 100%|██████████| 3/3 [00:34<00:00, 11.49s/it]

Training agent for 8 timesteps





-------------------------------------------
| raw/                    |               |
|    agent/rollout/ep_... | -2.72         |
|    agent/time/fps       | 16            |
|    agent/time/iterat... | 1             |
|    agent/time/time_e... | 0             |
|    agent/time/total_... | 32            |
|    agent/train/appro... | 0.00010267645 |
|    agent/train/clip_... | 0.2           |
|    agent/train/entro... | -2.64         |
|    agent/train/expla... | 0.0135        |
|    agent/train/learn... | 0.00025       |
|    agent/train/loss     | -0.0377       |
|    agent/train/n_upd... | 4             |
|    agent/train/polic... | -0.00596      |
|    agent/train/value... | 0.0107        |
-------------------------------------------
--------------------------------------
| mean/                   |          |
|    agent/rollout/ep_... | -2.72    |
|    agent/time/fps       | 16       |
|    agent/time/iterat... | 1        |
|    agent/time/time_e... | 0        |
|    agent/time/to

Training reward model: 100%|██████████| 3/3 [00:45<00:00, 15.22s/it]

Training agent for 8 timesteps





------------------------------------------
| raw/                    |              |
|    agent/rollout/ep_... | -2.42        |
|    agent/time/fps       | 35           |
|    agent/time/iterat... | 1            |
|    agent/time/time_e... | 0            |
|    agent/time/total_... | 48           |
|    agent/train/appro... | 0.0001000762 |
|    agent/train/clip_... | 0.2          |
|    agent/train/entro... | -2.64        |
|    agent/train/expla... | 0.485        |
|    agent/train/learn... | 0.00025      |
|    agent/train/loss     | -0.0252      |
|    agent/train/n_upd... | 8            |
|    agent/train/polic... | -0.0049      |
|    agent/train/value... | 0.0233       |
------------------------------------------
--------------------------------------
| mean/                   |          |
|    agent/rollout/ep_... | -2.42    |
|    agent/time/fps       | 35       |
|    agent/time/iterat... | 1        |
|    agent/time/time_e... | 0        |
|    agent/time/total_... | 48     

{'reward_loss': 0.6221150159835815, 'reward_accuracy': 0.644444465637207}

We can now wrap the environment with the learned reward model, shaped by the policy's learned value function. Note that if we were training this for real, we would want to normalize the output of the reward net as well as the value function, to ensure their values are on the same scale. To do this, use the `NormalizedRewardNet` class from `src/imitation/rewards/reward_nets.py` on `reward_net`, and modify the potential to add a `RunningNorm` module from `src/imitation/util/networks.py`.

In [3]:
from imitation.rewards.reward_nets import ShapedRewardNet, cnn_transpose
from imitation.rewards.reward_wrapper import RewardVecEnvWrapper


def value_potential(state):
    state_ = cnn_transpose(state)
    return agent.policy.predict_values(state_)


shaped_reward_net = ShapedRewardNet(
    base=reward_net,
    potential=value_potential,
    discount_factor=0.99,
)
learned_reward_venv = RewardVecEnvWrapper(venv, shaped_reward_net.predict)

Next, we train an agent that sees only the shaped, learned reward.

In [4]:
learner = PPO(
    policy=CnnPolicy,
    env=learned_reward_venv,
    seed=0,
    batch_size=64,
    ent_coef=0.0,
    learning_rate=0.0003,
    n_epochs=10,
    n_steps=64,
)
learner.learn(1000)

<stable_baselines3.ppo.ppo.PPO at 0x7fadb817f640>

We now evaluate the learner using the original reward.

In [5]:
from stable_baselines3.common.evaluation import evaluate_policy

reward, _ = evaluate_policy(learner.policy, venv, 10)
print(reward)

1.3
