FIFO On-Policy replay buffers lead to error with PPO #40

GreatArcStudios · 2024-01-05T20:29:17Z

I tried an example with pole cart and the starter code with the provided FIFO on-policy replay buffer with PPO. There is an error regarding the cumulative reward.

With FIFOOnPolicyReplayBuffer and PPO we get:

Traceback (most recent call last):
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading\Modelling\Tests\Pearl\Basic.py", line 25, in <module>
    trainer.train(num_iterations=1000)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading\Modelling\Training\Pearl\Trainer.py", line 46, in train
    info = online_learning(
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\utils\functional_utils\train_and_eval\online_learning.py", line 107, in online_learning
    episode_info, episode_total_steps = run_episode(
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\utils\functional_utils\train_and_eval\online_learning.py", line 275, in run_episode
    agent.learn()
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\pearl_agent.py", line 206, in learn
    report = self.policy_learner.learn(self.replay_buffer)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\sequential_decision_making\ppo.py", line 154, in learn
    result = super().learn(replay_buffer)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\policy_learner.py", line 171, in learn
    single_report = self.learn_batch(batch)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\sequential_decision_making\actor_critic_base.py", line 231, in learn_batch
    self._critic_learn_batch(batch)  # update critic
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\sequential_decision_making\ppo.py", line 145, in _critic_learn_batch
    assert batch.cum_reward is not None
AssertionError

Here is the full example code:

env = GymEnvironment(gym.make("CartPole-v1"))

num_actions = env.action_space.n
agent = PearlAgent(
    policy_learner=ProximalPolicyOptimization(
        state_dim=env.observation_space.shape[0],
        action_space=env.action_space,
        actor_hidden_dims=[64, 64],
        critic_hidden_dims=[64, 64],
        training_rounds=100,
        batch_size=8,
        action_representation_module=OneHotActionTensorRepresentationModule(
            max_number_actions=num_actions,
        ),
    ),
    replay_buffer=FIFOOnPolicyReplayBuffer(10_000),
)

observation, action_space = env.reset()
agent.reset(observation, action_space)
done = False
while not done:
    action = agent.act(exploit=False)
    action_result = env.step(action)
    agent.observe(action_result)
    agent.learn()
    done = action_result.done

The text was updated successfully, but these errors were encountered:

yiwan-rl · 2024-01-12T19:16:03Z

Hi, thanks for raising this point.

PPO (and REINFORCE) is compatible with the replay buffer here
https://github.com/facebookresearch/Pearl/blob/main/pearl/replay_buffers/sequential_decision_making/on_policy_episodic_replay_buffer.py.
FIFOOnPolicyReplayBuffer is used in SARSA currently.

They have two differences:

Data in FIFOOnPolicyReplayBuffer can be used before reaching the end of each episode. But this is not the case with OnPolicyEpisodicReplayBuffer.
FIFOOnPolicyReplayBuffer involves two actions in each transition tuple (s, a, r, s', a') while OnPolicyEpisodicReplayBuffer involves one action in each transition tuple (s, a, r, s').

FIFOOnPolicyReplayBuffer and OnPolicyEpisodicReplayBuffer can be merged. This would reduce confusion and repetition. And this is already in our plan.

GreatArcStudios · 2024-01-12T20:47:23Z

Ok, this is helpful to know. I'll close the ticket then.

GreatArcStudios closed this as completed Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIFO On-Policy replay buffers lead to error with PPO #40

FIFO On-Policy replay buffers lead to error with PPO #40

GreatArcStudios commented Jan 5, 2024 •

edited

Loading

yiwan-rl commented Jan 12, 2024

GreatArcStudios commented Jan 12, 2024

FIFO On-Policy replay buffers lead to error with PPO #40

FIFO On-Policy replay buffers lead to error with PPO #40

Comments

GreatArcStudios commented Jan 5, 2024 • edited Loading

yiwan-rl commented Jan 12, 2024

GreatArcStudios commented Jan 12, 2024

GreatArcStudios commented Jan 5, 2024 •

edited

Loading