Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIFO On-Policy replay buffers lead to error with PPO #40

Closed
GreatArcStudios opened this issue Jan 5, 2024 · 2 comments
Closed

FIFO On-Policy replay buffers lead to error with PPO #40

GreatArcStudios opened this issue Jan 5, 2024 · 2 comments

Comments

@GreatArcStudios
Copy link

GreatArcStudios commented Jan 5, 2024

I tried an example with pole cart and the starter code with the provided FIFO on-policy replay buffer with PPO. There is an error regarding the cumulative reward.

With FIFOOnPolicyReplayBuffer and PPO we get:

Traceback (most recent call last):
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading\Modelling\Tests\Pearl\Basic.py", line 25, in <module>
    trainer.train(num_iterations=1000)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading\Modelling\Training\Pearl\Trainer.py", line 46, in train
    info = online_learning(
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\utils\functional_utils\train_and_eval\online_learning.py", line 107, in online_learning
    episode_info, episode_total_steps = run_episode(
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\utils\functional_utils\train_and_eval\online_learning.py", line 275, in run_episode
    agent.learn()
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\pearl_agent.py", line 206, in learn
    report = self.policy_learner.learn(self.replay_buffer)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\sequential_decision_making\ppo.py", line 154, in learn
    result = super().learn(replay_buffer)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\policy_learner.py", line 171, in learn
    single_report = self.learn_batch(batch)
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\sequential_decision_making\actor_critic_base.py", line 231, in learn_batch
    self._critic_learn_batch(batch)  # update critic
  File "C:\Users\ericz\Documents\Github\RLTrading Model\RLTrading-Data\RLTrading\Pearl\pearl\policy_learners\sequential_decision_making\ppo.py", line 145, in _critic_learn_batch
    assert batch.cum_reward is not None
AssertionError

Here is the full example code:

env = GymEnvironment(gym.make("CartPole-v1"))

num_actions = env.action_space.n
agent = PearlAgent(
    policy_learner=ProximalPolicyOptimization(
        state_dim=env.observation_space.shape[0],
        action_space=env.action_space,
        actor_hidden_dims=[64, 64],
        critic_hidden_dims=[64, 64],
        training_rounds=100,
        batch_size=8,
        action_representation_module=OneHotActionTensorRepresentationModule(
            max_number_actions=num_actions,
        ),
    ),
    replay_buffer=FIFOOnPolicyReplayBuffer(10_000),
)

observation, action_space = env.reset()
agent.reset(observation, action_space)
done = False
while not done:
    action = agent.act(exploit=False)
    action_result = env.step(action)
    agent.observe(action_result)
    agent.learn()
    done = action_result.done
@yiwan-rl
Copy link
Contributor

Hi, thanks for raising this point.

PPO (and REINFORCE) is compatible with the replay buffer here
https://github.com/facebookresearch/Pearl/blob/main/pearl/replay_buffers/sequential_decision_making/on_policy_episodic_replay_buffer.py.
FIFOOnPolicyReplayBuffer is used in SARSA currently.

They have two differences:

  1. Data in FIFOOnPolicyReplayBuffer can be used before reaching the end of each episode. But this is not the case with OnPolicyEpisodicReplayBuffer.
  2. FIFOOnPolicyReplayBuffer involves two actions in each transition tuple (s, a, r, s', a') while OnPolicyEpisodicReplayBuffer involves one action in each transition tuple (s, a, r, s').

FIFOOnPolicyReplayBuffer and OnPolicyEpisodicReplayBuffer can be merged. This would reduce confusion and repetition. And this is already in our plan.

@GreatArcStudios
Copy link
Author

Ok, this is helpful to know. I'll close the ticket then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants