Fix/ppo playback, visualization, and remove time penalty#11
Merged
Conversation
Without this wrapper, play_ppo.py ran indefinitely on a converged model. The trained policy produces near-zero reward, so the reward_limit threshold was never crossed and no termination signal fired. TimeLimit caps episodes at max_episode_steps=3000, matching the training configuration used in __init__ and resume().
Exposes the PPO discount factor as gamma (default 0.99) on ProximalPolicyOptimizationAgent and as --gamma in train_ppo.py. Tested with gamma=0.999: converges slower for this task. The pendulum control problem is reactive and local — a longer planning horizon adds value function complexity without improving policy quality. gamma=0.99 remains the default.
The reward_fac[2] term adds -self.time * 0.001 to the reward each step.
self.time grows continuously but is not part of the observation, violating
the Markov property: two identical crane/pendulum states at different
episode times produce different rewards. PPO's value function V(s) cannot
condition on hidden state, so it must average across the time distribution
at each state — producing noisy gradient estimates and erratic convergence.
Time preference is already encoded implicitly through the discount factor γ:
V(s) = r_t + γ*r_{t+1} + γ²*r_{t+2} + … so the policy naturally prefers
faster solutions without any extra signal. An explicit time penalty is
both redundant and harmful for function-approximation methods like PPO.
Q-learning tolerates it because dt=0.1 makes the penalty 10× smaller, and
the tabular value function averages over many revisits to the same state
bucket, smoothing out the non-stationarity.
Set reward_fac[2] = 0.0 in both train_ppo.py and play_ppo.py. The
reward_fac parameter remains on the environment so Q-learning can still
use it; only the PPO scripts override it explicitly.
render() had no handler for render_mode='plot'. show_plot() was only called from reset() at the start of the next episode, so with a single episode run the plot never appeared.
- Switch from 2×2 grid to 4×1 vertical layout (figsize 16×12): gives each subplot full width and a shared time axis, making crane-pendulum interactions easier to align visually. - Combine primary and twin-axis legend handles so load speed, damping, and crane speed (on ax1y2/ax2y2) appear in the legend. ax.legend() alone only captured primary-axis lines. - Use plt.suptitle() instead of plt.title() for the figure-level title. plt.title() attached to the last active axes (ax4), placing the text between subplots. suptitle() places it above all subplots. - Add loc="upper left" to ax2 legend to avoid overlap with the twin y-axis spine on the right. - Add suptitle stub to stubs/matplotlib-stubs/pyplot.pyi (pyright did not recognise plt.suptitle without it). Also updates CHANGELOG.md with all changes from this branch.
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Context
PR #10 introduced
reward_fac— a configurable tuple of reward weights, the third of which adds−self.time × 0.001to the reward each step. This breaks PPO in a fundamental way (see below). This PR also fixes two bugs that madeplay_ppo.pynon-functional, adds agammaparameter, and improves the episode plot.Changes
Bug: infinite loop in
play_ppo.py(commit 1)ProximalPolicyOptimizationAgent.load()was missing theTimeLimitwrapper. The converged model produces near-zero reward, so thereward_limitthreshold was never crossed and the episode ran forever. Addedwrapper_class=TimeLimit, max_episode_steps=3000to match the training configuration.Feature: configurable
gammaparameter (commit 2)ProximalPolicyOptimizationAgentnow accepts agammaparameter (default 0.99), exposed as--gammaintrain_ppo.py. Tested with γ=0.999 — converges slower for this task, since the pendulum control problem is reactive and local; a longer planning horizon adds value function complexity without benefit.Fix: disable explicit time penalty for PPO — Markov violation (commit 3)
reward_fac[2]adds−self.time × 0.001each step.self.timegrows continuously but is not in the observation, violating the Markov property: two identical crane/pendulum states at different points in an episode produce different rewards. PPO's value function V(s) cannot condition on hidden state, so it is forced to average across the time distribution at each state — producing noisy gradient estimates and poor convergence.Importantly, time preference is already encoded implicitly through the discount factor γ: V(s) = r_t + γ·r_{t+1} + γ²·r_{t+2} + … so the policy naturally prefers faster solutions without any extra signal. The explicit penalty is both redundant and harmful for function-approximation methods like PPO.
Q-learning tolerates it because
dt=0.1makes the penalty 10× smaller, and the tabular value function averages over many revisits to the same bucket, smoothing out the non-stationarity.reward_facremains on the environment unchanged; only the PPO scripts overridereward_fac[2]to0.0.Bug: episode plot never appeared (commit 4a)
render()had no handler forrender_mode='plot'.show_plot()was only called fromreset()at the start of the next episode, so with--episodes 1the plot was never shown. Added the missingelifbranch.Improvement:
show_plotvisualization (commit 4b)ax.legend()only captured primary-axis lines. Load speed, damping, and crane speed (on twin axes) were absent from the legend. Fixed by combining handles from both axes withget_legend_handles_labels().plt.suptitle()instead ofplt.title():plt.title()attached to the last active axes, placing the text between subplots.suptitle()places it at the figure level.suptitlestub tostubs/matplotlib-stubs/pyplot.pyi(pyright did not recognise it).