# List of Tumbling Blocks

1. **Not adding type annotations to my dataclass**. This makes the class-level default values rather than instance attributes that can be set via the constructor.
2. **Mixing up standard deviations between actor & critic network**. The critic network needs a larger std (e.g. 1) to estimate returns over a widge range. The actor network needs a smaller std (e.g. 0.01) to make the policy more uniform at the beginning, which encourages exploration instead of action commitment. A small std for the actor network is *one of the most important initialisation details*.
3. **Understanding we use np.empty((0, self.num_envs, ...))**. Also the shape doesn't make intuitive sense, this just essentially preps the experiences to concatenated on via `np.concatenate`. This avoids the awkward case of having the initalization step be different.
4. **Understanding why we use .cpu().numpy()**. This is just for the Gym package, which expects data to be on the CPU and in the NumPy format.

# Setup

In [156]:
import gymnasium as gym
import numpy as np
import torch as t
import torch.nn as nn
import torch.optim as optim
import wandb
import warnings
import time
import einops
import itertools
import random
import os

from dataclasses import dataclass
from gymnasium.spaces import Box, Discrete
from gymnasium.envs.classic_control import CartPoleEnv
from jaxtyping import Bool, Float, Int
from torch import nn, Tensor
from torch.distributions.categorical import Categorical
from torch.optim.optimizer import Optimizer
from tqdm import tqdm, trange
from typing import Literal
from helper_functions import set_global_seeds, make_env, get_episode_data_from_infos

warnings.filterwarnings("ignore")
Arr: np.ndarray

device = t.device("mps") if t.backends.mps.is_available() else "cuda" if t.cuda.is_available() else t.device("cpu")
print(f"Using device: {device}")

Using device: mps


In [163]:
# effectively @dataclass is always going to be paired with the Args class since it initalizes all of the arguments
# and gets rid of the the annoying initalizations in def __init__(self, )
@dataclass 
class PPOArgs:
    seed: int = 1
    env_id: str = "CartPole-v1"
    mode: Literal["classic-control", "atari", "mujoco"] = "classic-control"

    total_timesteps: int = 500000
    num_envs: int = 4
    num_steps_per_rollout: int = 128
    num_minibatches: int = 4
    batches_per_learning_phase: int = 4

    lr: float = 2.5e-4
    max_grad_norm: float = 0.5

    gamma: float = 0.99
    gae_lambda: float = 0.95
    clip_coef: float = 0.2
    ent_coef: float = 0.01
    vf_coef: float = 0.25

    video_log_freq: int | None = None
    project_name: str = "Cartpole_PPO"
    entity: str = None

    # comments in reference to num_minibatches = 2
    def __post_init__(self):
        self.batch_size = self.num_steps_per_rollout * self.num_envs # 512
        self.minibatch_size = self.batch_size // self.num_minibatches # 256 
        self.total_phases = self.total_timesteps // self.batch_size # 976
        self.total_training_steps = self.total_phases * self.batches_per_learning_phase * self.num_minibatches # 7808

        self.video_save_path = "videos/cartpole_ppo"
        os.makedirs(self.video_save_path, exist_ok=True)

args = PPOArgs(num_minibatches = 2)

We define our actor and critic networks, which are just three layers with `nn.tanh()` in between. We initialize the layers in such a way that the signals' magnitudes are preserved, a flavor of He initialization. As mentioned before, the `std` of the last layers of the networks are very important.

For now, we only define `get_actor_and_critic` in our CartPole case. The other modes will come in handy later.

In [147]:
# this is essentially He initialization
# orthogonality preserves magnitude and structures of signals as they propagate through the nn
def layer_init(layer, std = np.sqrt(2), bias_const = 0.0):
    t.nn.init.orthogonal_(layer.weight, std)
    t.nn.init.constant_(layer.bias, bias_const)
    return layer

def get_actor_and_critic_classic(num_obs, num_actions):
    actor = nn.Sequential(
        layer_init(nn.Linear(num_obs, 64)),
        nn.Tanh(),
        layer_init(nn.Linear(64, 64)),
        nn.Tanh(),
        layer_init(nn.Linear(64, num_actions), std = 0.01),
    )

    critic = nn.Sequential(
        layer_init(nn.Linear(num_obs, 64)),
        nn.Tanh(),
        layer_init(nn.Linear(64, 64)),
        nn.Tanh(),
        layer_init(nn.Linear(64, 1), std = 1)
    )

    return actor, critic

# returns the networks used for PPO
# comments based on classic-control, atari, mujoco
def get_actor_and_critic(envs, mode):
    # (4, ), (84, 84, 4), (24, )
    obs_shape = envs.single_observation_space.shape
    # 4, 28224, 24
    num_obs = np.array(obs_shape).prod()
    # 2, 4, idk for mujuco yet
    num_actions = (envs.single_action_space.n if isinstance(envs.single_action_space, gym.spaces.Discrete)
                   else np.array(envs.single_action_space.shape).prod())
    
    if mode == "classic-control":  # Changed from classic_control to classic-control
        actor, critic = get_actor_and_critic_classic(num_obs, num_actions)
    if mode == "atari":
        raise NotImplementedError()
    if mode == "mujoco":
        raise NotImplementedError()
    return actor, critic  # Added missing return statement

**What is GAE (Generalized Advantage Estimation)?**

We care about the **advantage function** $A_\theta(s_t, a_t)$, defined by $Q_\theta(s_t, a_t) - V_\theta(s_t)$ which represents the difference. between the expected future reward when taking action $a$ and the reward from taking the expected action according to policy $\pi_\theta$ from that point onwards.

We start with the 1-step residual $\hat{A}_\theta(s_t, a_t) = \delta_t = r_t + \gamma \cdot (1 - d_{t+1}) \cdot V_\theta(s_{t+1}) - V_\theta(s_t)$, but this is too short-sighted. By combining the idea of summing future residuals and adding a geometrically decay factor to account for future residuals, we get $$\hat{A}_t^{GAE(\lambda)} = \delta_t + (\gamma \lambda) \cdot \delta_{t+1} + \cdots + (\gamma \lambda)^{T-t+1} \delta_{T-1}$$

We can get the recursive step $\hat{A}_t^{GAE(\lambda)} = \delta_t + (1-d_{t+1}) \cdot (\gamma \lambda) \cdot \hat{A}_{t+1}^{GAE(\lambda)}$ which allows us to start from the final step and work forwards.

In [148]:
@t.inference_mode()
def compute_advantages(next_value, next_terminated, rewards, values, terminated, gamma, gae_lambda):
    # standardize as floats, also needed for the 1.0 - next_terminated
    terminated = terminated.float()
    next_terminated = next_terminated.float()

    # Concatenate tensors along dim=0 (time dimension)
    next_values = t.concat((values[1:], next_value[None, :]), dim=0)
    next_terminated = t.concat((terminated[1:], next_terminated[None, :]), dim=0)

    # calculated from above
    deltas = rewards + gamma * (1.0 - next_terminated) * next_values - values
    advantages = t.zeros_like(deltas)
    
    # update from the last element
    advantages[-1] = deltas[-1]

    # recursively iterate from the back
    for s in reversed(range(values.shape[0] - 1)):
        advantages[s] = deltas[s] + (1 - next_terminated[s]) * gamma * gae_lambda * advantages[s + 1]

    return advantages

If we have `num_envs` vectorized environments and `num_steps_per_rollout`, then we have a total `batch_size = num_envs * num_steps_per_rollout`. Our goal is to randomly split those `batch_size` agent actions into minibatches with size `minibatch_size.`

In [149]:
def get_minibatch_indices(rng, batch_size, minibatch_size):
    num_minibatches = batch_size // minibatch_size

    # numbers 0 -> batch_size-1, shaped by the num_minibatches
    indices = rng.permutation(batch_size).reshape(num_minibatches, minibatch_size)
    return list(indices)

We define the `ReplayMinibatch` and the `ReplayMemory` which has all the necessary functions (`reset`, `add`, and `get_minibatches`). 

`reset` resets the memory; since PPO is on-policy, we use resets instead of slicing.

`add` adds the most recent arrays related to the experience.

`get_minibatches` get all the minibatches which are used as the batches to perform back propogation on the two networks.

In [150]:
# seperate np for data storage and tensors for when torch is actually needed for grads, etc.
# keeps all per-timestep rollout data neatly packaged, easier function interfaces
@dataclass
class ReplayMinibatch:
    obs: Float[Tensor, "minibatch_size *obs_shape"]
    actions: Int[Tensor, "minibatch_size *action_shape"]
    logprobs: Float[Tensor, "minibatch_size"]
    advantages: Float[Tensor, "minibatch_size"]
    returns: Float[Tensor, "minibatch_size"]
    terminated: Bool[Tensor, "minibatch_size"]

class ReplayMemory:
    def __init__(self, num_envs, obs_shape, action_shape, batch_size, minibatch_size, batches_per_learning_phase, seed: int = 42):
        self.num_envs = num_envs
        self.obs_shape = obs_shape
        self.action_shape = action_shape
        self.batch_size = batch_size
        self.minibatch_size = minibatch_size
        self.batches_per_learning_phase = batches_per_learning_phase
        self.rng = np.random.default_rng(seed)
        self.reset()

    # *self.obs_shape and *self.action_shape allows for shape flexibility
    def reset(self):
        self.obs = np.empty((0, self.num_envs, *self.obs_shape), dtype=np.float32)
        self.actions = np.empty((0, self.num_envs, *self.action_shape), dtype=np.int32)
        self.logprobs = np.empty((0, self.num_envs), dtype=np.float32)
        self.values = np.empty((0, self.num_envs), dtype=np.float32)
        self.rewards = np.empty((0, self.num_envs), dtype=np.float32)
        self.terminated = np.empty((0, self.num_envs), dtype=bool)

    # using [None, :] adds another axis at the beginning, which is probably used as a time-step
    def add(self, obs, actions, logprobs, values, rewards, terminated):
        self.obs = np.concatenate((self.obs, obs[None, :]))
        self.actions = np.concatenate((self.actions, actions[None, :]))
        self.logprobs = np.concatenate((self.logprobs, logprobs[None, :]))
        self.values = np.concatenate((self.values, values[None, :]))
        self.rewards = np.concatenate((self.rewards, rewards[None, :]))
        self.terminated = np.concatenate((self.terminated, terminated[None, :]))

    def get_minibatches(self, next_value, next_terminated, gamma, gae_lambda):
        obs, actions, logprobs, values, rewards, terminated = (
            t.tensor(x, device = device, dtype = t.float32)
            for x in [self.obs, self.actions, self.logprobs, self.values, self.rewards, self.terminated]
        )

        # these two will be useful in the critic learning value function, which depends on V_target, aka returns
        advantages = compute_advantages(next_value, next_terminated, rewards, values, terminated, gamma, gae_lambda)
        returns = advantages + values

        minibatches = []
        # we update on each experience more than once, particularly self.batches_per_learning_phase times
        for _ in range(self.batches_per_learning_phase):
            for indices in get_minibatch_indices(self.rng, self.batch_size, self.minibatch_size):
                # not sure what the (0,1) means, but should combine the [T, num_envs, feature_dim] -> [T * num_envs, feature_dim]
                minibatches.append(
                    ReplayMinibatch(*[data.flatten(0, 1)[indices] 
                                    for data in [obs, actions, logprobs, advantages, returns, terminated]])
                )

        # clears the rollout buffer for the next training cycle
        self.reset()

        return minibatches

Defining the `PPOAgent` class, which includes two key functions.

`play_step`: uses the actor network to get the policy, from which actions are sampled. Those actions are used in the environment which generates experiences stored in the replay memory. Then, `self.next_obs` and `self.next_terminated` are updated, as well as `self.step`.

`get_minibatches`: basically uses `get_minibatches` from the `ReplayMinibatch` class after using the critic network to evaluate the advantage. It also resets the memory.

In [151]:
class PPOAgent:
    def __init__(self, envs, actor, critic, memory):
        self.envs = envs
        self.actor = actor
        self.critic = critic
        self.memory = memory

        self.step = 0
        self.next_obs = t.tensor(envs.reset()[0], device = device, dtype = t.float)
        self.next_terminated = t.zeros(envs.num_envs, device = device, dtype = t.bool)

    def play_step(self):
        obs = self.next_obs
        terminated = self.next_terminated

        # f-prop to get the policy
        # why no flatten here? maybe dist can't do it
        with t.inference_mode():
            logits = self.actor(obs)
            values = self.critic(obs).flatten().cpu().numpy()

        # sample from the policy to determine which action to take
        dist = Categorical(logits = logits)
        actions = dist.sample()

        # take the action -> step
        next_obs, rewards, next_terminated, next_truncated, infos = self.envs.step(actions.cpu().numpy())

        # add everything into the memory
        logprobs = dist.log_prob(actions).cpu().numpy()
        self.memory.add(obs.cpu().numpy(), actions.cpu().numpy(), logprobs, values, rewards, terminated.cpu().numpy())

        # update info for next step
        self.next_obs = t.from_numpy(next_obs).to(device, dtype = t.float)
        self.next_terminated = t.from_numpy(next_terminated).to(device, dtype = t.float)
        self.step += self.envs.num_envs

        return infos
    
    # not sure why we need .flatten()
    def get_minibatches(self, gamma, gae_lambda):
        # get the minibatches from the memory
        with t.inference_mode():
            next_value = self.critic(self.next_obs).flatten()
        minibatches = self.memory.get_minibatches(next_value, self.next_terminated, gamma, gae_lambda)
        self.memory.reset()

        return minibatches

Here, we calculate the three components that will comrise the final training objective, given by $$L_t^{PPO}(\theta) = \hat{\mathbb{E}}_t\left[L_t^{CLIP}(\theta) - c_1 L_t^{VF}(\theta) + c_2 S[\pi_\theta](s_t)\right]$$
where the first reprents the PPO objective function $L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)\right]$.

The second represents the critic learning value function, defined by $$L^{VF}(\theta) = (V_\theta(s_t)-V_t^{target})^2$$ where $V_t^{target} = V_{\theta_{target}}(s_t)+\hat{A}_{\theta_{target}}(s_t, a_t)$ is the return.

The third represents the entropy which encourages exploration. The standard formula for a discrete probability distribution $p$ is given by $$H(p) = \sum_x p(x) \ln\left(\frac1{p(x)}\right)$$ with the convention that $\ln\left(\frac1{0}\right)=0$. The larger the entropy, the more uncertain the action is, which essentially encourages exploration. If entropy is 0, there is no certainity and hence no exploration.

In [152]:
def calc_clipped_obj(dist, mb_action, mb_advantages, mb_logprobs, clip_coef, eps: float = 1e-8):
    # we define prob_ratio as the ratio of the probabilities of new policy to the old policy
    #  e^{logx - logy} = x/y, which is what we want
    logits_diff = dist.log_prob(mb_action) - mb_logprobs
    prob_ratio = t.exp(logits_diff)

    # we standardize the mb_advantage
    # standardization helps with training stability by keeping the advantages in a reasonable range
    mb_advantages = (mb_advantages - mb_advantages.mean()) / (mb_advantages.std() + eps)

    # standard application of the formula
    non_clipped = prob_ratio * mb_advantages
    clipped = t.clip(prob_ratio, 1 - clip_coef, 1 + clip_coef) * mb_advantages

    return t.minimum(non_clipped, clipped).mean()

# for MSE loss between predicted values and actual returns
def calc_value_fn_loss(values, mb_returns, vf_coef):

    return vf_coef * ((values - mb_returns)**2).mean()

# to encourage exploration
def calc_entropy(dist, ent_coef):
    return ent_coef * dist.entropy().mean()

Here, we use a linear scheduler which decreases the learning rate from start (`initial_lr`) to end (`end_lr`). If we wanted it to decay for some period then stay constant afterwards, we would do something like in `cartpole_dqn.ipynb` and use the `min()` function, sorta like a ReLU function.

In [153]:
class PPOScheduler:
    def __init__(self, optimizer, initial_lr, end_lr, total_phases):
        self.optimizer = optimizer
        self.initial_lr = initial_lr
        self.end_lr = end_lr
        self.total_phases = total_phases
        self.n_step_calls = 0

    # the scheduler than linearly decays from initial_lr to end_lr
    def step(self):
        self.n_step_calls += 1
        frac = self.n_step_calls / self.total_phases
        # we update for both actor and critic
        for param_group in self.optimizer.param_groups:
            param_group["lr"] = self.initial_lr + frac * (self.end_lr - self.initial_lr)

# not sure: is it better to have end_lr = 0.0 at the end or still some small non-0 value?
def make_optimizer(actor, critic, total_phases, initial_lr, end_lr = 0.0):
    # passes all parameters (actor and critic) into a single optimizer
    optimizer = optim.AdamW(itertools.chain(
        actor.parameters(), critic.parameters()), lr = initial_lr, eps = 1e-5, maximize = True)
    scheduler = PPOScheduler(optimizer, initial_lr, end_lr, total_phases)
    return optimizer, scheduler

In [154]:
class PPOTrainer:
    def __init__(self, args):
        set_global_seeds(args.seed)
        self.args = args
        self.run_name = f"{args.env_id}_{args.project_name}_seed{args.seed}__{time.strftime('%Y%m%d-%H%M%S')}"
        self.envs = gym.vector.SyncVectorEnv([
            make_env(
                env_id=args.env_id,
                seed=args.seed + idx,
                idx=idx,
                run_name=self.run_name,
                mode=args.mode,
                video_log_freq=args.video_log_freq,
                video_save_path=args.video_save_path
            ) for idx in range(args.num_envs)
        ])

        self.num_envs = self.envs.num_envs
        self.action_shape = self.envs.single_action_space.shape
        self.obs_shape = self.envs.single_observation_space.shape

        self.memory = ReplayMemory(self.num_envs, self.obs_shape, self.action_shape, args.batch_size, args.minibatch_size,
                                   args.batches_per_learning_phase, args.seed)
        
        self.actor, self.critic = get_actor_and_critic(self.envs, mode = args.mode)
        # Move networks to the correct device
        self.actor = self.actor.to(device)
        self.critic = self.critic.to(device)
        
        self.optimizer, self.scheduler = make_optimizer(self.actor, self.critic, args.total_training_steps, args.lr)

        self.agent = PPOAgent(self.envs, self.actor, self.critic, self.memory)

    # not sure what the shape of data is
    # not sure why it is being overridden if there are multiple new_data s.t. new_data is not None
    # in fact, i'm not sure how ts works
    def rollout_phase(self):
        data = None
        t0 = time.time()
        
        for i in range(self.args.num_steps_per_rollout):
            infos = self.agent.play_step()
            new_data = get_episode_data_from_infos(infos)

            # if the env terminated
            if new_data is not None:
                data = new_data
                wandb.log(new_data, step = self.agent.step)

        # samples per second, around 200
        wandb.log({"SPS": (self.args.num_steps_per_rollout * self.num_envs) / (time.time() - t0)}, step = self.agent.step)

        return data

    # forward direction, using the agent (actor/critic networks) to calculate the three functions and compute the total objective
    def compute_ppo_objective(self, minibatch):
        # actor policy, used in clipped_surrogate_obj
        logits = self.actor(minibatch.obs)
        dist = Categorical(logits = logits)

        # not sure why we squeeze here
        # values, used in the value fn loss
        values = self.critic(minibatch.obs).squeeze()

        clipped_obj = calc_clipped_obj(dist, minibatch.actions, minibatch.advantages, minibatch.logprobs, self.args.clip_coef)
        value_fn_loss = calc_value_fn_loss(values, minibatch.returns, self.args.vf_coef)
        entropy = calc_entropy(dist, self.args.ent_coef)
        total_obj = clipped_obj - value_fn_loss + entropy

        # purely for logging debug variables
        with t.inference_mode():
            new_log_prob = dist.log_prob(minibatch.actions)
            log_ratio = new_log_prob - minibatch.logprobs
            ratio = log_ratio.exp()
            clip_frac = [((ratio - 1.0).abs() > self.args.clip_coef).float().mean().item()]
            approx_kl = ((ratio - 1) - log_ratio).mean().item()

        # not sure why some of them have .mean() and why some of them have .item()
        # for clip_frac is there a difference between np.mean(clip_frac) and clip_frac.mean()
        wandb.log(dict(
            total_steps = self.agent.step,
            values = values.mean().item(),
            lr = self.scheduler.optimizer.param_groups[0]["lr"],
            clipped_obj = clipped_obj.item(),
            value_fn_loss = value_fn_loss.item(),
            entropy = entropy.item(),
            approx_kl = approx_kl,
            clip_frac = np.mean(clip_frac)
        ), step = self.agent.step)

        return total_obj
    
    # after calculating the total_obj, perform backprop to update weights
    def learning_phase(self):
        minibatches = self.agent.get_minibatches(self.args.gamma, self.args.gae_lambda)
        for minibatch in minibatches:
            total_obj = self.compute_ppo_objective(minibatch)
            total_obj.backward()
            # step recommended in the iclr blog
            # global gradient clipping offers a small performance boost -> global l2 norm doesn't exceed 0.5
            nn.utils.clip_grad_norm_(list(self.actor.parameters()) + list(self.critic.parameters()), self.args.max_grad_norm)
            self.optimizer.step()
            self.optimizer.zero_grad()
        self.scheduler.step()

    def train(self):
        # not sure what entity and monitor_gym exactly do
        wandb.init(project = self.args.project_name,
                   entity = self.args.entity,
                   name = self.run_name,
                   monitor_gym = self.args.video_log_freq is not None)
        # log parameter histograms, gradient norms, and updates
        wandb.watch([self.actor, self.critic], log = "all", log_freq = 50)

        pbar = tqdm(range(self.args.total_phases))
        last_logged_time = time.time()

        for phase in pbar:
            data = self.rollout_phase()

            if data is not None and time.time() - last_logged_time > 0.5:
                last_logged_time = time.time()
                # not sure what is going on here
                pbar.set_postfix(phase = phase, **data)

            self.learning_phase()

        self.envs.close()
        wandb.finish()

In [155]:
args = PPOArgs(video_log_freq = 50)
trainer = PPOTrainer(args)
trainer.train()

100%|██████████| 976/976 [12:48<00:00,  1.27it/s, episode_duration=2.7, episode_length=500, episode_reward=500, phase=973]  


0,1
SPS,█████▅▄██▇▇████▄▄▇▆█▅▅▁▅▁▅█▇█▇█▆▃█▅▅▇▇▆▇
approx_kl,▁▁▂▄▃▂▃▄▁█▁▁▁▁▄▁▁▃▂▁▁▁▁▁▁▁▁▄▃▂▇▁▂▂▃▂▁▄▂▁
clip_frac,▁▁▂▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▂▃▃▁▁▁▁▂▅▄▁▁▁▅█▆
clipped_obj,▇▆▆▅▆▇▇▆▆▅▅▆▆▂▅▆▇▆▁▆▆▅▆▁▅▆▆▆▆▆▇▆▆▇▅▆▆█▆▅
entropy,█▇▇▆▆▅▅▅▆▆▆▆▅▆▆▅▆▆▆▆▅▆▆▆▆▆▆▆▅▄▂▄▃▂▃▃▂▂▁▁
episode_duration,▁▁▁▁▂▁▂▂▄▁▃▅▃▃▅▄▃▅▅▄▄▃▄▃▃▄▃▃▃▃▄▆▆█▆▅▅▅▅▅
episode_length,▁▁▁▁▁▁▁▃▄▄█▆▆▄▆█▃▃█▅▆▇▄▄▄▄▃▄▅▅▆▅▄▇██████
episode_reward,▁▁▁▁▁▁▁▂▄▄▅▄▅▃▇▄▄▅▄▆▄▅▄▄▆███████████████
lr,██▇▇▇▇▇▇▆▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▃▂▂▁▁▁▁
total_steps,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇███

0,1
SPS,1042.401
approx_kl,4e-05
clip_frac,0.0
clipped_obj,9e-05
entropy,0.00301
episode_duration,2.70312
episode_length,500.0
episode_reward,500.0
lr,0.00023
total_steps,499712.0


In [166]:
class EasyCart(CartPoleEnv):
    def step(self, action):
        obs, reward, terminated, truncated, info = super().step(action)

        x, velo, theta, omega = obs
        theta_max = 0.2095
        x_max = 0.2095
        

        reward_theta = 1 - abs(theta / theta_max)
        reward_x = 1 - abs(x / x_max)

        reward_new = (reward_theta + reward_x) / 2

        return obs, reward_new, terminated, truncated, info

gym.envs.registration.register(id="EasyCart-v0", entry_point=EasyCart, max_episode_steps=500)
args = PPOArgs(env_id="EasyCart-v0", video_log_freq=50)
trainer = PPOTrainer(args)
trainer.train()

100%|██████████| 976/976 [11:56<00:00,  1.36it/s, episode_duration=3.44, episode_length=500, episode_reward=488, phase=974]   


0,1
SPS,▇▇▆▇█▄████▃▅███▁▅█▅██████▂█████▅███▅█▇▄▅
approx_kl,▁▁▁▁▄▁▂▂▁▃▂▃▂▄▂▁▂▁▁▁██▂▄▂▃▄▂▅▃▂▃▂▁▃▁▂▃▃▃
clip_frac,▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁█▃▂▄▄▁▁▁▁▂▅▄▂▁▁
clipped_obj,▅▃▃▃▃▃▂▂▄▆▁▃▇▄▃▂▃█▆▄▃▃▃▂▃▄█▃▂▄▅▂▃▄█▃▂▄▃▃
entropy,▇██▇▇▆███▇█▇▇▇▇▆▆▇▆▇▅▄▅▄▄▄▃▃▂▂▂▂▂▁▁▁▁▁▁▁
episode_duration,▁▁▁▁▁▂▁▁▂▂▃▂▃▂▂▂▂▂▂▂▂▃▂▃▆█▆▇▇▆▆▆▇▆▆▆▆▆▆▇
episode_length,▁▂▁▁▁▁▂▄▁▁▂▂▂▂▃▂▁▂▂▃▃▂▃▂▄▂▃▃▃▃▃▃▃▅▃█████
episode_reward,▁▁▂▂▂▂▂▃▂▂▂▁▂▂▂▁▁▁▂▂▂▂▂▂▂▂▂▂▂▃▇▂▇▇▇█████
lr,██▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁
total_steps,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▇▇▇▇▇██

0,1
SPS,1017.15206
approx_kl,0.00051
clip_frac,0.00781
clipped_obj,0.00483
entropy,0.00165
episode_duration,3.4375
episode_length,500.0
episode_reward,488.39368
lr,0.00023
total_steps,499712.0


In [167]:
class EasyCart(CartPoleEnv):
    def step(self, action):
        obs, reward, terminated, truncated, info = super().step(action)

        x, velo, theta, omega = obs
        theta_max = 0.2095
        x_max = 0.2095
        

        reward_theta = 1 - abs(theta / theta_max)
        reward_x = 1 - abs(x / x_max)

        reward_new = reward_theta * reward_x

        return obs, reward_new, terminated, truncated, info

gym.envs.registration.register(id="EasyCart-v0", entry_point=EasyCart, max_episode_steps=500)
args = PPOArgs(env_id="EasyCart-v0", video_log_freq=50)
trainer = PPOTrainer(args)
trainer.train()

100%|██████████| 976/976 [11:40<00:00,  1.39it/s, episode_duration=2.67, episode_length=500, episode_reward=390, phase=975]    


0,1
SPS,▇▇▇▇▇▁▄▇▇▇▇▇▇▅▇▇█▇▃▄▇▃▇▇▇▄▆▆▇▇▇▇▇▇▇▇▇▇▅▇
approx_kl,▁▁▁▁▁▁▂▃▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂█▁▂▂▃▁▃▁▂▂▁▁▁▂▁▁▂
clip_frac,▁▁▁▁▁▁▁▁▁▁▁▁▃▁▂▁▁▁▁▁▁▁█▁▁▁▁▁▁▂▁▅▂▁▁▂▅▁▃▂
clipped_obj,▂▁▁▂▁▅▃▂▄█▅▂▁▂▂▂▂▃▂▃▂▅▃▁▂▂▂▃▅▂▃▃▇▂▄▂▄▂▁▄
entropy,██▇▆▆▅▄▃▄▃▃▃▂▃▂▂▂▂▂▂▂▁▂▂▃▂▃▄▃▄▃▃▃▂▃▂▃▂▂▃
episode_duration,▃▁▁▂▁▂▁▂▂▃▂▁▂▂▂▃▂▂▁▃▃▃▂▃▅▂▄▅▄▅▃▄▃▅▃▇███▃
episode_length,▁▁▁▂▁▂▁▂▂▂▁▂▂▃▂▂▂▂▂▅▄▃▂▆▂▅▂▅▄▃▂▄▃▂▃████▄
episode_reward,▄▄▄▄▄▄▄▄▅▅▅▅▄▄▄▄▅▄▅▄▅▄▅▅▆▄▅▃▅▆▄▅▅█▃▄▄▃▁▇
lr,████▇▇▇▇▇▇▆▆▆▆▆▅▅▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁
total_steps,▁▁▁▁▁▂▂▂▂▂▂▃▃▄▄▄▄▄▄▄▄▄▅▅▅▅▅▅▅▆▆▇▇▇▇▇▇███

0,1
SPS,1019.86782
approx_kl,0.00361
clip_frac,0.07031
clipped_obj,-0.00575
entropy,0.00194
episode_duration,2.67188
episode_length,500.0
episode_reward,389.9183
lr,0.00023
total_steps,499712.0


In [170]:
class EasyCart(CartPoleEnv):
    def step(self, action):
        obs, reward, terminated, truncated, info = super().step(action)

        x, velo, theta, omega = obs
        theta_max = 0.2095
        x_max = 0.2095
        
        # Clip the ratios to ensure they don't exceed 1
        theta_ratio = np.clip(abs(theta / theta_max), 0, 1)
        x_ratio = np.clip(abs(x / x_max), 0, 1)
        
        # This ensures rewards are always in [0, 1]
        reward_theta = 1 - theta_ratio
        reward_x = 1 - x_ratio
        
        # Now the product will always be non-negative and we can safely take the square root
        reward_new = np.sqrt(max(0, reward_theta * reward_x))

        return obs, reward_new, terminated, truncated, info

# Re-register with a new version to ensure we're using the updated environment
if "EasyCart-v0" in gym.envs.registry:
    del gym.envs.registry["EasyCart-v0"]
gym.envs.registration.register(id="EasyCart-v0", entry_point=EasyCart, max_episode_steps=500)
args = PPOArgs(env_id="EasyCart-v0", video_log_freq=50)
trainer = PPOTrainer(args)
trainer.train()

100%|██████████| 976/976 [12:01<00:00,  1.35it/s, episode_duration=2.78, episode_length=500, episode_reward=478, phase=975] 


0,1
SPS,█▇▄▅▆▄▁▇███████▅▅▅▆▁█▆▆█▇▇▇███▇▆▇▇▇▇▇█▇█
approx_kl,▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▂▂▁▂▄█▂▁█▃▂▂▂▂▄▁▁▄▂▁▁▁
clip_frac,▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▁▁▂█▄▁▁▂▂▁▂▄▃▅▅▄▁▄▃▁▃▆▄▃▃
clipped_obj,▅▄▄▅▄▄▆▆▃▅▄█▃▂▂▅▆▄▅▄▄▅▄▅▆▆▄█▅▃▂▆▃▃▂▁▄▄▄▅
entropy,██▇▆▆▆▆▆▆▆▆▆▅▅▆▅▄▄▄▃▂▂▃▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁
episode_duration,▁▁▁▁▂▃▂▂▂▃▅▃█▇▇███▆▆▇█▆▇▇▇▆▇▇▆▇▇▆▇▇▆▆▆▇▇
episode_length,▁▁▁▃▂▃▂▁▂▁▁▃▅▆▃▇▄▃▆▅▆███████████████████
episode_reward,▁▁▁▁▁▂▂▁▁▁▁▂▁▁▂▂▂▃▂▇▇██▇▇███████████████
lr,██▇▇▇▇▇▇▆▆▆▆▆▆▆▅▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▁▁▁▁
total_steps,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇██

0,1
SPS,937.327
approx_kl,0.00136
clip_frac,0.00781
clipped_obj,0.00096
entropy,0.00148
episode_duration,2.78125
episode_length,500.0
episode_reward,477.97803
lr,0.00023
total_steps,499712.0


In [169]:
class EasyCart(CartPoleEnv):
    def step(self, action):
        obs, reward, terminated, truncated, info = super().step(action)

        x, velo, theta, omega = obs
        theta_max = 0.2095
        x_max = 0.2095
        

        kinetic_energy = 0.5 * (velo ** 2 + omega ** 2)
        reward_new = 1 - 0.1 * kinetic_energy

        return obs, reward_new, terminated, truncated, info

gym.envs.registration.register(id="EasyCart-v0", entry_point=EasyCart, max_episode_steps=500)
args = PPOArgs(env_id="EasyCart-v0", video_log_freq=50)
trainer = PPOTrainer(args)
trainer.train()

100%|██████████| 976/976 [11:42<00:00,  1.39it/s, episode_duration=2.77, episode_length=500, episode_reward=499, phase=973]  




0,1
SPS,▁▆▅▆▇▇▇▇▇▇▆▇▃▆▇▇▇▇▆▇▇▇▇▇▇▇█▇▇▇▆▆▇▆▇▃▇▇▇▇
approx_kl,▂▅▆▃▃▁▁▂▁▃▁▁▁▁▁▁▁▂▂▁▆▂▆▁▁▁▂▂▁▃▃▁▁▁▁▁▁▅▂█
clip_frac,▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅█▁▂▁▁▁▁▁▁▁▁▁▄
clipped_obj,▆▄▃▄▁▃▂▂▂▄▂▂▃▇▃▃▃▂▇▁▄▃▄▂▄▁▆▃▃▃▃▃▃▄▅█▃▃▄▃
entropy,██▇▇▇▇▆▇▇▇▇▇▇▇▇▇▇█▇▆▄▅▄▄▄▄▅▄▄▅▂▃▂▂▂▁▂▂▁▁
episode_duration,▁▁▁▁▁▂▁▂▂▂▆▇▅▆▅▅▆▇▅▆▇▇▇▄▅▄▅▅▃▅▃▃▃▅█▅▅▅█▇
episode_length,▁▂▁▁▂▆▂█▇██▄▂▅███▃██▅▃▃▄█▄▃▄████▇▅█▇████
episode_reward,▂▁▁▂▁▂▃▅▆▆█▃▄▃█▅▃████▅▃▃▃▄▃▃███▇▅▇▄██▇██
lr,█████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁
total_steps,▁▁▁▁▁▁▁▂▂▃▃▃▃▄▄▄▄▄▅▅▅▆▆▆▆▆▆▆▆▇▇▇▇▇██████

0,1
SPS,1028.71546
approx_kl,0.00449
clip_frac,0.07031
clipped_obj,0.01667
entropy,0.00181
episode_duration,2.76562
episode_length,500.0
episode_reward,498.58557
lr,0.00023
total_steps,499712.0
