# What are the differences between rpo_continuous_action and ppo_continuous_action
In the context of CleanRL, which is a library designed for implementing various reinforcement learning algorithms, you'll find different implementations of Proximal Policy Optimization (PPO) adapted for different types of action spaces. The two specific algorithms you've mentioned, `rpo_continuous_action` and `ppo_continuous_action`, represent different variations of the PPO algorithm for handling continuous action spaces.

### Differences Between `rpo_continuous_action` and `ppo_continuous_action`

1. **Algorithm Variants**:
   - **`ppo_continuous_action`**: This implementation follows the standard PPO methodology as introduced by Schulman et al. in their original paper on PPO. PPO uses a clipped objective to provide a stable method for training policies by preventing large updates that will lead to erratic behavior.
   - **`rpo_continuous_action`**: This stands for **"Regularized PPO"** or **"Robust PPO,"** depending on the specific implementation. This variant introduces additional regularization techniques designed to enhance stability and robustness during training compared to standard PPO. 

2. **Clipping and Objectives**:
   - In **PPO**, the typical objective function includes a clipping mechanism that limits the policy's update step, preventing dramatic shifts in the policy distribution:
     $$
     L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]
     $$
     where $r_t(\theta)$ is the probability ratio of the new and old policies, and $\hat{A}_t$ is the estimated advantage.

   - In **RPO**, additional regularization techniques may be added that further constrain the policy updates to encourage exploration while ensuring that the updates align with the underlying policy's behavior.

3. **Exploration Techniques**:
   - **PPO** generally relies on the regular policy gradient updates with the objective of maximizing expected rewards, focusing on balancing exploration and exploitation.
   - **RPO** variations might include stronger exploration incentives or tailored advantage estimators that improve the agent's capacity to adapt to non-stationary or complex environments.

4. **Implementation Specifics**:
   - The actual implementation of these algorithms may differ in terms of code structure, logging, hyperparameter settings, and specific tricks employed, like normalization of state/action spaces, variance reduction techniques, and so forth.

5. **Performance and Robustness**:
   - Users may see differing performance results in different environments, with RPO potentially providing better robustness against instability during training, especially in environments with more complex dynamics or high variance in rewards.

### Summary

While both `rpo_continuous_action` and `ppo_continuous_action` are based on the Proximal Policy Optimization framework, the key differences are in the additional modifications and regularizations applied in RPO to achieve better stability and robustness during training. 

When choosing between them, the best approach might depend on the specific characteristics of the environment you're training the agent in, along with empirical results from performance testing. If you have access to both implementations in CleanRL, experimenting with both could provide insights into which algorithm performs better for your specific task.

# The different part
The code snippet you've provided highlights a key difference between the `rpo_continuous_action` and `ppo_continuous_action` implementations in terms of how actions are sampled, particularly in the **RPO** (Regularized Proximal Optimization) approach. Let’s break down the rationale and functionality behind the changes introduced in the RPO implementation.

### Key Differences in Action Sampling

1. **Standard PPO Behavior**: 
   - In the standard **PPO** implementation:
     ```python
     if action is None:
         action = probs.sample()
     ```
   - Here, if no action is provided, a new action is sampled directly from the probability distribution defined by `action_mean` and `action_std`. This is typical in reinforcement learning, especially during training, where you want to sample actions that reflect the policy.

2. **RPO Behavior**:
   - In the **RPO** implementation:
     ```python
     else:  # new to RPO
         # sample again to add stochasticity to the policy
         z = torch.FloatTensor(action_mean.shape).uniform_(-self.rpo_alpha, self.rpo_alpha).to(device)
         action_mean = action_mean + z
         probs = Normal(action_mean, action_std)
     ```
   - In this case, if an action is provided (typically when you want to evaluate the action rather than sample), the method adds a stochastic perturbation to the `action_mean`.

### Explanation of the RPO Regulation

**Stochastic Perturbation**:
- The line `z = torch.FloatTensor(action_mean.shape).uniform_(-self.rpo_alpha, self.rpo_alpha).to(device)` generates a tensor `z` filled with uniform noise drawn from the range $[-self.rpo_alpha, self.rpo_alpha]$. This noise is then added to the `action_mean`:
  ```python
  action_mean = action_mean + z
  ```
  
- This perturbation introduces a degree of randomness into the action mean, which broadens the exploratory behavior of the policy. Essentially, the RPO approach is encouraging exploration even when a specific action (perhaps from a previous policy iteration) is provided.

### Rationale Behind This Stochasticity
- **Increased Exploration**: The added stochasticity ensures that the policy doesn't become too deterministic over time, which could happen if actions are chosen based purely on the computed mean without any variability. 
- **Regularization Effect**: By injecting noise into the action means, the RPO implementation can be considered as a form of **regularization**. It encourages exploration in a controlled manner and helps prevent overfitting to the current policy by sampling actions that are slightly different from what the model predicts as optimal.
- **Adaptive Exploration**: The parameter `self.rpo_alpha` can control the amount of exploration. A higher value yields more exploration, while a lower value may lead to more exploitation based on what the policy has learned so far.

### Summary
The regulation and exploration strategy embedded in the `rpo_continuous_action` stems from adding controlled stochasticity to the action mean. This contrasts with the standard PPO implementation, where actions are sampled directly without any additional noise. This method helps to keep exploration alive and can improve the training stability and performance of the agent in challenging environments. 

In the context of Reinforcement Learning, particularly in stochastic environments or those with high-dimensional action spaces, such techniques can contribute significantly towards achieving a more robust learning process.


# Why this stochastic perturbation not get clipped by ppo original algorithm?
In Proximal Policy Optimization (PPO), the clipping mechanism is designed to prevent overly large updates to the policy during training by bounding the probability ratios. The core idea is to maintain a balance between exploration (updating the policy to discover new strategies) and stability (not moving too far away from the existing policy in a single update).

### Understanding the Clipping Mechanism

The formula you've provided reflects the PPO objective, which uses the following components:

- $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$
  - This is the probability ratio of the new policy to the old policy for the action taken at time $t$.

- $\hat{A}_t$
  - This represents the estimated advantage at time $t$, indicating how much better the action was compared to the policy's average action.

- The clipping term:
  $$
  \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)
  $$
  - This restricts the probability ratio to a range around 1 (essentially $[1 - \epsilon, 1 + \epsilon]$), which stabilizes training by limiting how much the policy can change in a single training step.

### Interaction with Stochastic Perturbation

1. **Stochastic Perturbation in RPO**:
   - By adding stochastic perturbations to the `action_mean`, the policy is encouraged to sample actions that are distributed around the mean value, effectively injecting noise into the action selection.
   - For example, if the original action mean is adjusted by some randomness, this affects the distribution but does not directly alter the policy's parameters. The sampling process is treated as part of the exploration.

2. **Behaviors during Update**:
   - The perturbation in `action_mean` does not cause the probability ratio $r_t(\theta)$ to exceed the clipping limits because the perturbation is applied before calculating the action probabilities from the policy during that specific iteration.
   - In other words, the probabilistic output (action distribution) remains bounded and represents a continuous action space governed by the standard deviation and the perturbed mean, which modifies the action taken without directly affecting the learned policy.

3. **Clipping Effectiveness**:
   - The clipping mechanism targets the ratio of the new policy to the old policy, which reflects how the policy changes respond to the actions taken during updates.
   - The stochastic nature of the perturbation improves exploration without contradicting the clipping strategy. As long as the mean remains close to the previously learned mean (and the variances are reasonable), the ratio $r_t(\theta)$ should not lead to large changes.

### Summary

The stochastic perturbation used in the **RPO** variant adds exploration but does not exceed the bounds imposed by the PPO clipping mechanism because:
- **Noise Changes Action Distribution, Not Policy**: The perturbation modifies the actual sampled actions rather than the fundamental policy parameters governing $r_t(\theta)$.
- **Well-Bounded Exploration**: The probability distributions from which actions are sampled remain stable, adhering to the policy's learned behaviors, and generally do not yield large ratios that would exceed the clipping threshold.
- **Training Stability**: The PPO algorithm's design allows these improvements in exploration to occur while preserving overall stability and convergence by clipping and regularly updating according to the policy's performance.

Thus, the integration of stochastic perturbation in the RPO implementation enhances exploration capability while conforming to the safety bounds established by the clipping mechanism in PPO.


In [10]:
# docs and experiment results can be found at https://docs.cleanrl.dev/rl-algorithms/rpo/#rpo_continuous_actionpy
import os
import random
import time
from dataclasses import dataclass

import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import tyro
from torch.distributions.normal import Normal
from torch.utils.tensorboard import SummaryWriter


@dataclass
class Args:
    exp_name: str = os.path.basename(__file__)[: -len(".py")]
    """the name of this experiment"""
    seed: int = 1
    """seed of the experiment"""
    torch_deterministic: bool = True
    """if toggled, `torch.backends.cudnn.deterministic=False`"""
    cuda: bool = True
    """if toggled, cuda will be enabled by default"""
    track: bool = False
    """if toggled, this experiment will be tracked with Weights and Biases"""
    wandb_project_name: str = "cleanRL"
    """the wandb's project name"""
    wandb_entity: str = None
    """the entity (team) of wandb's project"""
    capture_video: bool = False
    """whether to capture videos of the agent performances (check out `videos` folder)"""

    # Algorithm specific arguments
    env_id: str = "HalfCheetah-v4"
    """the id of the environment"""
    total_timesteps: int = 8000000
    """total timesteps of the experiments"""
    learning_rate: float = 3e-4
    """the learning rate of the optimizer"""
    num_envs: int = 1
    """the number of parallel game environments"""
    num_steps: int = 2048
    """the number of steps to run in each environment per policy rollout"""
    anneal_lr: bool = True
    """Toggle learning rate annealing for policy and value networks"""
    gamma: float = 0.99
    """the discount factor gamma"""
    gae_lambda: float = 0.95
    """the lambda for the general advantage estimation"""
    num_minibatches: int = 32
    """the number of mini-batches"""
    update_epochs: int = 10
    """the K epochs to update the policy"""
    norm_adv: bool = True
    """Toggles advantages normalization"""
    clip_coef: float = 0.2
    """the surrogate clipping coefficient"""
    clip_vloss: bool = True
    """Toggles whether or not to use a clipped loss for the value function, as per the paper."""
    ent_coef: float = 0.0
    """coefficient of the entropy"""
    vf_coef: float = 0.5
    """coefficient of the value function"""
    max_grad_norm: float = 0.5
    """the maximum norm for the gradient clipping"""
    target_kl: float = None
    """the target KL divergence threshold"""
    rpo_alpha: float = 0.5
    """the alpha parameter for RPO"""

    # to be filled in runtime
    batch_size: int = 0
    """the batch size (computed in runtime)"""
    minibatch_size: int = 0
    """the mini-batch size (computed in runtime)"""
    num_iterations: int = 0
    """the number of iterations (computed in runtime)"""


def make_env(env_id, idx, capture_video, run_name, gamma):
    def thunk():
        if capture_video and idx == 0:
            env = gym.make(env_id, render_mode="rgb_array")
            env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
        else:
            env = gym.make(env_id)
        env = gym.wrappers.FlattenObservation(env)  # deal with dm_control's Dict observation space
        env = gym.wrappers.RecordEpisodeStatistics(env)
        env = gym.wrappers.ClipAction(env)
        env = gym.wrappers.NormalizeObservation(env)
        env = gym.wrappers.TransformObservation(env, lambda obs: np.clip(obs, -10, 10))
        env = gym.wrappers.NormalizeReward(env, gamma=gamma)
        env = gym.wrappers.TransformReward(env, lambda reward: np.clip(reward, -10, 10))
        return env

    return thunk


def layer_init(layer, std=np.sqrt(2), bias_const=0.0):
    torch.nn.init.orthogonal_(layer.weight, std)
    torch.nn.init.constant_(layer.bias, bias_const)
    return layer


class Agent(nn.Module):
    def __init__(self, envs, rpo_alpha):
        super().__init__()
        self.rpo_alpha = rpo_alpha
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 1), std=1.0),
        )
        self.actor_mean = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, np.prod(envs.single_action_space.shape)), std=0.01),
        )
        self.actor_logstd = nn.Parameter(torch.zeros(1, np.prod(envs.single_action_space.shape)))

    def get_value(self, x):
        return self.critic(x)

    def get_action_and_value(self, x, action=None):
        action_mean = self.actor_mean(x)
        action_logstd = self.actor_logstd.expand_as(action_mean)
        action_std = torch.exp(action_logstd)
        probs = Normal(action_mean, action_std)
        if action is None:
            action = probs.sample()
        else:  # new to RPO
            # sample again to add stochasticity to the policy
            z = torch.FloatTensor(action_mean.shape).uniform_(-self.rpo_alpha, self.rpo_alpha).to(device)
            action_mean = action_mean + z
            probs = Normal(action_mean, action_std)

        return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(x)


if __name__ == "__main__":
    args = tyro.cli(Args)
    args.batch_size = int(args.num_envs * args.num_steps)
    args.minibatch_size = int(args.batch_size // args.num_minibatches)
    args.num_iterations = args.total_timesteps // args.batch_size
    run_name = f"{args.env_id}__{args.exp_name}__{args.seed}__{int(time.time())}"
    if args.track:
        import wandb

        wandb.init(
            project=args.wandb_project_name,
            entity=args.wandb_entity,
            sync_tensorboard=True,
            config=vars(args),
            name=run_name,
            monitor_gym=True,
            save_code=True,
        )
    writer = SummaryWriter(f"runs/{run_name}")
    writer.add_text(
        "hyperparameters",
        "|param|value|\n|-|-|\n%s" % ("\n".join([f"|{key}|{value}|" for key, value in vars(args).items()])),
    )

    # TRY NOT TO MODIFY: seeding
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    torch.backends.cudnn.deterministic = args.torch_deterministic

    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")

    # env setup
    envs = gym.vector.SyncVectorEnv(
        [make_env(args.env_id, i, args.capture_video, run_name, args.gamma) for i in range(args.num_envs)]
    )
    assert isinstance(envs.single_action_space, gym.spaces.Box), "only continuous action space is supported"

    agent = Agent(envs, args.rpo_alpha).to(device)
    optimizer = optim.Adam(agent.parameters(), lr=args.learning_rate, eps=1e-5)

    # ALGO Logic: Storage setup
    obs = torch.zeros((args.num_steps, args.num_envs) + envs.single_observation_space.shape).to(device)
    actions = torch.zeros((args.num_steps, args.num_envs) + envs.single_action_space.shape).to(device)
    logprobs = torch.zeros((args.num_steps, args.num_envs)).to(device)
    rewards = torch.zeros((args.num_steps, args.num_envs)).to(device)
    dones = torch.zeros((args.num_steps, args.num_envs)).to(device)
    values = torch.zeros((args.num_steps, args.num_envs)).to(device)

    # TRY NOT TO MODIFY: start the game
    global_step = 0
    start_time = time.time()
    next_obs, _ = envs.reset(seed=args.seed)
    next_obs = torch.Tensor(next_obs).to(device)
    next_done = torch.zeros(args.num_envs).to(device)
    num_updates = args.total_timesteps // args.batch_size

    for update in range(1, num_updates + 1):
        # Annealing the rate if instructed to do so.
        if args.anneal_lr:
            frac = 1.0 - (update - 1.0) / num_updates
            lrnow = frac * args.learning_rate
            optimizer.param_groups[0]["lr"] = lrnow

        for step in range(0, args.num_steps):
            global_step += 1 * args.num_envs
            obs[step] = next_obs
            dones[step] = next_done

            # ALGO LOGIC: action logic
            with torch.no_grad():
                action, logprob, _, value = agent.get_action_and_value(next_obs)
                values[step] = value.flatten()
            actions[step] = action
            logprobs[step] = logprob

            # TRY NOT TO MODIFY: execute the game and log data.
            next_obs, reward, terminations, truncations, infos = envs.step(action.cpu().numpy())
            done = np.logical_or(terminations, truncations)
            rewards[step] = torch.tensor(reward).to(device).view(-1)
            next_obs, next_done = torch.Tensor(next_obs).to(device), torch.Tensor(done).to(device)

            if "final_info" in infos:
                for info in infos["final_info"]:
                    if info and "episode" in info:
                        print(f"global_step={global_step}, episodic_return={info['episode']['r']}")
                        writer.add_scalar("charts/episodic_return", info["episode"]["r"], global_step)
                        writer.add_scalar("charts/episodic_length", info["episode"]["l"], global_step)

        # bootstrap value if not done
        with torch.no_grad():
            next_value = agent.get_value(next_obs).reshape(1, -1)
            advantages = torch.zeros_like(rewards).to(device)
            lastgaelam = 0
            for t in reversed(range(args.num_steps)):
                if t == args.num_steps - 1:
                    nextnonterminal = 1.0 - next_done
                    nextvalues = next_value
                else:
                    nextnonterminal = 1.0 - dones[t + 1]
                    nextvalues = values[t + 1]
                delta = rewards[t] + args.gamma * nextvalues * nextnonterminal - values[t]
                advantages[t] = lastgaelam = delta + args.gamma * args.gae_lambda * nextnonterminal * lastgaelam
            returns = advantages + values

        # flatten the batch
        b_obs = obs.reshape((-1,) + envs.single_observation_space.shape)
        b_logprobs = logprobs.reshape(-1)
        b_actions = actions.reshape((-1,) + envs.single_action_space.shape)
        b_advantages = advantages.reshape(-1)
        b_returns = returns.reshape(-1)
        b_values = values.reshape(-1)

        # Optimizing the policy and value network
        b_inds = np.arange(args.batch_size)
        clipfracs = []
        for epoch in range(args.update_epochs):
            np.random.shuffle(b_inds)
            for start in range(0, args.batch_size, args.minibatch_size):
                end = start + args.minibatch_size
                mb_inds = b_inds[start:end]

                _, newlogprob, entropy, newvalue = agent.get_action_and_value(b_obs[mb_inds], b_actions[mb_inds])
                logratio = newlogprob - b_logprobs[mb_inds]
                ratio = logratio.exp()

                with torch.no_grad():
                    # calculate approx_kl http://joschu.net/blog/kl-approx.html
                    old_approx_kl = (-logratio).mean()
                    approx_kl = ((ratio - 1) - logratio).mean()
                    clipfracs += [((ratio - 1.0).abs() > args.clip_coef).float().mean().item()]

                mb_advantages = b_advantages[mb_inds]
                if args.norm_adv:
                    mb_advantages = (mb_advantages - mb_advantages.mean()) / (mb_advantages.std() + 1e-8)

                # Policy loss
                pg_loss1 = -mb_advantages * ratio
                pg_loss2 = -mb_advantages * torch.clamp(ratio, 1 - args.clip_coef, 1 + args.clip_coef)
                pg_loss = torch.max(pg_loss1, pg_loss2).mean()

                # Value loss
                newvalue = newvalue.view(-1)
                if args.clip_vloss:
                    v_loss_unclipped = (newvalue - b_returns[mb_inds]) ** 2
                    v_clipped = b_values[mb_inds] + torch.clamp(
                        newvalue - b_values[mb_inds],
                        -args.clip_coef,
                        args.clip_coef,
                    )
                    v_loss_clipped = (v_clipped - b_returns[mb_inds]) ** 2
                    v_loss_max = torch.max(v_loss_unclipped, v_loss_clipped)
                    v_loss = 0.5 * v_loss_max.mean()
                else:
                    v_loss = 0.5 * ((newvalue - b_returns[mb_inds]) ** 2).mean()

                entropy_loss = entropy.mean()
                loss = pg_loss - args.ent_coef * entropy_loss + v_loss * args.vf_coef

                optimizer.zero_grad()
                loss.backward()
                nn.utils.clip_grad_norm_(agent.parameters(), args.max_grad_norm)
                optimizer.step()

            if args.target_kl is not None:
                if approx_kl > args.target_kl:
                    break

        y_pred, y_true = b_values.cpu().numpy(), b_returns.cpu().numpy()
        var_y = np.var(y_true)
        explained_var = np.nan if var_y == 0 else 1 - np.var(y_true - y_pred) / var_y

        # TRY NOT TO MODIFY: record rewards for plotting purposes
        writer.add_scalar("charts/learning_rate", optimizer.param_groups[0]["lr"], global_step)
        writer.add_scalar("losses/value_loss", v_loss.item(), global_step)
        writer.add_scalar("losses/policy_loss", pg_loss.item(), global_step)
        writer.add_scalar("losses/entropy", entropy_loss.item(), global_step)
        writer.add_scalar("losses/old_approx_kl", old_approx_kl.item(), global_step)
        writer.add_scalar("losses/approx_kl", approx_kl.item(), global_step)
        writer.add_scalar("losses/clipfrac", np.mean(clipfracs), global_step)
        writer.add_scalar("losses/explained_variance", explained_var, global_step)
        print("SPS:", int(global_step / (time.time() - start_time)))
        writer.add_scalar("charts/SPS", int(global_step / (time.time() - start_time)), global_step)

    envs.close()
    writer.close()
