# The Proximal Policy Optimization (PPO) Algorithm, from Scratch

### Objective of this Notebook:
The goal of this notebook is understand the Proximal Policy Optimization (PPO) algorithm. The objectives of this notebook are as follows:

1. **Understanding of Proximal Policy Optimization**: Introducing PPO and its position in the RL landscape.
2. **PPO algorithm explanation**: Break down the PPO algorithm, explaining its various components, including the clipping mechanism, advantage estimation, and policy updating.
3. **Code Implementation in Modified Environment**: A step-by-step guide to implementing PPO in the modified Gym Pendulum environment. This includes adaptations like converting the continuous action space into a discrete one and scaling variables.

## A Brief Overview of Proximal Policy Optimization
In reinforcement learning, PPO has emerged as a powerful algorithm that balances the trade-off between efficiency and ease of implementation. Unlike some policy gradient methods, PPO aims to take larger policy update steps without incurring potential stability issues.

The core idea of PPO is to optimize the policy in a way that doesn't change it too drastically from one iteration to the next. This "proximal" constraint helps to stabilize training, and it is typically achieved through a specialized clipping mechanism.

# PPO Pseudocode

1. Input: initial policy parameters $ \theta_0 $, initial value function parameters $ \phi_0 $
2. **for** $ k = 0, 1, 2, ... $ **do**
3. $\quad$ Collect set of trajectories $ \text{D}_k = \{\tau_i\} $ by running policy $ \pi_k = \pi (\theta_k) $ in the environment.
4. $\quad$ Compute rewards-to-go $ \hat{R}_t $.
5. $\quad$ Compute advantage estimates, $\hat{A}$ (using any method of advantage estimation) based on the current value function $ V_{\phi_k} $.
6. $\quad$ Update the policy by maximizing the PPO-Clip objective:

$$
\theta_{k+1} = \arg \max_\theta \dfrac{1}{|\text{D}_k|T} \sum_{\tau \in \text{D}_k} \sum_{t=0}^{T} \min \left(\dfrac{\pi_\theta (a_t | s_t)}{\pi_{\theta_k} (a_t | s_t)} A ^ {\pi_{\theta_k}} (s_t, a_t), \: \text{clip}\left(\dfrac{\pi_\theta (a_t | s_t)}{\pi_{\theta_k} (a_t | s_t)}, 1 - \epsilon, 1 + \epsilon \right) A ^ {\pi_{\theta_k}}(s_t, a_t) \right)
$$

$\quad$ typically via SGD with Adam.

7. Fit value function by regression on MSE:

$$
\phi_{k + 1} = \arg \min_\phi \dfrac{1}{|\text{D}_k|T} \sum_{\tau \in \text{D}_k} \sum_{t=0}^{T} \left( V_\phi (s_t) - \hat{R}_t\right)^2 ,
$$

$\quad$ typically via some gradient descent algorithm.

8. **end for**

## Our Environment
We're going to use the Gym Pendulum environment, with a few simple modifications:
1. We're going to convert the continuous action space to a discrete one.
2. We will scale the angular velocity variable so it's of comparable magnitude to the angles.
3. We scale down the reward by a factor of 1/10, as this will likely be eaiser on our neural networks.

In [1]:
import gym
import numpy as np

class DiscreteActionWrapper(gym.ActionWrapper):
    "Bin continuous actions into discrete intervals."
    def __init__(self, env, n_actions=5):
        super().__init__(env)
        self.n_actions = n_actions
        self.action_space = gym.spaces.Discrete(n_actions)

    def action(self, action):
        if action == 0:
            return np.array([-2])
        elif action == 1:
            return np.array([-1])
        elif action == 2:
            return np.array([0])
        elif action == 3:
            return np.array([1])
        elif action == 4:
            return np.array([2])
    
class ObservationWrapper(gym.ObservationWrapper):
    "Scale the third value of the observation by 1/8."
    def __init__(self, env):
        super().__init__(env)

    def observation(self, observation):
        observation[2] *= 1/8.0
        return observation
    
class RewardScalingWrapper(gym.RewardWrapper):
    "Scale the reward by the given factor."
    def __init__(self, env, scaling_factor=1/10.0):
        super().__init__(env)
        self.scaling_factor = scaling_factor

    def reward(self, reward):
        return reward * self.scaling_factor
    
env = gym.make('Pendulum-v1')
env = RewardScalingWrapper(DiscreteActionWrapper(ObservationWrapper(env)))

### Step 1: Initialize Policy and Value Function Paramters 

In our approach, we deploy a conventional Actor-Critic architecture for both the policy (actor) and value function (critic).

- **Policy** (Actor): The actor accepts a state tensor as input and produces logits. These logits correspond to the relative likelihoods of the actions to be chosen. Essentially, the policy dictates how our agent decides to act given the current state.

- **Value Function** (Critic): The critic's role is to estimate the value of the given state. By accepting a state tensor, the critic provides a quantifiable measure of how "good" the state is, guiding the agent's decisions.

- **Old Policy** for Probability Ratio Calculation: We also maintain an "old" policy that represents the policy from the previous time step. This old policy plays a vital role in calculating the probability ratio used in the PPO algorithm. Initially, the current policy and the old policy are identical, since there is no prior policy to reference.

In [2]:
import torch
from torch import nn
import torch.nn.functional as F

hidden_size = 64

class Policy(nn.Module):
    def __init__(self, input_size, n_actions):
        super().__init__()
        self.dense_1 = nn.Linear(input_size, hidden_size)
        self.dense_2 = nn.Linear(hidden_size, hidden_size//2)
        self.dense_3 = nn.Linear(hidden_size//2, n_actions)
        
    def forward(self, x):
        x = F.relu(self.dense_1(x))
        x = F.relu(self.dense_2(x))
        x = self.dense_3(x)
        return x
    
class ValueFunction(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.dense_1 = nn.Linear(input_size, hidden_size)
        self.dense_2 = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        x = F.relu(self.dense_1(x))
        x = self.dense_2(x)
        return x
    
n_actions = env.action_space.n
input_size = env.observation_space.shape[0]

policy = Policy(input_size, n_actions)
value_function = ValueFunction(input_size)

old_policy = Policy(input_size, n_actions)
old_policy.load_state_dict(policy.state_dict())

<All keys matched successfully>

### Step 2: Main Training Loop
In this step, we set up the core of our training process:

- **Initialize and Reset Environment**: Start by resetting the environment and storing the initial observations.
- **Define Training Parameters**: Set up essential training parameters like learning rate, epochs, batch size, etc.
- **Initialize Optimizers**: Get the optimizers ready for updating the policy and value function during training.

This phase ensures we have everything in place to run the training loop for PPO.

In [3]:
device = 'cpu'
training_steps = 10000
batch_size = 64
gamma = 0.99
epsilon = 0.2

learning_rate_policy = 0.0003
learning_rate_value = 0.001

optimizer_policy = torch.optim.Adam(policy.parameters(), lr=learning_rate_policy)
optimizer_value_function = torch.optim.Adam(value_function.parameters(), lr=learning_rate_value)

transitions = []

state, info = env.reset()

### Step 3: Collect Trajectories

In this step, we gather the trajectories that will guide the training of our PPO model:

- **Environment Handling**: Though often done with multiple environments in parallel, we'll collect trajectories using only a single environment. The process remains consistent in both cases.
- **Collecting Transition Tuples**: We'll gather tuples of `(state_0, action_0, reward, state_1, done)` until we reach a set number of transitions, typically controlled by a `batch_size` hyperparameter. For this code, we'll use a batch size of 64.
- **State to Tensor Conversion**: Convert the current state to a Torch tensor and feed it through the policy network to obtain logits.
- **Action Sampling**: Convert the logits to a probability distribution and sample an action, which is then sent to the environment to obtain the next state, reward, and done flag.
- **Transition Storage**: Append the transition to a list. If the list reaches the batch size, move to the next step for learning. Otherwise, repeat this step to collect more transitions.

By following this process, we build the foundational data needed for the PPO algorithm to learn.

In [4]:
for i in range(training_steps):
    # Convert state to a tensor
    state_tensor = torch.tensor(state, dtype=torch.float, device=device)

    # Pass state tensor through policy network
    logits = policy(state_tensor)

    # Convert logits to probability distribution
    dist = torch.distributions.Categorical(logits=logits)

    # Sample action from probability distribution
    action = dist.sample().item()

    # Execute action in environment
    next_state, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

    transition = (state, action, reward, next_state, done)
    transitions.append(transition)
    
    # Reset environment if done
    if done:
        state, info = env.reset()
    else:
        state = next_state

    if len(transitions) < batch_size:
        continue
        
    # In the actual code, there's no break here.
    # We just need it because we can't write a for loop over several cells.
    else:
        break

### Step 4: Compute Rewards-to-Go
Computing the empirical return (rewards-to-go) is achieved through the following equation:

$$
\hat{R}_t = r_t + \gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} + ... + \gamma^n \cdot V_\phi(s_{t + n})
$$

Rewards-to-go provide an empirical estimate for $Q(s_t, a_t)$, which we us to compute the advantage later on.

- **Utilizing Future Rewards**: We draw from rewards in future episodes to estimate the value of $ r_{t+i} $ and employ the value function to approximate the discounted reward until the end of the episode $ V_\phi(s_{t + n}) $ when no more reward information is available.
- **Elegance of Approach**: This method requires computing the value of only the final state in the trajectory, making it an elegant solution.
- **Iterating Backwards**: The calculation is performed by iterating backwards over the rewards, simplifying the programming.
- **Bootstrapping with Value Function**: Including a value function approximation for on-policy methods allows training on incomplete trajectories, increasing bias, reducing variance, and enhancing generalization.

This step carefully integrates the observed rewards with value function approximations to build more reliable targets for our PPO training. It represents an essential balance between empirical information and model-based estimates, driving the policy towards better decisions.

In [5]:
states, actions, rewards, next_states, dones = zip(*transitions)

# If the final step didn't cause the environment to terminate or timeout
if not dones[-1]:
    # Calculate the value of the final state
    final_state_tensor = torch.tensor(next_states[-1], dtype=torch.float, device=device)
    final_state_value = value_function(final_state_tensor).item()
else:
    # If it did finish, there is no future discounted reward, so it's just 0
    final_state_value = 0

# Initialize rewards to go list
rewards_to_go = []

# R represents all the future rewards
# In the above equation, anything being multiplied by gamma is a part of R
R = final_state_value

# Iterate backwards
for r, done in zip(reversed(rewards), reversed(dones)):
    # If the environment is done, there is no future reward
    if done:
        R = 0
    # Add the immediate reward and discount the future rewards
    R = r + gamma * R

    # Add the calculated reward-to-go to the list at the start as we're iterating backwards
    rewards_to_go.insert(0, R)

### Step 5: Compute Advantage Estimates
In RL, the concept of advantage serves as a measure of how much better a particular action is compared to the average action in a given state. 

The mathematical formulation of advantage, $ A(s,a) $, is expressed as the difference between the Q-value ofo a state-action pair, $ Q(s,a) $ and the value of a state, $ V(s) $.

$$
A(s, a) = Q(s, a) - V(s)
$$

To calculate the advantage, we need to break down the components involved:
- **Q-value**: The Q-value represents the expected return of taking action $a$ in state $s$ and following the current policy thereafter. Using the Bellman equation, it can be expressed as:

$$
Q(s,a) = r + \gamma V(s')
$$

Here, $r$ is the immediate reward, and $\gamma \cdot V(s)$ is the expected future discounted reward from the next state $s'$.

- **Rewards-to-go**: To get a more accurate estimate of $Q(s,a)$, we can unravel the Bellman equation several steps into the future:

$$
Q(s_0,a_0) = r_0 + \gamma r_1 + \gamma^2 r_2 + ... + \gamma^n V(s_n)
$$

This series sums the rewards at each step, discounted by $\gamma$, plus the value of the final state in the trajectory, discounted appropriately. This sum is the rewards-to-go.

- **Advantage Calculation**: Finally, by combining the Q-value with the state value, we arrive at our expression for the advantage:

$$
A(s,a) = \text{reward-to-go} - V_\phi(s)
$$

Here, $V_\phi(s)$ is the value function's approximation of the state value, parameterized by $\phi$. The value function is trained to predict this quantity, and it is reused here to approximate V(s) for all states.

In [6]:
# Convert states to a tensor
states_tensor = torch.tensor(np.array(states), dtype=torch.float, device=device)

# Convert rewards-to-go to a tensor
rewards_tensor = torch.tensor(np.array(rewards_to_go), dtype=torch.float, device=device)

# Get state values with value function
values = value_function(states_tensor).squeeze(-1)

# Calculate advantages
advantages = rewards_tensor - values

### Step 6: Update the policy by maximizing the PPO-Clip objective:

$$
\theta_{k+1} = \arg \max_\theta \dfrac{1}{|\text{D}_k|T} \sum_{\tau \in \text{D}_k} \sum_{t=0}^{T} \min \left(\dfrac{\pi_\theta (a_t | s_t)}{\pi_{\theta_k} (a_t | s_t)} A ^ {\pi_{\theta_k}} (s_t, a_t), \; \text{clip}\left(\dfrac{\pi_\theta (a_t | s_t)}{\pi_{\theta_k} (a_t | s_t)}, 1 - \epsilon, 1 + \epsilon \right) A ^ {\pi_{\theta_k}}(s_t, a_t) \right)
$$

$\quad$ typically via SGD with Adam.

Here's a breakdown of each component of the formula:

- **The policy ratio**, $ r_t(\theta) $: This is the ratio of the probability of taking action $a_t$ under new policy parameters $\theta$ to the old policy parameters $\theta_k$. 

    We keep track of the old policy so we can calculate this value. Values greater than 1 indicates the action is being selected more frequently, values lower than 1 indicate the action being selected less frequently.

    It is defined as:

$$
\dfrac{\pi_\theta (a_t | s_t)}{\pi_{\theta_k} (a_t | s_t)}
$$


- **Clipped policy ratio**: To ensure stability, the policy ratio is clipped within the range $[1 - \epsilon, \; 1+\epsilon]$. This prevents the policy from changing too dramatically in a single step.

$$
\text{clip}\left(r_t(\theta), \; 1 - \epsilon, 1 + \epsilon \right)
$$

- **Advantage function** $ A ^ {\pi_{\theta_k}} (s_t, a_t) $: This represents how much better taking a specific action is compared to the average over all possible actions in the given state, according to the old policy.

- **Objective function**: The whole expression encapsulates the PPO-clip objective that we want to maximize. It is designed to favor actions that have a positive advantage, while constraining the policy from changing too drastically.

$$ 
\min \left(r_t(\theta) A ^ {\pi_{\theta_k}} (s_t, a_t), \; \text{clip}\left(r_t(\theta), 1 - \epsilon, 1 + \epsilon \right) A ^ {\pi_{\theta_k}}(s_t, a_t) \right)
$$

Let's break down this objective into two possible scenarios: when the advantage is positive and when it is negative.

- Positive Advantage: When the advantage is positive, it means taking the action is better than the average, and we want to increase the probability of selecting it. 

Suppose our policy ratio $ r_t(\theta) = 1.2 $ (indicating the action is favored more by the new policy) and the advantage $ A ^ {\pi_{\theta_k}} (s_t, a_t) = 0.5 $, with $ \epsilon = 0.1 $:

- LHS (Left-Hand Side): $ 1.2 \times 0.5 = 0.6 $
- RHS (Right-Hand Side): $ \text{clip}(1.2, 1 - 0.1, 1 + 0.1) \times 0.5 = 1.1 \times 0.5 = 0.55 $

In this case, the minimum value is taken, so the final result would be $ 0.55 $.

- Negative Advantage: When the advantage is negative, it means the action is worse than the average, and we want to decrease the probability of selecting it.

Suppose our policy ratio $ r_t(\theta) = 0.8 $ (indicating the action is less favored by the new policy) and the advantage $ A ^ {\pi_{\theta_k}} (s_t, a_t) = -0.5 $, with $ \epsilon = 0.1 $:

- LHS: $ 0.8 \times (-0.5) = -0.4 $
- RHS: $ \text{clip}(0.8, 1 - 0.1, 1 + 0.1) \times (-0.5) = 0.9 \times (-0.5) = -0.45 $

Again, the minimum value is taken, so the final result would be $ -0.45 $.

These two examples demonstrate how the objective function in PPO simultaneously attempts to increase the probability of actions that are better than the average and decrease the probability of actions that are worse than the average. The clipping mechanism ensures that these updates are controlled, promoting stability in the training process.

- **Averaging Over Trajectories and Time Steps**: The outer summation and division by $|\text{D}_k|$ average the objective over all the trajectories in the dataset $\text{D}_k$ and all the time steps up to {T}. This makes the objective representative of the policy's performance over the entire dataset of trajectories. In practice

- **Policy update**: The final step is to actually update the policy parameters $\theta$ to $\theta_{k+1}$ by maximizing the objective. This is done using stochastic gradient descent (typically Adam) to find the values of $\theta$ that maximize this objective.

In [7]:
# First, calculate policy ratio
# Convert actions to tensor and unsqueeze for upcoming gather
actions_tensor = torch.tensor(actions, dtype=torch.long, device=device).unsqueeze(-1)

# For current policy
logits_current = policy(states_tensor)
probs_current = F.softmax(logits_current, dim=-1)
action_probs_current = probs_current.gather(dim=1, index=actions_tensor)

# For old policy (use no_grad() as we never update the old policy with backprop)
with torch.no_grad():
    logits_old = old_policy(states_tensor)
    probs_old = F.softmax(logits_old, dim=-1)
    action_probs_old = probs_old.gather(dim=1, index=actions_tensor)

# Calculate the policy ratio
policy_ratio = (action_probs_current/action_probs_old).squeeze(-1)

# Next, calculate clipped policy ratio
# This is a very simple calculation as we've already calculated the policy ratio
clipped_policy_ratio = torch.clip(policy_ratio, min=1-epsilon, max=1+epsilon)

# Next, we calculate the objective function
objective = torch.min(policy_ratio * advantages, clipped_policy_ratio * advantages)

# We then average this across all transitions and episodes using mean to get the policy loss
# We use negative as we want to maximize it and loss metrics are to be minimized
loss_policy = -torch.mean(objective)

# First, set our old policy to our current policy before we update it
old_policy.load_state_dict(policy.state_dict())

# Backward pass for policy
optimizer_policy.zero_grad()
loss_policy.backward()
optimizer_policy.step()

### Step 7: Fit value function by regression on MSE:

$$
\phi_{k + 1} = \arg \min_\phi \dfrac{1}{|\text{D}_k|T} \sum_{\tau \in \text{D}_k} \sum_{t=0}^{T} \left( V_\phi (s_t) - \hat{R}_t\right)^2 ,
$$

$\quad$ typically via some gradient descent algorithm.

The goal of this process is to make the value function approximate the expected returns under the current policy. In other words, we want to find a function $V_\phi(s)$ that best predicts what the actual return will be if we follow our current policy from state $s$.

The equation above says that we're looking to minimize the mean squared error (MSE) between the predicted state value using our value function $V_\phi(s_t)$ and the empirically observed rewards (previously calculated rewards-to-go).

We previously used our rewards-to-go to esimate the Q-value $Q(s,a)$ in our advantage calculation $A(s,a) = Q(s,a) - V(s)$, so why are we now using it to describe a target for the value function?

The rewards-to-go are an empirical estimate of the expected return if action $a$ is taken in state $s$, and the current policy is followed thereafter. They essentially serve as an estimate for $Q(s,a)$ for the current policy. 

This metric hence serves to guide our value function towards more accurately approximating the value of a state under our current policy.

### Step 7: Fit Value Function by Regression on MSE

We optimize the value function parameters according to the equation:

$$
\phi_{k + 1} = \arg \min_\phi \dfrac{1}{|\text{D}_k|T} \sum_{\tau \in \text{D}_k} \sum_{t=0}^{T} \left( V_\phi (s_t) - \hat{R}_t\right)^2 ,
$$

This is typically achieved through a gradient descent algorithm like Adam.

- **Objective**: The goal is to find the value function $V_\phi(s)$ that best approximates the expected returns under the current policy starting from state $s$.
- **Minimizing MSE**: We want to minimize the mean squared error (MSE) between the predicted value function $V_\phi(s_t)$ and the empirically observed rewards (rewards-to-go).
- **Utilizing Rewards-to-Go**: Previously used to estimate the Q-value $Q(s,a)$ in our advantage calculation $A(s,a) = Q(s,a) - V(s)$, rewards-to-go now guide our value function to more accurately predict the value of a state under our policy.

By linking the value function to empirical returns, this step ensures that the value function stays in line with the policy's actual performance in the environment.


In [8]:
# Predict values for current state
values = value_function(states_tensor).squeeze(-1)

loss_value_function = F.mse_loss(values, rewards_tensor)

# Backward pass for value function
optimizer_value_function.zero_grad()
loss_value_function.backward()
optimizer_value_function.step()

### Done
And that's it! We've gone through one full pass of the PPO algorithm, detailing each step and understanding the mechanics that make this algorithm effective.

## Full Code
Now, we will bring all these pieces together to create a complete code implementation for PPO in our modified Gym Pendulum environment. Below is the full code that combines all the steps, including environment setup, hyperparameter initialization, training loop, and evaluation.

In [9]:
import numpy as np
import gym
import torch
from torch import nn
import torch.nn.functional as F


# Environment
class DiscreteActionWrapper(gym.ActionWrapper):
    "Bin continuous actions into discrete intervals."
    def __init__(self, env, n_actions=5):
        super().__init__(env)
        self.n_actions = n_actions
        self.action_space = gym.spaces.Discrete(n_actions)

    def action(self, action):
        if action == 0:
            return np.array([-2])
        elif action == 1:
            return np.array([-1])
        elif action == 2:
            return np.array([0])
        elif action == 3:
            return np.array([1])
        elif action == 4:
            return np.array([2])
    
class ObservationWrapper(gym.ObservationWrapper):
    "Scale the third value of the observation by 1/8."
    def __init__(self, env):
        super().__init__(env)

    def observation(self, observation):
        observation[2] *= 1/8.0
        return observation
    
class RewardScalingWrapper(gym.RewardWrapper):
    "Scale the reward by the given factor."
    def __init__(self, env, scaling_factor=1/10.0):
        super().__init__(env)
        self.scaling_factor = scaling_factor

    def reward(self, reward):
        return reward * self.scaling_factor
    
# Networks
class Policy(nn.Module):
    def __init__(self, input_size, n_actions):
        super().__init__()
        self.dense_1 = nn.Linear(input_size, hidden_size)
        self.dense_2 = nn.Linear(hidden_size, hidden_size//2)
        self.dense_3 = nn.Linear(hidden_size//2, n_actions)
        
    def forward(self, x):
        x = F.relu(self.dense_1(x))
        x = F.relu(self.dense_2(x))
        x = self.dense_3(x)
        return x
    
class ValueFunction(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.dense_1 = nn.Linear(input_size, hidden_size)
        self.dense_2 = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        x = F.relu(self.dense_1(x))
        x = self.dense_2(x)
        return x
    
    
# Make an environment
env = gym.make('Pendulum-v1')
env = RewardScalingWrapper(DiscreteActionWrapper(ObservationWrapper(env)))

# 1. Initialize networks and old policy
hidden_size = 64
n_actions = env.action_space.n
input_size = env.observation_space.shape[0]

policy = Policy(input_size, n_actions)
value_function = ValueFunction(input_size)

old_policy = Policy(input_size, n_actions)
old_policy.load_state_dict(policy.state_dict())

# Define hyperparameters
device = 'cpu'
training_steps = 2000000
batch_size = 64
gamma = 0.99
epsilon = 0.2

learning_rate_policy = 0.0003
learning_rate_value = 0.001

optimizer_policy = torch.optim.Adam(policy.parameters(), lr=learning_rate_policy)
optimizer_value_function = torch.optim.Adam(value_function.parameters(), lr=learning_rate_value)

transitions = []

# Reset environment
state, info = env.reset()

# 2. Main training loop
for i in range(training_steps):
    # 3. Gather trajectories
    # Convert state to a tensor
    state_tensor = torch.tensor(state, dtype=torch.float, device=device)

    # Pass state tensor through policy network
    logits = policy(state_tensor)

    # Convert logits to probability distribution
    dist = torch.distributions.Categorical(logits=logits)

    # Sample action from probability distribution
    action = dist.sample().item()

    # Execute action in environment
    next_state, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

    transition = (state, action, reward, next_state, done)
    transitions.append(transition)
    
    # Reset environment if done
    if done:
        state, info = env.reset()
    else:
        state = next_state

    if len(transitions) < batch_size:
        continue
        
    # 4. Compute rewards to go
    # Expand transitions
    states, actions, rewards, next_states, dones = zip(*transitions)
    
    # Clear transitions
    transitions = []

    # If the final step didn't cause the environment to terminate or timeout
    if not dones[-1]:
        # Calculate the value of the final state
        final_state_tensor = torch.tensor(next_states[-1], dtype=torch.float, device=device)
        final_state_value = value_function(final_state_tensor).item()
    else:
        # If it did finish, there is no future discounted reward, so it's just 0
        final_state_value = 0

    # Initialize rewards to go list
    rewards_to_go = []

    # R represents all the future rewards
    # In the above equation, anything being multiplied by gamma is a part of R
    R = final_state_value

    # Iterate backwards
    for r, done in zip(reversed(rewards), reversed(dones)):
        # If the environment is done, there is no future reward
        if done:
            R = 0
        # Add the immediate reward and discount the future rewards
        R = r + gamma * R

        # Add the calculated reward-to-go to the list at the start as we're iterating backwards
        rewards_to_go.insert(0, R)
        
    # 5. Advantage estimate
    # Convert states to a tensor
    states_tensor = torch.tensor(np.array(states), dtype=torch.float, device=device)

    # Convert rewards-to-go to a tensor
    rewards_tensor = torch.tensor(np.array(rewards_to_go), dtype=torch.float, device=device)

    # Get state values with value function
    values = value_function(states_tensor).squeeze(-1)

    # Calculate advantages
    advantages = rewards_tensor - values
    
    # 6. Calculate policy ratio loss and backward pass
    # Convert actions to tensor and unsqueeze for upcoming gather
    actions_tensor = torch.tensor(actions, dtype=torch.long, device=device).unsqueeze(-1)

    # For current policy
    logits_current = policy(states_tensor)
    probs_current = F.softmax(logits_current, dim=-1)
    action_probs_current = probs_current.gather(dim=1, index=actions_tensor)

    # For old policy (use no_grad() as we never update the old policy with backprop)
    with torch.no_grad():
        logits_old = old_policy(states_tensor)
        probs_old = F.softmax(logits_old, dim=-1)
        action_probs_old = probs_old.gather(dim=1, index=actions_tensor)

    # Calculate the policy ratio
    policy_ratio = (action_probs_current/action_probs_old).squeeze(-1)

    # Next, calculate clipped policy ratio
    # This is a very simple calculation as we've already calculated the policy ratio
    clipped_policy_ratio = torch.clip(policy_ratio, min=1-epsilon, max=1+epsilon)

    # Next, we calculate the objective function
    objective = torch.min(policy_ratio * advantages, clipped_policy_ratio * advantages)

    # We then average this across all transitions and episodes using mean to get the policy loss
    # We use negative as we want to maximize it and loss metrics are to be minimized
    loss_policy = -torch.mean(objective)

    # First, set our old policy to our current policy before we update it
    old_policy.load_state_dict(policy.state_dict())

    # Backward pass for policy
    optimizer_policy.zero_grad()
    loss_policy.backward()
    optimizer_policy.step()
    
    # 7. Calculate value function loss and backward pass
    # Predict values for current state
    values = value_function(states_tensor).squeeze(-1)

    loss_value_function = F.mse_loss(values, rewards_tensor)

    # Backward pass for value function
    optimizer_value_function.zero_grad()
    loss_value_function.backward()
    optimizer_value_function.step()

### Visualization
Here's how we can use our learned policy to execute actions in the environment, with an accompanying recording:

In [10]:
# Create env
env = gym.make('Pendulum-v1')
env = RewardScalingWrapper(DiscreteActionWrapper(ObservationWrapper(env)))

# Reset env
state, info = env.reset()

total_reward = 0

while True:
    state_tensor = torch.tensor(state, dtype=torch.float, device=device)
    logits = policy(state_tensor)
    dist = torch.distributions.Categorical(logits=logits)
    action = dist.sample().item()

    state, reward, terminated, truncated, _ = env.step(action)

    if terminated or truncated:
        break

![](https://i.imgur.com/ykouf68.gif)

### Wrapping Up
I hope you found this notebook useful. I intend to make notebooks in the future that incorporate modern tweaks into PPO to improve performance.

Do check out my other work if you enjoyed this.