# **Homework 5: Proximal Policy Optimization (PPO)**

## **Objective**
In this assignment, we will implement **Proximal Policy Optimization (PPO)** for the [CartPole-v1](https://gymnasium.farama.org/environments/classic_control/cart_pole/) environment from OpenAI Gym. Along the way, we will incorporate several important policy gradient concepts:

- **REINFORCE** (Monte Carlo Policy Gradient)
- **Importance Weighting (IW)** (For Off-Policy Learning)
- **Generalized Advantage Estimation (GAE)** (To Balance Bias and Variance)

*You should run this assignment on Google Colab.*

## **Understanding PPO with Mathematical Formulation**

### **Vanilla Policy Gradient (REINFORCE)**
The **policy gradient** method optimizes the policy by maximizing the expected cumulative reward using the gradient:

$$
\sum_{h\geq 0} \nabla_\theta\log \pi_\theta(A_h|S_h) \cdot \sum_{t\geq h}\gamma^{t-h} r(S_t, A_t)
$$

Here, we compute gradients based on **Monte Carlo estimates** of rewards. This approach has high variance and slow convergence.

---

### **Proximal Policy Optimization (PPO)**
#### **GAE**
To reduce variance, we replace returns with **Generalized Advantage Estimation (GAE)**:

$$
A_t = \sum_{l=0}^{T} (\gamma \lambda)^l \delta_{t+l}
$$

where

$$
\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)
$$

This approach balances bias and variance in the advantage estimation.

#### **main update**
PPO refines the policy update by using a **clipped surrogate objective** that restricts how far the new policy can deviate from the old one:

$$
L^{CLIP}(\theta) = \mathbb{E} \left[ \min\left(r_t(\theta) A_t, \ \text{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) A_t\right) \right]
$$

with

$$
r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}
$$

This clipping ensures that updates remain within a **trust region**, enhancing stability.


## 1. Setup

First, let's install and import the necessary packages. Run the following in a code cell:

In [None]:
!pip install gymnasium==1.0.0 torch numpy matplotlib seaborn

In [None]:
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="darkgrid")

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

The following helper function that evaluates a given policy is provided.

In [None]:
def evaluate_policy(policy, env_name, seed=42):
    env_test = gym.make(env_name)
    # env_test.seed(seed)
    state, _ = env_test.reset()
    done = False
    total_reward = 0
    while not done:
        state = torch.FloatTensor(state).unsqueeze(0)
        dist = policy(state)
        next_state, reward, done, _, _ = env_test.step(dist.sample().item())
        state = next_state
        total_reward += reward
    return total_reward

## 2. Defining the Policy and Value Networks

We start from the policy network, which decides on the actions to take, and the value network, which estimates the returns. You need to implement both neural networks.

In [None]:
# TODO: Implement the PolicyNetwork class

class PolicyNetwork(nn.Module):

  """
    Implement the Policy Network.

    Your task is to complete the initialization of the Policy Network, which maps states to action probabilities.

    The network should consist of three fully connected layers:

    1. An input layer that takes in state_dim and outputs 128 neurons.
    2. A hidden layer with 128 neurons.
    3. A final output layer that maps to action_dim, producing logits for each possible action.
    4. The activation function for the hidden layers should be ReLU (torch.relu).
    5. Implement the forward pass to return a Categorical distribution given state inputs.

    Hint: The constructor takes 'state_dim' and 'action_dim' as arguments, representing the dimensions
    of the state space and action space, respectively.
  """
  def __init__(self, state_dim, action_dim):
      super(PolicyNetwork, self).__init__()
      ##### Code implementation here #####
      pass

  def forward(self, x):
      ##### Code implementation here #####
      pass

# TODO: Implement the ValueNetwork class
class ValueNetwork(nn.Module):
  """
    Implement the Value Network.

    Your task is to complete the initialization of the **Value Network**, which maps states to their estimated values.

    Network Architecture:
    - The network consists of **three fully connected layers**:
    1. An **input layer** that takes `state_dim` and outputs **128 neurons**.
    2. A **hidden layer** with **128 neurons**.
    3. A **final output layer** that produces a **single scalar value** representing the state's estimated value.
    4. Activation function for the hidden layers should be ReLU (torch.relu).


    Hint: The constructor takes 'state_dim' as an argument, representing the dimension of the state space.
  """
  def __init__(self, state_dim):
      super(ValueNetwork, self).__init__()
      ##### Code implementation here #####
      pass

  def forward(self, x):
      ##### Code implementation here #####
      pass


Below you need to implement GAE. The function should return the advantage estimates for each state-action pair.

Hints: refer to the aforementioned TD error formula and note that the advantage is computed recursively.

In [None]:
# TODO: Implement the advantage function calculation

"""
Introduction: The function `compute_advantages` calculates **returns** using the **Generalized Advantage Estimation (GAE)** method.
GAE helps reduce variance while maintaining bias efficiency in reinforcement learning algorithms like PPO.

This function follows a **backward recursion** process:
1. The function receives `rewards`, `masks`, and `values`, which are **lists** representing a **trajectory** (a sequence of states and rewards).
2. The **masks** are used to handle terminal states (they are 0 if the episode ends and 1 otherwise).
3. It calculates **delta**, which is the temporal difference (TD) error.
4. The function accumulates **advantages** using the **recursive formula** for GAE.
5. Finally, it stores the **returns** (advantage + value function) for each step.

# Important Notes:
- As the recursion computation may be a bit tricky, you can adopt a simplr **for-loop** calculation if you desire.
- The function is named `compute_advantages`, but it actually calculates **returns** (advantage + value).
- The use of **masks** ensures that advantage propagation stops at the end of an episode.

As a reference for your understand, here is how **REINFORCE** computes returns in a recursion way:

  ```python
  def compute_returns(rewards, gamma=0.99):
      returns = []
      G = 0
      for reward in reversed(rewards):
          G = reward + gamma * G
          returns.insert(0, G)
      return returns
  ```
"""

def compute_advantages(next_value, rewards, masks, values, gamma=0.99, lambda_gae=0.95):
    values = values + [next_value]  # Append bootstrap value for last state
    advantages = 0
    returns = []

    for step in reversed(range(len(rewards))):  # Iterate in reverse (backward pass)
        delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step]

        ##### Code implementation here #####
        advantages = None

        returns.insert(0, advantages + values[step])

    return returns

## 3. Implementing the Training Loop

Training in PPO involves collecting data from the environment, computing advantages, and updating the policy and value networks.

1. `ppo_iter()` This function generates mini-batches from the collected data, which are then used for gradient updates. This function is provided to you.

2. `ppo_update()` This function optimizes the policy and value networks using the Proximal Policy Optimization algorithm. This function applies the core PPO algorithm, using the experiences collected from the environment to perform multiple epochs of updates on the policy and value networks.You need to implement the core parts.


In [None]:
def ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantage):
    batch_size = states.size(0)
    mini_batches = []

    for _ in range(batch_size // mini_batch_size):
        rand_ids = torch.randint(0, batch_size, (mini_batch_size,))
        mini_batch = states[rand_ids], actions[rand_ids], log_probs[rand_ids], returns[rand_ids], advantage[rand_ids]
        mini_batches.append(mini_batch)

    return mini_batches

In [None]:
# TODO: Implement the main PPO update function

def ppo_update(policy_net, value_net, optimizer, ppo_epochs, mini_batch_size, states, actions, log_probs, returns, advantages, clip_param=0.2):
    """
    Implement the PPO update algorithm.

    This function should perform the optimization of the policy and value networks using the Proximal Policy Optimization (PPO) algorithm.
    You'll need to compute the ratio of new and old policy probabilities, apply the clipping technique, and calculate the losses for both the actor (policy network) and critic (value network).

    Instructions:
    1. Iterate over the number of PPO epochs, which is the number of optimizer.step() with the current collected data.
    2. In each epoch, iterate over the mini-batches of experiences.
    3. Calculate the new log probabilities of the actions taken, using the policy network.
    4. Compute the ratio of new to old probabilities.
    5. Apply the PPO clipping technique to the computed ratios.
    6. Calculate the actor (policy) and critic (value) losses. You need to check the consistency for variable shapes before calculating the losses.
      6.1 Compute the actor loss:
        - The **surrogate objective** function is:
          \[
          L^{\text{clip}} = -\min(\text{ratio} \cdot \text{advantage}, \text{clipped_ratio} \cdot \text{advantage})
          \]
        - - The final **actor loss** is the mean of this objective.

      6.2 Compute the critic loss:
        - The value network should minimize the difference between predicted and actual returns:
          \[
          L^{\text{critic}} = (\text{return} - \text{value})^2
          \]
        - The final **critic loss** is the mean of this objective.

    7. Combine the losses and perform a backpropagation step.

    Hints:
    - Use `policy_net(state)` to get the distribution over actions for the given states.
    - The `dist.log_prob(action)` method calculates the log probabilities of the taken actions according to the current policy.
    - The ratio is computed as the exponential of the difference between new and old log probabilities (`(new_log_probs - old_log_probs).exp()`).
    - Use `torch.clamp(ratio, 1.0 - clip_param, 1.0 + clip_param)` to clip the ratio between `[1-clip_param, 1+clip_param]`.
    - The actor loss is the negative minimum of the clipped and unclipped objective, averaged over all experiences in the mini-batch.
    - The critic loss is the mean squared error between the returns and the value estimates from the value network.
    - Remember to zero the gradients of the optimizer before the backpropagation step with `optimizer.zero_grad()`.
    - After computing the loss and performing backpropagation with `loss.backward()`, take an optimization step with `optimizer.step()`.
    """
    for _ in range(ppo_epochs):
        for state, action, old_log_probs, return_, advantage in ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantages):
            dist = policy_net(state)
            new_log_probs = dist.log_prob(action)

            ##### Code implementation here #####
            pass
            actor_loss = None
            critic_loss = None
            ##### Code implementation End #####

            loss = 0.5 * critic_loss + actor_loss  # You can freely adjust the weight of the critic loss

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Main training loop, which collects data from the environment, computes advantages, and updates the policy and value networks. All these parts are provided to you.

In [None]:
def train(env_name='CartPole-v1', num_steps=1000, mini_batch_size=8, ppo_epochs=4, threshold=400):
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    policy_net = PolicyNetwork(state_dim, action_dim)
    value_net = ValueNetwork(state_dim)
    optimizer = optim.Adam(list(policy_net.parameters()) + list(value_net.parameters()), lr=3e-3)

    state, _ = env.reset()
    early_stop = False
    reward_list = []

    for step in range(num_steps):
        log_probs = []
        values = []
        states = []
        actions = []
        rewards = []
        masks = []

        # Collect samples under the current policy
        for _ in range(2048):
            state = torch.tensor(np.array(state), dtype=torch.float32)  # Ensure correct tensor conversion
            dist, value = policy_net(state), value_net(state)

            action = dist.sample()
            next_state, reward, done, _, _ = env.step(int(action.item()))  # Ensure action is an int
            log_prob = dist.log_prob(action)

            log_probs.append(log_prob.unsqueeze(0))
            values.append(value.unsqueeze(0))
            rewards.append(torch.tensor([reward], dtype=torch.float32))
            masks.append(torch.tensor([1 - done], dtype=torch.float32))
            states.append(state.unsqueeze(0))
            actions.append(action.unsqueeze(0))  # Fix for actions


            state = next_state
            if done:
                state, _ = env.reset()  # Ensure proper Gym reset handling

        next_state = torch.tensor(np.array(next_state), dtype=torch.float32).unsqueeze(0)  # Ensure proper conversion
        next_value = value_net(next_state)
        returns = compute_advantages(next_value, rewards, masks, values)

        returns = torch.cat(returns).detach()
        log_probs = torch.cat(log_probs).detach()
        values = torch.cat(values).detach()
        states = torch.cat(states)
        actions = torch.cat(actions)
        advantage = returns - values

        # Run PPO update for policy and value networks
        ppo_update(policy_net, value_net, optimizer, ppo_epochs, mini_batch_size, states, actions, log_probs, returns, advantage)

        if step % 1 == 0:
            test_reward = np.mean([evaluate_policy(policy_net, env_name) for _ in range(10)])
            print(f'Step: {step}\tReward: {test_reward}')
            reward_list.append(test_reward)
            if test_reward > threshold:
                print("Solved!")
                early_stop = True
                break
    return early_stop, reward_list

## 4. Training and evaluation

You can freely adjust all hyperparameters for better performances. The provided hyperparameters, if implemented correctly, should be able to make rewards close to/higher than 400.

Note: Please try several times if you think your code is correct, the learning curves can have some variances over different runs. Present the best run you can get.

In [None]:
threshold = 400

early_stop, reward_list = train(env_name='CartPole-v1', num_steps=100, mini_batch_size=16, ppo_epochs=4, threshold=threshold)

### Plot the performance curves

In [None]:
if not early_stop:
    print("Not solved in %d steps"%len(reward_list))

# Plot using Seaborn
sns.lineplot(x=np.arange(len(reward_list)), y=reward_list, color='salmon', marker='o', linestyle='-', label='PPO training: reach reward %.1f with %d steps'%(threshold, len(reward_list)))

# Optional: Adding titles and labels
plt.title('Performance Curves')
plt.xlabel('Step')
plt.ylabel('Reward')
plt.legend()

# Show the plot
plt.show()