# Homework: Implementing PPO on Gym

In this homework, you will implement the Proximal Policy Optimization (PPO) algorithm and test it on an environment from OpenAI Gym. This will give you hands-on experience with one of the most influential policy gradient methods in reinforcement learning.


## 1. Setup

First, let's install and import the necessary packages. Run the following in a code cell:

In [None]:
!pip install gym torch matplotlib seaborn



In [None]:
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(style="darkgrid")

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

The following helper functions that compute advantage and evaluate a given policy are provided.

In [None]:
def compute_advantages(next_value, rewards, masks, values, gamma=0.99):
    values = values + [next_value]
    advantages = 0
    returns = []
    for step in reversed(range(len(rewards))):
        delta = rewards[step] + gamma * values[step + 1] * masks[step] - values[step]
        advantages = delta + gamma * masks[step] * advantages
        returns.insert(0, advantages + values[step])
    return returns

def evaluate_policy(policy, env_name, seed=42):
    env_test = gym.make(env_name)
    # env_test.seed(seed)
    state, done, total_reward = env_test.reset(), False, 0
    while not done:
        state = torch.FloatTensor(state).unsqueeze(0)
        dist = policy(state)
        next_state, reward, done, _ = env_test.step(dist.sample().numpy()[0])
        state = next_state
        total_reward += reward
    return total_reward



## 2. Defining the Policy and Value Networks

We start from the policy network, which decides on the actions to take, and the value network, which estimates the returns. You need to implement both neural networks.

In [None]:
# TODO: Implement the PolicyNetwork class

class PolicyNetwork(nn.Module):

  """
    Implement the Policy Network.

    Your task is to complete the initialization of the policy network that maps states to action probabilities.
    This network should consist of several fully connected layers with ReLU activation, followed by a final layer
    that outputs logits for each action. The forward pass should return a Categorical distribution over actions.

    Instructions:
    1. Initialize the fully connected layers in the __init__ method.
    2. Implement the forward pass to return a Categorical distribution given state inputs.

    Hint: The constructor takes 'state_dim' and 'action_dim' as arguments, representing the dimensions
    of the state space and action space, respectively.
  """
  def __init__(self, state_dim, action_dim):
      super(PolicyNetwork, self).__init__()
      ##### Code implementation here #####
      pass

  def forward(self, x):
      ##### Code implementation here #####
      pass

# TODO: Implement the ValueNetwork class
class ValueNetwork(nn.Module):
  """
    Implement the Value Network.

    Your task is to complete the initialization of the value network that maps states to value estimates.
    Similar to the policy network, this network should consist of several fully connected layers with ReLU activation
    followed by a final layer that outputs a single value estimate for the input state.

    Instructions:
    1. Initialize the fully connected layers in the __init__ method.
    2. Implement the forward pass to return the value estimate given state inputs.

    Hint: The constructor takes 'state_dim' as an argument, representing the dimension of the state space.
    """
  def __init__(self, state_dim):
      super(ValueNetwork, self).__init__()
      ##### Code implementation here #####
      pass

  def forward(self, x):
      ##### Code implementation here #####
      pass


## 3. Implementing the Training Loop

Training in PPO involves collecting data from the environment, computing advantages, and updating the policy and value networks.

1. `ppo_iter()` This function generates mini-batches from the collected data, which are then used for gradient updates. This function is provided to you.

2. `ppo_update()` This function optimizes the policy and value networks using the Proximal Policy Optimization algorithm. This function applies the core PPO algorithm, using the experiences collected from the environment to perform multiple epochs of updates on the policy and value networks.You need to implement the core parts.


In [None]:
def ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantage):
    batch_size = states.size(0)
    mini_batches = []

    for _ in range(batch_size // mini_batch_size):
        rand_ids = np.random.randint(0, batch_size, mini_batch_size)
        mini_batch = states[rand_ids, :], actions[rand_ids], log_probs[rand_ids], returns[rand_ids], advantage[rand_ids]
        mini_batches.append(mini_batch)

    return mini_batches

def ppo_update(policy_net, value_net, optimizer, ppo_epochs, mini_batch_size, states, actions, log_probs, returns, advantages, clip_param=0.2):
    """
    Implement the PPO update algorithm.

    This function should perform the optimization of the policy and value networks using the Proximal Policy Optimization (PPO) algorithm.
    You'll need to compute the ratio of new and old policy probabilities, apply the clipping technique, and calculate the losses for both the actor (policy network) and critic (value network).

    Instructions:
    1. Iterate over the number of PPO epochs, which is the number of optimizer.step() with the current collected data.
    2. In each epoch, iterate over the mini-batches of experiences.
    3. Calculate the new log probabilities of the actions taken, using the policy network.
    4. Compute the ratio of new to old probabilities.
    5. Apply the PPO clipping technique to the computed ratios.
    6. Calculate the actor (policy) and critic (value) losses.
    7. Combine the losses and perform a backpropagation step.

    Hints:
    - Use `policy_net(state)` to get the distribution over actions for the given states.
    - The `dist.log_prob(action)` method calculates the log probabilities of the taken actions according to the current policy.
    - The ratio is computed as the exponential of the difference between new and old log probabilities (`(new_log_probs - old_log_probs).exp()`).
    - Use `torch.clamp(ratio, 1.0 - clip_param, 1.0 + clip_param)` to clip the ratio between `[1-clip_param, 1+clip_param]`.
    - The actor loss is the negative minimum of the clipped and unclipped objective, averaged over all experiences in the mini-batch.
    - The critic loss is the mean squared error between the returns and the value estimates from the value network.
    - Remember to zero the gradients of the optimizer before the backpropagation step with `optimizer.zero_grad()`.
    - After computing the loss and performing backpropagation with `loss.backward()`, take an optimization step with `optimizer.step()`.
    """
    for _ in range(ppo_epochs):
        for state, action, old_log_probs, return_, advantage in ppo_iter(mini_batch_size, states, actions, log_probs, returns, advantages):
            dist = policy_net(state)
            new_log_probs = dist.log_prob(action)

            ##### Code implementation here #####
            pass
            actor_loss = None
            critic_loss = None
            ##### Code implementation End #####

            loss = 0.5 * critic_loss + actor_loss

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Main training loop

In [None]:
def train(env_name='CartPole-v1', num_steps=1000, mini_batch_size=8, ppo_epochs=4, threshold=400):
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n

    policy_net = PolicyNetwork(state_dim, action_dim)
    value_net = ValueNetwork(state_dim)
    optimizer = optim.Adam(list(policy_net.parameters()) + list(value_net.parameters()), lr=3e-3)

    state = env.reset()
    early_stop = False
    reward_list = []

    for step in range(num_steps):
        log_probs = []
        values = []
        states = []
        actions = []
        rewards = []
        masks = []
        entropy = 0

        # Collect samples under the current policy
        for _ in range(2048):
            state = torch.FloatTensor(state).unsqueeze(0)
            dist, value = policy_net(state), value_net(state)

            action = dist.sample()
            next_state, reward, done, _ = env.step(action.numpy()[0])
            log_prob = dist.log_prob(action)

            log_probs.append(log_prob)
            values.append(value)
            rewards.append(torch.tensor([reward], dtype=torch.float32))
            masks.append(torch.tensor([1-done], dtype=torch.float32))
            states.append(state)
            actions.append(action)

            state = next_state
            if done:
                state = env.reset()

        next_state = torch.FloatTensor(next_state).unsqueeze(0)
        next_value = value_net(next_state)
        returns = compute_advantages(next_value, rewards, masks, values)

        returns = torch.cat(returns).detach()
        log_probs = torch.cat(log_probs).detach()
        values = torch.cat(values).detach()
        states = torch.cat(states)
        actions = torch.cat(actions)
        advantage = returns - values

        # run PPO update for policy and value networks
        ppo_update(policy_net, value_net, optimizer, ppo_epochs, mini_batch_size, states, actions, log_probs, returns, advantage)

        if step % 1 == 0:
            test_reward = np.mean([evaluate_policy(policy_net, env_name) for _ in range(10)])
            print(f'Step: {step}\tReward: {test_reward}')
            reward_list.append(test_reward)
            if test_reward > threshold:
                print("Solved!")
                early_stop = True
                break
    return early_stop, reward_list

## 4. Training and evaluation

In [None]:
# Run the training function
threshold = 400

early_stop, reward_list = train(env_name='CartPole-v1', num_steps=100, mini_batch_size=16, ppo_epochs=4, threshold=threshold)

### Plot the performance curves

In [None]:
if not early_stop:
  print("Not solved in %d steps"%len(reward_list))

# Plot using Seaborn
sns.lineplot(x=np.arange(len(reward_list)), y=reward_list, color='salmon', marker='o', linestyle='-', label='PPO training: reach reward %.1f with %d steps'%(threshold, len(reward_list)))

# Optional: Adding titles and labels
plt.title('Performance Curves')
plt.xlabel('Step')
plt.ylabel('Reward')
plt.legend()

# Show the plot
plt.show()