<a href="https://colab.research.google.com/github/gitHubAndyLee2020/OpenAI_Gym_RL_Algorithms_Database/blob/main/PPO_Module.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### PPO

> About

- Consists of Actor and Critic, where the Actor generates action probability for given state, and Critic generates expected return for given state
- The Critic network is used to calculate the expected reward of taking some action sampled from training data, and the advantage is calculate by the difference between the expected reward and actual reward
- For some state and action taken during training, ratio between the current updated Actor network selecting that action given the state divided by the original Actor network selecting that action is used to control the amount that the model is updated according to the advantage; this is used to prevent the updated network from deviating too much from the original network
- The difference discounted actual reward value and predicted reward value is used to update the Critic network

> Pro

- Stability and Robustness

> Con

- Sample Inefficiency

```
class Actor(nn.Module):
  def __init__(self):
    - Initialize neural network that maps state -> hidden layer -> action probability

  def forward(self, x):
    - Feed the input state through the neural network and apply softmax to the output action probability
    - Return the action probability
```

```
class Critic(nn.Module):
  def __init__(self):
    - Initialize neural network that maps state -> hidden layer -> expected reward
    
  def forward(self, x):
    - Feed the input state through the neural network
    - Return the expected reward
```

```
class PPO():
  def __init__(self):
    - Initialize Actor and Critic networks, Actor and Critic optimizers, and buffer to store training data

  def select_action(self, state):
    - Feed the state through the Actor network and get the action probability
    - Convert the action probability into categorical probability and select an action
    - Return the selected action and its action probability

  def get_value(self, state):
    - Feed the state through the Critic network
    - Return the expected reward

  def save_param(self):
    - Save the Actor and Critic networks' weights

  def store_transition(self, transition):
    - Store the transition into the storage

  def update(self, i_ep):
    - Get the state tensor, action tensor, reward tensor, and action log probability tensor from transitions in the storage
    - Calculate the discounted returns using R = r_cur + gamma * r_cur+1 + gamma^2 * r_cur+2 + ..., the discounted return is stored as a tensor [r0 + gamma * r1 + gamma^2 r2, r1 + gamma * r2 + gamma^2 * r3,...] for each time step, Gt represents how much reward the Actor model managed to achieve from the current time step to end of the game
    - Run the following update loop for some amount of times
    # Update Loop
    - 1. Select a random index from the storage, item at the index is a batch of data
    - 2. Fetch the Gt value at the index, this represents the actual reward achieved
    - 3. Feed the state at the index to the Critic model, and get the expected reward
    - 4. Take the difference between the Gt value and the expected reward. This represents how much better the Actor model performed compared to what was expected from the state, called the advantage
    - 5. Feed the state into the Actor network, and get the probability of the action that was actually taken, let's call this generated action probability
    - 6. Calculate the ratio by generated action probability / actual action probability. The ratio represents how much the Actor model changed compared to when the data was collect (in the first loop, the ratio will be close to 1 since the model hasn't been updated yet)
    - 7. Calculate the first surrogate loss value by multiplying ratio and advantage. This represents how more likely is the updated Actor to choose the action that brings "advantage" amount of more rewards ("advantage" could be positive or negative)
    - 8. Calculate the second surrogate loss value by clamping the ratio value to 1 +- clip paramter range then multiplying by the advantage. This achieves the same purpose as the first surrogate loss value, except the ratio is confined between hardline range limit, to avoid large loss value
    - 9. Select the minimum value from the first and second surrogate loss values and take the negated mean (since all the values from the above are tensors from batches of data). The negation means that (1) high ratio and positive advantage -> low loss, less change, (2) high ratio and negative advantage -> high loss, more change; the model is directed towards favoring high-reward actions
    - 10. Apply backpropagation to Actor network
    - 11. Calculate the Critic loss by the Mean Square Loss of Gt value and expected value, more difference between Gt value (actual reward value) and expected value results in higher loss for more adjustments to the Critic network
```

```
def main():
  - Run the following training loop for some number of epochs
  # Training loop
  - Collect data from the environment until the game ends
  - When the game is over, update the agent using the collected data
```

In [None]:
# Importing required libraries and modules
import argparse
import pickle
from collections import namedtuple
from itertools import count
import os, time
import numpy as np
import matplotlib.pyplot as plt
import gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.distributions import Normal, Categorical
from torch.utils.data.sampler import BatchSampler, SubsetRandomSampler
from tensorboardX import SummaryWriter

# Setting hyperparameters and constants
gamma = 0.99
render = False
seed = 1
log_interval = 10

# Initialize the CartPole environment and state and action space dimensions
env = gym.make('CartPole-v0').unwrapped
num_state = env.observation_space.shape[0]
num_action = env.action_space.n

# Set the random seed for reproducibility
torch.manual_seed(seed)
env.seed(seed)

# Define a named tuple to store transitions
Transition = namedtuple('Transition', ['state', 'action',  'a_log_prob', 'reward', 'next_state'])

# Define the Actor neural network
class Actor(nn.Module):
    def __init__(self):
        super(Actor, self).__init__()
        self.fc1 = nn.Linear(num_state, 100)
        self.action_head = nn.Linear(100, num_action)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        action_prob = F.softmax(self.action_head(x), dim=1)
        return action_prob

# Define the Critic neural network
class Critic(nn.Module):
    def __init__(self):
        super(Critic, self).__init__()
        self.fc1 = nn.Linear(num_state, 100)
        self.state_value = nn.Linear(100, 1)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        value = self.state_value(x)
        return value

# Define the PPO (Proximal Policy Optimization) class
class PPO():
    clip_param = 0.2
    max_grad_norm = 0.5
    ppo_update_time = 10
    buffer_capacity = 1000
    batch_size = 32

    def __init__(self):
        super(PPO, self).__init__()
        self.actor_net = Actor()
        self.critic_net = Critic()
        self.buffer = []
        self.counter = 0
        self.training_step = 0
        self.writer = SummaryWriter('../exp')
        self.actor_optimizer = optim.Adam(self.actor_net.parameters(), 1e-3)
        self.critic_net_optimizer = optim.Adam(self.critic_net.parameters(), 3e-3)

        # Create directories if they don't exist
        if not os.path.exists('../param'):
            os.makedirs('../param/net_param')
            os.makedirs('../param/img')

    def select_action(self, state):
        state = torch.from_numpy(state).float().unsqueeze(0)
        with torch.no_grad():
            action_prob = self.actor_net(state)
        c = Categorical(action_prob)
        action = c.sample()
        return action.item(), action_prob[:, action.item()].item()

    def get_value(self, state):
        state = torch.from_numpy(state)
        with torch.no_grad():
            value = self.critic_net(state)
        return value.item()

    def save_param(self):
        torch.save(self.actor_net.state_dict(), '../param/net_param/actor_net' + str(time.time())[:10] + '.pkl')
        torch.save(self.critic_net.state_dict(), '../param/net_param/critic_net' + str(time.time())[:10] + '.pkl')

    def store_transition(self, transition):
        self.buffer.append(transition)
        self.counter += 1

    def update(self, i_ep):
        state = torch.tensor([t.state for t in self.buffer], dtype=torch.float)
        action = torch.tensor([t.action for t in self.buffer], dtype=torch.long).view(-1, 1)
        reward = [t.reward for t in self.buffer]
        old_action_log_prob = torch.tensor([t.a_log_prob for t in self.buffer], dtype=torch.float).view(-1, 1)

        R = 0
        Gt = []
        for r in reward[::-1]:
            R = r + gamma * R
            Gt.insert(0, R)
        Gt = torch.tensor(Gt, dtype=torch.float)

        # Training loop for PPO
        for i in range(self.ppo_update_time):
            for index in BatchSampler(SubsetRandomSampler(range(len(self.buffer))), self.batch_size, False):
                if self.training_step % 1000 == 0:
                    print('I_ep {} ，train {} times'.format(i_ep, self.training_step))

                Gt_index = Gt[index].view(-1, 1)
                V = self.critic_net(state[index])
                delta = Gt_index - V
                advantage = delta.detach()

                # Core PPO logic for policy optimization
                action_prob = self.actor_net(state[index]).gather(1, action[index])
                ratio = (action_prob / old_action_log_prob[index])
                surr1 = ratio * advantage
                surr2 = torch.clamp(ratio, 1 - self.clip_param, 1 + self.clip_param) * advantage

                # Update actor network
                action_loss = -torch.min(surr1, surr2).mean()
                self.writer.add_scalar('loss/action_loss', action_loss, global_step=self.training_step)
                self.actor_optimizer.zero_grad()
                action_loss.backward()
                nn.utils.clip_grad_norm_(self.actor_net.parameters(), self.max_grad_norm)
                self.actor_optimizer.step()

                # Update critic network
                value_loss = F.mse_loss(Gt_index, V)
                self.writer.add_scalar('loss/value_loss', value_loss, global_step=self.training_step)
                self.critic_net_optimizer.zero_grad()
                value_loss.backward()
                nn.utils.clip_grad_norm_(self.critic_net.parameters(), self.max_grad_norm)
                self.critic_net_optimizer.step()
                self.training_step += 1

        del self.buffer[:]

# Main function to run the agent in the environment
def main():
    agent = PPO()
    for i_epoch in range(1000):
        state = env.reset()
        if render: env.render()

        for t in count():
            action, action_prob = agent.select_action(state)
            next_state, reward, _, done, _ = env.step(action)
            trans = Transition(state, action, action_prob, reward, next_state)
            if render: env.render()
            agent.store_transition(trans)
            state = next_state

            if done:
                if len(agent.buffer) >= agent.batch_size:
                    agent.update(i_epoch)
                agent.writer.add_scalar('liveTime/livestep', t, global_step=i_epoch)
                break

# Entry point of the script
if __name__ == '__main__':
    main()
    print("end")