# Implementing Policy Gradient with PyTorch on CartPole


## Monte-Carlo Policy Gradient (REINFORCE)

Agent learning $\pi_\theta(a|s)$, where $\theta$ is the parameter vector, $s$ is a particular state, and $a$ an action.

In plain Monte-Carlo Policy Gradient, the agent finishes an entire episode first before updating the policy parameter based on the cumulative rewards obtained through the trajectory.

In [None]:
import gym
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Categorical

In [None]:
#
# Use CartPole Example
#
env = gym.make("CartPole-v1", new_step_api=True, render_mode="human")
print(env.observation_space) #Box(Low, High, vector dim, in float32 values)
print("Observation space shape:",env.observation_space.shape)
print(env.action_space)
print("No of actions:", env.action_space.n)

Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
Observation space shape: (4,)
Discrete(2)
No of actions: 2


## Creating NN with PyTorch
We create a simple NN:<br>
Input - vector of 4 features:<br>

**Cart Position (x)**:<br>
The horizontal position of the cart on the track.
Typically in the range of about ±4.8 (before termination).
Positive values mean the cart is to the right, negative to the left.<br>
**Cart Velocity (ẋ)**:<br>
The linear velocity of the cart. Measured in m/s. Positive values indicate motion to the right, negative to the left.<br>
**Pole Angle (θ):**<br>
The angle of the pole from vertical (0 rad). Typically in the range ±0.209 rad (~12 deg) before the episode ends. Positive values mean the pole is leaning to the right, negative to the left.<br>
**Pole Angular Velocity (θ̇):**<br>
The rate of change of the pole's angle in rad/s. Positive values indicate counterclockwise rotation, negative means clockwise.<br>

Neural Network:<br>
Hidden Layer 1 - 20 nodes<br>
Hidden Layer 2 - 30 nodes<br>
Output Layer - 2 nodes (move left or right) <br>
NN training to be done with Adam optimizer.


In [None]:
#
#Hyperparameters
#
LEARNING_RATE = 0.01
GAMMA = 0.99

In [None]:
#
# A model can be defined in PyTorch by subclassing the torch.nn.Module class.
# This is the PyTorch base class meant to encapsulate behaviors specific to PyTorch Models and their components.
#
# The model is defined in two steps. We first specify the layer definition of the model,
# and then outline how they are applied to the inputs. Here’s a simple model with
# two linear layers and an activation function:
#
# class TinyModel(nn.Module):
#    def __init__(self):
#        super(TinyModel, self).__init__()
#        self.linear1 = nn.Linear(D_in, H1) #Dim of input = D_in
#        self.activation = nn.ReLU() # Dim of hidden layer = H1 (num of nodes)
#        self.linear2 = nn.Linear(H1, D_out) #Dim of output = D_out
#        self.softmax = nn.Softmax(dim=1)
#
#    def forward(self, x):
#        x = self.linear1(x)
#        x = self.activation(x)
#        x = self.linear2(x)
#        x = self.softmax(x)
#        return x
#
# tinymodel = TinyModel()
#
# You may have noticed that we define the SoftMax activation for the final layer in this model.
# This is because the CrossEntropyLoss function is not used here (remember we said that it has already
# combined both a SoftMax activation and the cross entropy loss function inside).
#
# For PG, it is the sum of (log_prob*reward) of every steps. It is the objective function J(theta)
# which we are attempting to maximize here.
#
class PolicyModel(nn.Module):
    def __init__(self):
        #super(PolicyModel, self).__init__()
        super().__init__()
        self.linear1 = nn.Linear(env.observation_space.shape[0], 20) # input dim=4, 20
        self.linear2 = nn.Linear(20, 30) #hidden layer dimensions 20 & 30 (nodes) - 2 hidden layers
        self.linear3 = nn.Linear(30, env.action_space.n) # output dim=2
        self.activation = nn.ReLU()
        self.softmax = nn.Softmax(dim=1) #softmax over dim (col)

        # Storages used during a trajectory
        self.saved_log_probs = [] #stores ln(prob) of corresponding action chosen randomly during sampling in a trajectory
        self.rewards = [] #stores rewards obtained during trajectory

    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.activation(x)
        x = self.linear3(x)
        x = self.softmax(x)
        return x

In [None]:
model = PolicyModel() #instantiate the NN
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE) #set the NN optimizer used in training,
                                                             #the method parameters() comes from nn.Module class
                                                             #i.e., we are training the model.parameters()

### Select Action
Chooses an action based on our policy probability distribution using the PyTorch distributions package. Returns a probability for each possible action in the action space (move left or move right) as an array (e.g. [0.7 0.3]). We then choose an action based on these probabilities, record our history, and return our action.<br>

The PyTorch distribution package contains parameterizable probability distributions and sampling functions. This allows the construction of stochastic computation graphs and stochastic gradient estimators for optimization.<br>

PyTorch supports REINFORCE by providing methods to create surrogate functions that can be backpropagated through. REINFORCE is commonly seen as the basis for policy gradient methods in reinforcement learning. When the probability density function is differentiable with respect to its parameters, we only need **sample()** and **log_prob()** to implement REINFORCE.<br>

See [PyTorch documentation](https://pytorch.org/docs/stable/distributions.html):<br>


```
probs = policy_network(state) # e.g. output probs = [[0.3,0.7]]
m = Categorical(probs)        # create a multinomial distribution based on probs
action = m.sample()           # generate sample action corresponding to distribution
next_state, reward = env.step(action)

# m.log_prob(action) generate ln of prob linked to the generated action
# e.g. move left -> 0.3, ln(0.3)=-1.204
loss = -m.log_prob(action) * reward
loss.backward()
```




In [None]:
# Function outputs an appropriate action based on prob distribution
# At the same time, calc the log_prob and saved into model's saved_log_probs[]
#
# state is an input vector [position, velocity, angle, angular velocity] in numpy format.
# Need to change this to torch tensor format.
#
def select_action(state):
    # get output of prob of action from NN using input state s
    state = torch.from_numpy(state).float().unsqueeze(0) #convert to torch tensor, need to be array of arrays - unsqueeze(0)
    probs = model.forward(state) #input is array of arrays, requirement of PyTorch, probs is e.g. [[0.3, 0.7]], call forward
                                 #so, output is an array of array of probs.
    # generate sample action corresponding to m distribution
    m = Categorical(probs) # create a surrogate multinomial distribution based on probs
    action = m.sample() #action (0 or 1) generated in torch tensor format, randomly according to m distribution

    # calc log prob of corresponding sample action
    model.saved_log_probs.append(m.log_prob(action)) #e.g. 0.3 chance of moving left, ln(0.3)=-1.204

    return action.item() #return int value taken out from torch tensor, which represents the action


###Reward $R_t$
We update our policy by taking a sample of the action value function $Q{\pi_\theta} (s_t,a_t)$ by playing through episodes of the game.  $Q{\pi_\theta} (s_t,a_t)$ is defined as the expected return by taking action $a$ in state $s$ following policy $\pi$.

We know that for every step the simulation continues we receive a reward of 1.  We can use this to calculate the policy gradient at each time step, where $r$ is the reward for a particular state-action pair.  Rather than using the instantaneous reward, $r$, we instead use a long term reward $ R_{t} $ where $R_t$ is the discounted sum of all future rewards for the length of the episode.  In this way, the **longer** the episode runs into the future, the **greater** the reward for a particular state-action pair in the present. $R_{t}$ is then,

$$ R_{t} = \sum_{k=0}^{N} \gamma^{k}r_{t+k} $$

where $\gamma$ is the discount factor (e.g., 0.99).  For example, an episode with 5 steps will have its rewards calculated as: [4.90, 3.94, 2.97, 1.99, 1]. This is assuming the agent is collecting a reward value of 1 at a
each step.

```
r: (1)+0.99+0.99^2+0.99^3+0.99^4   (1)+0.99+0.99^2+0.99^3   (1)+0.99+0.99^2   (1)+0.99   (1)
```

Next we scale our reward vector by substracting the mean from each element and scaling to unit variance by dividing by the standard deviation. That is, we normalize the rewards. It also has the effect of compensating for future uncertainty as well as helping in NN convergence.

## Update Policy
After each episode we apply Monte-Carlo Policy Gradient to improve our policy according to the equation depicted in the Policy Gradient Theorem:

$$\nabla_\theta J(\theta) = \nabla_\theta \, \log \pi_\theta (a_t|s_t) . R_t  $$

We will then feed our policy history multiplied by our rewards to our optimizer and update the weights of our neural network using stochastic gradent *ascent*.  This should increase the likelihood of actions that got our agent a larger reward. We also normalize our reward values.<br>

This is a typical PyTorch NN parameter update training loop:
```
# Training loop
for epoch in range(num_epochs):
    for inputs, targets in data_loader:
        optimizer.zero_grad() # Zero the gradients
        outputs = net(inputs) # forward pass
        loss = criterion(outputs, targets) #Compute loss
        loss.backward() #Backward pass
        optimizer.step() #Update parameters
```





Example of how rewards are accumulated and calculated in a trajectory:

```
R = 0
rewards = []
for r in [1,1,1,1,1]: #rewards
    R = r + 0.99*R
    rewards.insert(0, R)
```
In the loop, rewards will change accordingly:
```
[1]
[1.99, 1]
[2.97, 1.99, 1]
[3.94, 2.97, 1.99, 1]
[4.90, 3.94, 2.97, 1.99, 1]

r: (1)+0.99+0.99^2+0.99^3+0.99^4   (1)+0.99+0.99^2+0.99^3   (1)+0.99+0.99^2   (1)+0.99   (1)
```

In [None]:
# Example to show torch.cat in action (tutorial refresh)
test_list=[]
for i in range(5): test_list.append(torch.tensor([i]))
print(test_list) #list of torch tensor
t=torch.cat(test_list) #t becomes a torch tensor tensor, concat all the numbers into 1 torch array
print(t, type(t))
print(t.sum()) # result is still a torch tensor

[tensor([0]), tensor([1]), tensor([2]), tensor([3]), tensor([4])]
tensor([0, 1, 2, 3, 4]) <class 'torch.Tensor'>
tensor(10)


In [None]:
#
# Called per episode (trajectory)
#
def update_policy():
    R = 0
    policy_loss = [] # this is a bit of a misnomer
    rewards = []
    for r in model.rewards[::-1]: # model.rewards[] mostly contain 1s for CartPole, it's inside model object
        R = r + GAMMA * R
        rewards.insert(0, R) #contains all rewards (including discounted ones)

    # Normalize reward
    rewards = torch.tensor(rewards)
    rewards = (rewards - rewards.mean())/(rewards.std() + 1e-10) # normalize for better convergence

    # get loss
    for reward, log_prob in zip(rewards, model.saved_log_probs):
        #
        # Note: Loss function is not cross-entropy like in regular NN
        #       For PG, it is the sum of (log_prob*reward) of every steps that we are maximizing
        #       Actually, it's a misnomer to call it a "policy loss"
        #
        policy_loss.append(-log_prob * reward) #to be concatenated and summed below cummulatively
                                               #got to add a negative here cause Torch uses grad descent
                                               #adding negative turns it into grad ascent
                                               #policy_loss is a list of *torch tensors*
                                               #Note: model.saved_log_probs is a list of
                                               #      torch.tensor([x]) with []. It came from
                                               #      m.log_prob(action)
    #update NN parameters
    optimizer.zero_grad()
    policy_loss_t = torch.cat(policy_loss).sum() #no need forward pass here as it has been called in select_action during trajectory
                                                 #result torch.cat(policy_loss).sum() is a torch tensor
    policy_loss_t.backward() #NN backpropagation
    optimizer.step() #update NN parameters per episode iteration

    # Empty out model.rewards[] & model.saved_log_probs[]
    # to reaccumulate for next trajectory
    del model.rewards[:] #delete everything elements in list leaving list empty
    del model.saved_log_probs[:] #delete everything elements in list leaving list empty


### Training
This is our main policy training loop.  For each step in a training episode, we choose an action, take a step through the environment, and record the resulting new reward.  We call update_policy() at the end of each episode to feed the episode history to our neural network and improve our policy.

## Run Model

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from os import stat
total_rewards=0
episode=0

while (True):
    episode+=1
    rewards_per_episode=0
    s = env.reset()

    # Run 1 episode
    while (True):
        # Inside select_action, we:
        # (1) Model predict the softmax probs given a state
        # (2) Generate a sample action example based on the softmax probs output
        # (3) Take the opportunity to calc log prob of the corresponding sample action and
        #     append it inside model.saved_log_probs[], e.g. 0.3 chance of moving left, ln(0.3)=-1.204
        a = select_action(s)

        # terminated - whether the episode has ended due to a terminal state (e.g., completion or failure)
        # truncated - whether the episode has ended due to a time limit or other non-terminal conditions
        #             (e.g., max no of steps has been reached). In CartPole, the episode might be truncated
        #             if the pole stays upright for too long (e.g., 500 steps).
        s, r, terminated, truncated, _ = env.step(a)

        model.rewards.append(r)
        rewards_per_episode += r

        if terminated or truncated: break

    total_rewards += rewards_per_episode

    if rewards_per_episode==500.0:
        print(f"Episode {episode}: Rewards in this eps = {rewards_per_episode}, Avg rewards per eps={total_rewards/episode:0.2f}")
        torch.save(model, '/content/drive/MyDrive/policyNet.pt')
        break #ends when 500 is reached

    if episode % 20 == 0:
        print(f"Episode {episode}: Rewards in this eps = {rewards_per_episode}, Avg rewards per eps={total_rewards/episode:0.2f}")
        torch.save(model, '/content/drive/MyDrive/policyNet.pt')

    #
    # UPDATE POLICY
    #
    update_policy()

Episode 20: Rewards in this eps = 18.0, Avg rewards per eps=23.15
Episode 40: Rewards in this eps = 35.0, Avg rewards per eps=24.57
Episode 60: Rewards in this eps = 54.0, Avg rewards per eps=45.57
Episode 80: Rewards in this eps = 22.0, Avg rewards per eps=76.06
Episode 100: Rewards in this eps = 94.0, Avg rewards per eps=74.96
Episode 120: Rewards in this eps = 64.0, Avg rewards per eps=72.74
Episode 140: Rewards in this eps = 142.0, Avg rewards per eps=77.40
Episode 160: Rewards in this eps = 160.0, Avg rewards per eps=84.23
Episode 180: Rewards in this eps = 194.0, Avg rewards per eps=89.67
Episode 189: Rewards in this eps = 500.0, Avg rewards per eps=100.65
