# CPSC 533V: Policy Gradients and Proximal Policy Optimization (PPO)

---

## Submission Information

- Complete the assignment by editing and executing the associated Python files.
- Task 1 should be completed in the notebook, i.e. include your answers under each question.
- Task 2-4 are coding and experiment questions. Copy and paste your results (screenshots and logs) in the notebook.  You should also copy completed code into this notebook and paste under the corresponding questions, they should be only a few lines maximum.
- When done, upload the completed Jupyter notebook (ipynb file) on canvas.
- **We recommend working in groups of two**. List your names and student numbers below (if you use a different name on Canvas).

<ul style="list-style-type: none; font-size: 1.2em;">
<li><span style="color:blue"><b><u>Juntai Cao</u></b></span> (50171404): <i>jtcao7@cs.ubc.ca</i></li>
<li><span style="color:blue"><b><u>Yuwei Yin</u></b></span> (36211928): <i>yuweiyin@cs.ubc.ca</i></li>
</ul>

GitHub: <a>https://github.com/YuweiYin/UBC_CPSC_533V/tree/master/Assignment_3/</a>

*As always, you are encouraged to discuss your ideas and approaches with other students, even if you are not working as a group.*

## Assignment Background

This assignment is on vanilla policy gradients (VPG) methods and Proximal Policy Optimization (PPO).
You will be implementing the loss functions for vanilla policy gradients (VPG), running some experiments on it, and then implementing clipped-PPO policy gradients and its loss function.  The change for PPO is simple and it yields efficient-to-compute stable policy updates, making PPO one of the most widely used DeepRL algorithms today.


Goals:
- To understand policy gradient RL and to implement the relevant training losses, for both discrete and continuous action spaces
- To observe the sensitivity issues that plague vanilla policy gradients
- To understand and implement the PPO clipping objective and to observe how it addresses these issues

External resources:
- [Sutton's Book Chapter 13: Policy Gradients](http://incompleteideas.net/book/the-book-2nd.html)
- [Andrej Karpathy's post on RL in general and policy gradients specifically](http://karpathy.github.io/2016/05/31/rl/)
- [OpenAI's Spinning Up for coverage of policy gradients and PPO](https://spinningup.openai.com/en/latest/)
- [PPO paper](https://arxiv.org/pdf/1707.06347.pdf)
- [Matthew's StackOverflow Post on PPO](https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl/50663200#50663200)

## Task 0: Preliminaries

### Dependencies

In addition to dependencies from past assignments, we will learn to use TensorBoard to view our experiment results. 
```bash
pip install tensorboard
```

If you want to experiment with LunarLander instead of Cartpole, you'll also need to install the box2d environment.
```bash
pip install 'gymnasium[box2d]'
```

### Debugging

You can include:  `import ipdb; ipdb.set_trace()` in your code and it will drop you to that point in the code, where you can interact with variables and test out expressions.  We recommend this as an effective method to debug the algorithms.

---

### Quick recap of policy gradients

The idea is that we create a **differentiable policy** $\pi$ to be optimized so as to yield actions that yield high return.  To optimize the policy, we generate samples in the environment and we use those to compute a "modulated gradient" usable for gradient ascent on the policy parameters.  The modulated gradient consists of two terms: (1) the bare policy gradient term $\text{log}(\pi_\theta(a_t | s_t))$,  and the (2) reward/advantage modulator term $A_t$.  Note that $a_t$ is the action that was actually chosen and sent to the environment.  In PyTorch, we implement this modulated gradient by multiplying the two terms together in the following loss function and then calling backward on it:
$$L^{PG}(\theta) = \text{log}(\pi_\theta(a_t | s_t)) * A_t$$

The policy gradient term by itself indicates the direction required to move the policy parameters to *make the action that we chose more probable*.  By itself, this does nothing useful, if applied equally to all samples.  However, by multiplying this gradient by the advantage $A_t$, the full modulated gradient tells us how to move in the direction that makes good actions more probably and bad actions less probable.  When $A_t$ is large in absolute value, we should change the probability a lot. When $A_t$ is negative, we should make that action less likely.  This lets us use a non-differentiable reward signal to modulate the policy's gradient.

Here is a reference of a full vanilla policy gradient algorithm from OpenAI's Spinning Up resources.  This uses a critic value function $V$ trained to predict return.

![alt text](https://spinningup.openai.com/en/latest/_images/math/262538f3077a7be8ce89066abbab523575132996.svg)


## Task 1: Getting up to speed [14pts]
We have provided template code to get started.
For an overview, the files are: `models.py`, `pg_buffer.py`, `main.py`, `utils.py`.  You need only modify `main.py`, but you may modify the others if you so choose.
- `model.py` has the implementation of the networks and the action distributions we will use 
- `pg_buffer.py` has the implementation of the policy gradient buffer (similar to a replay buffer, but only for the most recent on-policy data)
- `main.py` has the (incomplete) implementation of the policy gradient losses and training loop
- `utils.py` utility (helper) functions

### 1.1 `models.py`

#### 1.a.  Read `models.py` [1pts]
Read through `models.py` and describe what the contained classes do and how they are used.  Include notes that also help support your own understanding and any questions you have.  If you find an anwer to these questions later, you can write them here. Pay attention to the distributions used and how they are parameterized, and also what is different between data collection and optimization time.

<span style="color:blue"><b><u>Answer to 1.a.</u></b></span>:

<span style="color:blue">

`Network(torch.nn.Module)` class implements a neural net with two fully connected layers and a ReLU activation in between. This network serves as a <u>building block of both actor and critic networks</u>. Additionally, the network is initialized with $0.0$ weights and biases in the output layer to promote initial random exploration by the actor and a baseline of zero for the critic's value function.

`DiscreteActor(torch.nn.Module)` class <u>processes the observation</u> through a neural net to produce log probability of each action being taken, followed by <u>returning a Categorical distribution for sampling actions</u>. If a specific action is provided, it also returns the log probability of taking that action according to the current policy.

`GaussianActor(torch.nn.Module)` class <u>chooses N continuous actions by sampling from N parameterized independent Normal distributions</u>. It <u>maps observations to the means of the Normal distribution</u>, which represents the log probability of each action being taken. The standard deviation of the distribution is learnable but independent of the current observation, initialized to encourage exploration at the start. <u>Given an observation, the class outputs a Normal distribution for sampling actions</u>. If a specific series of action is provided, the <u>sum of the log probability</u> of this series of action under the current policy is returned, which represent the log of the probability of taking a sequence of actions continuously.

`ActorCritic(torch.nn.Module)` class <u>constructs an independent value network as the critic network (value function)</u>, and <u>chooses an appropriate actor network</u> depending on whether the action space is discrete ($\to$ `DiscreteActor`) or continuous ($\to$ `GaussianActor`).
The `step` method performs a forward pass for both actor and critic networks without gradient computation (`torch.no_grad()`) to efficiently generate actions, their log probabilities, and state values for given observations (no backpropagation during rollouts phase).
The `act` method provides a simpler way to only return the action for a given observation.
</span>

---

#### 1.b.  Categorical distribution [1pts]
Imagine we have 4 possible actions {0, 1, 2, 3}, and our network outputs logits of `[-2.5, 0.9, 2.4, 3.7]`.  How does `Categorical` convert this into a valid probability distribution, i.e., having probabilities that sum to 1?  What mathematical function is used and what would be the probabilities returned in this case?

<span style="color:blue"><b><u>Answer to 1.b.</u></b></span>:

<span style="color:blue">

Apply a SoftMax function to the outputs. The probability returned is `tensor([0.0015, 0.0455, 0.2041, 0.7489])`.
</span>

---

#### 1.c. Gradient of Categorical distribution [3pts]

Continuing from the previous question, assume that we sample from that distribution such that we choose the action corresponding to index 2 (i.e., $a_t = 2$).  Now we want to compute the log prob gradient of this action.  What would be the value of this gradient with respect to all of the logit inputs? In other words, what is $\nabla_{\text{logits}} \text{log}(\pi(a_t))$ if $\pi$ is our Categorical?

You can solve this either by deriving the gradient on paper using your answer from 1.b. or by empirically computing it with code. In the latter case, you may use the pseudocode below, but you must write a mathematical expression for how the logit gradients are related to the probabilities of the Categorical (`c.probs`).

```python
logits = torch.nn.Parameter(torch.tensor([-2.5, 0.9, 2.4, 3.7]))  # imagine these came from the output of the network
c = Categorical(logits=logits)
a_t = torch.tensor(2)  # imagine this came from c.sample()
logp = c.log_prob(a_t)
logp.backward()
print(logits.grad)
```

<span style="color:blue"><b><u>Answer to 1.c.</u></b></span>:

<span style="color:blue">

The log probability of the choosing action $a_t$ is:

$$\log(p(a_t)) = \log(\exp(l_{a_t})\sum(\exp(l))) = l_{a_t} - \log(\sum(\exp(l))$$

where $\sum(\exp(l))$ is the sum of the exponential of the logits of all possible actions $a$.

When calculating the partial derivative of log(p) with respect to $l_{a_t}$, the first part is $1$, the second part is $-\exp(l_{a_t})/\sum(exp(l))$, that is to say $\partial\log(p(a_t))/\partial l_{a_t} = 1 - p(a_t)$, the gradient of log prob corresponding to chosen action $a_t$ is 1 subtract the probability of choosing action $a_t$.

For other actions, the first part is $0$ and the second part is $-\exp(l_a)/\sum(\exp(l))$. In addition, $\partial\log(p(a))/\partial l_{a_t} = -p(a)$, the gradient of log prob corresponding to action $a$ is the opposite of the probability of choosing action $a$.

Therefore, the calculated gradients would be `tensor([-0.0015, -0.0455,  0.7959, -0.7489])`
</span>

---

#### 1.d. Gaussian actor [2pts]

Now imagine we have a continuous action space with 2 actions, and our network outputs a mu of `[0.0, 1.2]`.  Then assume we sampled from that distribution to get $a_t = [0.1, 1.0]$.  What is $\nabla_\mu \text{log}(\pi_\mu(a_t))$ if $\pi$ is our Normal?  Give the value for this case, and write a mathematical expression for the gradient value in general, as a function of $\mu$ and $a_t$.

<span style="color:blue"><b><u>Answer to 1.d.</u></b></span>:

<span style="color:blue">

In a continuous action space, when actions are modeled as coming from a Gaussian distribution, the policy $\pi_{\mu,\sigma}$, parametrized by $\mu$ and $\sigma$ for action $a_t$, can be described by the PDF of the Gaussian distribution as follows ($\pi_{\mu,\sigma}$ denotes the policy and $\pi$ denotes the real transcendental constant):

$$\log(\pi_{\mu,\sigma}(a_t))=\log(\frac{1}{\sqrt{2\pi\sigma^2}}\cdot\exp(-\frac{1}{2}(\frac{a_t-\mu}{\sigma})^2))=-\frac{1}{2}(\log(2\pi\sigma^2)+(\frac{a_t-\mu}{\sigma})^2)$$

$$\nabla_\mu \text{log}(\pi_{\mu,\sigma}(a_t))=\frac{\partial}{\partial\mu}\log(\pi_{\mu,\sigma}(a_t))=\frac{a_t-\mu}{\sigma^2}$$

Specifically, when $a_t=[0.1, 1.0]$ and $\mu=[0, 1.2]$, we have $\nabla_\mu \text{log}(\pi_{\mu,\sigma}(a_t))=[\frac{0.1}{\sigma_1^2}$,$\frac{-0.2}{\sigma_2^2}]$.
</span>

---

#### 1.e. Meaning of these gradients [1pts]
For both continuous and discrete actions, what are these gradients telling us to do, in terms of the logits and the mus and the actions chosen?

<span style="color:blue"><b><u>Answer to 1.e.</u></b></span>:

<span style="color:blue">
The gradient indicates the direction and magnitude by which we should adjust logits and $\mu$ to increase the log probability of sampling actions (increase if the gradient is positive, decrease otherwise.)
</span>

---

### 1.2 `pg_buffer.py`

This code implements a buffer used to store the data we accumulate so we can process it in a batch.
Notably, it also computes GAE-Lambda Advantages. To answer the questions below, you should first skim the GAE paper, including at least the abstract and Equation 1 with the different options for $\Psi$ (`psi`): https://arxiv.org/pdf/1506.02438.pdf.  

#### 1.f  Why use GAE-lambda? [1pts]
What is the main argument from the GAE paper about why one should use the Advantage function, (rather than sum of discounted future reward for example) for our choice of $A_t$?

<span style="color:blue"><b><u>Answer to 1.f.</u></b></span>:

<span style="color:blue">

The choice of advantage function yields almost the lowest possible variance. By definition, the advantage function $A^{\pi}(s_t,a_t)=Q^{\pi}(s_t,a_t)-V^{\pi}(s_t)$ measures <u>whether the action is better or worse than the policy's default behavior</u>, as a step in the policy gradient direction should <u>increase the probability of better-than-average actions</u> and decrease the probability of worse-than-average actions.
</span>

---

#### 1.g  Paper to code correspondence [1pts]
See the `finish_path` function.  In which line of the GAE algorithm (pg 8) would you call it? And which equation in the GAE paper does the `adv_buf` line (`pg_buffer.py:61`) correspond to?

<span style="color:blue"><b><u>Answer to 1.g.</u></b></span>:

<span style="color:blue">
The function covers line 4 and 5 of the algorithm. The  `adv_buf` line corresponds to $\hat{A}_t=\sum^{\infty}_{l=0}(r\lambda)^l\delta^V_{t+l}$.
</span>

---

### 1.3 `main.py`

#### 1.h. Read `main.py` [2pts]

Read through the code and write down any notes that help your understanding of what is going on, as well as any questions you have.

<span style="color:blue"><b><u>Answer to 1.h.</u></b></span>:

<span style="color:blue">

`main.py` initiates by creating an ActorCritic model and preparing an experience buffer for data collection. Then it proceeds to build the loss function of calculating the policy loss, which leverages the advantage function and log probabilities (VPG or PPO, note that we need to add a '-' sign to the loss, as a higher advantage means the action is better than the policy’s default behavior, therefore a positive reward). Following this, a function for computing the value loss is set up, employing the mean squared error (MSE) between predicted values and actual returns to assess the accuracy of the value network. Adam optimizers are used for optimizing both policy and value networks. `update` allows for a sequential updating of the policy and value networks based on the collected experiences and calculated losses. 
In each episode, the model run through `epochs` of epochs, within which it executes  *steps_per_epoch* rollout steps for data collection. Then the `update` method is invoked to update the policy and value networks sequentially.
</span>

---

#### 1.i. Order of data collection and updating [1pts]
Note the order that we collect data and run optimization in.  How many steps do we collect data before running an update (w/ default args)?  Then how many optimization steps do we run in `update` by default?

<span style="color:blue"><b><u>Answer to 1.i.</u></b></span>:

<span style="color:blue">
The number of env steps to run during optimizations is $1000$. In `update`, there are $4$ iterations to update policy network $\pi$ and $40$ iterations to update the value network $V$.
</span>

---

#### 1.j. End of episode handling [1pts]
Describe how the episode terminals / timeouts are handled

<span style="color:blue"><b><u>Answer to 1.j.</u></b></span>:

<span style="color:blue">
If the episode ends due to a timeout or the end of the epoch, it might not have reached a natural terminal state. Therefore, the value of the current observation is estimated by running a single forward step of the ActorCritic networks on the updated obs state, This can be used to bootstrap the value target for the final state to approximate the expected future rewards from this point onward. 

If the episode reached a natural terminal state (not due to timeout or epoch ending), the value is set to 0, since there's no future rewards once a terminal state is reached. Then, save episode length and return, reset the env.
</span>

---

In [None]:
import torch
from torch.distributions.categorical import Categorical

logits = torch.nn.Parameter(torch.tensor([-2.5, 0.9, 2.4, 3.7]))   # imagine these came from the output of the network
c = Categorical(logits=logits)
print(f"The probability distribution of the actions is {c.probs}.")

In [None]:
import torch

from torch.distributions.categorical import Categorical
logits = torch.nn.Parameter(torch.tensor([-2.5, 0.9, 2.4, 3.7]))   # imagine these came from the output of the network
c = Categorical(logits=logits)
print(f"The probability distribution of the actions is {c.probs}.")

---

## Task 2: Implementing Policy Gradient Losses [10pts]

Now you will implement the vanilla policy gradient losses.  This includes the policy gradient loss $L^{PG}$ as well as a critic loss $L^{V}$, where the critic will be used to compute better advantages. You can reference any external sources you would like, but we suggest first trying to implement the losses without these.

$$L^{PG}(\theta) = \text{log}(\pi_\theta(a_t | s_t)) A_t$$

$$L^{V}(\phi) = (V_{\phi}(s_t) - R_t)^2$$

In this homework, choose between CartPole and LunarLander, although experiment with other environments if you are feeling adventurous.  We recommend LunarLander because it is fun and more challenging than CartPole, and good policies are generally quick to learn.  It takes around 10 minutes to reach interesting behavior on a decent computer, and should be fine for this homework.  However, if you find that it is taking too long to train, you can switch to CartPole.  LunarLander also has both discrete and continuous versions so you can try both modes.

- Fill in the TODOs in the `compute_loss_pi` and `compute_loss_v` functions.
- Run your code and make sure it is correct.

The figure below gives examples of the learning curves that you can expect to see with a correct implementation.  This is LunarLander-v2 performance run with the default arg settings.  Note that watching the losses is not as helpful as it is supervised learning. Losses in RL can be deceiving.  They can be increasing, while your policy is still improving a lot.  The reverse can also happen.  They are mostly good to watch as a sanity check and diagnostic. Also note that entropy is a less obvious, but very helpful metric to watch in RL, especially for discrete action spaces.  It should not stay at its maximum, nor should it drop very quickly; it should somewhat gradually decrease as shown in the figure. 

![example curves](https://i.imgur.com/Ut7R1C9.png)
You might see something slightly different due to small differences in your implementation.  Command to run: `tensorboard --log_dir=logs/`

<span style="color:blue"><b><u>Answer for Task 2</u></b></span>:

In [1]:
# ANSWERS for Task 2

# Copy your completed functions (or relevant sections) from main.py and paste them here
'''
if args.loss_mode == "vpg":
    # TODO (Task 2): implement vanilla policy gradient loss
    loss_pi = -(logp * psi).mean()
'''

'''
def compute_loss_v(batch):
    obs, ret = batch["obs"], batch["ret"]
    v = ac.v(obs)
    # TODO: (Task 2): compute value function loss
    loss_v = ((v - ret.unsqueeze(1)) ** 2).mean()
    return loss_v
'''

'\ndef compute_loss_v(batch):\n    obs, ret = batch["obs"], batch["ret"]\n    v = ac.v(obs)\n    # TODO: (Task 2): compute value function loss\n    loss_v = ((v - ret.unsqueeze(1))**2).mean()\n    return loss_v\n'

---

## Task 3: Experimenting with the code [11pts]
 
Once you verify your losses are correct by seeing that your policy starts learning, you will run some experiments.  For this, we have created several command line options that can be used to vary parameters, as well as a logging system that prints to stdout and logs scalars (and optionally gifs) to TensorBoard.  

#### 3.a.  REINFORCE vs. GAE-Lambda [3pts]

As the GAE paper discusses, there are many possible choices for advantage term to use in the policy gradient.  One of the first ones imagined is the discounted future return (`future_return` in the code).  This choice leads to the REINFORCE algorithm (excerpted from the [Sutton book Chapter 13](http://incompleteideas.net/book/the-book-2nd.html) for your reference (where $G$ is discounted future return):

![REINFORCE](https://i.imgur.com/WzyIzgg.png)
You will compare REINFORCE advantage (discounted return) to GAE lambda advantage.  Before you run the experiment, write down what you think will happen.  Why might REINFORCE do better or why might GAE-Lambda do better for this environment? Then run the two following experiments and measure the difference.  You should run them for at least 100 epochs, and maybe more if you want.  Then write down what happened and include a TensorBoard screenshot with both the results.

```
python3 main.py --psi_mode=future_return --prefix=logs/3a/ --epochs=100  # you can make these longer if you want
```

```
python3 main.py --psi_mode=gae --prefix=logs/3a/ --epochs=100
```

In [2]:
# ANSWERS for Task 3.a

# Describe your predictions. Why might REINFORCE do better or why might GAE-Lambda do better for this environment? 
# Write down what actually happened
# Include any screenshots, logs, etc

<span style="color:blue"><b><u>Answer to 3.a.</u></b></span>:

<span style="color:blue">

<b>Prediction</b>: GAE-$\lambda$ will do better for this environment compared to REINFORCE in $100$ epochs. GAE-$\lambda$ computes the advantage function, which measures how much better an action is compared to the policy's average action. This can lead to more precise policy updates compared to simply using future returns. Also, by effectively reducing variance in the policy gradient estimate while not overly increasing bias, GAE-$\lambda$ can lead to <u>more accurate policy updates and faster convergence to a good policy</u>. REINFORCE, on the other hand, can <u>have high variance and lead to noisy updates</u>.

<b>Tensorboard results</b> (GAE in <span style="color:purple">purple</span> and REINFORCE in <span style="color:green">green</span>):
</span>

![tensorboard_out3a.png](https://www.cs.ubc.ca/~yuweiyin/joey/ubc_course/533V-2023W2/assets/A3_images/tensorboard_out3a.png)

<span style="color:blue">
From the tensorboard plots, by the end of 100 epochs, the episode return of GAE-$\lambda$ is higher than REINFORCE. This indicates GAE has a faster convergence and better performance. Also, the stability of the KL divergence in GAE-$\lambda$ after 20 epochs contrasts with the variability observed in REINFORCE, where several spikes are noticeable. Sudden spikes in KL divergence could indicate that the model is forgetting previously learned behaviors (catastrophic forgetting). A stable KL divergence helps ensure that the agent retains its learned behavior over time and also indicates a balance between exploration and leads to more predictable and smoother learning curves.
</span>

#### 3.b.  Running with different numbers of policy training steps / vanilla policy gradient failure [3pts]

One issue of vanilla policy gradient methods is that they are fairly unstable.
In general one cannot run too many update steps for the most recent data because the policy will then overfit to that local data.
It will update too much based on that local gradient estimate and it will eventually cause more harm than good.
Once this happens, it is very difficult to recover.

This is a well known issue that motivated the development of TRPO and PPO, and you are you going to test this issue for yourself. By default, our code only runs 4 policy iterations during each update phase. What happens if you try to run more?  Try the following experiments, include a screenshot and write some thoughts you had about this.  Anything expected or unexpected?  (Note you will rerun these experiments with PPO in a minute)

```
python3 main.py --prefix=logs/3b/ --train_pi_iters=4  --epochs=150 # you can just also keep your results from part a 
```

```
python3 main.py --prefix=logs/3b/ --train_pi_iters=10 --epochs=150  
```

```
python3 main.py --prefix=logs/3b/ --train_pi_iters=20 --epochs=150
```

In [3]:
# ANSWERS for Task 3.b

# Describe anything expected or unexpected in the experiment
# Include any screenshots, logs, etc

<span style="color:blue"><b><u>Answer to 3.b.</u></b></span>:

<span style="color:blue">

<b>Tensorboard results</b> (4 policy iterations in <span style="color:gray">gray</span>, 10 policy iterations in <span style="color:cyan">cyan</span>, and 20 policy iterations in <span style="color:magenta">magenta</span>):
</span>

![tensorboard_out3b.png](https://www.cs.ubc.ca/~yuweiyin/joey/ubc_course/533V-2023W2/assets/A3_images/tensorboard_out3b.png)

<span style="color:blue">

We can see from the tensorboard that <u>when training iterations of VPG goes up, the losses and KL-divergence start to go wild</u>. This is known as <u>a famous problem of VPG: empirically it often leads to destructively large policy updates</u>. Without a mechanism to control the size of updates, VPG can make excessively large updates to the policy. Large updates can destabilize the learning process, leading to divergence or catastrophic drops in performance, and can cause the policy to miss optimal policies or enter regions of the policy space that lead to poor performance.
</span>

#### 3.c.  Design your  own experiment(s) [5pts]

NOTE: you can defer this until after implementing the PPO loss if you wish

Now you get to design your own experiments.  Perhaps you are curious about how the learning rate affects things or how a different network would work.  This is your chance to experiment with whatever you think would be interesting to try.
You are free to make any modifications to the code that you would like to run your experiments.

Here is a further list of ideas:
- network arch, activations
- learning rates
- implementing other $\psi$ versions (e.g., #6)
- continuous environment.  comparing how LunarLander does against LunarLanderContinuous
- effect of gamma parameter
- effect of lambda parameter
- how much better performance is if we don't sample from the mean (deterministic forward pass, for evaluation)
- how different random seeds compare
- anything else that you think of

Describe what you are testing, and your predictions about what is going to happen.
Then run the experiment and report the results, including screenshots.

In [4]:
# ANSWERS for Task 3.c

# Describe what you are testing and your predictions
# Include any screenshots, logs, etc

<span style="color:blue"><b><u>Answer to 3.c.</u></b></span>:

<span style="color:blue">

In GAE-$\lambda$ paper, $\lambda$ is a weight parameter that controls the bias-variance trade-off in the advantage estimation. $\lambda$ close to $1$ has high variance, while $\lambda$ close to $0$ induces bias but with low variance.  Empirically, the authors find that the best value of $\lambda$ is much lower than the best value of $\gamma$, likely because $\lambda$ introduces far less bias than $\gamma$ for a reasonably accurate value function. Therefore, we experiment with different $\lambda$s in the first part.

In our first experiment, we experimented $\lambda=0.97, 0.95, 0.9, 0.8$ with maximum $200$ epochs for each episode, and the results are shown in <span style="color:orange">orange</span>, <span style="color:gray">gray</span>, <span style="color:cyan">cyan</span>, <span style="color:lime">lime</span> respectively.
</span>

![tensorboard_out3c_lam.png](https://www.cs.ubc.ca/~yuweiyin/joey/ubc_course/533V-2023W2/assets/A3_images/tensorboard_out3c_lam.png)

<span style="color:blue">

From the result displayed above, it's clear that <u>a lower $\lambda$ diminished the performance</u>. This could be attributed to a lower $\lambda$ value leads to advantage estimates prioritize immediate rewards, resulting in <u>lower variance but higher bias</u>. Therefore, even though $\lambda$ itself does not control exploration mechanisms, the way it shapes the advantage function can influence policy updates. If the advantage function underestimates the long-term benefits of exploratory actions (because it's more biased towards immediate rewards), it might indirectly discourage exploration by making the policy update favor short-term gains over potentially better long-term strategies. And as a result, using a low $\lambda$ hinder the performance of the model.

Then, we experimented PPO under continuous environment.
</span>

![tensorboard_out3c_cont.png](https://www.cs.ubc.ca/~yuweiyin/joey/ubc_course/533V-2023W2/assets/A3_images/tensorboard_out3c_cont.png)

<span style="color:blue">

It can be seen that PPO performs differently in continuous versus discrete action spaces, with <u>noted fluctuations in episode returns in continuous environments</u>. However, the convergence of the value loss and KL-divergence (note that the scale is $0.01$) suggests that <u>the critic component of the PPO algorithm</u>, which estimates the value function, <u>is stabilizing and becoming more accurate</u> in predicting the expected returns from states. This suggests <u>PPO is still effective in continuous settings</u>.
</span>

---

## Task 4: Trying out the PPO clipping objective [10pts]

The following are useful resources for understanding PPO:
- [OpenAI's Spinning Up for coverage of policy gradients and PPO](https://spinningup.openai.com/en/latest/)
- [PPO paper](https://arxiv.org/pdf/1707.06347.pdf)
- [Matthew's StackOverflow Post on PPO](https://stackoverflow.com/questions/46422845/what-is-the-way-to-understand-proximal-policy-optimization-algorithm-in-rl/50663200#50663200)


Now implement the PPO clipped loss objective in the `compute_loss_pi` function. It is a small fix (only a few lines) to our policy gradients implementation.  After you see that it is learning, by running the command below, you will then compare it to VPG.
```
python3 main.py --loss_mode=ppo
```

This would have been problematic before, but now the algorithm should stay fairly stable:
```
python3 main.py --loss_mode=ppo --prefix=logs/4/ --train_pi_iters=20 --epochs=150
```
vs.

```
python3 main.py --loss_mode=vpg --prefix=logs/4/ --train_pi_iters=20 --epochs=150
```


Record the results of what happened and consider including some screenshots.  You are free to run and include any other tests that you found interesting.  You can also try to further tune PPO and find hyperparameters that make it work better.

In [5]:
# ANSWERS for Task 4

# Copy your completed function (or relevant sections) here
# Include any screenshots, logs, etc
# Describe anything else you have tried

'''
elif args.loss_mode == "ppo":
    # TODO (Task 4): implement clipped PPO loss
    ratio = torch.exp(logp - logp_old)
    surr1 = ratio * psi
    surr2 = torch.clamp(ratio, min=1 - args.clip_ratio, max=1 + args.clip_ratio) * psi
    loss_pi = -(torch.minimum(surr1, surr2)).mean()
'''

<span style="color:blue"><b><u>Answer for Task 4</u></b></span>:

<span style="color:blue">

<b>Tensorboard results</b> (PPO in <span style="color:gray">gray</span> and VPG in <span style="color:cyan">cyan</span>):
</span>

![tensorboard_out4.png](https://www.cs.ubc.ca/~yuweiyin/joey/ubc_course/533V-2023W2/assets/A3_images/tensorboard_out4.png)

<span style="color:blue">

The TensorBoard plots reveal that <u>PPO is much more stable in policy/value losses and KL divergence</u> compared to VPG, and also have a <u>more consistent and steady increase in returns</u>. This <b>stability</b> shows PPO's effectiveness in moderating policy updates through clipping can lead to a more controlled and predictable learning process. In contrast, VPG's lack of such moderating features results in greater fluctuations in losses and potentially erratic progress in returns.
</span>

---

## Task 5: Optional

#### 5.1 Fully solving LunarLander

During initial testing, you likely did not fully solve LunarLander.  An optimal reward is about 300.  Your first bonus task is to adapt your implementation, as needed, to achieve this high reward.  This likely involves parameter tuning, implementing learning rate annealing, and maybe some observation normalization.

#### 5.2 Implementing parallelized environments

A major bottleneck right now is that we are only collecting data with 1 agent at a time.  Most modern algorithm implementations use many parallel episodic evaluations to collect data.  This greatly speeds up training and is a practical necessity if you want to use these algorithms to solve new problems.  Your second bonus task is to implement parallelized environment data collection.  One fairly easy way to do this is to use Gymnasium's `AsyncVectorEnv`.  This runs N environments each in their own process.  To use it, you will have to make a few slight modifications to your data collection code to handle stacks of observations and actions.

Documentation (see tests for usage): https://gymnasium.farama.org/api/vector/#gymnasium.vector.AsyncVectorEnv

#### 5.3 New environments

Your third bonus task is to try solving the PyBullet environments (or Mujoco if you want to get a free license).  `HalfCheetah` is a good place to start as one of the easier control tasks.  See the [Bullet code here](https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/examples/enjoy_TF_HalfCheetahBulletEnv_v0_2017may.py) for how to make the bullet envs.


```
# Example environment usage
import gymnasium as gym
import pybullet_envs
env = gym.make("HalfCheetahBulletEnv-v0")
env.render(mode="human")

obs = env.reset()
while True:
    act = env.action_space.sample()
    obs, rew, done, info = env.step(act)

```

#### 5.4 Setting up MuJoCo

MuJoCo is free as of October 18, 2021 ([News](https://deepmind.com/blog/announcements/mujoco)). There are now official and open-source [Python bindings for MuJoCo](https://github.com/google-deepmind/mujoco/tree/main).  This bonus task is about installing and running MuJoCo on your machine.  MuJoCo is slightly faster than PyBullet, so you might consider using it for your projects.

# DONE