# Exercise 6: Deep Reinforcement Learning (DRL)

In DRL, we do not require a symbolic model of our system, only a (highly-parallizable) simulation. The policy is then learned directly from the data collected of the simulated robot interacting with its environment. As much data is needed for training (typically millions of samples), the efficient simulation implemented by [crazyflow](https://github.com/utiasDSL/crazyflow) will come in handy.

In the lecture you, you were introduced to the **on-policy DRL PPO algorithm**, as it has become the most popular for robotics applications. In this exercise, you will use **PPO** to learn to fly a flying a figure eight trajectory. With a drone.

<div class="alert alert-info">
    <h3>Task 1: WandB</h3>
    <p>
    Go to <a href="https://wandb.ai/site" target="_blank">https://wandb.ai/site</a> and create a free WandB account. WandB is a very common tool in tracking deep learning experiments, and we will use it to track our training in this exercise. WandB includes 100GB of free online storage, whis is more than enough for our use case. Once you have created your account, execute the following code cell. Follow the instructions to obtain your API key and paste it as requested. If everything was successful, you should see a massage similar to <code>"wandb: Currently logged in as: ...</code>
    </p>
</div>

In [1]:
%load_ext autoreload
%autoreload 2

from pathlib import Path

import torch
import wandb
from ml_collections import ConfigDict
from ppo import PPOTrainer, set_seeds

wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/vscode/.netrc
[34m[1mwandb[0m: Currently logged in as: [33moliefr[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

<div class="alert alert-info">
    <h3>Task 2 (Optional): Setup GPU</h3>
    <p>
    This exercise benefits from running the training on a powerful <a href="https://developer.nvidia.com/cuda-gpus" target="_blank">cuda-enabled GPU</a>. If you haven't done so yet, you can easily setup your container to run on the GPU. Consult the <code>README.md</code> for the instruction.
    </p>
</div>

In [3]:
train_device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {train_device}")

Using device: cpu


**Note on  (CPU / GPU)**: There are two bottlenecks that determine the speed of any DRL training: a) The simulation (data collection) and b) the learning of the agent (optimization). Both can have there on device to run on, i.e., CPU or GPU. Unfortunately, choosing a device for a simulation is not always straightforward, as the simulation can be faster on either device, depending on the specific simulation and parallelization. For reference see [the example in MJX here](https://mujoco.readthedocs.io/en/stable/mjx.html#mjx-the-sharp-bits). If the simulation is run on CPU, it can make sense to run the optimization on the CPU as well, as moving tensors from CPU to GPU takes time.

For crazyflow, we observe that simulation on GPU is faster if you run more than ~32 environments in parallel (depending on the specific CPU and GPU). A typical number of parallel environments to use for DRL is for instance 4096. At this point, the GPU provides significant speed ups as crazyflow is implemented in a way that the tensors stay on the GPU throughout the entire training process.



<span style="color:green">As we will execute the entirety of the training code using a central config object defined below, all implementations are contained in external `.py` files. Make use of the local testing feature to test your implementations for the below tasks to make sure the training runs smoothly in the end!</span>

<div class="alert alert-info">
    <h3>Task 3: Deterministic Seeding for Reproducible PPO Training</h3>
    <p>
      In the provided <code>ppo.py</code>, implement the <code>set_seeds</code> function so that it ensures complete reproducibility across all sources of randomness.
    </p>
</div>

Specifically, your function must:

  - Set the Python built-in random module’s seed.

  - Seed NumPy’s random number generator.

  - Seed PyTorch’s CPU and all GPU (CUDA) random generators.

  - Configure torch.backends.cudnn for fully deterministic behavior by enabling `deterministic = True` and disabling `benchmark = False`.

<div class="alert alert-info">
    <h3>Task 4: Implement make_envs</h3>
    <p>
      Implement the <code>make_envs</code> function in <code>ppo.py</code> to construct two parallel Gymnasium vector environments.
    </p>
</div>

<div class="alert alert-success">
    <h3>Task 5: Exam Preparation: Normalization</h3>
    <p>
        Why do we need to normalize observations? (The reason is the same as for Neural Networks.)
    </p>
</div>

The `save_model` function packages everything needed to resume training or run inference—namely the agent’s network weights, optimizer state, and the environment’s normalization statistics—into a single checkpoint file. This ensures you can stop and later restart training seamlessly, and that input normalization remains consistent at test time.

<div class="alert alert-info">
    <h3>Task 6: Implement save_model</h3>
    <p>
      Implement the <code>save_model</code> function in <code>ppo.py</code>.
    </p>
</div>

<div class="alert alert-info">
    <h3>Task 7: Synchronize Observation Normalization</h3>
    <p>
      Implement the <code>sync_envs(train_envs, eval_envs)</code> function to ensure that the evaluation environment uses the exact same observation normalization statistics as the training environment.
    </p>
</div>

<div class="alert alert-success">
    <h3>Task 8: Check Code and Exam Preparation</h3>
   <p>Have a look at the implementation of <code>Agent</code> class in <code>agent.py</code>. Answer the following questions:
   
   - Network initialization: Why do you have to be careful with the way you initialize the policy network? Have a look at <a href="https://arxiv.org/pdf/2006.05990">https://arxiv.org/pdf/2006.05990</a> at page 5 to answer the question.
   - In the <code>ppo.py</code> file, in <code>PPOTrainer.learn</code>, you can see an `entropy_loss` term in the computation of the overall loss (`loss = ...`). This entropy is computed in the `action_and_value` function in `agent.py` from the action probabilities. What would you expect to happen during training if you increase the weight of the entropy factor in the loss function (`self.config.ent_coef`)? For this, it is important that you know what the entropy of a distribution represents. You can look it up <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">on wikipedia</a>. <a href="https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/">This article</a> under point 10. also helps!
</p>
</div>

<div class="alert alert-info">
    <h3>Task 9: Generalized Advantage Estimate</h3>
    <p>
      You already learned about the PPO loss in the lecture. The PPO loss calculation requires the calculation of so-called Advantages. For an explanation, see the Script in Eq. 8.39, 8.40, and 8.42. Calculating the advantages is a critical design decision in DRL algorithms as it heavily influences the learning. A standard method is the Generalized Advantage Estimate (GAE).
<p></p>
      GAE is a bit more involved and we already implemented a part of it. However, you task is to complete the function. Head over to <code>ppo.py</code> and implement the <code>calculate_advantages</code> method of the <code>PPOTrainer</code> class. You will need to refer to one equation in the original GAE paper: <a href="https://arxiv.org/pdf/1506.02438">https://arxiv.org/pdf/1506.02438</a>.
    </p>
</div>

<div class="alert alert-info">
    <h3>Task 10: PPO-Clip Policy Gradient Loss</h3>
    <p>
      Go to the <code>ppo.py</code>, implement the <code>calculate_pg_loss</code> method of <code>PPOTrainer</code> class to compute the policy gradient loss.
    </p>
</div>

### PPO Value Function Loss

####  Case 1: **Unclipped Value Loss**

When `if_clip = False`, the value loss is computed using the standard Mean Squared Error (MSE) between predicted values and target returns:

$$
\mathcal{L}_v = \frac{1}{2} \cdot \mathbb{E}_t \left[ \left( V_{\theta}(s_t) - \hat{R}_t \right)^2 \right]
$$

- $ V_{\theta}(s_t) $: Current value function prediction (`newvalue`)
- $ \hat{R}_t $: Target return at time step $ t $ (from GAE, i.e., `b_returns`)
- The factor $ \frac{1}{2} $ is conventional in MSE to simplify derivative expressions.

---

#### Case 2: **Clipped Value Loss**

When `if_clip = True`, PPO applies clipping to the value function update to prevent large deviations from the old value estimate.


##### **Step 1: Clipped Value Prediction**

$$
v_t^{\text{clip}} = V_{\theta_{\text{old}}}(s_t) + \text{clip}\left(V_{\theta}(s_t) - V_{\theta_{\text{old}}}(s_t),\; -\epsilon,\; +\epsilon \right)
$$

##### **Step 2: Loss Calculation**

$$
\begin{aligned}
\ell_{\text{unclip}} &= \left(V_{\theta}(s_t) - \hat{R}_t\right)^2 \\
\ell_{\text{clip}} &= \left(v_t^{\text{clip}} - \hat{R}_t\right)^2 \\
\mathcal{L}_v &= \frac{1}{2} \cdot \mathbb{E}_t \left[ \max\left( \ell_{\text{unclip}},\; \ell_{\text{clip}} \right) \right]
\end{aligned}
$$

<div class="alert alert-info">
    <h3>Task 11: Value Loss</h3>
   <p> Go to the <code>ppo.py</code>, complete the implementation of <code>calculate_v_loss</code> method of <code>PPOTrainer</code> class.
</p>
</div>

<div class="alert alert-success">
    <h3>Task 12: Check code</h3>
   <p> The <code>learn</code> method is the core part of the <code>PPO</code> algorithm — it is responsible for training both the policy and value networks. The method performs multiple optimization steps using previously collected experience samples (such as <code>observations, actions, log probabilities, returns, etc.</code>).</p>

   <p>Have a look at the <code>loss=...`</code> line in the <code>`learn()`</code> function, where all loss components are accumulated.</p>

After completing the previous tasks, you are encouraged to take a closer look at the overall workflow inside the <code>learn</code> method.
</p>
</div>

<div class="alert alert-success">
    <h3>Task 13: Train your Agent</h3>
    <p>
        Once you’ve finished the above tasks, run the following cell to train your PPO agent. Don't forget to commit and push your policy <code>(ppo_checkpoint_ex06.pt)</code> to ARTEMIS in the end!
    </p>
</div>

In [4]:
# ConfigDict allows us to use convinient dot-based property access: https://github.com/google/ml_collections
train_config = ConfigDict(
    {
        "n_envs": 1024,
        "device": train_device,
        "total_timesteps": 1_000_000,
        "learning_rate": 1.5e-3,
        "n_steps": 16,  # Number of steps per environment per policy rollout
        "gamma": 0.90,  # Discount factor
        "gae_lambda": 0.95,  # Lambda for general advantage estimation
        "n_minibatches": 16,  # Number of mini-batches
        "n_epochs": 15,
        "norm_adv": True,
        "clip_coef": 0.25,
        "clip_vloss": True,
        "ent_coef": 0.01,
        "vf_coef": 0.5,
        "max_grad_norm": 5.0,
        "target_kl": None,
        "seed": 0,
        "n_eval_envs": 64,
        "n_eval_steps": 1_000,
        "save_model": True,
        "eval_interval": 999_000,
        "lr_decay": True,
    }
)

set_seeds(train_config.seed)
trainer = PPOTrainer(train_config, wandb_log=True)
trainer.train()

  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(
  warn(


KeyboardInterrupt: 

<div class="alert alert-success">
    <h3>Task 14: Test  your Agent</h3>
    <p>
        Once training has finished, run the inference using the cell below. 
    </p>
</div>

In [None]:
from ppo import PPOTester

path = Path.cwd() / "ppo_checkpoint_ex06.pt"

# set render to `True` to see how the agent performs.
_ = PPOTester(seed=0, ckpt_path=path, n_episodes=10, render=True)