In [None]:
%%capture

%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl

In [None]:
%presentation_style

In [None]:
%%capture

%set_random_seed 12

In [None]:
%load_latex_macros


$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$
$\newcommand{\amax}{{\text{argmax}}}$
$\newcommand{\P}{{\mathbb{P}}}$
$\newcommand{\E}{{\mathbb{E}}}$
$\newcommand{\R}{{\mathbb{R}}}$
$\newcommand{\Z}{{\mathbb{Z}}}$
$\newcommand{\N}{{\mathbb{N}}}$
$\newcommand{\C}{{\mathbb{C}}}$
$\newcommand{\abs}[1]{{ \left| #1 \right| }}$
$\newcommand{\simpl}[1]{{\Delta^{#1} }}$


<img src="_static/images/aai-institute-cover.svg" alt="Snow" style="width:100%;">
<div class="md-slide title">Training RL Agents </div>

# Training RL Agents

In this notebook we will dive a bit deeper into RL by training agents on the pendulum environment
that we already encountered before and analyze the results.
We will be concerned with questions like reward shaping, stability of results,
generalization to non-training situations and other issues related to real-world applications of RL.

Here we will only look at model-free RL since model-based RL will often require some domain specific
algorithms and engineering. Also, the openly available tools for model-based RL are far less mature
than for model-free.

In [None]:
%load_ext tensorboard

import os
from collections.abc import Sequence

import gymnasium as gym
from gymnasium.envs.classic_control import PendulumEnv
from gymnasium.wrappers import TimeLimit

from tianshou.env import ShmemVectorEnv
from tianshou.highlevel.config import SamplingConfig
from tianshou.highlevel.env import ContinuousEnvironments, EnvFactory
from tianshou.highlevel.experiment import (
    ExperimentConfig,
)
from training_rl.env_utils import demo_model


In [None]:
%tensorboard --logdir log --host localhost

## The vanilla pendulum

Let us start by simply using gym's pendulum as is and training a soft actor critic (an off-policy algorithm) on it.

In [None]:
from tianshou.highlevel.params.lr_scheduler import LRSchedulerFactoryLinear
from typing import Literal
import torch
from tianshou.highlevel.params.dist_fn import DistributionFunctionFactoryIndependentGaussians
from tianshou.highlevel.params.policy_params import PPOParams
from tianshou.highlevel.experiment import PPOExperimentBuilder


def train_ppo_agent(
    env_factory: EnvFactory,
    experiment_config: ExperimentConfig = None,
    buffer_size: int = 4096,
    hidden_sizes: Sequence[int] = (64, 64),
    lr: float = 3e-4,
    gamma: float = 0.99,
    epoch: int = 100,
    step_per_epoch: int = 30000,
    step_per_collect: int = 2048,
    repeat_per_collect: int = 10,
    batch_size: int = 64,
    training_num: int = 64,
    test_num: int = 10,
    rew_norm: bool = True,
    vf_coef: float = 0.25,
    ent_coef: float = 0.0,
    gae_lambda: float = 0.95,
    bound_action_method: Literal["clip", "tanh"] | None = "clip",
    lr_decay: bool = True,
    max_grad_norm: float = 0.5,
    eps_clip: float = 0.2,
    dual_clip: float | None = None,
    value_clip: bool = False,
    norm_adv: bool = False,
    recompute_adv: bool = True,
):
    experiment_config = experiment_config or ExperimentConfig()
    log_name = os.path.join("ppo", str(experiment_config.seed))

    sampling_config = SamplingConfig(
        num_epochs=epoch,
        step_per_epoch=step_per_epoch,
        batch_size=batch_size,
        num_train_envs=training_num,
        num_test_envs=test_num,
        buffer_size=buffer_size,
        step_per_collect=step_per_collect,
        repeat_per_collect=repeat_per_collect,
    )

    experiment = (
        PPOExperimentBuilder(env_factory, experiment_config, sampling_config)
        .with_ppo_params(
            PPOParams(
                discount_factor=gamma,
                gae_lambda=gae_lambda,
                action_bound_method=bound_action_method,
                reward_normalization=rew_norm,
                ent_coef=ent_coef,
                vf_coef=vf_coef,
                max_grad_norm=max_grad_norm,
                value_clip=value_clip,
                advantage_normalization=norm_adv,
                eps_clip=eps_clip,
                dual_clip=dual_clip,
                recompute_advantage=recompute_adv,
                lr=lr,
                lr_scheduler_factory=LRSchedulerFactoryLinear(sampling_config)
                if lr_decay
                else None,
                dist_fn=DistributionFunctionFactoryIndependentGaussians(),
            ),
        )
        .with_actor_factory_default(hidden_sizes, torch.nn.Tanh, continuous_unbounded=True)
        .with_critic_factory_default(hidden_sizes, torch.nn.Tanh)
        .build()
    )
    experiment_result = experiment.run(log_name)
    return experiment_result

In [None]:
def get_pendulum_env(render_mode: Literal["rgb_array"] | None = None):
    return TimeLimit(PendulumEnv(render_mode=render_mode), max_episode_steps=200)


class PendulumEnvFactory(EnvFactory):
    def create_envs(
        self, num_training_envs: int, num_test_envs: int
    ) -> ContinuousEnvironments:
        env = get_pendulum_env()
        train_envs = ShmemVectorEnv([get_pendulum_env] * num_training_envs)
        test_envs = ShmemVectorEnv([get_pendulum_env] * num_test_envs)
        return ContinuousEnvironments(
            env=env,
            train_envs=train_envs,
            test_envs=test_envs,
        )

In [None]:
exp_result = train_ppo_agent(
    PendulumEnvFactory(), epoch=1, step_per_epoch=20000, training_num=10,test_num=1
)
policy = exp_result.world.policy


In [None]:
pend_env = gym.make("Pendulum-v1", render_mode="rgb_array")

In [None]:
obs, info = pend_env.reset()

In [None]:
policy.compute_action(obs, info)

In [None]:
demo_model(pend_env, policy.compute_action, 200)

## Exercise 1

As you see, the reward is not easily interpretable. In fact, it is composed of different quantities (do have a look at the source
code of the environment, it is also provided below). How would you go about evaluating an agent's performance? Think about an
evaluation strategy and put it into code. It might contain the average time needed until the pendulum is stabilized, the average
torque per unit time, the average angular distance travelled by the pendulum or other metrics you find interesting.

## Sparse Rewards

With the vanilla pendulum, the agent receives rewards continuously based on the angle. The reward also
contains information about the torque. What if we tried training with sparse rewards, where we motivate
the agent to move a pendulum to a certain angle-range as fast as possible and to leave it there?

Since gym environments are not modular, we cannot easily modify the reward. We could follow the strategies
outlined before to change the reward (and in a real project this would be the way to go). However, for
educational purposes we will instead copy-paste gym's source code and simply modify it according to our needs
prior to each experiment.

In [None]:
from os import path
from typing import Optional

import numpy as np

import gymnasium as gym
from gymnasium import spaces
from gymnasium.envs.classic_control import utils
from gymnasium.error import DependencyNotInstalled


DEFAULT_X = np.pi
DEFAULT_Y = 1.0


class CustomPendulumEnv(gym.Env):
    """
    ## Description

    The inverted pendulum swingup problem is based on the classic problem in control theory.
    The system consists of a pendulum attached at one end to a fixed point, and the other end being free.
    The pendulum starts in a random position and the goal is to apply torque on the free end to swing it
    into an upright position, with its center of gravity right above the fixed point.

    The diagram below specifies the coordinate system used for the implementation of the pendulum's
    dynamic equations.

    ![Pendulum Coordinate System](/_static/diagrams/pendulum.png)

    -  `x-y`: cartesian coordinates of the pendulum's end in meters.
    - `theta` : angle in radians.
    - `tau`: torque in `N m`. Defined as positive _counter-clockwise_.

    ## Action Space

    The action is a `ndarray` with shape `(1,)` representing the torque applied to free end of the pendulum.

    | Num | Action | Min  | Max |
    |-----|--------|------|-----|
    | 0   | Torque | -2.0 | 2.0 |


    ## Observation Space

    The observation is a `ndarray` with shape `(3,)` representing the x-y coordinates of the pendulum's free
    end and its angular velocity.

    | Num | Observation      | Min  | Max |
    |-----|------------------|------|-----|
    | 0   | x = cos(theta)   | -1.0 | 1.0 |
    | 1   | y = sin(theta)   | -1.0 | 1.0 |
    | 2   | Angular Velocity | -8.0 | 8.0 |

    ## Rewards

    The reward function is defined as:

    *r = -(theta<sup>2</sup> + 0.1 * theta_dt<sup>2</sup> + 0.001 * torque<sup>2</sup>)*

    where `$\theta$` is the pendulum's angle normalized between *[-pi, pi]* (with 0 being in the upright position).
    Based on the above equation, the minimum reward that can be obtained is
    *-(pi<sup>2</sup> + 0.1 * 8<sup>2</sup> + 0.001 * 2<sup>2</sup>) = -16.2736044*,
    while the maximum reward is zero (pendulum is upright with zero velocity and no torque applied).

    ## Starting State

    The starting state is a random angle in *[-pi, pi]* and a random angular velocity in *[-1,1]*.

    ## Episode Truncation

    The episode truncates at 200 time steps.

    ## Arguments

    - `g`: acceleration of gravity measured in *(m s<sup>-2</sup>)* used to calculate the pendulum dynamics.
      The default value is g = 10.0 .

    ```python
    import gymnasium as gym
    gym.make('Pendulum-v1', g=9.81)
    ```

    On reset, the `options` parameter allows the user to change the bounds used to determine
    the new random state.

    ## Version History

    * v1: Simplify the math equations, no difference in behavior.
    * v0: Initial versions release (1.0.0)

    """

    metadata = {
        "render_modes": ["human", "rgb_array"],
        "render_fps": 30,
    }

    def __init__(self, render_mode: Optional[str] = None, g=10.0, target_angle_range=(-np.pi / 5, np.pi / 5)):
        self.target_angle_range = target_angle_range
        self.max_speed = 8
        self.max_torque = 2.0
        self.dt = 0.05
        self.g = g
        self.m = 1.0
        self.l = 1.0

        self.render_mode = render_mode

        self.screen_dim = 500
        self.screen = None
        self.clock = None
        self.isopen = True

        high = np.array([1.0, 1.0, self.max_speed], dtype=np.float32)
        # This will throw a warning in tests/envs/test_envs in utils/env_checker.py as the space is not symmetric
        #   or normalised as max_torque == 2 by default. Ignoring the issue here as the default settings are too old
        #   to update to follow the gymnasium api
        self.action_space = spaces.Box(
            low=-self.max_torque, high=self.max_torque, shape=(1,), dtype=np.float32
        )
        self.observation_space = spaces.Box(low=-high, high=high, dtype=np.float32)

    def step(self, u):
        th, thdot = self.state  # th := theta

        g = self.g
        m = self.m
        l = self.l
        dt = self.dt

        u = np.clip(u, -self.max_torque, self.max_torque)[0]
        self.last_u = u  # for rendering
        costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)

        newthdot = thdot + (3 * g / (2 * l) * np.sin(th) + 3.0 / (m * l**2) * u) * dt
        newthdot = np.clip(newthdot, -self.max_speed, self.max_speed)
        newth = th + newthdot * dt

        self.state = np.array([newth, newthdot])

        if self.render_mode == "human":
            self.render()
        min_th, max_th = self.target_angle_range
        angle_cost = 0 if min_th < angle_normalize(th) < max_th else 1
        return self._get_obs(), -costs, False, False, {}

    def reset(self, *, seed: Optional[int] = None, options: Optional[dict] = None):
        super().reset(seed=seed)
        if options is None:
            high = np.array([DEFAULT_X, DEFAULT_Y])
        else:
            # Note that if you use custom reset bounds, it may lead to out-of-bound
            # state/observations.
            x = options.get("x_init") if "x_init" in options else DEFAULT_X
            y = options.get("y_init") if "y_init" in options else DEFAULT_Y
            x = utils.verify_number_and_cast(x)
            y = utils.verify_number_and_cast(y)
            high = np.array([x, y])
        low = -high  # We enforce symmetric limits.
        self.state = self.np_random.uniform(low=low, high=high)
        self.last_u = None

        if self.render_mode == "human":
            self.render()
        return self._get_obs(), {}

    def _get_obs(self):
        theta, thetadot = self.state
        return np.array([np.cos(theta), np.sin(theta), thetadot], dtype=np.float32)

    def render(self):
        if self.render_mode is None:
            assert self.spec is not None
            gym.logger.warn(
                "You are calling render method without specifying any render mode. "
                "You can specify the render_mode at initialization, "
                f'e.g. gym.make("{self.spec.id}", render_mode="rgb_array")'
            )
            return

        try:
            import pygame
            from pygame import gfxdraw
        except ImportError as e:
            raise DependencyNotInstalled(
                "pygame is not installed, run `pip install gymnasium[classic-control]`"
            ) from e

        if self.screen is None:
            pygame.init()
            if self.render_mode == "human":
                pygame.display.init()
                self.screen = pygame.display.set_mode(
                    (self.screen_dim, self.screen_dim)
                )
            else:  # mode in "rgb_array"
                self.screen = pygame.Surface((self.screen_dim, self.screen_dim))
        if self.clock is None:
            self.clock = pygame.time.Clock()

        self.surf = pygame.Surface((self.screen_dim, self.screen_dim))
        self.surf.fill((255, 255, 255))

        bound = 2.2
        scale = self.screen_dim / (bound * 2)
        offset = self.screen_dim // 2

        rod_length = 1 * scale
        rod_width = 0.2 * scale
        l, r, t, b = 0, rod_length, rod_width / 2, -rod_width / 2
        coords = [(l, b), (l, t), (r, t), (r, b)]
        transformed_coords = []
        for c in coords:
            c = pygame.math.Vector2(c).rotate_rad(self.state[0] + np.pi / 2)
            c = (c[0] + offset, c[1] + offset)
            transformed_coords.append(c)
        gfxdraw.aapolygon(self.surf, transformed_coords, (204, 77, 77))
        gfxdraw.filled_polygon(self.surf, transformed_coords, (204, 77, 77))

        gfxdraw.aacircle(self.surf, offset, offset, int(rod_width / 2), (204, 77, 77))
        gfxdraw.filled_circle(
            self.surf, offset, offset, int(rod_width / 2), (204, 77, 77)
        )

        rod_end = (rod_length, 0)
        rod_end = pygame.math.Vector2(rod_end).rotate_rad(self.state[0] + np.pi / 2)
        rod_end = (int(rod_end[0] + offset), int(rod_end[1] + offset))
        gfxdraw.aacircle(
            self.surf, rod_end[0], rod_end[1], int(rod_width / 2), (204, 77, 77)
        )
        gfxdraw.filled_circle(
            self.surf, rod_end[0], rod_end[1], int(rod_width / 2), (204, 77, 77)
        )

        fname = path.join(path.dirname(__file__), "assets/clockwise.png")
        img = pygame.image.load(fname)
        if self.last_u is not None:
            scale_img = pygame.transform.smoothscale(
                img,
                (scale * np.abs(self.last_u) / 2, scale * np.abs(self.last_u) / 2),
            )
            is_flip = bool(self.last_u > 0)
            scale_img = pygame.transform.flip(scale_img, is_flip, True)
            self.surf.blit(
                scale_img,
                (
                    offset - scale_img.get_rect().centerx,
                    offset - scale_img.get_rect().centery,
                ),
            )

        # drawing axle
        gfxdraw.aacircle(self.surf, offset, offset, int(0.05 * scale), (0, 0, 0))
        gfxdraw.filled_circle(self.surf, offset, offset, int(0.05 * scale), (0, 0, 0))

        self.surf = pygame.transform.flip(self.surf, False, True)
        self.screen.blit(self.surf, (0, 0))
        if self.render_mode == "human":
            pygame.event.pump()
            self.clock.tick(self.metadata["render_fps"])
            pygame.display.flip()

        else:  # mode == "rgb_array":
            return np.transpose(
                np.array(pygame.surfarray.pixels3d(self.screen)), axes=(1, 0, 2)
            )

    def close(self):
        if self.screen is not None:
            import pygame

            pygame.display.quit()
            pygame.quit()
            self.isopen = False


def angle_normalize(x):
    return ((x + np.pi) % (2 * np.pi)) - np.pi


In [None]:
sparse_env = TimeLimit(CustomPendulumEnv(), max_episode_steps=100)

In [None]:
exp_result = train_ppo_agent(
    PendulumEnvFactory(), epoch=1, step_per_epoch=20000, training_num=10,test_num=1
)

## Transferring to perturbed environments

The environment assumes a fixed mass. What if were to apply the same agent on an env with a different mass?
Note that planning algorithms a la MPC would have no problem with this at all, their performance would not
go down as long as mass is included in the dynamics-model.

Not so for the "real RL" agent:

In [None]:
env = get_pendulum_env(render_mode="rgb_array")


demo_model(env, policy.compute_action, num_steps=400)

The pendulum is balanced upright but an excessive amount of torque is being applied constantly.
How could we improve this situation?

## Exercise 2

Try the following: we randomize the pendulum's mass at reset of episodes and also add mass to the observations.
For that wou will again follow the *bad-practice* and modify the environment by overriding methods directly.
Don't do this in a real project! Part of the reason for doing it here is to highlight how cumbersome and fragile such
a software design becomes.

## Exercise 3

There are many possibilities to extend the experiments done above. You could try:

    1. Also changing l and g and adding them to the observation.
    2. Normalizing all observations to lie within 0 and 1 (or at least between -1 and 1)
    3. What if we could not observe the angular velocity? Remove the velocity from the observation.
       This renders the decision process non-Markovian and partially observed.
       However, adding a single past observation is sufficient to restore the Markov property.
       Add a history of previous observations and actions to the environment. You can use
       gymnasium's `FrameStack` and `FlattenObservation` wrappers for that.

<img src="_static/images/aai-institute-cover.svg" alt="Snow" style="width:100%;">
<div class="md-slide title">Thank you for the attention!</div>