# Extended pendulum

In this notebook we will dive a bit deeper into RL by training agents on the pendulum environment
that we already encountered before and analyze the results.
We will be concerned with questions like reward shaping, stability of results,
generalization to non-training situations and other issues related to real-world applications of RL.

Here we will only look at model-free RL since model-based RL will often require some domain specific
algorithms and engineering. Also, the openly available tools for model-based RL are far less mature
than for model-free.

In [None]:
# Only needed on colab or on a fresh setup
# Won't work on Windows!
!pip install -r requirements.txt
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [None]:
import sys

# rendering envs directly in notebook is recommended only on colab
in_notebook = "google.colab" in sys.modules

In [None]:
%load_ext tensorboard

import torch

from gym.envs.classic_control import PendulumEnv
from gym.wrappers import TimeLimit
import numpy as np

from stable_baselines3 import SAC
from visualization import demo_model

# You might want to switch to a CUDA-enabled runtime for executing this notebook
print(f"CUDA available: {torch.cuda.is_available()}")

In [None]:
%tensorboard --logdir logs --host localhost

## The vanilla pendulum

Let us start by simply using gym's pendulum as is and training a soft actor critic (an off-policy algorithm) on it.

In [None]:
def main(
    experiment_config: ExperimentConfig,
    task: str = "Ant-v3",
    buffer_size: int = 1000000,
    hidden_sizes: Sequence[int] = (256, 256),
    actor_lr: float = 1e-3,
    critic_lr: float = 1e-3,
    gamma: float = 0.99,
    tau: float = 0.005,
    alpha: float = 0.2,
    auto_alpha: bool = False,
    alpha_lr: float = 3e-4,
    start_timesteps: int = 10000,
    epoch: int = 200,
    step_per_epoch: int = 5000,
    step_per_collect: int = 1,
    update_per_step: int = 1,
    n_step: int = 1,
    batch_size: int = 256,
    training_num: int = 1,
    test_num: int = 10,
):
    now = datetime.datetime.now().strftime("%y%m%d-%H%M%S")
    log_name = os.path.join(task, "sac", str(experiment_config.seed), now)

    sampling_config = SamplingConfig(
        num_epochs=epoch,
        step_per_epoch=step_per_epoch,
        num_train_envs=training_num,
        num_test_envs=test_num,
        buffer_size=buffer_size,
        batch_size=batch_size,
        step_per_collect=step_per_collect,
        update_per_step=update_per_step,
        start_timesteps=start_timesteps,
        start_timesteps_random=True,
    )

    env_factory = MujocoEnvFactory(task, experiment_config.seed, sampling_config, obs_norm=False)

    experiment = (
        SACExperimentBuilder(env_factory, experiment_config, sampling_config)
        .with_sac_params(
            SACParams(
                tau=tau,
                gamma=gamma,
                alpha=AutoAlphaFactoryDefault(lr=alpha_lr) if auto_alpha else alpha,
                estimation_step=n_step,
                actor_lr=actor_lr,
                critic1_lr=critic_lr,
                critic2_lr=critic_lr,
            ),
        )
        .with_actor_factory_default(
            hidden_sizes,
            continuous_unbounded=True,
            continuous_conditioned_sigma=True,
        )
        .with_common_critic_factory_default(hidden_sizes)
        .build()
    )
    experiment.run(log_name)

In [None]:
env = TimeLimit(PendulumEnv(), max_episode_steps=100)

In [None]:
model = SAC("MlpPolicy", env, verbose=1, tensorboard_log="logs")
model_name = model.__class__.__name__.lower() + "_vanilla_pendulum"

In [None]:
# Let us have a quick glance at the parameters of this off-policy model
from inspect import getfullargspec

getfullargspec(SAC).annotations

In [None]:
# We will now train the model, you can view the training progress in tensorboard
# This takes about 10 minutes on a laptop with a GPU
model.learn(total_timesteps=12000, log_interval=4, tb_log_name=model_name)
model.save(model_name)

In [None]:
# we load the saved model and look at how it performs on the environment
model = SAC.load(model_name)
demo_model(env, model, num_steps=800, in_notebook=in_notebook)

## Exercise 1

As you see, the reward is not easily interpretable. In fact, it is composed of different quantities (do have a look at the source
code of the environment, it is also provided below). How would you go about evaluating an agent's performance? Think about an
evaluation strategy and put it into code. It might contain the average time needed until the pendulum is stabilized, the average
torque per unit time, the average angular distance travelled by the pendulum or other metrics you find interesting.

## Sparse Rewards

With the vanilla pendulum, the agent receives rewards continuously based on the angle. The reward also
contains information about the torque. What if we tried training with sparse rewards, where we motivate
the agent to move a pendulum to a certain angle-range as fast as possible and to leave it there?

Since gym environments are not modular, we cannot easily modify the reward. We could follow the strategies
outlined before to change the reward (and in a real project this would be the way to go). However, for
educational purposes we will instead copy-paste gym's source code and simply modify it according to our needs
prior to each experiment.

In [None]:
# Copy of gym's pendulum that we will modify

import gym
from gym import spaces
from gym.utils import seeding
import numpy as np


class CustomPendulumEnv(gym.Env):
    metadata = {"render.modes": ["human", "rgb_array"], "video.frames_per_second": 30}

    def __init__(self, g=10.0, target_angle_range=(-np.pi / 5, np.pi / 5)):
        self.target_angle_range = target_angle_range
        self.max_speed = 8
        self.max_torque = 2.0
        self.dt = 0.05
        self.g = g
        self.m = 1.0
        self.l = 1.0
        self.viewer = None

        high = np.array([1.0, 1.0, self.max_speed], dtype=np.float32)
        self.action_space = spaces.Box(
            low=-self.max_torque, high=self.max_torque, shape=(1,), dtype=np.float32
        )
        self.observation_space = spaces.Box(low=-high, high=high, dtype=np.float32)

        self.seed()

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def step(self, u):
        th, thdot = self.state  # th := theta

        g = self.g
        m = self.m
        l = self.l
        dt = self.dt

        u = np.clip(u, -self.max_torque, self.max_torque)[0]
        self.last_u = u  # for rendering
        # costs = angle_normalize(th) ** 2 + 0.1 * thdot ** 2 + 0.001 * (u ** 2)

        min_th, max_th = self.target_angle_range
        angle_cost = 0 if min_th < angle_normalize(th) < max_th else 1
        costs = angle_cost

        newthdot = (
            thdot
            + (-3 * g / (2 * l) * np.sin(th + np.pi) + 3.0 / (m * l ** 2) * u) * dt
        )
        newth = th + newthdot * dt
        newthdot = np.clip(newthdot, -self.max_speed, self.max_speed)

        self.state = np.array([newth, newthdot])
        return self._get_obs(), -costs, False, {}

    def reset(self):
        high = np.array([np.pi, 1])
        self.state = self.np_random.uniform(low=-high, high=high)
        self.last_u = None
        return self._get_obs()

    def _get_obs(self):
        theta, thetadot = self.state
        return np.array([np.cos(theta), np.sin(theta), thetadot])

    def render(self, mode="human"):
        if self.viewer is None:
            from gym.envs.classic_control import rendering

            self.viewer = rendering.Viewer(500, 500)
            self.viewer.set_bounds(-2.2, 2.2, -2.2, 2.2)
            rod = rendering.make_capsule(1, 0.2)
            rod.set_color(0.8, 0.3, 0.3)
            self.pole_transform = rendering.Transform()
            rod.add_attr(self.pole_transform)
            self.viewer.add_geom(rod)
            axle = rendering.make_circle(0.05)
            axle.set_color(0, 0, 0)
            self.viewer.add_geom(axle)
            fname = "clockwise.png"
            self.img = rendering.Image(fname, 1.0, 1.0)
            self.imgtrans = rendering.Transform()
            self.img.add_attr(self.imgtrans)

        self.viewer.add_onetime(self.img)
        self.pole_transform.set_rotation(self.state[0] + np.pi / 2)
        if self.last_u:
            self.imgtrans.scale = (-self.last_u / 2, np.abs(self.last_u) / 2)

        return self.viewer.render(return_rgb_array=mode == "rgb_array")

    def close(self):
        if self.viewer:
            self.viewer.close()
            self.viewer = None


def angle_normalize(x):
    return ((x + np.pi) % (2 * np.pi)) - np.pi

In [None]:
sparse_env = TimeLimit(CustomPendulumEnv(), max_episode_steps=100)

In [None]:
model = SAC("MlpPolicy", sparse_env, verbose=1, tensorboard_log="logs")

# This takes about 10 minutes on a laptop with a GPU
model_name = model.__class__.__name__.lower() + "_sparse_pendulum"

In [None]:
model.learn(total_timesteps=12000, log_interval=4, tb_log_name=model_name)
model.save(model_name)

In [None]:
model = SAC.load(model_name)
demo_model(sparse_env, model, num_steps=800, in_notebook=in_notebook)

## Transferring to perturbed environments

The environment assumes a fixed mass. What if were to apply the same agent on an env with a different mass?
Note that planning algorithms a la MPC would have no problem with this at all, their performance would not
go down as long as mass is included in the dynamics-model.

Not so for the "real RL" agent:

In [None]:
env.env.m = 0.2

demo_model(env, model, num_steps=400, in_notebook=in_notebook)

The pendulum is balanced upright but an excessive amount of torque is being applied constantly.
How could we improve this situation?

We will try the following: we randomize the pendulum's mass at reset of episodes and also add mass to the observations.
For that we will again follow the *bad-practice* and modify the environment by overriding methods directly.
Don't do this in a real project! Part of the reason for doing it here is to highlight how cumbersome and fragile such
a software design becomes.

In [None]:
class VariableMassPendulum(PendulumEnv):
    def __init__(self, mass_range=(0.1, 1.5)):
        super().__init__()
        m_min, m_max = mass_range
        high = np.array([1.0, 1.0, self.max_speed, m_max], dtype=np.float32)
        low = -high
        low[-1] = m_min
        self.observation_space = spaces.Box(low=-high, high=high, dtype=np.float32)

        self.mass_range = mass_range
        self.m = np.random.uniform(*mass_range)

    def _get_obs(self):
        obs = list(super()._get_obs())
        obs.append(self.m)
        return np.array(obs)

    def reset(self):
        self.m = np.random.uniform(*self.mass_range)
        return super().reset()

In [None]:
var_mass_env = TimeLimit(VariableMassPendulum(), max_episode_steps=100)
model = SAC("MlpPolicy", var_mass_env, verbose=1, tensorboard_log="logs")
model_name = model.__class__.__name__.lower() + "_var_mass_pendulum"

In [None]:
# This takes about 12 minutes on a laptop with a GPU
model.learn(total_timesteps=15000, log_interval=4, tb_log_name=model_name)
model.save(model_name)

In [None]:
model = SAC.load(model_name)
demo_model(var_mass_env, model, num_steps=400, in_notebook=in_notebook)

Let us see how this new agent behaves with low masses. We can look into that by configuring a degenerate range such
that always the same mass is "sampled" at reset.

In [None]:
var_mass_env.env.mass_range = (0.3, 0.3)
demo_model(var_mass_env, model, num_steps=400, in_notebook=in_notebook)

## Exercise 2

There are many possibilities to extend the experiments done above. You could try:

    1. Also changing l and g and adding them to the observation.
    2. Normalizing all observations to lie within 0 and 1 (or at least between -1 and 1)
    3. What if we could not observe the angular velocity? Remove the velocity from the observation.
       This renders the decision process non-Markovian and partially observed.
       However, adding a single past observation is sufficient to restore the Markov property.
       Add a history of previous observations and actions to the environment. You can use
       gym's `FrameStack` and `FlattenObservation` wrappers for that.