# Build a Gym Environment

This notebook is inspired to the Stable Baselines3 tutorial available at [https://github.com/araffin/rl-tutorial-jnrr19](https://github.com/araffin/rl-tutorial-jnrr19).


## Introduction

In this notebook, we will learn how to build a customized environment with **Gymnasium**.

### Links

Gymnasium Github: [https://github.com/Farama-Foundation/Gymnasium](https://github.com/Farama-Foundation/Gymnasium)

Gymnasium Documentation: [https://gymnasium.farama.org/index.html](https://gymnasium.farama.org/index.html#)

Stable Baselines 3 Github:[https://github.com/DLR-RM/stable-baselines3](https://github.com/DLR-RM/stable-baselines3)

Stable Baseline 3 Documentation: [https://stable-baselines3.readthedocs.io/en/master/](https://stable-baselines3.readthedocs.io/en/master/)

## Install Gymnasium and Stable Baselines3 Using Pip

In [1]:
!pip install gymnasium
!pip install renderlab  #For rendering
!pip install stable-baselines3[extra]



In [2]:
import gymnasium as gym
import renderlab
import stable_baselines3

print(gym.__version__)
print(stable_baselines3.__version__)

Gym has been unmaintained since 2022 and does not support NumPy 2.0 amongst other critical functionality.
Please upgrade to Gymnasium, the maintained drop-in replacement of Gym, or contact the authors of your software and request that they upgrade.
See the migration guide at https://gymnasium.farama.org/introduction/migration_guide/ for additional information.
  return datetime.utcnow().replace(tzinfo=utc)



1.2.3
2.7.1


In [3]:
def evaluate(env, policy, gamma=1., num_episodes=100):
    """
    Evaluate a RL agent
    :param env: (Env object) the Gym environment
    :param policy: (BasePolicy object) the policy in stable_baselines3
    :param gamma: (float) the discount factor
    :param num_episodes: (int) number of episodes to evaluate it
    :return: (float) Mean reward for the last num_episodes
    """
    all_episode_rewards = []
    for i in range(num_episodes): # iterate over the episodes
        episode_rewards = []
        done = False
        discounter = 1.
        obs, _ = env.reset()
        while not done: # iterate over the steps until termination
            action, _ = policy.predict(obs)
            obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode_rewards.append(reward * discounter) # compute discounted reward
            discounter *= gamma

        all_episode_rewards.append(sum(episode_rewards))

    mean_episode_reward = np.mean(all_episode_rewards)
    std_episode_reward = np.std(all_episode_rewards) / np.sqrt(num_episodes - 1)
    print("Mean reward:", mean_episode_reward,
          "Std reward:", std_episode_reward,
          "Num episodes:", num_episodes)

    return mean_episode_reward, std_episode_reward

## The Minigolf Environment

The `Minigolf` environment models a simple problem in which the agent has to hit a ball on a green using a putter in order to reach the hole with the minimum amount of moves.

* The green is characterized by a **friction** $f$ that is selected uniformly random at the beginning of each episode in the interval `[0.065, 0.196]` and does not change during the episode.
* The **position** of the ball is represented by a unidimensional variable $x_t$ that is initialized uniformly random in the interval `[1,20]`. The observation is made of the pair $s_t = (x_t,f)$.
* The **action** $a_t$ is the force applied to the putter and has to be bounded in the interval `[1e-5,5]`. Before being applied the action is subject to a Gaussian noise, so that the actual action $u_t$ applied is given by:

$$
u_t = a_t + \epsilon \qquad \text{where} \qquad \epsilon \sim \mathcal{N}(0,\sigma^2),
$$
where $\sigma =0.1$. The movement of the ball is governed by the kinematic law:

$$
x_{t+1} = x_{t} - v_t \tau_t + \frac{1}{2} d \tau_t^2
$$

where:
* $v_t$ is the velocity computed as $v_t = u_t l$,
* $d$ is the deceleration computed as $d = \frac{5}{7} fg$,
* $\tau_t$ is the time interval computed as $\tau_t = \frac{v_t}{d}$.

The remaining constants are the putter length $l = 1$ and the gravitational acceleration $g=9.81$. The **episode** terminates when the next state is such that the ball enters or surpasses (without entering) the hole. The **reward** is `-1` at every step and `-100` if the ball surpasses the hole. To check whether the ball will not reach, enter, or surpass the hole, refer to the following condition:

\begin{align*}
&v_t < v_{\min} \implies \text{ball does not reach the hole} \\
&v_t > v_{\max} \implies \text{ball surpasses the hole} \\
&\text{otherwise} \implies \text{ball enters the hole}
\end{align*}

where

\begin{align*}
& v_\min = \sqrt{\frac{10}{7} fgx_t}
& v_\max = \sqrt{ \frac{g(2 h - \rho)^2}{2r} + v_\min^2},
\end{align*}
where $h = 0.1$ is the hole size and $\rho = 0.02135$ is the ball radius.


**References**

Penner, A. R. "The physics of putting." Canadian Journal of Physics 80.2 (2002): 83-96.

## Exercise 1

Complete the constructor `__init__`, methods `reset` and `step` based on the environment description provided above.

In [4]:
import numpy as np
from gymnasium.spaces import Box

class Minigolf(gym.Env):
    """
    The Minigolf problem.

    """

    def __init__(self):
        super(Minigolf, self).__init__()

        # Constants
        self.min_pos, self.max_pos = 1.0, 20.0
        self.min_action, self.max_action = 1e-5, 5.0
        self.min_friction, self.max_friction = 0.065, 0.196
        self.putter_length = 1.0
        self.hole_size = 0.10
        self.sigma_noise = 0.1
        self.ball_radius = 0.02135
        self.g = 9.81


        # Instance the spaces
        low = np.array([self.min_pos, self.min_friction])
        high = np.array([self.max_pos, self.max_friction])

        self.action_space = Box(low=self.min_action,
                                high=self.max_action,
                                shape=(1,),
                                dtype=np.float32)

        self.observation_space = Box(low=low,
                                     high=high,
                                     shape=(2,),
                                     dtype=np.float32)


    def step(self, action):

        #Retrieve the state components
        x, friction = self.state

        # Clip the action within the allowed range
        action = np.clip(action, self.min_action, self.max_action)

        # TODO Add noise to the action
        # if type(action) == float:
        #   uaction = action + np.random.normal(0, self.sigma_noise)
        # else:
        #   uaction = action[0] + np.random.normal(0, self.sigma_noise)
        # # TODO Compute the speed
        uaction = float(action) + np.random.normal(0, self.sigma_noise)
        vt = self.putter_length * uaction


        # Compute the speed limits
        v_min = np.sqrt(10 / 7 * friction * 9.81 * x)
        v_max = np.sqrt((2 * self.hole_size - self.ball_radius) ** 2 \
                        * (9.81 / (2 * self.ball_radius)) + v_min ** 2)

        # TODO Compute the deceleration
        d = 5/7 * self.g * friction

        # TODO Compute the time interval
        T = vt/d
        # TODO Update the position
        xt_next = x - vt*T + 1/2 * d * T**2

        # Clip the position
        x = np.clip(xt_next, self.min_pos, self.max_pos)

        # TODO Compute the reward and episode termination (done)
        if vt > v_max:
          done = True
          reward = -100
        elif vt < v_min:
          done = False
          reward = -1
        else:
          done = True
          reward = 0


        self.state = np.array([x, friction]).astype(np.float32)

        return self.state, reward, done, False, {}


    def reset(self, seed=None, options=None):

        # TODO Random generation of initial position and friction
               # Instance the spaces
        if seed is not None:
          np.random.seed(seed)
        x = np.random.uniform(self.min_pos, self.max_pos)
        friction = np.random.uniform(self.min_friction, self.max_friction)
        #print(x, friction)

        self.state = np.array([x, friction]).astype(np.float32)

        return self.state, {}

To be able to instance the environment with `gym.make`, we need to register the environment

In [5]:
from gymnasium.envs.registration import register

register(
    id="Minigolf-v1",
    entry_point="__main__:Minigolf",
    max_episode_steps=20,
    reward_threshold=0,
)

### Validate the environment

Stable Baselines3 provides a [helper](https://stable-baselines3.readthedocs.io/en/master/common/env_checker.html) to check that our environment complies with the Gym interface.

In [6]:
from stable_baselines3.common.env_checker import check_env

env = Minigolf()

# If the environment don't follow the interface, an error will be thrown
check_env(env, warn=True)

  gym.logger.warn(

  gym.logger.warn(


  uaction = float(action) + np.random.normal(0, self.sigma_noise)



## Evaluate some simple Policies

* **Do-nothing policy**: a policy plays the zero action.

$$
\pi(s) = 0
$$


* **Max-action policy**: a policy that plays the maximum available actions.

$$
\pi(s) = +\infty
$$


* **Zero-mean Gaussian policy**: a policy that selects the action sampled from a Gaussian policy with zero mean and variance $\sigma^2=1$

$$
\pi(a|s) = \mathcal{N}(0,\sigma^2)
$$

In [7]:
class DoNothingPolicy():

    def predict(self, obs):
        return 0, obs


class MaxActionPolicy():

    def predict(self, obs):
        return np.inf, obs


class ZeroMeanGaussianPolicy():

    def predict(self, obs):
        return np.random.randn(), obs

In [8]:
env = gym.make("Minigolf-v1")

do_nothing_policy = DoNothingPolicy()

max_action_policy = MaxActionPolicy()

gauss_policy = ZeroMeanGaussianPolicy()


do_nothing_mean, do_nothing_std = evaluate(env, do_nothing_policy)
max_action_mean, max_action_std = evaluate(env, max_action_policy)
gauss_policy_mean, gauss_policy_std = evaluate(env, gauss_policy)

Mean reward: -20.0 Std reward: 0.0 Num episodes: 100
Mean reward: -78.29 Std reward: 4.1717979313238525 Num episodes: 100
Mean reward: -18.42 Std reward: 0.4015575735165319 Num episodes: 100


## Train PPO, DDPG, and SAC

We now train three algorithms suitable for environments with continuous actions: [Proximal Policy Optimization](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html), [Deep Deterministic Policy Gradient](https://stable-baselines3.readthedocs.io/en/master/modules/ddpg.html), and [Soft Actor Critic](https://stable-baselines3.readthedocs.io/en/master/modules/sac.html).

In [9]:
from stable_baselines3 import PPO, DDPG, SAC


# Separate evaluation env
eval_env = gym.make('Minigolf-v1')

ppo = PPO("MlpPolicy", env, verbose=1, policy_kwargs=dict(net_arch=[32]))
ddpg = DDPG("MlpPolicy", env, verbose=1, policy_kwargs=dict(net_arch=[32]))
sac = SAC("MlpPolicy", env, verbose=1, policy_kwargs=dict(net_arch=[32]))

print('PPO')
ppo.learn(total_timesteps=50000, log_interval=4, progress_bar=True)

print('DDPG')
ddpg.learn(total_timesteps=50000, log_interval=1024, progress_bar=True)

print('SAC')
sac.learn(total_timesteps=50000, log_interval=2048, progress_bar=True)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


  return datetime.utcnow().replace(tzinfo=utc)




Output()

  if ip and hasattr(ip, 'kernel') and hasattr(ip.kernel, '_parent_header'):



Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
PPO


  uaction = float(action) + np.random.normal(0, self.sigma_noise)



------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 15.5         |
|    ep_rew_mean          | -14.9        |
| time/                   |              |
|    fps                  | 536          |
|    iterations           | 4            |
|    time_elapsed         | 15           |
|    total_timesteps      | 8192         |
| train/                  |              |
|    approx_kl            | 0.0068355724 |
|    clip_fraction        | 0.0841       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.44        |
|    explained_variance   | 0.162        |
|    learning_rate        | 0.0003       |
|    loss                 | 72.3         |
|    n_updates            | 30           |
|    policy_gradient_loss | -0.00802     |
|    std                  | 1.01         |
|    value_loss           | 55.1         |
------------------------------------------
------------------------------------------
| rollout/ 

Output()

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1.57     |
|    ep_rew_mean     | -12.6    |
| time/              |          |
|    episodes        | 1024     |
|    fps             | 159      |
|    time_elapsed    | 62       |
|    total_timesteps | 9880     |
| train/             |          |
|    actor_loss      | 14.9     |
|    critic_loss     | 12.6     |
|    learning_rate   | 0.001    |
|    n_updates       | 9779     |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 6.67     |
|    ep_rew_mean     | -5.67    |
| time/              |          |
|    episodes        | 2048     |
|    fps             | 158      |
|    time_elapsed    | 96       |
|    total_timesteps | 15324    |
| train/             |          |
|    actor_loss      | 4.52     |
|    critic_loss     | 5.51     |
|    learning_rate   | 0.001    |
|    n_updates       | 15223    |
--------------

Output()

SAC
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 6.82     |
|    ep_rew_mean     | -5.83    |
| time/              |          |
|    episodes        | 2048     |
|    fps             | 89       |
|    time_elapsed    | 275      |
|    total_timesteps | 24534    |
| train/             |          |
|    actor_loss      | 23       |
|    critic_loss     | 39.1     |
|    ent_coef        | 0.182    |
|    ent_coef_loss   | 0.1      |
|    learning_rate   | 0.0003   |
|    n_updates       | 24433    |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 3.93     |
|    ep_rew_mean     | -2.93    |
| time/              |          |
|    episodes        | 4096     |
|    fps             | 89       |
|    time_elapsed    | 386      |
|    total_timesteps | 34443    |
| train/             |          |
|    actor_loss      | 9.94     |
|    critic_loss     | 21.7     |
|    ent_c

<stable_baselines3.sac.sac.SAC at 0x7eb781f21580>

Let us now evaluate the results of the training.

In [10]:
ppo_mean, ppo_std = evaluate(eval_env, ppo)
ddpg_mean, ddpg_std = evaluate(eval_env, ddpg)
sac_mean, sac_std = evaluate(eval_env, sac)

  uaction = float(action) + np.random.normal(0, self.sigma_noise)



Mean reward: -6.19 Std reward: 1.7375063403706712 Num episodes: 100
Mean reward: -0.81 Std reward: 0.06307763374436204 Num episodes: 100
Mean reward: -2.13 Std reward: 0.15805318117901987 Num episodes: 100
