<a href="https://colab.research.google.com/github/barrtender/euchre/blob/main/5_custom_gym_env.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Following Stable Baselines Tutorial - Creating a custom Gym environment

Originally at: https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html

But then I got errors and realized that stable-baselines3 was the thing to use so Frankenstein'd it together with https://stable-baselines3.readthedocs.io/en/master/guide/custom_env.html (note the 3 in the URL).

The goal was to get a colab up that I can put a custom environment into and train easily.

I wanted to validate with the simplest custom environment I could find, so "Go Left" seemed appropriate.

The idea is that the game is to "Go Left". It's a 1-dimensional game that starts the agent off on the right side of a line and puts the goal on the left side of the line. The game gives two options "go right" and "go left". The agent has 50 moves to find the exit. If the agent makes it to the goal they get a positive reward. If they don't, they get a negative reward.

I found that without the negative reward or higher reward for finishing early the models never figured out how to win.

In [1]:
## Install Dependencies and Stable Baselines Using Pip

%tensorflow_version 2.x
!pip install stable-baselines3>=2.0.0
!pip install sb3-contrib
!pip install 'shimmy>=0.2.1'

Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.
Collecting sb3-contrib
  Downloading sb3_contrib-2.1.0-py3-none-any.whl (80 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m80.3/80.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sb3-contrib
Successfully installed sb3-contrib-2.1.0
Collecting shimmy>=0.2.1
  Downloading Shimmy-1.2.1-py3-none-any.whl (37 kB)
Installing collected packages: shimmy
Successfully installed shimmy-1.2.1


## First steps with the gym interface (verbatim copy/pasted)

As you have noticed in the previous notebooks, an environment that follows the gym interface is quite simple to use.
It provides to this user mainly three methods:
- `reset()` called at the beginning of an episode, it returns an observation
- `step(action)` called to take an action with the environment, it returns the next observation, the immediate reward, whether the episode is over and additional information
- (Optional) `render(method='human')` which allow to visualize the agent in action. Note that graphical interface does not work on google colab, so we cannot use it directly (we have to rely on `method='rbg_array'` to retrieve an image of the scene

Under the hood, it also contains two useful properties:
- `observation_space` which one of the gym spaces (`Discrete`, `Box`, ...) and describe the type and shape of the observation
- `action_space` which is also a gym space object that describes the action space, so the type of action that can be taken

The best way to learn about gym spaces is to look at the [source code](https://github.com/openai/gym/tree/master/gym/spaces), but you need to know at least the main ones:
- `gym.spaces.Box`: A (possibly unbounded) box in $R^n$. Specifically, a Box represents the Cartesian product of n closed intervals. Each interval has the form of one of [a, b], (-oo, b], [a, oo), or (-oo, oo). Example: A 1D-Vector or an image observation can be described with the Box space.
```python
# Example for using image as input:
observation_space = spaces.Box(low=0, high=255, shape=(HEIGHT, WIDTH, N_CHANNELS), dtype=np.uint8)
```                                       

- `gym.spaces.Discrete`: A discrete space in $\{ 0, 1, \dots, n-1 \}$
  Example: if you have two actions ("left" and "right") you can represent your action space using `Discrete(2)`, the first action will be 0 and the second 1.



[Documentation on custom env](https://stable-baselines.readthedocs.io/en/master/guide/custom_env.html)

##  Gym env skeleton

> In practice this is how a gym environment looks like.
Here, we have implemented a simple grid world were the agent must learn to go always left.

In [98]:
import numpy as np
import gymnasium
from gymnasium import spaces


class GoLeftEnv(gymnasium.Env):
  """
  Custom Environment that follows gym interface.
  This is a simple env where the agent must learn to go always left.
  """
  # Because of google colab, we cannot implement the GUI ('human' render mode)
  metadata = {'render.modes': ['console']}
  # Define constants for clearer code
  LEFT = 0
  RIGHT = 1
  MAX_TIME = 50
  time_left = MAX_TIME

  def __init__(self, grid_size=10):
    super(GoLeftEnv, self).__init__()

    # Size of the 1D-grid
    self.grid_size = grid_size
    # Initialize the agent at the right of the grid
    self.agent_pos = grid_size - 1

    # Define action and observation space
    # They must be gym.spaces objects
    # Example when using discrete actions, we have two: left and right
    n_actions = 2
    self.action_space = spaces.Discrete(n_actions)
    # The observation will be the coordinate of the agent
    # this can be described both by Discrete and Box space
    self.observation_space = spaces.Box(low=0, high=self.grid_size,
                                        shape=(1,), dtype=np.float32)

  def reset(self, seed=1):
    """
    Important: the observation must be a numpy array
    :return: (np.array)
    """
    self.time_left = self.MAX_TIME
    # Initialize the agent at the right of the grid
    self.agent_pos = self.grid_size - 1
    # here we convert to float32 to make it more general (in case we want to use continuous actions)
    return np.array([self.agent_pos]).astype(np.float32), {}

  def step(self, action):
    self.time_left -= 1
    if action == self.LEFT:
      self.agent_pos -= 1
    elif action == self.RIGHT:
      self.agent_pos += 1
    else:
      raise ValueError("Received invalid action={} which is not part of the action space".format(action))

    # Account for the boundaries of the grid
    self.agent_pos = np.clip(self.agent_pos, 0, self.grid_size)

    # Are we at the left of the grid?
    done = bool(self.agent_pos == 0)

    # Null reward everywhere except when reaching the goal (left of the grid)
    reward = self.time_left if self.agent_pos == 0 else 0

    if self.time_left <= 0:
      done = True
      reward = -self.MAX_TIME

    # Optionally we can pass additional info, we are not using that for now
    info = {}

    return np.array([self.agent_pos]).astype(np.float32), reward, done, done, info

  def render(self, mode='console'):
    if mode != 'console':
      raise NotImplementedError()
    # agent is represented as a cross, rest as a dot
    print("." * self.agent_pos, end="")
    print("x", end="")
    print("." * (self.grid_size - self.agent_pos))

  def close(self):
    pass


### Validate the environment

> Stable Baselines provides a [helper](https://stable-baselines.readthedocs.io/en/master/common/env_checker.html) to check that your environment follows the Gym interface. It also optionally checks that the environment is compatible with Stable-Baselines (and emits warning if necessary).

In [130]:
from stable_baselines3.common.env_checker import check_env

In [131]:
env = GoLeftEnv()
# If the environment don't follow the interface, an error will be thrown
check_env(env, warn=True)

### Testing the environment

In [101]:
env = GoLeftEnv(grid_size=10)

obs = env.reset(1)
env.render()

print(env.observation_space)
print(env.action_space)
print(env.action_space.sample())
print(env.action_space.sample())
print(env.action_space.sample())
print(env.action_space.sample())
print(env.action_space.sample())

GO_LEFT = 0
# Hardcoded best agent: always go left!
n_steps = 20
for step in range(n_steps):
  print("Step {}".format(step + 1))
  obs, reward, done, trun, info = env.step(GO_LEFT)
  print('obs=', obs, 'reward=', reward, 'done=', done)
  env.render()
  if done or trun:
    print("Goal reached!", "reward=", reward)
    break

.........x.
Box(0.0, 10.0, (1,), float32)
Discrete(2)
1
0
1
0
1
Step 1
obs= [8.] reward= 0 done= False
........x..
Step 2
obs= [7.] reward= 0 done= False
.......x...
Step 3
obs= [6.] reward= 0 done= False
......x....
Step 4
obs= [5.] reward= 0 done= False
.....x.....
Step 5
obs= [4.] reward= 0 done= False
....x......
Step 6
obs= [3.] reward= 0 done= False
...x.......
Step 7
obs= [2.] reward= 0 done= False
..x........
Step 8
obs= [1.] reward= 0 done= False
.x.........
Step 9
obs= [0.] reward= 41 done= True
x..........
Goal reached! reward= 41


### Try it with Stable-Baselines

> Once your environment follow the gym interface, it is quite easy to plug in any algorithm from stable-baselines

I trained two models here to sort of race them and to make sure the training was actually working.

I don't know what the `make_vec_env` is. It's important though. It's mentioned in the [Getting Started](https://stable-baselines3.readthedocs.io/en/master/guide/quickstart.html) and is used in the colab they shared. Otherwise I couldn't get it to work

In [102]:
from stable_baselines3 import PPO, A2C, DQN
from stable_baselines3.common.env_util import make_vec_env

In [121]:
# Instantiate the env
vec_env = make_vec_env(GoLeftEnv, n_envs=1, env_kwargs=dict(grid_size=10))

env = GoLeftEnv(grid_size=10)
env.reset()

# Train the agent
model = DQN("MlpPolicy", env, verbose=1).learn(50000, log_interval=100)

# Train the second agent. It's a little dumber but it can win
model_two = DQN("MlpPolicy", env, verbose=1)
model_two.learn(total_timesteps=10000, log_interval=100)

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 44.9     |
|    ep_rew_mean      | -30.9    |
|    exploration_rate | 0.147    |
| time/               |          |
|    episodes         | 100      |
|    fps              | 11192    |
|    time_elapsed     | 0        |
|    total_timesteps  | 4487     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 43.5     |
|    ep_rew_mean      | -28      |
|    exploration_rate | 0.05     |
| time/               |          |
|    episodes         | 200      |
|    fps              | 11222    |
|    time_elapsed     | 0        |
|    total_timesteps  | 8839     |
----------------------------------
----------------------------------
| rollout/            |          |
|    ep_len_mean      | 45       |
|    ep_rew_mean      | -30   

<stable_baselines3.dqn.dqn.DQN at 0x7b8cf1e46f50>

In [129]:
# Test the trained agent
# using the vecenv
obs = vec_env.reset()
n_steps = 50
path = []
for step in range(n_steps):
    action, _ = model.predict(obs, deterministic=False)
    obs, reward, done, info = vec_env.step(action)
    path.append(obs[0][0])
    if done:
        break
print(path)
print(f"Reward={reward}")

print("Number two GO")

# Test the trained agent
# using the vecenv
obs = vec_env.reset()
n_steps = 50
path = []
for step in range(n_steps):
    action, _ = model_two.predict(obs, deterministic=False)
    obs, reward, done, info = vec_env.step(action)
    path.append(obs[0][0])
    if done:
        break
print(path)
print(f"Reward={reward}")

[8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0, 9.0]
Reward=[41.]
Number two GO
[10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 10.0, 9.0, 9.0]
Reward=[-50.]


## Future

I'll finish my Euchre custom environment and put it in the place of this game and then train on that.
