# Training the Baby Robot environment

This uses the Stable Baseline PPO Reinforcement Learning algorithm to train Baby Robot to navigate a maze.

To run training on Colab with a GPU takes about 20 minutes and without a GPU will take considerably longer, so it's worth checking that you have the GPU enabled
(in Colab select the 'Runtime' toolbar option and then from "Change Runtime Type" set the Hardware accelerator to GPU). Additionally, after training the model will be saved, so its maybe worth copying this to Google Drive or somewhere similar to avoid retraining.

In [1]:
# set this true to train the model
# - otherwise it will try to load a pre-trained model
TRAIN_MODEL = True

# the name of the environment to create
Environment_Name = "BabyRobot-v0"

# define where the model should be written
model_dir = 'Models/'

In [13]:
%pip install --upgrade babyrobot -q
%pip install --upgrade stable-baselines3 -q

Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
babyrobot 1.0.16 requires gym==0.25.2, but you have gym 0.21.0 which is incompatible.[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
import numpy as np
import gym
import babyrobot

In [4]:
# test if running in Google Colab
if 'COLAB_GPU' in os.environ:
  print("Setting Up For Google Colab")
  from google.colab import output
  output.enable_custom_widget_manager()

In [5]:
# create the specified environment with a discrete action space
setup = {'action_space':'discrete'}
env = babyrobot.make(Environment_Name,**setup)

In [6]:
setup = { 'width': 8,
          'height': 5,
          'add_maze': True,
          'maze_seed': 42,
          'end': [5,4],
          'add_compass':True 
        }       

puddles = [((2,2),2),           
           ((2,0),1),
           ((7,4),2),          
           ((5,1),2)]
setup['puddles'] = puddles

setup['grid'] = {'theme': 'black_orange'}
setup['side_panel'] = {'width':200}      

walls = [((2, 0),'E'), # remove the east wall at (2,0)
         ((1, 2),'S'),((2, 2),'S'),((2, 2),'E'),((3, 2),'S'),((4, 2),'S'),((3, 2),'E'),
         ((5, 2),'E')] # add an east wall at (5,2)   
setup['walls'] = walls

In [7]:
env = babyrobot.make(Environment_Name,**setup)
env.render()

MultiCanvas(height=326, sync_image_data=True, width=718)

In [8]:
# add coodinates to see where we're working
info = {'coords': True}
env.show_info(info)

In [9]:
# remove the coordinates
env.clear_info(all_info=True)

## Observation Space

The environments default Observation Space is a MultiDiscrete space, which returns an [x,y] co-ordinate in the grid.

In [10]:
env.reset()
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space Shape", env.observation_space)
print("Sample observation", env.observation_space.sample()) # Get a random observation

_____OBSERVATION SPACE_____ 

Observation Space Shape MultiDiscrete([8 5])
Sample observation [5 0]


## Action Space

The default Action Space for the environment returns a Dynamic Space, where the action space for each cell contains only the valid actions for that cell. So, for example, the actions won't contain those that would make Baby Robot walk into a wall.

In [11]:
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action


 _____ACTION SPACE_____ 

Action Space Shape 1
Action Space Sample 2


# Stable Baselines Training

Train the model on the environment using the _[PPO](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html)_ algorithm.

In [14]:
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_vec_env

In [15]:
# Create environment
env = babyrobot.make(Environment_Name,**setup)

## Convert from MultiDiscrete to Discrete ObservationSpace

Unfortunately the Stable Baselines PPO algorithm supports neither MultiDiscrete nor Dynamic spaces. We therefore need to convert both the Observation Space and the Action Space to make them into Discrete Spaces.

For the Observation Space this means converting from grid co-ordinates into having a single ID to identify each cell in the grid. To do this we can use a **[wrapper](https://alexandervandekleut.github.io/gym-wrappers/)** around the Observation Space:



In [16]:
class DiscreteWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
        assert isinstance(env.observation_space, gym.spaces.MultiDiscrete), \
            "Should only be used to wrap MultiDiscrete envs."        
        self.observation_space = gym.spaces.Discrete(env.observation_space[0].n * env.observation_space[1].n)
    
    def observation(self, obs):
        new_obs = (obs[0] + (obs[1] * self.env.observation_space[1].n))        
        return new_obs

For the Action Space we can simply specify that we want to use a Discrete space in the setup of the environment. This will result in all states in the grid having the 5 possible actions (Stay, North, South, East and West) as opposed to only valid actions that are state specific.

In [21]:
# create an environment with Discrete action and observation spaces
setup['action_space'] = 'discrete'
# use the old Gym step function that returns a single 'done' value
setup['new_step_api'] = False
env = DiscreteWrapper(babyrobot.make(Environment_Name,**setup))

In [22]:
env.reset()

print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action

print("\n_____OBSERVATION SPACE_____ \n")
print("Observation Space Shape", env.observation_space)
print("Sample observation", env.observation_space.sample()) # Get a random observation


 _____ACTION SPACE_____ 

Action Space Shape 5
Action Space Sample 4

_____OBSERVATION SPACE_____ 

Observation Space Shape Discrete(40)
Sample observation 32


In [23]:
# create the model
model = PPO('MlpPolicy', env, verbose=1)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [24]:
%%time

# test if the model should be trained
if TRAIN_MODEL:

  # Train the agent
  model.learn(total_timesteps=500_000)

  # Save the trained model
  model.save(f"{model_dir}/{Environment_Name}_ppo")

-----------------------------
| time/              |      |
|    fps             | 991  |
|    iterations      | 1    |
|    time_elapsed    | 2    |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 3.78e+03     |
|    ep_rew_mean          | -4.14e+03    |
| time/                   |              |
|    fps                  | 625          |
|    iterations           | 2            |
|    time_elapsed         | 6            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0026101149 |
|    clip_fraction        | 0.0043       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.61        |
|    explained_variance   | -0.0195      |
|    learning_rate        | 0.0003       |
|    loss                 | 31.3         |
|    n_updates            | 10           |
|    policy_grad

## Load and Evaluate the Trained Model

In [58]:
# load the pre-trained model
model = PPO.load(f"{model_dir}/{Environment_Name}_ppo", print_system_info=True)

== CURRENT SYSTEM INFO ==
OS: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic #1 SMP Sun Apr 24 10:03:06 PDT 2022
Python: 3.7.13
Stable-Baselines3: 1.6.0
PyTorch: 1.12.0+cu113
GPU Enabled: True
Numpy: 1.21.6
Gym: 0.21.0

== SAVED MODEL SYSTEM INFO ==
OS: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic #1 SMP Sun Apr 24 10:03:06 PDT 2022
Python: 3.7.13
Stable-Baselines3: 1.6.0
PyTorch: 1.12.0+cu113
GPU Enabled: True
Numpy: 1.21.6
Gym: 0.21.0



In [59]:
eval_env = DiscreteWrapper(babyrobot.make(Environment_Name,**setup))
eval_env.render()

MultiCanvas(height=326, sync_image_data=True, width=718)

In [60]:
obs = eval_env.reset()
for i in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, done, info = eval_env.step(action)
    eval_env.render()
    if done:
      break