In [None]:
from Environments.random_maze import maze_game
import numpy as np
import matplotlib.pyplot as plt
from gymnasium import spaces
from gymnasium.spaces import Dict
from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.models import MODEL_DEFAULTS
from ray.rllib.models import ModelCatalog
import copy
from ray.rllib.models.torch.torch_modelv2 import TorchModelV2
from ray.rllib.utils.framework import try_import_tf, try_import_torch
from ray.rllib.utils.torch_utils import FLOAT_MIN
from ray.rllib.models.torch.fcnet import FullyConnectedNetwork as TorchFC
from ray import air, tune

**Advanced RL Algorithms and RLLib!** 

The goal of this second notebook is to show you how I go about solving RL problems, using RLLib and more advanced algorithms. Hopefully it will give you an insight into types of problems that are encountered when working in RL and how they are solved. 

As a bit of a showcase, we are going to make the maze game for earlier harder, and discuss how to solve it. 

As in the previous notebook, we will start by looking at problem. You will notice that once again it is a maze game; however, this one will be a fair bit harder to solve. Run the cell below a few times, what do you notice?

In [None]:
example = maze_game()
example.reset()

This version of the maze game is random! We will need a high level of generalisation to solve this problem! (the action space is the same)

You will also notice that the state is split up by a dictionary. The base form (before wrapping) should present as much data as possible, we can always use wrappers later to reduce the data. 

Like all good ML, we will start with a baseline. We will produce a wrapper that stacks the into 3 planes that will make up the state:

In [3]:
class stacked_maze(maze_game):
    #overwrite the init to change the obs space
    def __init__(self):
        super().__init__()
        #This wrapper updates the observation space. It is now best discribed a multibinary.
        self.observation_space = spaces.MultiBinary((12,12,2))
    
    #overwrite create_state to change how the env handles the state
    def create_state(self): #In this version, the goal never changes, so ignore it
        state = np.stack([np.array(self.maze==self.wall, dtype=np.int8),
                          np.array(self.maze==self.agent, dtype=np.int8)], axis=-1)
        return state

Now we can introduce RLLib, we start by registiering the environment:

In [4]:
#We need to registier the env we want to use with RLLib before we can use it:
def env_creator(env_config):
    return stacked_maze()
register_env("stacked_maze", env_creator)

Now we will define the model that we want to use. More on this config can be found here: https://github.com/ray-project/ray/blob/469f4d296a112f2ade556ea586a0f05811b34d32/rllib/models/catalog.py#L52

We are using some convolution layers to reduce the dimensionality of the matrix inputs. These will get flattened and passed it the dense layers given in 'fcnet_hiddens'.

In [5]:
model = copy.copy(MODEL_DEFAULTS)
model.update({'fcnet_hiddens': [256, 128], 'fcnet_activation': 'relu', 'conv_filters': [[4, [3, 3], 1], [8, [3, 3], 1], [12, [3, 3], 1], [16, [3, 3], 1]]})

We now create a config that will produce a trainer for learning on the "stacked_maze" environment. We wont worry too much about the learning params rn. At this stage I am mainly checking that my MDP is working.

In [None]:
config = (
    PPOConfig()
    .rl_module(_enable_rl_module_api=False)
    .environment(env="stacked_maze")\
    .rollouts(num_rollout_workers=4, num_envs_per_worker=1)\
    .training(_enable_learner_api=False, train_batch_size=2000, gamma=0.995, model=model, lr=0.001,  )\
    .environment(disable_env_checking=True)\
    .framework('torch')\
)
trainer = config.build()

Now the trainer is built, we can start training! We are simply powering this with a for loop.

In [None]:
def print_results(results_dict):
    train_iter = results_dict["training_iteration"]
    r_mean = results_dict["episode_reward_mean"]
    r_max = results_dict["episode_reward_max"]
    r_min = results_dict["episode_reward_min"]
    print(f"{train_iter:4d} \tr_mean: {r_mean:.1f} \tr_max: {r_max:.1f} \tr_min: {r_min: .1f}")

for i in range(50):
    print_results(trainer.train())

So this didn't get a very good result, but it seems to have learnt something! So, it is now time to consider what hyperparameters we are using. In RL, we typically have 3-8 different hyperparameters to balance, which makes running hyperparameter searches very time-consuming. Luckily, we can generally reduce the range of our hyperparameter searches by using a bit of common sense. Thinking about our target problem, each episode the maze changes, so the agent needs to learn the general representation of the maze for the best result. 

Like, in supervised learning, for generalisation to form, we need to ensure that the DNN cannot overfit its data (overfitting is typically caused by parameters that give aggressive learning, like a high learning rate, or by a lack of data points to generalise across). To reduce overfitting in PPO, we change the following parameters:

- train_batch_size: Increasing this makes PPO use more data for each policy update - letting the policy apply its update with more data can help with generalisation.
- clip_param: PPO clips its policy less to prevent overly large updates. Reducing this value (default is .2) reduces the magnitude of the policy updates.
- num_sgd_iter: PPO uses importance sampling to let it apply multiple updates with one training batch. The fewer iterations of the sample batch we do, the less we will fit the data. This can be set far higher (20-30) in very static problems. 
- sgd_minibatch_size: The size of the mini-batches that make up each update. Has similar effects as train_batch_size. By making it so large, we take the average across multiple episodes for each update.
- entropy_coeff: Adding in a small amount of entropy loss to aid in exploration.

In [None]:
config = (
    PPOConfig()
    .rl_module(_enable_rl_module_api=False)
    .environment(env="stacked_maze")\
    .rollouts(num_rollout_workers=4, num_envs_per_worker=1)\
    .training(_enable_learner_api=False, train_batch_size=15000, gamma=0.995, model=model, lr=0.0003,\
              clip_param=0.15, num_sgd_iter=4, sgd_minibatch_size=3000, entropy_coeff=0.0001,)\
    .environment(disable_env_checking=True)\
    .framework('torch')\
)
trainer = config.build()

In [None]:
for i in range(50):
    print_results(trainer.train())

Hopefully, you saw better results with these more optimised hyperparameters. But, it seems that we still haven't solved the random maze game. Longer training would probably improve things, but we can also aid learning through the same action masking technique we looked at in the last notebook!

To implement action masking with RLLib we again start by wrapping the target environment:

In [3]:
class random_maze_action_mask(maze_game):
    #overwrite the init to change the obs space
    def __init__(self):
        super().__init__()
        #This wrapper updates the observation space. It is now best discribed a multibinary.
        self.observation_space = spaces.Dict({'observations': spaces.MultiBinary((12,12,2)), 
                                              'action_mask': spaces.MultiBinary(4)})
    
    #overwrite create_state to change how the env handles the state
    def create_state(self):
        state = {}
        state['observations'] = np.stack([np.array(self.maze==self.wall, dtype=np.int8),
                                  np.array(self.maze==self.agent, dtype=np.int8)], axis=-1)
        state['action_mask'] = self.get_action_mask()
        return state
    
    def get_action_mask(self):
        action_mask = np.zeros(4,dtype=np.int8)
        if self.y != 0:
            if self.maze[self.x, self.y-1] != self.wall:
                action_mask[0] = 1
        if self.x != self.maze.shape[0]-1:
            if self.maze[self.x+1, self.y] != self.wall:
                action_mask[1] = 1
        if self.y != self.maze.shape[1]-1:
            if self.maze[self.x, self.y+1] != self.wall:
                action_mask[2] = 1
        if self.x != 0:
            if self.maze[self.x-1, self.y] != self.wall:
                action_mask[3] = 1
        return action_mask

def env_creator(env_config):
    return random_maze_action_mask()
register_env("random_maze_action_mask", env_creator)

Next, we will update the model that our policy will use. This one will look at the action_mask present in the state. It uses it to reduce the relevant logits to 'FLOAT_MIN'. When the policies action selector (a separate class that turns DNN logits into actions) gets these logits, the softmax will not select the masked-out logits as actions. 

In [7]:
torch, nn = try_import_torch()

class ActionMaskModel(TorchModelV2, nn.Module):
    """PyTorch version of above ActionMaskingModel."""

    def __init__(
        self,
        obs_space,
        action_space,
        num_outputs,
        model_config,
        name,
        **kwargs,
    ):
        orig_space = getattr(obs_space, "original_space", obs_space)
        assert (
            isinstance(orig_space, Dict)
            and "action_mask" in orig_space.spaces
            and "observations" in orig_space.spaces
        )

        TorchModelV2.__init__(
            self, obs_space, action_space, num_outputs, model_config, name, **kwargs
        )
        nn.Module.__init__(self)

        self.internal_model = TorchFC(
            orig_space["observations"],
            action_space,
            num_outputs,
            model_config,
            name + "_internal",
        )

    def forward(self, input_dict, state, seq_lens):
        # Extract the available actions tensor from the observation.
        action_mask = input_dict["obs"]["action_mask"]
        # Compute the unmasked logits.
        logits, _ = self.internal_model({"obs": input_dict["obs"]["observations"]})
        # Convert action_mask into a [0.0 || -inf]-type mask.
        inf_mask = torch.clamp(torch.log(action_mask), min=FLOAT_MIN)
        masked_logits = logits + inf_mask
        # Return masked logits.
        return masked_logits, state

    def value_function(self):
        return self.internal_model.value_function()

#Like an envionrment, we need to register our custom model.
ModelCatalog.register_custom_model("RandomMazeModel", ActionMaskModel)
model = copy.copy(MODEL_DEFAULTS)
model.update({'custom_model': "RandomMazeModel", 'custom_model_config': {},})


Now we can train again using action masking.

In [None]:
config = (
    PPOConfig()
    .rl_module(_enable_rl_module_api=False)
    .environment(env="random_maze_action_mask")\
    .rollouts(num_rollout_workers=4, num_envs_per_worker=1)\
    .training(_enable_learner_api=False, train_batch_size=15000, gamma=0.995, model=model, lr=0.0003,\
              clip_param=0.15, num_sgd_iter=4, sgd_minibatch_size=3000, entropy_coeff=0.0001,)\
    .environment(disable_env_checking=True)\
    .framework('torch')\
)
trainer = config.build()

In [None]:
for i in range(50):
    print_results(trainer.train())

Again, we can see that, unsurprisingly, learning has gone a lot better with action masking. PPO still isn't solving the problem, though. Now, this is most likely because we haven't let training go on for long enough, but this is also a good opportunity to look at parameter tuning. 

Running a hyperparameter search can be time-consuming, so I have already run one and kept the results:

In [15]:
config = (
    PPOConfig()
    .rl_module(_enable_rl_module_api=False)
    .environment(env="random_maze_action_mask")\
    .rollouts(num_rollout_workers=12, num_envs_per_worker=1)\
    .training(_enable_learner_api=False, train_batch_size=tune.grid_search([12000, 18000]), gamma=0.995, model=model, lr=0.0003,\
              clip_param=0.15, num_sgd_iter=tune.grid_search([3,5]), sgd_minibatch_size=tune.grid_search([2000, 4000]), entropy_coeff=0.0001,)\
    .environment(disable_env_checking=True)\
    .framework('torch')\
    .resources(num_gpus=1)
    )

tuner = tune.Tuner(
    "PPO",
    tune_config=tune.TuneConfig(num_samples=1),
    param_space=config.to_dict(),
    run_config=air.RunConfig(stop={"timesteps_total": 1.5e6}, storage_path="./results", name="ppo_hyperparam_search"))

results = tuner.fit()

0,1
Current time:,2023-10-25 17:32:23
Running for:,02:04:56.28
Memory:,30.3/79.9 GiB

Trial name,status,loc,num_sgd_iter,sgd_minibatch_size,train_batch_size,iter,total time (s),ts,reward,episode_reward_max,episode_reward_min,episode_len_mean
PPO_random_maze_action_mask_9f288_00000,TERMINATED,127.0.0.1:11688,3,2000,12000,125,279.398,1500000,2.247,48.1,-50,263.1
PPO_random_maze_action_mask_9f288_00001,TERMINATED,127.0.0.1:6416,5,2000,12000,125,295.171,1500000,4.096,47.5,-50,269.66
PPO_random_maze_action_mask_9f288_00002,TERMINATED,127.0.0.1:1908,3,4000,12000,125,271.755,1500000,5.333,46.7,-50,272.32
PPO_random_maze_action_mask_9f288_00003,TERMINATED,127.0.0.1:20528,5,4000,12000,125,296.525,1500000,2.704,47.5,-50,273.56
PPO_random_maze_action_mask_9f288_00004,TERMINATED,127.0.0.1:26372,3,2000,18000,84,309.386,1512000,8.224,47.9,-50,238.4
PPO_random_maze_action_mask_9f288_00005,TERMINATED,127.0.0.1:20272,5,2000,18000,84,279.82,1512000,11.331,47.1,-50,232.38
PPO_random_maze_action_mask_9f288_00006,TERMINATED,127.0.0.1:14960,3,4000,18000,84,319.457,1512000,5.894,47.9,-50,241.66
PPO_random_maze_action_mask_9f288_00007,TERMINATED,127.0.0.1:26696,5,4000,18000,84,336.087,1512000,-1.614,48.1,-50,286.68
PPO_random_maze_action_mask_9f288_00008,TERMINATED,127.0.0.1:27040,3,2000,12000,125,296.075,1500000,4.711,47.3,-50,268.52
PPO_random_maze_action_mask_9f288_00009,TERMINATED,127.0.0.1:632,5,2000,12000,125,283.684,1500000,9.698,47.5,-50,233.68


[2m[36m(RolloutWorker pid=23808)[0m   File "python\ray\_raylet.pyx", line 1424, in ray._raylet.execute_task
[2m[36m(RolloutWorker pid=23808)[0m   File "python\ray\_raylet.pyx", line 1364, in ray._raylet.execute_task.function_executor
[2m[36m(RolloutWorker pid=23808)[0m   File "c:\Users\Adam\anaconda3\lib\site-packages\ray\_private\function_manager.py", line 726, in actor_method_executor
[2m[36m(RolloutWorker pid=23808)[0m     return method(__ray_actor, *args, **kwargs)
[2m[36m(RolloutWorker pid=23808)[0m   File "c:\Users\Adam\anaconda3\lib\site-packages\ray\util\tracing\tracing_helper.py", line 464, in _resume_span
[2m[36m(RolloutWorker pid=23808)[0m     return method(self, *_args, **_kwargs)
[2m[36m(RolloutWorker pid=23808)[0m   File "c:\Users\Adam\anaconda3\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 470, in __init__
[2m[36m(RolloutWorker pid=23808)[0m     self.policy_dict, self.is_policy_to_train = self.config.get_multi_agent_setup(
[2m[

To take a look at these results, we will use tensorboard. Running the command below will launch tensorboard, but you will want to pull it out to the console if you want to use the notebook and tensorboard at the same time.

In [16]:
!tensorboard --logdir '/Users/<username>/RL Master Class/RLMasterClass/results/ppo_hyperparam_search'

  and should_run_async(code)
'tensorboard' is not recognized as an internal or external command,
operable program or batch file.


At this stage, it might be wise to investigate the performance of a different algorithm, for example a DQN. I'm leavning this as an open task, feel free to explore DQNs or make a start on anything else. From here, I'd like to encourge you to work on anything that you might find interesting! 

Some ideas: 

- Apply DQN to this maze problem, and try to find an optimal learning set up. 

- Modify the Maze game in some way, and see how that affects learning. 

- Create your own MDP to try and further your understanding of how we make and solve MDPs. A good challenge is a two player game like tictactoe or connect4.

- Find some other benchmarks to try and solve. Gynasium has many available: https://gymnasium.farama.org/index.html 