# <a href="https://colab.research.google.com/github/enlite-ai/maze/blob/main/tutorials/notebooks/getting_started/getting_started_3_customization_intro.ipynb" target="_top"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" /></a> Maze: Getting Started Part III - Customization

Part 3 of 4 in the *Getting started* series. We recommend reading [part 1](https://colab.research.google.com/github/enlite-ai/maze/blob/master/tutorials/notebooks/getting_started/getting_started_1.ipynb) and [part 2](https://colab.research.google.com/github/enlite-ai/maze/blob/master/tutorials/notebooks/getting_started/getting_started_2.ipynb) prior to this notebook.

## On Maze

MazeRL is an application-oriented Deep Reinforcement Learning (RL) framework, addressing real-world decision problems. If you'd like to know more, check out
* the [Github repository](https://github.com/enlite-ai/maze),
* the [documentation](https://maze-rl.readthedocs.io/en/latest/index.html#documentation-overview) or
* the official [website](https://www.enlite.ai/).


## Introduction

After the first two notebooks adressed the basic workflow and how to configure Maze' high-level Python API, this one aims to convey how to write your own components: We will implement our own wrapper and policy.


### Install Maze and Dependencies

Maze is available as pip package. The other dependencies required for this notebook are PyTorch and OpenAI's gym. We recommend installing PyTorch via Conda. If you are executing this notebook on Google Collabe, both libraries are already available.

In [None]:
!pip install torch
!pip install gym
!pip install maze-rl

## Writing a Customized Wrapper

[Part 2](https://colab.research.google.com/github/enlite-ai/maze/blob/master/tutorials/notebooks/getting_started/getting_started_2.ipynb) introduced the concept of environment wrappers - here we take a look at how to write one ourselves. Wrappers can be immensely useful, as they allow to modify an environment's behaviour without having to rewrite the actual environment code and can be nested arbitrarily. They thus enable a high degree of flexibility in configuration your RL setup.

One example for this is _reward clipping_, which is a particular kind of reward shaping. In reward clipping, an environment's rewards are clipped to a specific range. This may be done e.g. to standardize the reward range across several environments, or if rewards don't carry much information outside a certain range. Standardized rewards may also be helpful in learning an appropriate policy. Note that there are other approaches to circumvent these problems, e.g. reward normalization.

Any wrapper in Maze inherits from `maze.core.wrappers.wrapper.Wrapper`. This ensures that wrapped environments provide all the functionality needed to fit into the framework. For our minimal reward clipping wrapper, only changing the reward in `.step()` is of immediate interest - we don't care about changing any other behaviour. We can let our reward clipping wrapper thus inherit from `maze.core.wrappers.wrapper.RewardWrapper`, which already takes care of most of the functionality to be implemented in a wrapper.

Our minimal reward clipping wrapper could look like this:

In [2]:
from typing import Any, Dict, Tuple, Optional, Union
import dataclasses
from maze.core.env.maze_env import MazeEnv
from maze.core.wrappers.wrapper import RewardWrapper, EnvType
from maze.core.env.action_conversion import ActionType

@dataclasses.dataclass
class MinimalRewardClippingWrapper(RewardWrapper[MazeEnv]):
    """
    Clips original step reward to range [min, max].
    """

    env: MazeEnv
    """The underlying environment."""
    min_val: float
    """Minimum allowed reward value."""
    max_val: float
    """Maximum allowed reward value."""

    def reward(self, reward: float) -> float:
        """
        Clips the original reward.
        :param reward: The original reward.
        :return: The clipped reward.
        """
        return min(max(self.min_val, reward), self.max_val)

[94mINFO: Setting MKL_THREADING_LAYER=GNU to avoid PyTorch issues with conda![0m
[94mINFO: Setting OMP_NUM_THREADS=1 to avoid performance drop when using distributed environments![0m


Note that we don't implement all methods required from instantiable `Wrapper` classes - this is for convenience' sake, as we don't require these methods (specifically `clone_from()`) for our example.

What if we are interested in changing other behaviour? Alternatively, we could inherit directly from `Wrapper`. Such a more general wrapper might look like this:

In [3]:
from maze.core.wrappers.wrapper import Wrapper

@dataclasses.dataclass
class RewardClippingWrapper(Wrapper[MazeEnv]):
    """
    Clips original step reward to range [min, max].
    """

    env: MazeEnv
    """The underlying environment."""
    min_val: float
    """Minimum allowed reward value."""
    max_val: float
    """Maximum allowed reward value."""

    def step(self, action: ActionType) -> Tuple[Any, Any, bool, Dict[Any, Any]]:
        """
        We intercept the environment's rewards and clip it to the specified range.
        :param action: Action to execute.
        :return: Observation, clipped reward, done, info dict.
        """

        observation, reward, done, info = self.env.step(action)
        return observation, self.reward(reward), done, info

    def reward(self, reward: float) -> float:
        """
        Clips the original reward.
        :param reward: The original reward.
        :return: The clipped reward.
        """
        return min(max(self.min_val, reward), self.max_val)

Do our `MinimalRewardClippingWrapper` and `RewardClippingWrapper` work? Let's verify this, once more on the `CartPole` environment. We do this by taking a single step and looking at the reward - after a single step our pole will still be upright. Without the wrapper the reward should amount to 1, with to 0.5.

In [4]:
import gym
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv

env = GymMazeEnv(env=gym.make("CartPole-v0"))
env.reset()
print("Without reward clipping: Reward is {reward}".format(reward=env.step({"action": 0})[1]))

env = MinimalRewardClippingWrapper.wrap(GymMazeEnv(env=gym.make("CartPole-v0")), min_val=0, max_val=0.5)
env.reset()
print("With reward clipping (minimal wrapper): Reward is {reward}".format(reward=env.step({"action": 0})[1]))

env = RewardClippingWrapper.wrap(GymMazeEnv(env=gym.make("CartPole-v0")), min_val=0, max_val=0.5)
env.reset()
print("With reward clipping (non-minimal wrapper): Reward is {reward}".format(reward=env.step({"action": 0})[1]))

Without reward clipping: Reward is 1.0
With reward clipping (minimal wrapper): Reward is 0.5
With reward clipping (non-minimal wrapper): Reward is 0.5


As expected, our reward clipping wrappers intercept the reward returned by the original environment and modify it. Note that this is most likely not useful for the `CartPole` environment, but can come in handy for (possibly your own) environments with a more complex reward distribution.

## Writing a Customized Policy

Now that we have a grip on writing a wrapper - how can we write a policy customized for to an environment? This is a common use case and can be achieved with little overhead in Maze. We need:
* A class implementing the policy network.
* A `DistributionMapper` mapping our variables to probability distributions.
* The environment's action and observation space dictionaries.
* An instance of `PolicyComposer` that incorporates the different components.

### Implementing a Custom Policy Network

We choose a simple linear policy to demonstrate the workflow of training with your own policy.

In [5]:
from typing import Sequence, Dict, Tuple, Any, Optional, Union
import torch
from torch import nn

class CartPolePolicyNet(nn.Module):
    """ Simple linear policy net for demonstration purposes. """

    def __init__(self, obs_shapes: Dict[str, Sequence[int]], action_logit_shapes: Dict[str, Sequence[int]]):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(
                in_features=obs_shapes['observation'][0],
                out_features=action_logit_shapes['action'][0]
            )
        )

    def forward(self, x_dict: Dict[str, torch.Tensor]) -> Dict[str, torch.Tensor]:
        # Since x_dict has to be a dictionary in Maze, we extract the input for the network.
        x = x_dict['observation']

        # Do the forward pass.
        logits = self.net(x)

        # Since the return value has to be a dict again, put the
        # forward pass result into a dict with the correct key.
        logits_dict = {'action': logits}

        return logits_dict

### Assembling the `PolicyComposer`

The remaining components can be derived from the environment itself. We can instantiate our `PolicyComposer` instance:

In [6]:
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv
from maze.perception.models.policies.probabilistic_policy_composer import ProbabilisticPolicyComposer
from maze.distributions.distribution_mapper import DistributionMapper

# We instantiate an environment for easier access to its action and observation space properties.
env = GymMazeEnv("CartPole-v0")

# DistributionMapper can be derived from the action space's properties.
distribution_mapper = DistributionMapper(action_space=env.action_space, distribution_mapper_config={})

# We instatiate the policy.
policy_net = CartPolePolicyNet(
    obs_shapes={'observation': env.observation_space.spaces['observation'].shape},
    action_logit_shapes={'action': (env.action_space.spaces['action'].n,)}
)

# Assemble the policy composer.
policy_composer = ProbabilisticPolicyComposer(
    action_spaces_dict=env.action_spaces_dict,
    observation_spaces_dict=env.observation_spaces_dict,
    distribution_mapper=distribution_mapper,
    # Our environment only has one sub-step, so we can specify only one policy.
    networks=[policy_net],
    # We have only one agent and network, thus this is an empty list.
    substeps_with_separate_agent_nets=[],
    # We have only one sub-step and one agent.
    agent_counts_dict={0: 1}
)

### Training and Rollout

All that's left to do now is to inject the policy composer into our training setup and to start training.

In [7]:
from maze.api.run_context import RunContext

rc = RunContext(
    algorithm="ppo",
    silent=True,
    env=lambda: GymMazeEnv('CartPole-v0'),
    policy=policy_composer,
    runner="dev"
)
rc.train(n_epochs=1)



100%|██████████| 25/25 [00:02<00:00, 12.05it/s]


Let's check once if our custom policy learns how to act in its environment.

In [8]:
from maze.utils.notebooks import rollout

n_episodes = 15
rewards = [rollout(rc.env_factory(), rc, 200) for _ in range(n_episodes)]
print("Mean return with #{ne} episodes: {rew}".format(ne=n_episodes, rew=sum(rewards) / len(rewards)))

Mean return with #15 episodes: 16.0


This performs considerably worse than the default policy, but still significantly better than a random one. This is not particularly surprising as our policy is quite simple.

## Summary

This notebook shows how to...
* ...implement your own wrapper and apply it on an environment.
* ...implement, train and evalute your own policy.

### What's next?

* This [step-by-step tutorial](https://maze-rl.readthedocs.io/en/latest/getting_started/step_by_step_tutorial.html) covers more advanced features such as action masking, KPIs, configuration with Hydra, metric visualization with Tensorboard etc. It also actually implements the environment logic instead of just wrapping an existing environment.
* [Part 4](https://colab.research.google.com/github/enlite-ai/maze/blob/master/tutorials/notebooks/getting_started/getting_started_4.ipynb) discusses how to customize environments.
* If you would like to see more notebooks covering other areas of Maze, feel free to [kick of a discussion on Github](https://github.com/enlite-ai/maze/discussions).