# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/enlite-ai/maze/blob/main/tutorials/notebooks/getting_started/getting_started_4_customization_envs) Maze: Getting Started Part IV - Customizing Environments

Part 4 of 4 in the *Getting started* series. We recommend reading [part 1](https://colab.research.google.com/github/enlite-ai/maze/blob/master/tutorials/notebooks/getting_started/getting_started_1.ipynb), [part 2](https://colab.research.google.com/github/enlite-ai/maze/blob/master/tutorials/notebooks/getting_started/getting_started_2.ipynb) and [part 3](https://colab.research.google.com/github/enlite-ai/maze/blob/master/tutorials/notebooks/getting_started/getting_started_3.ipynb) prior to this notebook.

## On Maze

MazeRL is an application-oriented Deep Reinforcement Learning (RL) framework, addressing real-world decision problems. If this caught your interest, check out
* the [Github repository](https://github.com/enlite-ai/maze),
* the [documentation](https://maze-rl.readthedocs.io/en/latest/index.html#documentation-overview) or
* the official [website](https://www.enlite.ai/).

## Introduction

The [previous notebook](https://colab.research.google.com/github/enlite-ai/maze/blob/master/tutorials/notebooks/getting_started/getting_started_4.ipynb) laid out how to write your own components, like policies and wrappers. This one will demonstrate how to create your own environment from scratch.

## Writing a Customized Environment in Maze

Maze aims to offer all the building blocks and mechanisms needed for the development of successful real-world reinforcement learning applications. One of the ways this philosophy is reflected in the design is that writing an environment in Maze involves several classes instead of a single one. This is because Maze splits up an environment's responsibilities between several components and couples them loosely.

This entails some overhead, but allows greater flexibility and out-of-the-box support of paradigms like multi-agent learning, hierarchical RL or multi-stepping (see [here](https://maze-rl.readthedocs.io/en/latest/concepts_and_structure/struct_envs/overview.html) on a more thorough introduction into Maze environments and the supported paradigms). This can be essential for solving complex real-world problems.
We refer to the concept of decoupling environments this concept of decoupled  The documentation discusses this in more detail [here](https://maze-rl.readthedocs.io/en/latest/concepts_and_structure/env_hierarchy.html).

To showcase Maze environments, we will take a shortcut and simply wrap and re-implement our trusty `CartPole-v0` environment. This way we can explore the different components involved in building a Maze environment without the distraction of having to implement the actual environment logic.

The central elements for an environment in Maze are:
* A `CoreEnv` inheriting from `maze.core.env.core_env.CoreEnv` implementing the environment logic.
* An `MazeState` class encapsulating the environment state.
* An `MazeAction` class representing the possible action space.
* An `ObservationConversion` translating instances of the internally used `CartPoleMazeState` into instances of `gym.space.Space` that can be processed by the policy-learning algorithm.
* An `ActionConversion` translating instances of the internally used `CartPoleMazeAction` into instances of `gym.space.Space` that can be processed by the policy-learning algorithm.
* An `MazeEnv` that associates the `CoreEnv` implementing the environment logic with an `ActionConversion` and an `ObservationConversion` defining how to represent actions and observations to the policy-learning algorithm.

![Central environment components.](https://maze-rl.readthedocs.io/en/latest/_images/observation_action_interfaces.png)
_<div align="center" style="line-height: 0px; margin-bottom: 25px">Central environment components.</div>_

Some components are not strictly necessary, but highly recommended, as they allow for a more modular and structured workflow. For our `CartPole-v0` we will include the following optional components:
* A `CartPoleEvents` class describing which events the environment can fire. This is necessary for proper logging.
* A `CartPoleRewardAggregator` that processes the events defined in `CartPoleEvents` to compute the reward.
* A `CartPoleRenderer` implementing rendering functionality.

Note that implementing these components is _not_ necessary for an existing Gym environment: Maze offers the `GymMazeEnv` we've already used in the previous notebooks. We are implementing these components as an example for writing environments in Maze from scratch.

### State

Next up: `CartPoleMazeState`. Similar to its action counterpart, this class is supposed to contain all information necessary to reproduce an environment's state. In the case of cartpole this boils down to: The cart's position and velocity, the pole's angle and its tip's velocity.

In [1]:
import dataclasses

@dataclasses.dataclass
class CartPoleMazeState:
    """
    MazeState object for CartPole environment.
    """

    cart_position: float
    """Cart's position on x-axis."""
    cart_velocity: float
    """Cart's velocity."""
    pole_angle: float
    """Pole's angle."""
    pole_velocity: float
    """Pole's velocity at the tip."""

    def __repr__(self):
        return "CartPoleMazeState:\n\t- {cp}\n\t- {cv}\n\t- {pa}\n\t- {pv}".format(
            cp=self.cart_position,
            cv=self.cart_velocity,
            pa=self.pole_angle,
            pv=self.pole_velocity
        )

### Action

We'll start with `CartPoleMazeAction`, which should encapsulate an action in our environment. We know from the [official description](http://gym.openai.com/envs/CartPole-v0/) that we can move either leftwards or rightwards.

In [None]:
@dataclasses.dataclass
class CartPoleMazeAction:
    """
    MazeAction object for CartPole environment.
    """

    force: int
    """0 for move towards left or 1 for move towards right."""

    def __repr__(self):
        return "CartPoleMazeAction:\n\t- {f}".format(f=self.force)

Why do we represent an action like this instead of the more common Gym space format directly? While it would be easier in this case to the latter, in more complex use cases it is helpful to have complete freedom in how actions are represented internally - it helps to increase maintainability and cuts down on the time of development cycles.

### Observation/State Conversion

An `ObservableConversion` is a class that transforms the unstructured `CartPoleMazeState` used inside `CartPoleCoreEnv` into a structured format understood by the policy-learning algorithms. This format is OpenAI's gym spaces. This process is desribed further [here](https://maze-rl.readthedocs.io/en/latest/concepts_and_structure/env_hierarchy.html#env-hierarchy-interfaces).

In [None]:
from typing import Dict
import numpy as np
import gym
from maze.core.env.observation_conversion import ObservationConversionInterface

class CartPoleObservationConversion(ObservationConversionInterface):
    """
    ObservationConversion for CartPole.
    """

    def maze_to_space(self, maze_state: CartPoleMazeState) -> Dict[str, np.ndarray]:
        """
        Converts MazeState objects to gym space objects.
        :param maze_state: Observation in MazeState format.
        :return: Observation in gym space format.
        """

        return {
            "observation": np.asarray(
                [
                    maze_state.cart_position, maze_state.cart_velocity, maze_state.pole_angle,
                    maze_state.pole_velocity
                ],
                dtype=np.float32
            )
        }

    def space_to_maze(self, observation: Dict[str, np.ndarray]) -> CartPoleMazeState:
        """
        Converts gym space objects to MazeState objects.
        Note: It is not required - but occasionally practical -  to implement this method.
        :param observation: Observation in gym space format.
        :return: Observation in MazeState format.
        """

        return CartPoleMazeState(
            cart_position=observation["observation"][0],
            cart_velocity=observation["observation"][1],
            pole_angle=observation["observation"][2],
            pole_velocity=observation["observation"][3]
        )

    def space(self) -> gym.spaces.Dict:
        """
        Defines gym observation space.
        An easy and straightforward way to represent the four possible variables - cart position and velocity, pole
        angle and velocity - is to encode them in a single gym.spaces.Box.
        Maze requires all gym space representation to be a dictionary, so we wrap this Box in one.
        :return: Observation space in gym format.
        """

        return gym.spaces.Dict({"observation": gym.spaces.Box(low=-sys.maxsize, high=sys.maxsize, shape=(4,), dtype=np.float32)})

### Action Conversion

Analogous, we implement a class converting `CartPoleMazeAction` to `gym.space` representations and vice versa.

In [None]:
from typing import Dict

import gym
from maze.core.env.action_conversion import ActionConversionInterface

class CartPoleActionConversion(ActionConversionInterface):
    """I
    Specifies conversion between space actions and MazeActions.
    Related actions refer to a single train.
    """

    def space_to_maze(self, action: Dict[str, int], maze_state: CartPoleMazeState) -> CartPoleMazeAction:
        """
        Converts gym space objects to MazeActoin objects.
        :param action: Action in gym space format.
        :param maze_state: Current MazeState.
        :return: Action in gym space format.
        """

        return CartPoleMazeAction(action["action"])

    def maze_to_space(self, maze_action: CartPoleMazeAction) -> Dict[str, int]:
        """
        Converts gym space objects to MazeAction objects.
        Note: It is not required - but occasionally practical -  to implement this method.
        :param maze_action: Action in MazeAction format.
        :return: Action in gym space format.
        """

        return {"action": maze_action.force}

    def space(self) -> gym.spaces.Dict:
        """
        Defines gym observation space.
        :return: Action space in gym format. 0 -> apply force towards the left, 1 -> appy force towards the right.
        """

        # return gym.spaces.Dict({"action": gym.spaces.Box(shape=(1,), low=0, high=1, dtype=np.int)})
        return gym.spaces.Dict({"action": gym.spaces.Discrete(2)})

### Event Interface

Maze includes a customizable event-logging system monitoring and keeping track of all events taking place during a run. This makes debugging RL application less painful and facilitates a better understanding of agents' trajectories and behaviour.
_Event_ in this context refers to a specific environment state that is of interest for reward & statistics computation as well as debugging purposes. Events are emitted by the environment. See [here](https://maze-rl.readthedocs.io/en/latest/concepts_and_structure/event_system.html#event-system) for a more thorough introduction to the event system.

![Event logging in Maze.](https://maze-rl.readthedocs.io/en/latest/_images/logging_overview.png)
_<div align="center" style="line-height: 0px; margin-bottom: 25px">Information flow for event logging.</div>_

In the case of `CartPole-v0` specifically we could define three informative events:
* The cart moves. Relevant properties: Cart's position, pole's angle.
* The cart exceeds its allowed positional bounds. No relevant properties.
* The pole exceeds its allowed angular bounds. No relevant properties.

Our environment will fire these events whenever the corresponding state is detected. How can we define events? We implement an interface in which each method reflects on of the events we want to log. This allows our `CartPoleCoreEnv` and `CartPoleRewardAggregator` to utilize hook into Maze' event logging system using these events.

In [None]:
import abc
from maze.core.log_stats.event_decorators import define_episode_stats, define_epoch_stats, define_step_stats

class CartPoleEvents(abc.ABC):
    """
    Events for CartPole-v0 environment.
    """

    @define_epoch_stats(np.mean, output_name="mean_episode_total")
    @define_episode_stats(sum)
    @define_step_stats(len)
    def moved(self, cart_position: float, pole_angle: float):
        """
        Indicates cart pole movement.
        :param cart_position: Cart's position
        :param pole_angle: Pole's angle.
        """

    @define_epoch_stats(np.mean, output_name="mean_episode_total")
    @define_episode_stats(sum)
    @define_step_stats(len)
    def cart_out_of_bounds(self):
        """
        Indicates cart being out of bounds.
        """

    @define_epoch_stats(np.mean, output_name="mean_episode_total")
    @define_episode_stats(sum)
    @define_step_stats(len)
    def pole_out_of_bounds(self):
        """
        Indicates pole being out of bounds.
        """

The decorators `@define_epoch_stats`, `@define_episode_stats` and `@define_step_stats` determine which statistics Maze will compute and log for these events per epoch, episode and step respectively. Our setting here sets up Maze to capture the number of event occurences per step, the total number of event occurences per episode, and the mean total number of event occurences over all episodes per epoch.

This system is highly customizable. Discussing it in detail is out of scope for this tutorial. We recommend reading the [documentation for further information on event logging and statistics computation](https://maze-rl.readthedocs.io/en/latest/logging/event_kpi_logging.html#event-kpi-log).

### Reward Aggregator

Maze utilizes the event logging system introduced above to compute rewards. For this purpose we implement a _reward aggregator_: An entity that listens to fired events and computes a step reward from them. The documentation gives an [overview of reward computation and customization](https://maze-rl.readthedocs.io/en/latest/environment_customization/reward_aggregation.html#reward-aggregation).

![Event logging in Maze.](https://maze-rl.readthedocs.io/en/latest/_images/reward_aggregation.png)
_<div align="center" style="line-height: 0px; margin-bottom: 25px">Information flow for reward computation and aggregation.</div>_

As in the original environment, we assign a positive reward of +1 if both cart and pole are within their bounds. If an out-of-bounds event for either is triggered, we assign a reward of 0.

In [None]:
from typing import List, Type, Optional
from maze.core.env.reward import RewardAggregatorInterface
from maze.core.env.maze_state import MazeStateType


class CartPoleRewardAggregator(RewardAggregatorInterface):
    """
    Event aggregation object dealing with CartPole rewards.
    """

    def summarize_reward(self, maze_state: Optional[MazeStateType] = None) -> float:
        """
        Computes reward for current step. We always provide a reward of +1 except when pole or cart exceed their bounds,
        in which case we return 0.
        :return: Reward for current step.
        """

        # Fetch all emitted events.
        cart_oob_events = list(self.query_events([CartPoleEvents.cart_out_of_bounds]))
        pole_oob_events = list(self.query_events([CartPoleEvents.cart_out_of_bounds]))

        return int(len(cart_oob_events) == 0 and len(pole_oob_events) == 0)

    def get_interfaces(self) -> List[Type[abc.ABC]]:
        """
        This returns all interfaces whose events this reward aggregator subscribes to.
        :return: List of the event classes that contain events relevant for this reward aggregator.
        """

        return [CartPoleEvents]

### Renderer

The last component to consider is the renderer. A `maze.core.rendering.renderer.Renderer` is designed to `render()`
based on  the current state, action and triggered events. In our case all of this is not necessary, as we only wrap the
Gym environment's rendering functionality. If you write your own environment though, access to all three of these data
points may come in handy to properly visualize what's going on in your environment.

In [None]:
from typing import Optional, List
import gym.envs.classic_control.cartpole
from maze.core.log_events.step_event_log import StepEventLog
from maze.core.rendering.renderer import Renderer
from maze.core.rendering.renderer_args import OptionsArrayArg

class CartPoleRenderer(Renderer):
    """
    Renderer for CartPole environment.
    """

    def render(
        self,
        maze_state: CartPoleMazeState,
        maze_action: Optional[CartPoleMazeAction],
        events: StepEventLog,
        gym_env: gym.envs.classic_control.cartpole.CartPoleEnv
    ) -> None:
        """
        Wraps gym.envs.classic_control.cartpole.CartPoleEnv.render().
        Implementation of :py:meth:`~maze.core.rendering.renderer.Renderer.render`.
        :param maze_state: Current MazeState.
        :param maze_action: Maze action applied in the current step.
        :param events: Collection of all events triggered in this step.
        :param gym_env: Gym environment.
        """

        gym_env.render()

    def arguments(self) -> List[OptionsArrayArg]:
        """
        Exposing available argument options like this makes it possible to create appropriate user controls when
        controlling the renderer in interactive settings (e.g., using widgets in a Jupyter Notebook).
        In the case of CartPole there is no need for such controls/arguments.
        :return: List of specifiable arguments.
        """

        return []

### Core Environment

Having implemented all the components described above, we merge them with the environment logic in a `CoreEnv`. Since we simply wrap the existing Gym `CartPole-v0` environment here, we bypass the implementation of the internal logic. If you however write your own environment, the `CoreEnv` is the place to do so.

In [None]:
import sys
from typing import Tuple, Union, Dict, Any, Optional
import numpy as np
import copy

from maze.core.env.core_env import CoreEnv
from maze.core.env.maze_state import MazeStateType
from maze.core.events.pubsub import Pubsub
from maze.core.rendering.renderer import Renderer
from maze.core.env.structured_env import StepKeyType, ActorID


class CartPoleCoreEnv(CoreEnv):
    """
    Core environment for CartPole.
    """

    def __init__(self):
        super().__init__()

        # Set up Gym environment and state variables.
        self.gym_env = gym.make('CartPole-v0')
        self.cart_position, self.cart_velocity, self.pole_angle, self.pole_velocity = list(self.gym_env.reset())
        self.done = False

        # Set up event system. Necessary e.g. for reward aggregation.
        self.pubsub = Pubsub(self.context.event_service)

        # Set up reward aggregator.
        self.reward_aggregator = CartPoleRewardAggregator()
        self.pubsub.register_subscriber(self.reward_aggregator)

        # Set up renderer.
        self.renderer = CartPoleRenderer()

    def step(self, maze_action: CartPoleMazeAction) -> Tuple[CartPoleMazeState, int, bool, Dict[Any, Any]]:
        """
        Steps through environment.
        :param maze_action: Action to take.
        :return: State, reward, done flag, info dictionary.
        """

        # Step through Gym environment.
        observation, reward, done, info = self.gym_env.step(maze_action.force)
        self.cart_position, self.cart_velocity, self.pole_angle, self.pole_angle = list(observation)

        return self.get_maze_state(), reward, done, info

    def reset(self) -> MazeStateType:
        """
        Resets environment.
        """

        self.gym_env.reset()
        return self.get_maze_state()

    def seed(self, seed: int) -> None:
        """
        Implementation of :py:meth:`~maze.core.env.core_env.CoreEnv.seed`.
        """

        self.gym_env.seed(seed)

    def close(self) -> None:
        """
        Closes environment.
        """

        self.gym_env.close()

    def get_maze_state(self) -> CartPoleMazeState:
        """
        Return current MazeState.
        """

        return CartPoleMazeState(
            cart_position=self.cart_position,
            cart_velocity=self.cart_velocity,
            pole_angle=self.pole_angle,
            pole_velocity=self.pole_velocity
        )

    def get_serializable_components(self) -> Dict[str, Any]:
        """
        List components that should be serialized as part of trajectory data. Not necessary for CartPole.
        :return: Serialiazable components.
        """

        return {}

    def get_renderer(self) -> Renderer:
        """
        Returns renderer.
        :return: Renderer.
        """

        return self.renderer

    def actor_id(self) -> Tuple[Union[str, int], int]:
        """
        Currently active actor, i.e. sub-step key and agent ID. Trivial for CartPole, since there is only one sub-step
        key and one actor, which are both represented by index 0.
        """

        return ActorID(0, 0)

    def is_actor_done(self) -> bool:
        """
        We check whether current agent is done, since this is a single-policy environment.
        Implementation of :py:meth:`~maze.core.env.core_env.CoreEnv.is_actor_done`.
        """

        return self.done

    @property
    def agent_counts_dict(self) -> Dict[StepKeyType, int]:
        """
        Returns agent counts per substep. Trivial for CartPole, as we only have one agent and substep.
        :return: Dictionary with agent count per substep.
        """

        return {0: 1}

    def clone_from(self, env: 'CartPoleCoreEnv') -> None:
        """
        Clone from other core environment.
        """

        self.gym_env = copy.deepcopy(env.gym_env)

Note that we don't fully implement some of Maze' features as that would otherwise add some non-trivial code to the example. For a more thorough example see the [step-by-step tutorial in the documentation](https://maze-rl.readthedocs.io/en/latest/getting_started/step_by_step_tutorial.html).

### Maze Environment

Finally, we define `CartPoleMazeEnv`, which unites the core environment with action and observation conversion classes.

In [None]:
from maze.core.env.maze_env import MazeEnv


class CartPoleMazeEnv(MazeEnv[CartPoleCoreEnv]):
    """
    Environment for Flatland.
    """

    def __init__(
        self,
        core_env: CartPoleCoreEnv,
        action_conversion: CartPoleActionConversion,
        observation_conversion: CartPoleObservationConversion
    ):
        super().__init__(
            core_env=core_env,
            # Maze allows to bind conversion classes to specific sub-steps. Since we only have one for our CartPole
            # example, we reference it with 0.
            action_conversion_dict={0: action_conversion},
            observation_conversion_dict={0: observation_conversion}
        )

And that's a wrap! We are ready to train.

### Training and Rollout

Now that everything is in place, we train an PPO agent on our environment.

In [None]:
from maze.utils.notebooks import rollout
from maze.api.run_context import RunContext

rc = RunContext(
    algorithm="ppo",
    silent=True,
    env=lambda: CartPoleMazeEnv(CartPoleCoreEnv(), CartPoleActionConversion(), CartPoleObservationConversion())
)
rc.train(n_epochs=1)


Did our policy learn? We established in [part I](https://colab.research.google.com/github/enlite-ai/maze/blob/master/tutorials/notebooks/getting_started/getting_started_1.ipynb) that a random policy achieves a mean return of around 20. Let's see how our policy does.

In [None]:
n_episodes = 5
rewards = [rollout(rc.env_factory(), rc, 200) for _ in range(n_episodes)]
print("Mean return with #{ne} episodes: {rew}".format(ne=n_episodes, rew=sum(rewards) / len(rewards)))

The reward is significantly higher than for the random policy, so the agent does indeed learn with our exemplary `CartPoleMazeEnv`.

At this point you know everything you need to in order to get started with your own environment and are able to use the basic environment features Maze provides. If you want to take a deeper dive, check out the various links to the documentation mentioned previously in this notebook (and also in the _What's next_ section at the end of this notebook).

## Summary

This notebook...
* ...introduces some fundamental concepts in Maze such as event logging and the environment hierarchy and composition.
* ...shows how to implement your own environment with all necessary prerequisites and how to train and evaluate on it.

### What's next?

* The documentation explains [environments in Maze](https://maze-rl.readthedocs.io/en/latest/concepts_and_structure/struct_envs/overview.html) and [its components](https://maze-rl.readthedocs.io/en/latest/concepts_and_structure/env_hierarchy.html) in greater detail than we do here. If you're looking to utilize Maze' more advanced features, we definitely recommend reading these articles first to get a better feel for the concepts involved.
* This [step-by-step tutorial](https://maze-rl.readthedocs.io/en/latest/getting_started/step_by_step_tutorial.html) covers more advanced features such as action masking, KPIs, configuration with Hydra, metric visualization with Tensorboard etc. It also actually implements the environment logic instead of just wrapping an existing environment.
* If you would like to see more notebooks covering other areas of Maze, feel free to [kick of a discussion on Github](https://github.com/enlite-ai/maze/discussions).