#  Maze tutorial

In this tutorial, we tackle the maze problem.
We use this classical game to demonstrate how 
- a new scikit-decide domain can be easily created
- to find solvers from scikit-decide hub matching its characteristics
- to apply a scikit-decide solver to a domain
- to create its own rollout function to play a trained solver on a domain


Notes:
- In order to focus on scikit-decide use, we put some code not directly related to the library in a [separate module](./maze_utils.py) (like maze generation and display).
- A similar maze domain is already defined in [scikit-decide hub](https://github.com/airbus/scikit-decide/blob/master/skdecide/hub/domain/maze/maze.py) but we do not use it for the sake of this tutorial.
- **Special notice for binder + sb3:**
it seems that [stable-baselines3](https://stable-baselines3.readthedocs.io/en/master/) algorithms are extremely slow on [binder](https://mybinder.org/). We could not find a proper explanation about it. We strongly advise you to either launch the notebook locally or on colab, or to skip the cells that are using sb3 algorithms (here PPO solver).


Concerning the python kernel to use for this notebook:
- If running locally, be sure to use an environment with scikit-decide[all] and minizinc.
- If running on colab, the next cell does it for you and download the utility module used in this notebook.
- If running on binder, the environment should be ready.

In [None]:
# On Colab: install the library
on_colab = "google.colab" in str(get_ipython())
if on_colab:
    import glob
    import importlib
    import json
    import os
    import sys

    using_nightly_version = True

    if using_nightly_version:
        # look for nightly build download url
        release_curl_res = !curl -L   -H "Accept: application/vnd.github+json" -H "X-GitHub-Api-Version: 2022-11-28" https://api.github.com/repos/airbus/scikit-decide/releases/tags/nightly
        release_dict = json.loads(release_curl_res.s)
        release_download_url = sorted(
            release_dict["assets"], key=lambda d: d["updated_at"]
        )[-1]["browser_download_url"]
        print(release_download_url)

        # download and unzip
        !wget --output-document=release.zip {release_download_url}
        !unzip -o release.zip

        # get proper wheel name according to python version used
        wheel_pythonversion_tag = f"cp{sys.version_info.major}{sys.version_info.minor}"
        wheel_path = glob.glob(
            f"dist/scikit_decide*{wheel_pythonversion_tag}*manylinux*.whl"
        )[0]

        skdecide_pip_spec = f"{wheel_path}[all]"
    else:
        skdecide_pip_spec = "scikit-decide[all]"

    # uninstall google protobuf conflicting with ray and sb3
    ! pip uninstall -y protobuf

    # install scikit-decide with all extras
    !pip install {skdecide_pip_spec}

    # be sure to load the proper cffi (downgraded compared to the one initially on colab)
    import cffi

    importlib.reload(cffi)

    # install and configure minizinc
    !curl -o minizinc.AppImage -L https://github.com/MiniZinc/MiniZincIDE/releases/download/2.6.3/MiniZincIDE-2.6.3-x86_64.AppImage
    !chmod +x minizinc.AppImage
    !./minizinc.AppImage --appimage-extract
    os.environ["PATH"] = f"{os.getcwd()}/squashfs-root/usr/bin/:{os.environ['PATH']}"
    os.environ[
        "LD_LIBRARY_PATH"
    ] = f"{os.getcwd()}/squashfs-root/usr/lib/:{os.environ['LD_LIBRARY_PATH']}"

    # download notebook utils
    !wget https://raw.githubusercontent.com/airbus/scikit-decide/master/notebooks/maze_utils.py

In [None]:
from enum import Enum
from math import sqrt
from time import sleep
from typing import Any, NamedTuple, Optional, Union

from IPython.display import clear_output, display

# import Maze class from utility file for maze generation and display
from maze_utils import Maze
from stable_baselines3 import PPO

from skdecide import DeterministicPlanningDomain, Solver, Space, Value
from skdecide.builders.domain import Renderable, UnrestrictedActions
from skdecide.hub.solver.astar import Astar
from skdecide.hub.solver.stable_baselines import StableBaseline
from skdecide.hub.space.gym import EnumSpace, ListSpace, MultiDiscreteSpace
from skdecide.utils import match_solvers

# choose standard matplolib inline backend to render plots
%matplotlib inline

## About the maze problem
The maze problem is about to make an agent finding the goal in a maze by going up, down, left, or right without going through walls. 

We show you such a maze by using the Maze class defined in the [maze module](./maze_utils.py). Here the agent starts at the top-left corner and the goal is at the bottom-right corner of the maze. The following colour convention is used:
- dark purple: walls
- yellow: empty cells
- light green: goal
- blue: current position

In [None]:
# size of maze
width = 25
height = 19
# generate the maze
maze = Maze.generate_random_maze(width=width, height=height)
# starting position
entrance = 1, 1
# goal position
goal = height - 2, width - 2
# render the maze
ax, image = maze.render(current_position=entrance, goal=goal)
display(image.figure)

## MazeDomain definition

In this section, we will wrap the Maze utility class so that it will be recognized as a scikit-decide domain. Several steps are needed.

### States and actions
We begin by defining the state space (agent positions) and action space (agent movements).

In [None]:
class State(NamedTuple):
    x: int
    y: int


class Action(Enum):
    up = 0
    down = 1
    left = 2
    right = 3

### Domain type
Then we define the domain type from a base template (`DeterministicPlanningDomain`) with optional refinements (`UnrestrictedActions` and `Renderable`). This corresponds to the following characteristics:
- `DeterministicPlanningDomain`:
    - only one agent
    - deterministic starting state
    - handle only actions
    - actions are sequential
    - deterministic transitions
    - white box transition model
    - goal states are defined
    - positive costs (i.e. negative rewards)
    - fully observable
    - renderable (can be displayed)
- `UnrestrictedActions`: all actions are available at each step
- `Renderable`: can be displayed

We also specify the type of states, observations, events, transition values, ... 

This is needed so that solvers know how to work properly with this domain, and this will also help IDE or Jupyter to propose you intelligent code completion.

In [None]:
class D(DeterministicPlanningDomain, UnrestrictedActions, Renderable):
    T_state = State  # Type of states
    T_observation = State  # Type of observations
    T_event = Action  # Type of events
    T_value = float  # Type of transition values (rewards or costs)
    T_predicate = bool  # Type of logical checks
    T_info = None  # Type of additional information in environment outcome
    T_agent = Union  # Inherited from SingleAgent

### Actual domain class
We can now implement the maze domain by 
- deriving from the above domain type
- filling all non-implemented methods 
- adding a constructor to define the maze & start/end positions.

We also define (to help solvers that can make use of it)
- an heuristic for search algorithms


*NB: To know the methods not yet implemented, one can either use an IDE which can find them automatically or the [code generators](https://airbus.github.io/scikit-decide/guide/codegen.html) page in the online documentation, which generates the corresponding boilerplate code.*

In [None]:
class MazeDomain(D):
    """Maze scikit-decide domain

    Attributes:
        start: the starting position
        end: the goal to reach
        maze: underlying Maze object

    """

    def __init__(self, start: State, end: State, maze: Maze):
        self.start = start
        self.end = end
        self.maze = maze
        # display
        self._image = None  # image to update when rendering the maze
        self._ax = None  # subplot in which the maze is rendered

    def _get_next_state(self, memory: D.T_state, action: D.T_event) -> D.T_state:
        """Get the next state given a memory and action.

        Move agent according to action (except if bumping into a wall).

        """

        next_x, next_y = memory.x, memory.y
        if action == Action.up:
            next_x -= 1
        if action == Action.down:
            next_x += 1
        if action == Action.left:
            next_y -= 1
        if action == Action.right:
            next_y += 1
        return (
            State(next_x, next_y)
            if self.maze.is_an_empty_cell(next_x, next_y)
            else memory
        )

    def _get_transition_value(
        self,
        memory: D.T_state,
        action: D.T_event,
        next_state: Optional[D.T_state] = None,
    ) -> Value[D.T_value]:
        """Get the value (reward or cost) of a transition.

        Set cost to 1 when moving (energy cost)
        and to 2 when bumping into a wall (damage cost).

        """
        #
        return Value(cost=1 if next_state != memory else 2)

    def _get_initial_state_(self) -> D.T_state:
        """Get the initial state.

        Set the start position as initial state.

        """
        return self.start

    def _get_goals_(self) -> Space[D.T_observation]:
        """Get the domain goals space (finite or infinite set).

        Set the end position as goal.

        """
        return ListSpace([self.end])

    def _is_terminal(self, state: State) -> D.T_predicate:
        """Indicate whether a state is terminal.

        Stop an episode only when goal reached.

        """
        return self._is_goal(state)

    def _get_action_space_(self) -> Space[D.T_event]:
        """Define action space."""
        return EnumSpace(Action)

    def _get_observation_space_(self) -> Space[D.T_observation]:
        """Define observation space."""
        return MultiDiscreteSpace(
            nvec=[self.maze.height, self.maze.width], element_class=State
        )

    def _render_from(self, memory: State, **kwargs: Any) -> Any:
        """Render visually the maze.

        Returns:
            matplotlib figure

        """
        # store used matplotlib subplot and image to only update them afterwards
        self._ax, self._image = self.maze.render(
            current_position=memory,
            goal=self.end,
            ax=self._ax,
            image=self._image,
        )
        return self._image.figure

    def heuristic(self, s: D.T_state) -> Value[D.T_value]:
        """Heuristic to be used by search algorithms.

        Here Euclidean distance to goal.

        """
        return Value(cost=sqrt((self.end.x - s.x) ** 2 + (self.end.y - s.y) ** 2))

### Domain factory

To use scikit-decide solvers on the maze problem, we will need a domain factory recreating the domain at will. 

Indeed the method `solve_with()` used [later](#Training-solver-on-the-domain) needs such a domain factory so that parallel solvers can create identical domains on separate processes. 
(Even though we do not use parallel solvers in this particular notebook.)

Here is such a domain factory reusing the maze created in [first section](#About-maze-problem). We render again the maze using the `render` method of the wrapping domain.

In [None]:
# define start and end state from tuples defined above
start = State(*entrance)
end = State(*goal)
# domain factory
domain_factory = lambda: MazeDomain(maze=maze, start=start, end=end)
# instanciate the domain
domain = domain_factory()
# init the start position
domain.reset()
# display the corresponding maze
display(domain.render())

## Solvers

### Finding suitable solvers
The library hub includes a lot of solvers. We can use `match_solvers` function to show available solvers that fit the characteristics of the defined domain, according to the mixin classes used to define the [domain type](#domain-type). 

In [None]:
match_solvers(domain=domain)

In the following, we will restrict ourself to 2 solvers:

- `StableBaseline`, quite generic, allowing us to use reinforcement learning (RL) algorithms by wrapping a stable OpenAI Baselines solver ([stable_baselines3](https://github.com/DLR-RM/stable-baselines3))
- `LazyAstar` (A*), more specific, coming from path planning.

### PPO solver

We first try a solver coming from the Reinforcement Learning community that makes use of OpenAI [stable_baselines3](https://github.com/DLR-RM/stable-baselines3), giving access to a lot of RL algorithms.

Here we choose the [Proximal Policy Optimization (PPO)](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) solver. It directly optimizes the weights of the policy network using stochastic gradient ascent. See more details in stable baselines [documentation](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) and [original paper](https://arxiv.org/abs/1707.06347).

#### Solver instantiation

In [None]:
solver = StableBaseline(
    PPO, "MlpPolicy", learn_config={"total_timesteps": 10000}, verbose=True
)

#### Training solver on the domain
The solver will try to find an appropriate policy to solve the maze. 

In [None]:
MazeDomain.solve_with(solver, domain_factory)

The chosen syntax allows to apply *autocast* scikit-decide core mechanism to the solver so that generic solvers can be used to solve more specific domains. For instance solver that normally apply to multi-agent domain can also apply to single-agent domain thanks to this *autocast* mechanism.

#### Rolling out the solution (found by PPO)

We can use the trained solver to roll out an episode to see if this is actually solving the maze.

For educative purpose, we define here our own rollout (which will probably be needed if you want to actually use the solver in a real case). If you want to take a look at the (more complex) one already implemented in the library, see the [utils.py](https://github.com/airbus/scikit-decide/blob/master/skdecide/utils.py) module.


In [None]:
def rollout(
    domain: MazeDomain,
    solver: Solver,
    max_steps: int,
    pause_between_steps: Optional[float] = 0.01,
):
    """Roll out one episode in a domain according to the policy of a trained solver.

    Args:
        domain: the maze domain to solve
        solver: a trained solver
        max_steps: maximum number of steps allowed to reach the goal
        pause_between_steps: time (s) paused between agent movements.
          No pause if None.

    """
    # Initialize episode
    solver.reset()
    observation = domain.reset()

    # Initialize image
    figure = domain.render(observation)
    display(figure)

    # loop until max_steps or goal is reached
    for i_step in range(1, max_steps + 1):
        if pause_between_steps is not None:
            sleep(pause_between_steps)

        # choose action according to solver
        action = solver.sample_action(observation)
        # get corresponding action
        outcome = domain.step(action)
        observation = outcome.observation

        # update image
        figure = domain.render(observation)
        clear_output(wait=True)
        display(figure)

        # final state reached?
        if domain.is_terminal(observation):
            break

    # goal reached?
    is_goal_reached = domain.is_goal(observation)
    if is_goal_reached:
        print(f"Goal reached in {i_step} steps!")
    else:
        print(f"Goal not reached after {i_step} steps!")

    return is_goal_reached, i_step

We set a maximum number of steps to reach the goal according to maze size in order to decide if the proposed solution is working or not.

In [None]:
max_steps = maze.width * maze.height
print(f"Rolling out a solution with max_steps={max_steps}")

In [None]:
rollout(domain=domain, solver=solver, max_steps=max_steps, pause_between_steps=None)

As you can see, the goal is not reached at the end of the episode. Though a generic algorithm that can apply to a lot of problems, PPO seems not to be able to solve this maze. This is actually due to the fact that the reward is sparse (you get rewarded only when you reach the goal) and this is nearly impossible for this kind of RL algorithm to reach the goal just by chance without shaping the reward.

#### Cleaning up  the solver

Some solvers need proper cleaning before being deleted.

In [None]:
solver._cleanup()

Note that this is automatically done if you use the solver within a `with` statement. The syntax would look something like:

```python
with solver_factory() as solver:
    MyDomain.solve_with(solver, domain_factory)
    rollout(domain=domain, solver=solver)
```

### A* solver

We now use [A*](https://en.wikipedia.org/wiki/A*_search_algorithm) well known to be suited to this kind of problem because it exploits the knowledge of the goal and of heuristic metrics to reach the goal (e.g. euclidean or Manhattan distance).

A* (pronounced "A-star") is a graph traversal and path search algorithm, which is often used in many fields of computer science due to its completeness, optimality, and optimal efficiency.
One major practical drawback is its  𝑂(𝑏𝑑)  space complexity, as it stores all generated nodes in memory.

See more details in the [original paper](https://ieeexplore.ieee.org/document/4082128): P. E. Hart, N. J. Nilsson and B. Raphael, "A Formal Basis for the Heuristic Determination of Minimum Cost Paths," in IEEE Transactions on Systems Science and Cybernetics, vol. 4, no. 2, pp. 100-107, July 1968.


#### Solver instantiation

We use the heuristic previously defined in MazeDomain class.

In [None]:
solver = Astar(heuristic=lambda d, s: d.heuristic(s))

#### Training solver on the domain


In [None]:
MazeDomain.solve_with(solver, domain_factory)

#### Rolling out the solution (found by A*)

We use the same rollout function and maximum number of steps as for the PPO solver.

In [None]:
rollout(domain=domain, solver=solver, max_steps=max_steps, pause_between_steps=None)

This time, the goal is reached!

The fact that A* (which was designed for path planning problems) can do better than Deep RL here is due to:
- mainly the fact that this algorithm uses more information from the domain to solve it efficiently, namely the fact that all rewards are negative here ("positive cost") + exhaustively given list of next states (which enables to explore a structured graph, instead of randomly looking for a sparse reward)
- the possible use of an admissible heuristic (distance to goal), which speeds up even more solving (while keeping optimality guarantee)

#### Cleaning up  the solver

In [None]:
solver._cleanup()

## Conclusion

We saw how to define from scratch a scikit-decide domain by specifying its characteristics at the finer level possible, and how to find the existing solvers matching those characteristics.

We also managed to apply a quite classical solver from the RL community (PPO) as well as a more specific solver (A*) for the maze problem. Some important lessons:
- Even though for many the go-to method for decision making, PPO was not able to solve the "simple" maze problem;
- More precisely, PPO seems not well-fitted to structured domains with sparse rewards (e.g. goal state to reach);
- Solvers that take more advantage of all characteristics available are generally more suited, as A* demonstrated.

That is why it is important to define the domain with the finer granularity possible and also to use the solvers that can exploit at most the known characteristics of the domain.
