#  Gym environment with scikit-decide tutorial: Continuous Mountain Car

In this notebook we will solve the continuous mountain car problem taken from [OpenAI Gym](https://gym.openai.com/), a toolkit for developing environments, usually to be solved by reinforcement learning algorithms.
Continuous Mountain Car, a standard testing domain in Reinforcement Learning (RL), is a problem in which an under-powered car must drive up a steep hill. Note that we use here the *continuous* version of the mountain car because 
it has a shaped or dense reward (i.e. not sparse) which can be used successfully when solving, as opposed to the other "Mountain Car" environments. 

For reminder, a sparse reward is a reward which is null almost everywhere, whereas a dense or shaped reward has more meaningful values for most transitions.


<div align="middle">
    <video controls autoplay preload 
         src="https://gym.openai.com/videos/2019-10-21--mqt8Qj1mwo/MountainCarContinuous-v0/original.mp4">
    </video>
</div>


This problem has been chosen for three reasons:
  - Show how scikit-decide can be used to solve Gym environments (the de-facto standard in the RL community),
  - Highlight that by doing so, you will be able to use not only solvers from the RL community (like the ones in [stable_baselines3](https://github.com/DLR-RM/stable-baselines3) for example), but also other solvers coming from other communities like genetic programming and planning/search (use of an underlying search graph) that can be very efficient.

Therefore in this notebook we will go through the following steps:
  - Wrap a Gym environment in a scikit-decide domain;
  - Use a classical RL algorithm like PPO to solve our problem;
  - Give CGP (Cartesian Genetic Programming)  a try on the same problem;
  - Finally use IW (Iterated Width) coming from the planning community on the same problem.

In [None]:
from typing import Optional, Callable
from time import sleep
import os

from IPython.display import clear_output
import matplotlib.pyplot as plt
from stable_baselines3 import PPO, SAC
import gym

from skdecide.hub.solver.stable_baselines import StableBaseline
from skdecide import Solver
from skdecide.hub.domain.gym import (
    GymDomain,
    GymWidthDomain,
    GymDiscreteActionDomain,
    GymPlanningDomain,
)
from skdecide.hub.solver.iw import IW
from skdecide.hub.solver.cgp import CGP

# choose standard matplolib inline backend to render plots
%matplotlib inline

When running this notebook on remote servers like with Colab or Binder, rendering of gym environment will fail as no actual display device exists. Thus we need to start a virtual display to make it work.

In [None]:
if "DISPLAY" not in os.environ:
    import pyvirtualdisplay

    _display = pyvirtualdisplay.Display(visible=False, size=(1400, 900))
    _display.start()

##Â About Continuous Mountain Car problem

In this a problem, an under-powered car must drive up a steep hill. 
The agent (a car) is started at the bottom of a valley. For any given
state the agent may choose to accelerate to the left, right or cease
any acceleration.

### Observations

- Car Position  [-1.2, 0.6]
- Car Velocity  [-0.07, +0.07]

### Action
- the power coefficient [-1.0, 1.0]


### Goal
The car position is more than 0.45.

### Reward

Reward of 100 is awarded if the agent reached the flag (position = 0.45) on top of the mountain.
Reward is decrease based on amount of energy consumed each step.

### Starting State
The position of the car is assigned a uniform random value in [-0.6 , -0.4].
The starting velocity of the car is always assigned to 0.

         

## Wrap Gym environment in a scikit-decide domain

We choose the gym environment we would like to use.

In [None]:
ENV_NAME = "MountainCarContinuous-v0"

We define a domain factory using `GymDomain` proxy available in scikit-decide which will wrap the Gym environment.

In [None]:
domain_factory = lambda: GymDomain(gym.make(ENV_NAME))

Here is a screenshot of such an environment. 

Note: We close the domain straight away to avoid leaving the OpenGL pop-up window open on local Jupyter sessions.

In [None]:
domain = domain_factory()
domain.reset()
plt.imshow(domain.render(mode="rgb_array"))
plt.axis("off")
domain.close()

## Solve  with Reinforcement Learning (StableBaseline + PPO)

We first try a solver coming from the Reinforcement Learning community that is make use of OpenAI [stable_baselines3](https://github.com/DLR-RM/stable-baselines3), which give access to a lot of RL algorithms.

Here we choose [Proximal Policy Optimization (PPO)](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) solver. It directly optimizes the weights of the policy network using stochastic gradient ascent. See more details in stable baselines [documentation](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) and [original paper](https://arxiv.org/abs/1707.06347). 

### Check compatibility
We check the compatibility of the domain with the chosen solver.

In [None]:
domain = domain_factory()
assert StableBaseline.check_domain(domain)
domain.close()

### Solver instantiation

In [None]:
solver = StableBaseline(
    SAC, "MlpPolicy", learn_config={"total_timesteps": 50000}, verbose=True
)

### Training solver on domain

In [None]:
GymDomain.solve_with(solver, domain_factory)

### Rolling out a solution

We can use the trained solver to roll out an episode to see if this is actually solving the problem at hand.

For educative purpose, we define here our own rollout (which will probably be needed if you want to actually use the solver in a real case). If you want to take a look at the (more complex) one already implemented in the library, see the `rollout()` function in [utils.py](https://github.com/airbus/scikit-decide/blob/master/skdecide/utils.py) module.

By default we display the solution in a matplotlib figure. If you need only to check wether the goal is reached or not, you can specify `render=False`. In this case, the rollout is greatly speed up and a message is still printed at the end of process specifying success or not, with the number of steps required.

In [None]:
def rollout(
    domain: GymDomain,
    solver: Solver,
    max_steps: int,
    pause_between_steps: Optional[float] = 0.01,
    render: bool = True,
):
    """Roll out one episode in a domain according to the policy of a trained solver.

    Args:
        domain: the maze domain to solve
        solver: a trained solver
        max_steps: maximum number of steps allowed to reach the goal
        pause_between_steps: time (s) paused between agent movements.
          No pause if None.
        render: if True, the rollout is rendered in a matplotlib figure as an animation;
            if False, speed up a lot the rollout.

    """
    # Initialize episode
    solver.reset()
    observation = domain.reset()

    # Initialize image
    if render:
        plt.ioff()
        fig, ax = plt.subplots(1)
        ax.axis("off")
        plt.ion()
        img = ax.imshow(domain.render(mode="rgb_array"))
        display(fig)

    # loop until max_steps or goal is reached
    for i_step in range(1, max_steps + 1):
        if pause_between_steps is not None:
            sleep(pause_between_steps)

        # choose action according to solver
        action = solver.sample_action(observation)
        # get corresponding action
        outcome = domain.step(action)
        observation = outcome.observation

        # update image
        if render: 
            img.set_data(domain.render(mode="rgb_array"))
            fig.canvas.draw()
            clear_output(wait=True)
            display(fig)

        # final state reached?
        if outcome.termination:
            break

    # close the figure to avoid jupyter duplicating the last image
    if render:
        plt.close(fig)

    # goal reached?
    is_goal_reached = observation[0] >= 0.45
    if is_goal_reached:
        print(f"Goal reached in {i_step} steps!")
    else:
        print(f"Goal not reached after {i_step} steps!")

    return is_goal_reached, i_step

We create a domain for the roll out and close it at the end. If not closing it, an OpenGL popup windows stays open, at least on local Jupyter sessions.

In [None]:
domain = domain_factory()
try:
    rollout(domain=domain, solver=solver, max_steps=999, pause_between_steps=None, render=True)
finally:
    domain.close()

We can see that PPO does not find a solution to the problem. This is mainly due to the way the reward is computed. Indeed negative reward accumulates as long as the goal is not reached, which encourages the agent to stop moving.

Actually, typical RL algorithms like PPO are a good fit for domains with "well-shaped" rewards (guiding towards the goal), but can struggle in sparse or "badly-shaped" reward environment like Mountain Car Continuous. 

We will see in the next sections that non-RL methods can overcome this issue.

### Cleaning up

Some solvers need proper cleaning before being deleted.

In [None]:
solver._cleanup()

Note that this is automatically done if you use the solver within a `with` statement. The syntax would look something like:

```python
with solver_factory() as solver:
    MyDomain.solve_with(solver, domain_factory)
    rollout(domain=domain, solver=solver
```

## Solve with Cartesian Genetic Programming (CGP)

CGP (Cartesian Genetic Programming) is a form of genetic programming that uses a graph representation (2D grid of nodes) to encode computer programs.
See [Miller, Julian. (2003). Cartesian Genetic Programming. 10.1007/978-3-642-17310-3.](https://www.researchgate.net/publication/2859242_Cartesian_Genetic_Programming) for more details.

Pros:
+ ability to customize the set of atomic functions used by CPG (e.g. to inject some domain knowledge)
+ ability to inspect the final formula found by CGP (no black box)

Cons:
- the fitness function of CGP is defined by the rewards, so can be unable to solve in sparse reward scenarios

### Check compatibility
We check the compatibility of the domain with the chosen solver.

In [None]:
domain = domain_factory()
assert CGP.check_domain(domain)
domain.close()

### Solver instantiation

In [None]:
solver = CGP("TEMP_CGP", n_it=25, verbose=True)

### Training solver on domain

In [None]:
GymDomain.solve_with(solver, domain_factory)

### Rolling out a solution

We use the same roll out function as for PPO solver.

In [None]:
domain = domain_factory()
try:
    rollout(domain=domain, solver=solver, max_steps=999, pause_between_steps=None, render=True)
finally:
    domain.close()

CGP seems doing well on this problem. Indeed the presence of periodic functions ($asin$, $acos$, and $atan$) in its base set of atomic functions makes it suitable for modelling this kind of pendular motion.

***Warning***: On some cases, it happens that CGP does not actually find a solution. As there is randomness here, this is not possible. Running multiple episodes can sometimes solve the problem. If you have bad luck, you will even have to train again the solver.

In [None]:
for i_episode in range(10):
    print(f"Episode #{i_episode}")
    domain = domain_factory()
    try:
        rollout(domain=domain, solver=solver, max_steps=999, pause_between_steps=None, render=False)
    finally:
        domain.close()

### Cleaning up

In [None]:
solver._cleanup()

## Solve with Classical Planning  (IW)

Iterated Width (IW) is a width based search algorithm that builds a graph on-demand, while pruning non-novel nodes. 

In order to handle continuous domains, a state encoding specific to continuous state variables dynamically and adaptively discretizes the continuous state variables in such a way to build a compact graph based on intervals (rather than a naive grid of discrete point values). 

The novelty measures discards intervals that are included in previously explored intervals, thus favoring to extend the state variable intervals. 

See https://www.ijcai.org/proceedings/2020/578 for more details.

### Prepare the domain for IW

We need to wrap the Gym environment in a domain with finer charateristics so that IW can be used on it. More precisely, it needs the methods inherited from `GymPlanningDomain`, `GymDiscreteActionDomain` and `GymWidthDomain`. In addition, we will need to provide to IW a state features function to dynamically increase state variable intervals. For Gym domains, we use Boundary Extension Encoding (BEE) features as explained in the [paper](https://www.ijcai.org/proceedings/2020/578) mentioned above. This is implemented as `bee2_features()` method in `GymWidthDomain` that our domain class will inherit.

In [None]:
class D(GymPlanningDomain, GymWidthDomain, GymDiscreteActionDomain):
    pass


class GymDomainForWidthSolvers(D):
    def __init__(
        self,
        gym_env: gym.Env,
        set_state: Callable[[gym.Env, D.T_memory[D.T_state]], None] = None,
        get_state: Callable[[gym.Env], D.T_memory[D.T_state]] = None,
        termination_is_goal: bool = True,
        continuous_feature_fidelity: int = 5,
        discretization_factor: int = 3,
        branching_factor: int = None,
        max_depth: int = 1000,
    ) -> None:
        GymPlanningDomain.__init__(
            self,
            gym_env=gym_env,
            set_state=set_state,
            get_state=get_state,
            termination_is_goal=termination_is_goal,
            max_depth=max_depth,
        )
        GymDiscreteActionDomain.__init__(
            self,
            discretization_factor=discretization_factor,
            branching_factor=branching_factor,
        )
        GymWidthDomain.__init__(
            self, continuous_feature_fidelity=continuous_feature_fidelity
        )
        gym_env._max_episode_steps = max_depth


We redefine accordingly the domain factory.

In [None]:
domain4width_factory = lambda: GymDomainForWidthSolvers(gym.make(ENV_NAME))

### Check compatibility
We check the compatibility of the domain with the chosen solver.

In [None]:
domain = domain4width_factory()
assert IW.check_domain(domain)
domain.close()

### Solver instantiation

As explained earlier, we use the Boundary Extension Encoding state features `bee2_features` so that IW can dynamically increase state variable intervals. In other domains, other state features might be more suitable.

In [None]:
solver = IW(
    state_features=lambda d, s: d.bee2_features(s),
    node_ordering=lambda a_gscore, a_novelty, a_depth, b_gscore, b_novelty, b_depth: a_novelty
    > b_novelty,
    parallel=False,
    debug_logs=False,
    domain_factory=domain4width_factory,
)

### Training solver on domain

In [None]:
GymDomainForWidthSolvers.solve_with(solver, domain4width_factory)

### Rolling out a solution

**Disclaimer:** This roll out can be a bit painful to look on local Jupyter sessions. Indeed, IW creates copies of the environment at each step which makes pop up then close a new OpenGL window each time.

We have to slightly modify the roll out function as observations for the new domain are now wrapped in a `GymDomainProxyState` to make them serializable. So to get access to the underlying numpy array, we need to look for `observation._state`.

In [None]:
def rollout_iw(
    domain: GymDomain,
    solver: Solver,
    max_steps: int,
    pause_between_steps: Optional[float] = 0.01,
    render: bool = False,
):
    """Roll out one episode in a domain according to the policy of a trained solver.

    Args:
        domain: the maze domain to solve
        solver: a trained solver
        max_steps: maximum number of steps allowed to reach the goal
        pause_between_steps: time (s) paused between agent movements.
          No pause if None.
        render: if True, the rollout is rendered in a matplotlib figure as an animation;
            if False, speed up a lot the rollout.

    """
    # Initialize episode
    solver.reset()
    observation = domain.reset()

    # Initialize image
    if render:
        plt.ioff()
        fig, ax = plt.subplots(1)
        ax.axis("off")
        plt.ion()
        img = ax.imshow(domain.render(mode="rgb_array"))
        display(fig)

    # loop until max_steps or goal is reached
    for i_step in range(1, max_steps + 1):
        if pause_between_steps is not None:
            sleep(pause_between_steps)

        # choose action according to solver
        action = solver.sample_action(observation)
        # get corresponding action
        outcome = domain.step(action)
        observation = outcome.observation

        # update image
        if render:
            img.set_data(domain.render(mode="rgb_array"))
            fig.canvas.draw()
            clear_output(wait=True)
            display(fig)

        # final state reached?
        if outcome.termination:
            break

    # close the figure to avoid jupyter duplicating the last image
    if render:
        plt.close(fig)

    # goal reached?
    is_goal_reached = observation._state[0] >= 0.45
    if is_goal_reached:
        print(f"Goal reached in {i_step} steps!")
    else:
        print(f"Goal not reached after {i_step} steps!")

    return is_goal_reached, i_step

In [None]:
domain = domain4width_factory()
try:
    rollout_iw(domain=domain, solver=solver, max_steps=999, pause_between_steps=None, render=True)
finally:
    domain.close()

IW works especially well in mountain car. 

Indeed we need to increase the cinetic+potential energy to reach the goal, which comes to increase as much as possible the values of the state variables (position and velocity). This is exactly what IW is designed to do (trying to explore novel states, which means here with higher position or velocity). 

As a consequence, IW can find an optimal strategy in a few seconds (whereas in most cases PPO and CGP can't find optimal strategies in the same computation time).

### Cleaning up

In [None]:
solver._cleanup()

## Conclusion

We saw that it is possible thanks to scikit-decide to apply solvers from different fields and communities (Reinforcement Learning, Genetic Programming, and Planning) on a OpenAI Gym Environment.

Even though the domain used here is more classical for RL community, the solvers from other communities performed far better. In particular the IW algorithm was able to find an efficient solution in a very short time.