# Using RDDL domains and solvers with scikit-decide

In this notebook, we demonstrate how to use the RDLL scikit-decide wrapper domain in order to solve it with scikit-decide solvers. This domain is built upon the  RDDL environment from the excellent pyrddlgym-project GitHub project. Some of the solvers tested here are actually also wrapped from the same project but we will see also how to use other solvers (coded directly within scikit-decide or wrapped from other third party libraries).

Concerning the python kernel to use for this notebook:
- If running locally, be sure to use an environment with scikit-decide[all].
- If running on colab, the next cell does it for you.
- If running on binder, the environment should be ready.

In [None]:
# On Colab: install the library
on_colab = "google.colab" in str(get_ipython())
if on_colab:
    import glob
    import json
    import sys

    using_nightly_version = True

    if using_nightly_version:
        # look for nightly build download url
        release_curl_res = !curl -L   -H "Accept: application/vnd.github+json" -H "X-GitHub-Api-Version: 2022-11-28" https://api.github.com/repos/airbus/scikit-decide/releases/tags/nightly
        release_dict = json.loads(release_curl_res.s)
        release_download_url = sorted(
            release_dict["assets"], key=lambda d: d["updated_at"]
        )[-1]["browser_download_url"]
        print(release_download_url)

        # download and unzip
        !wget --output-document=release.zip {release_download_url}
        !unzip -o release.zip

        # get proper wheel name according to python version used
        wheel_pythonversion_tag = f"cp{sys.version_info.major}{sys.version_info.minor}"
        wheel_path = glob.glob(
            f"dist/scikit_decide*{wheel_pythonversion_tag}*manylinux*.whl"
        )[0]

        skdecide_pip_spec = f"{wheel_path}[all]"
    else:
        skdecide_pip_spec = "scikit-decide[all]"

    # uninstall google protobuf conflicting with ray and sb3
    ! pip uninstall -y protobuf

    # install scikit-decide with all extras
    !pip install {skdecide_pip_spec}

In [None]:
import logging
import os
import shutil

from pyRDDLGym_jax.core.simulator import JaxRDDLSimulator
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv
from ray.rllib.algorithms.ppo import PPO as RLLIB_PPO
from rddlrepository.archive.competitions.IPPC2023.MountainCar.MountainCarViz import (
    MountainCarVisualizer,
)
from rddlrepository.archive.standalone.Elevators.ElevatorViz import ElevatorVisualizer
from rddlrepository.archive.standalone.Quadcopter.QuadcopterViz import (
    QuadcopterVisualizer,
)
from rddlrepository.core.manager import RDDLRepoManager
from stable_baselines3 import PPO as SB3_PPO

from skdecide.hub.domain.rddl import RDDLDomain, RDDLDomainSimplifiedSpaces
from skdecide.hub.solver.cgp import CGP
from skdecide.hub.solver.ray_rllib import RayRLlib
from skdecide.hub.solver.rddl import RDDLGurobiSolver, RDDLJaxSolver
from skdecide.hub.solver.stable_baselines import StableBaseline
from skdecide.utils import rollout

## Instantiating and visualizing a RDDL domain

The pyrddlgym-project provides the [rddlrepository](https://github.com/pyrddlgym-project/rddlrepository) library of RDDL benchmarks from past IPPC competitions and third-party contributors. We list below the available problems with our pip installation of the library.

In [None]:
manager = RDDLRepoManager(rebuild=True)
print(sorted(manager.list_problems()))

We will use 3 different rddl benchmarks here to demonstrate the scikit-decide integration of pyrddlgym:
- MountainCar_ippc2023
- Quadcopter
- Elevators

Let's create the scikit-decide `RDDLDomain` instance and render it.
Note that here we use some options to display within the notebook:
- `display_with_pygame`: True by default (as in pyRDDLGym), here set to False to avoid a pygame window to pop up
- `display_within_jupyter`: useful to display within a jupyter notebook
- `visualizer`: we use a visualizer dedicated to the chosen benchmark
- `movie_name`: if set, a movie will be created at the end of a rollout 

In [None]:
problem_name = "MountainCar_ippc2023"
problem_info = manager.get_problem(problem_name)
problem_visualizer = MountainCarVisualizer
domain = RDDLDomain(
    rddl_domain=problem_info.get_domain(),
    rddl_instance=problem_info.get_instance(1),
    visualizer=problem_visualizer,
    display_with_pygame=False,
    display_within_jupyter=True,
    movie_name=None,  # here left empty because not used in a roll-out
)
domain.reset()
img = domain.render()

In [None]:
problem_name = "Quadcopter"
problem_info = manager.get_problem(problem_name)
problem_visualizer = QuadcopterVisualizer
domain = RDDLDomain(
    rddl_domain=problem_info.get_domain(),
    rddl_instance=problem_info.get_instance(1),
    visualizer=problem_visualizer,
    display_with_pygame=False,
    display_within_jupyter=True,
)
domain.reset()
img = domain.render()

In [None]:
problem_name = "Elevators"
problem_info = manager.get_problem(problem_name)
problem_visualizer = ElevatorVisualizer
domain = RDDLDomain(
    rddl_domain=problem_info.get_domain(),
    rddl_instance=problem_info.get_instance(1),
    visualizer=problem_visualizer,
    display_with_pygame=False,
    display_within_jupyter=True,
)
domain.reset()
img = domain.render()

## Solving the domain with scikit-decide (potentially bridged) solvers

Now comes the fun part: solving the domain with scikit-decide solvers, some of them - especially the reinforcement learning ones - being bridged to state-of-the-art existing libraries (e.g. RLlib, SB3). You will see that once the domain is defined, solving it takes very few lines of code.

### RL solvers

First, we create the domain factory for the benchmark "MountainCar_ippc2023". For these RL solvers, we need the underlying rddl env to use the base class `SimplifiedActionRDDLEnv` from [pyRDDLGym-rl](https://github.com/pyrddlgym-project/pyRDDLGym-rl), which uses gym spaces tractable by RL algorithms. This is done thanks to the argument `base_class`, which will be passed directly to `pyRDDLgym.make()`.

In [None]:
problem_name = "MountainCar_ippc2023"
problem_info = manager.get_problem(problem_name)
problem_visualizer = MountainCarVisualizer

domain_factory_rl = lambda alg_name=None: RDDLDomain(
    rddl_domain=problem_info.get_domain(),
    rddl_instance=problem_info.get_instance(1),
    base_class=SimplifiedActionRDDLEnv,
    visualizer=problem_visualizer,
    display_with_pygame=False,
    display_within_jupyter=True,
    movie_name=f"{problem_name}-{alg_name}" if alg_name is not None else None,
)

#### RLlib's PPO algorithm

The code below creates a scikit-decide's `RayRLlib` solver, then it calls the `solver.solve()` method, and it finally rollout the optimized policy by using scikit-decide's `rollout` utility function. The latter function will render the solution and the domain will generate a movie in the `rddl_movies` folder when reaching the termination condition of the rollout episode.

In [None]:
solver_factory = lambda: RayRLlib(
    domain_factory=domain_factory_rl, algo_class=RLLIB_PPO, train_iterations=10
)

with solver_factory() as solver:
    solver.solve()
    rollout(
        domain_factory_rl(alg_name="RLLIB-PPO"),
        solver,
        max_steps=300,
        render=True,
        verbose=False,
    )

Here is an example of executing the RLlib's PPO policy trained for 100 iterations on the mountain car benchmark:

![RLLIB PPO example solution](rddl_images/MountainCar_ippc2023-RLLIB-PPO_example.gif)

#### StableBaselines-3's PPO

Once the domain is defined, very few lines of code are sufficient to test another solver whose capabilities are compatible with the domain. In the cell below, we now test Stablebaselines-3's PPO algorithm.

In [None]:
solver_factory = lambda: StableBaseline(
    domain_factory=domain_factory_rl,
    algo_class=SB3_PPO,
    baselines_policy="MultiInputPolicy",
    learn_config={"total_timesteps": 10000},
    verbose=0,
)

with solver_factory() as solver:
    solver.solve()
    rollout(
        domain_factory_rl(alg_name="SB3-PPO"),
        solver,
        max_steps=1000,
        render=True,
        verbose=False,
    )

### CGP

Scikit-decide provides an implementation of [Cartesian Genetic Programming](https://dl.acm.org/doi/10.1145/3205455.3205578) (CGP), a form of Genetic Programming which optimizes a function (e.g. control policy) by learning its best representation as a directed acyclic graph of mathematical operators. One of the great capabilities of scikit-decide is to provide simple high-level means to compare algorithms from different communities (RL, GP, search, planning, etc.) on the same domains with few lines of code.

<img src="rddl_images/cgp-sketch.png" alt="Cartesian Genetic Programming" width="700"/>

Since our current implementation of CGP in scikit-decide does not handle complex observation spaces such as the dictionary spaces returned by the RDDL simulator, we used instead `RDDLDomainSimplifiedSpaces` where all actions and observations are numpy arrays thanks to the powerful `flatten` and `flatten_space` methods of `gymnasium`.

We call the CGP solver on this simplified domain and we render the obtained solution after a few iterations (including the generation of the video in the `rddl_movies` folder).

In [None]:
problem_name = "MountainCar_ippc2023"
problem_info = manager.get_problem(problem_name)
problem_visualizer = MountainCarVisualizer

domain_factory_cgp = lambda alg_name=None: RDDLDomainSimplifiedSpaces(
    rddl_domain=problem_info.get_domain(),
    rddl_instance=problem_info.get_instance(1),
    base_class=SimplifiedActionRDDLEnv,
    visualizer=problem_visualizer,
    display_with_pygame=False,
    display_within_jupyter=True,
    movie_name=f"{problem_name}-{alg_name}" if alg_name is not None else None,
    max_frames=200,
)

if os.path.exists("TEMP_CGP"):
    shutil.rmtree("TEMP_CGP")

solver_factory = lambda: CGP(
    domain_factory=domain_factory_cgp, folder_name="TEMP_CGP", n_it=25, verbose=False
)
with solver_factory() as solver:
    solver.solve()
    rollout(
        domain_factory_cgp("CGP"), solver, max_steps=200, render=True, verbose=False
    )

Here is an example of executing the CGP policy on the mountain car benchmark:

![CGP example solution](rddl_images/MountainCar_ippc2023-CGP_example.gif)

## Solving the domain with pyRDDLGym solvers wrapped in scikit-decide

One can also use the solvers implemented in pyRDDLGym project from within scikit-decide like the jax planner (https://github.com/pyrddlgym-project/pyRDDLGym-jax), or the gurobi planner (https://github.com/pyrddlgym-project/pyRDDLGym-gurobi).

### JAX Agent


The scikit-decide solver `RDDLJaxSolver` wraps the offline version of [JaxPlan](https://openreview.net/forum?id=7IKtmUpLEH) planner which compiles the RDDL model to a Jax computation graph allowing for planning by backpropagation. 
The solver constructor takes a configuration file of the `Jax` planner as explained [here](https://github.com/pyrddlgym-project/pyRDDLGym-jax/tree/main?tab=readme-ov-file#writing-a-configuration-file-for-a-custom-domain).

We apply it to the becnhmark "Quadcopter". 

Note that for this solver the domain needs
- to use the simulation backend specific to Jax,
- to be vectorized. 

This is done thanks to the arguments `backend` and `vectorized` which are passed to `pyRDDLGym.make()`.

In [None]:
problem_name = "Quadcopter"
problem_info = manager.get_problem(problem_name)
problem_visualizer = QuadcopterVisualizer

if not os.path.exists("Quadcopter_slp.cfg"):
    !wget https://raw.githubusercontent.com/pyrddlgym-project/pyRDDLGym-jax/main/pyRDDLGym_jax/examples/configs/Quadcopter_slp.cfg

domain_factory_jax_agent = lambda alg_name=None: RDDLDomain(
    rddl_domain=problem_info.get_domain(),
    rddl_instance=problem_info.get_instance(1),
    visualizer=problem_visualizer,
    display_with_pygame=False,
    display_within_jupyter=True,
    backend=JaxRDDLSimulator,
    movie_name=f"{problem_name}-{alg_name}" if alg_name is not None else None,
    max_frames=500,
    vectorized=True,
)

assert RDDLJaxSolver.check_domain(domain_factory_jax_agent())

logging.getLogger("matplotlib.font_manager").disabled = True
with RDDLJaxSolver(
    domain_factory=domain_factory_jax_agent, config="Quadcopter_slp.cfg"
) as solver:
    solver.solve()
    rollout(
        domain_factory_jax_agent(alg_name="JaxAgent"),
        solver,
        max_steps=500,
        render=True,
        max_framerate=5,
        verbose=False,
    )

We obtain the following example execution of the Jax policy, which clearly converges towards the goal (quadcopters flying towards the red triangle):

![JaxAgent example solution](rddl_images/Quadcopter-JaxAgent_example.gif)

### Gurobi Agent

We finally try the online version of [GurobiPlan](https://openreview.net/forum?id=7IKtmUpLEH) planner which compiles the RDDL model to a [Gurobi](https://www.gurobi.com) MILP model. 

We apply it to "Elevators" benchmark. 


<div class="alert alert-block alert-warning"><b>Note: </b>
To solve reasonable size problems, the solver needs a real license for Gurobi, as the free license available when installing gurobipy from PyPi is not sufficient to solve this domain. Here we limit the `rollout_horizon` to be able to run it with the free license, because optimization variables are created for each timestep.
</div>

In [None]:
problem_name = "Elevators"
problem_info = manager.get_problem(problem_name)
problem_visualizer = ElevatorVisualizer

domain_factory_gurobi_agent = lambda alg_name=None: RDDLDomain(
    rddl_domain=problem_info.get_domain(),
    rddl_instance=problem_info.get_instance(0),
    visualizer=problem_visualizer,
    display_with_pygame=False,
    display_within_jupyter=True,
    movie_name=f"{problem_name}-{alg_name}" if alg_name is not None else None,
    max_frames=50,
)

assert RDDLGurobiSolver.check_domain(domain_factory_gurobi_agent())

with RDDLGurobiSolver(
    domain_factory=domain_factory_gurobi_agent,
    rollout_horizon=2,  # increase the rollout_horizon with real license
) as solver:
    solver.solve()
    rollout(
        domain_factory_gurobi_agent(alg_name="GurobiAgent"),
        solver,
        max_steps=50,
        render=True,
        max_framerate=5,
        verbose=False,
    )

Here is an example of executing the online `GurobiPlan` strategy on this benchmark:

![GurobiAgent example solution](rddl_images/Elevators-GurobiAgent_example.gif)