# ICAPS24 SkDecide Tutorial: solving problems (possibly imported from Gym) with Reinforcement Learning and Cartesian Genetic Programming

Alexandre Arnold, Guillaume Povéda, Florent Teichteil-Königsbuch

Credits to [IMACS](https://imacs.polytechnique.fr/) and especially to Nolwen Huet

This tutorial shows how to load a domain in scikit-decide and try to solve it with techniques from different communities:

*   [Reinforcement Learning](https://en.wikipedia.org/wiki/Reinforcement_learning) (RL)
*   [Cartesian Genetic Programming](https://en.wikipedia.org/wiki/Cartesian_genetic_programming) (CGP)

## Prerequisites

Install scikit-decide:

In [None]:
!pip install "scikit-decide[all]"

Install renderlab to render Gymnasium environments in Google Colab:

In [None]:
!pip install renderlab

## Loading a domain

Once a problem is formalized as a scikit-decide domain, it can be tackled by any compatible solver. Domains can be created from scratch or imported from various formats. Here we demonstrate how to import an environment from [Gymnasium](https://gymnasium.farama.org) (the new official fork of OpenAI Gym, a standard API often used in RL communities), like [Mountain Car Continuous](https://gymnasium.farama.org/environments/classic_control/mountain_car_continuous):

In [1]:
import gymnasium as gym
from renderlab import RenderFrame
from skdecide.hub.domain.gym import GymDomain

# Select a Gymnasium environment
ENV_NAME = "MountainCarContinuous-v0"

# Create a domain factory, a callable returning a skdecide domain (used by solvers)
def domain_factory(record_videos=False):

    # Create a Gymnasium environment
    env = gym.make(ENV_NAME, render_mode="rgb_array")

    # Maybe wrap it with RenderFrame to record/play episode videos (works in Colab)
    if record_videos:
        env = RenderFrame(env, "./render")

    # Return a skdecide domain from a Gymnasium environment
    return GymDomain(env)

# In simple cases, domain_factory can be created in one line:
# domain_factory = lambda: GymDomain(gym.make(ENV_NAME))

The rollout utility provides a quick way to run episodes by taking random actions (or a solver policy as shown later) in the domain:

In [2]:
from skdecide.utils import rollout

# Instantiate one domain (used for rollouts)
domain = domain_factory(record_videos=True)

# Do a random rollout of the domain (random actions are taken when no solver is specified)
rollout(domain, num_episodes=1, max_steps=1000, verbose=False) # try verbose=True for more printing
domain.unwrapped().play() # watch last episode in video by calling play() on the underlying Gymnasium environment (works in Colab)

Moviepy - Building video temp-{start}.mp4.
Moviepy - Writing video temp-{start}.mp4





Moviepy - Done !
Moviepy - video ready temp-{start}.mp4


## Solving the domain

One of the key benefits of scikit-decide is its ability to connect the same domain definition to many different solvers from various communities. To demonstrate this versatility, we show how to solve the domain loaded above with both Reinforcement Learning and Cartesian Genetic Programming:

### With Reinforcement Learning (RL)

Scikit-decide provides wrappers for several RL solvers, such as [RLlib](https://docs.ray.io/en/latest/rllib/index.html) and [Stable-Baselines3](https://stable-baselines3.readthedocs.io). We use the latter in this example:

In [3]:
from stable_baselines3 import PPO
from skdecide.hub.solver.stable_baselines import StableBaseline

# Check domain compatibility with StableBaseline RL solver (good practice)
assert StableBaseline.check_domain(domain)

# Instantiate solver with parameters of choice (e.g. type of algo/neural net, learning steps...)
solver = StableBaseline(
    domain_factory,
    algo_class=PPO,
    baselines_policy="MlpPolicy",
    learn_config={"total_timesteps": 10000},
    verbose=1
)

# Solve with RL
solver.solve()

# Save solution
solver.save("saved_solution")

  logger.deprecation(



Using cpu device
-----------------------------
| time/              |      |
|    fps             | 1214 |
|    iterations      | 1    |
|    time_elapsed    | 1    |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 833          |
|    iterations           | 2            |
|    time_elapsed         | 4            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0072984253 |
|    clip_fraction        | 0.0275       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.4         |
|    explained_variance   | -0.00135     |
|    learning_rate        | 0.0003       |
|    loss                 | 0.197        |
|    n_updates            | 10           |
|    policy_gradient_loss | -0.00321     |
|    std                  | 0.974        |
|    value_loss           | 41.4         |

Now we can run episodes with rollout using the latest solver policy:

In [4]:
# Visualize solution (pass solver to rollout to use its policy)
rollout(domain, solver, num_episodes=1, max_steps=1000, verbose=False)
domain.unwrapped().play()

  and should_run_async(code)



Moviepy - Building video temp-{start}.mp4.
Moviepy - Writing video temp-{start}.mp4





Moviepy - Done !
Moviepy - video ready temp-{start}.mp4


It is always possible to reload a saved solution (especially useful in a new Python session) and possibly continue learning from there. By running this cell a couple of times, you should see increasingly better solutions:

In [5]:
# Optional: reload solution (required if reloading in a new Python session)
solver.load("saved_solution")

# Continue learning
solver.solve()

# Save updated solution
solver.save("saved_solution")

# Visualize updated solution
rollout(domain, solver, num_episodes=1, max_steps=1000, verbose=False)
domain.unwrapped().play()

  logger.deprecation(



-----------------------------
| time/              |      |
|    fps             | 994  |
|    iterations      | 1    |
|    time_elapsed    | 2    |
|    total_timesteps | 2048 |
-----------------------------
------------------------------------------
| time/                   |              |
|    fps                  | 774          |
|    iterations           | 2            |
|    time_elapsed         | 5            |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0012354925 |
|    clip_fraction        | 0.00586      |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.2         |
|    explained_variance   | 0.048        |
|    learning_rate        | 0.0003       |
|    loss                 | 0.715        |
|    n_updates            | 60           |
|    policy_gradient_loss | -0.00144     |
|    std                  | 0.802        |
|    value_loss           | 40.6         |
----------------



Moviepy - Done !
Moviepy - video ready temp-{start}.mp4


After using a solver, it is good practice to do a cleanup as shown below (not critical here, but sometimes useful for C++ parallel solvers in scikit-decide). Note that this is automatically done if you use the solver within a `with` statement, which will be shown in the CGP sub-section below as an alternative.

In [6]:
# Clean up solver after use (good practice)
solver._cleanup()

### With Cartesian Genetic Programming (CGP)

In [7]:
from skdecide.hub.solver.cgp import CGP

# Check domain compatibility with CGP solver (good practice)
assert CGP.check_domain(domain)

# Instantiate solver with parameters of choice (using "with" syntax to avoid manual clean up)
with CGP(domain_factory, folder_name="TEMP_CGP", n_it=25) as solver:

    # Solve with CGP
    solver.solve()

    # Visualize solution
    rollout(domain, solver, num_episodes=1, max_steps=1000, verbose=False)
    domain.unwrapped().play()

[Box(-1.0, 1.0, (1,), float32)]
[Box([-1.2  -0.07], [0.6  0.07], (2,), float32)]
[ 9  1  1  2  0  0 11  1  3  6  0  2  9  4  1  9  5  2  2  7  2  7  6  1
  2  6  1  0  6  8  0  9  9  3  9  6 17 13 10  8 13  6 16  6 10 12  9 12
  4  0  2  7 18  3  6 16  4 20 13 14 19  0  7  1  2 18 20  7  9  1 11  2
  2 18 20 15 23 19 11 18 15 20 22  5  6  2 24  7 21  8 11  5 12 16 19  3
 12 10 33  9  4  5 20 31  8  6 10  3 11 13 11 17 14 29 19 17  7  7 31 14
 13 39  6 15 19 38 12 37 24  1 15  3 11 11 32 13 26 27  6 19 21 13 35 19
 15 10  0  9 36  7 18 35  6 15 49 49 20 17  8  4 29 18 20  0 27  1 26 54
 18 37 50  5 30 55  6 38 20 20 20 55 16  5 55 10 29 40 17 15 48 14 29  5
 19 40 55  2 15 29 17 62 44  2 12 30 12 65 56 15 48 34  1 16 29 15 17 31
  0 28 64  7  4 73 20 24 13  4  7 28 10 74 58 15 62 12 18 58 76  4 79 15
  5 58 59  3  9 47  0  8  3 10  7 40  1  7  3 20  0 79  8 29 81  6 37 48
  5 88  2  8 71 26  6  3 70  2 62 58 20 54 29  6  7 16  8 64 33  2 25 57
  1 12 38  0 96 35  4 87 79 12 68 16 15]


  return self.randrange(a, b+1)



1 	 92.70000000000002 	 True 	 [-1000.     92.7  -999.9  -999.8]
2 	 92.80000000000001 	 True 	 [  92.1          92.7          92.8        -179.25470504]
3 	 96.91493590344449 	 True 	 [92.8       96.9149359  0.        92.1      ]
4 	 98.44227934031996 	 True 	 [95.73040789 98.02986625 98.44227934 97.6115466 ]
5 	 98.44227934031996 	 False 	 [  96.72557947   96.9166946    96.72393836 -999.8       ]
6 	 98.44227934031996 	 False 	 [  96.39711733   96.91513345   95.8910914  -999.9       ]
7 	 98.44227934031996 	 False 	 [-446.83336307   96.00030897   97.29849228   96.93801972]
8 	 98.44227934031996 	 False 	 [  96.22899305   96.24631185   96.37106949 -250.        ]
9 	 98.44227934031996 	 False 	 [  94.9923539    96.28628984 -999.9          96.20419366]
10 	 98.44227934031996 	 False 	 [-999.9          96.9983443  -999.9          97.55962985]
11 	 98.44227934031996 	 False 	 [  97.040722   -233.04850587 -250.           96.92035865]
12 	 98.44227934031996 	 False 	 [-999.9          98.097



Moviepy - Done !
Moviepy - video ready temp-{start}.mp4


In this example, you may find that CGP often finds better solutions than RL (although this depends on the solver parameters and the random seed). Note however that this is highly problem-dependent: try re-running this notebook after setting `ENV_NAME = "CartPole-v1"` at the beginning and you may find opposite results. That shows the power of having a wide catalog of solvers to find the best solution for each specific problem!