
#  <a href="https://colab.research.google.com/github/enlite-ai/maze/blob/main/tutorials/notebooks/getting_started/getting_started_1_basic_workflow.ipynb" target="_top"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" /></a> Maze: Getting Started Part I - Basic Workflow


Part 1 of 4 in the *Getting started* series.

---

## On Maze

MazeRL is an application oriented Deep Reinforcement Learning (RL) framework, addressing real-world decision problems. If this caught your interest, check out
* the [Github repository](https://github.com/enlite-ai/maze),
* the [documentation](https://maze-rl.readthedocs.io/en/latest/index.html#documentation-overview) or
* the official [website](https://www.enlite.ai/).


## Introduction

This notebook describes and explains the basic workflow for training an existing OpenAI Gym environment (specifically: `CartPole-v0`), rolling out the trained policy and evaluating it. It is part of the *Getting Started* series aiming to convey the basics of Maze and hence targeted at first-time users. You are not expected to have any prior experience with Maze (although basic knowledge of reinforcement learning concepts is recommended).

### Install Maze and Dependencies

Maze is available as pip package. The other dependencies required for this notebook are PyTorch and OpenAI's gym. We recommend installing PyTorch via Conda. If you are executing this notebook on Google Collabe, both libraries are already available.

In [None]:
!pip install torch
!pip install gym
!pip install maze-rl

## A First Example

As a primer we'll implement a minimal example that trains an agent in OpenAI gym's `CartPole-v0` environment (more on that [below](#working-with-runcontext)). This is very similar to the example you may have seen on Maze' [readme](https://github.com/enlite-ai/maze/blob/main/README.md) already.

In [2]:
from maze.api.run_context import RunContext
from maze.core.wrappers.maze_gym_env_wrapper import GymMazeEnv

rc = RunContext(env=lambda: GymMazeEnv('CartPole-v0'), silent=True, algorithm="a2c")
rc.train(n_epochs=1)

# Run trained policy.
env = GymMazeEnv('CartPole-v0')
obs = env.reset()
total_reward = 0
max_step = 100
step = 0
done = False

while not done and step < max_step:
    action = rc.compute_action(obs)
    obs, reward, done, info = env.step(action)
    total_reward += reward
    step += 1

print("Total reward: {reward}".format(reward=total_reward))

[94mINFO: Setting MKL_THREADING_LAYER=GNU to avoid PyTorch issues with conda![0m
[94mINFO: Setting OMP_NUM_THREADS=1 to avoid performance drop when using distributed environments![0m


100%|██████████| 25/25 [00:03<00:00,  8.10it/s]

Total reward: 100.0





That's it! You implemented your first reinforcement learning application with Maze.

What happened here? We will go through the details in the course of this notebook, but in a nutshell we
* initialized a `RunContext`, which is Maze' [high-level API](#working-with-runcontext);
* used our `RunContext` instance to train on a Maze-compatible `CartPole` environment;
* initialized a new Maze-compatible `CartPole` environment for evaluating our agent;
* ran the trained agent on the new environment and collected the total reward.

In `CartPole` every step the pole is upright is rewarded with +1. A reward of 100 obtained from 100 steps thus means that our agent is already able to balance the pole for the entire time period. In the following we will discuss the functionality and components involved in this process step by step.

## Working with RunContext

`RunContext` is Maze' high-level API providing capabilities for the training, rollout and evaluation of policies. It is designed to minimize the amount of boilerplate code while allowing a high degree of configurability. In this notebook we will use `RunContext`, but won't explore it in detail. To learn more about it, check out the [documentation](https://maze-rl.readthedocs.io/en/latest/concepts_and_structure/run_context_overview.html).

In a nutshell, we will instantiate a `RunContext` with `rc = RunContext(...)` and then use its training/rollout/evaluation functionality with `rc.train(...)`/`rc.rollout(...)`/`rc.evaluate()`. It is comparable with the trainer or algorithm classes from other RL libraries without being limited to training operations only.

## Setting up

### Environment

We will train on one of the "Hello World" problems in reinforcement learning: OpenAI gym's `CartPole-v0`. The goal in this environment is to move the cart so that the pole stays upright. It is described in detail [here](https://gym.openai.com/envs/CartPole-v0/):

  > A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

![Cartpole](https://cdn-images-1.medium.com/max/1200/1*oMSg2_mKguAGKy1C64UFlw.gif)
*Keeping the pole upright in `CartPole-v0`*.

Any OpenAI gym environment in Maze can be instantiated by wrapping it in a `GymMazeEnv` like so:

In [3]:
import gym

env = GymMazeEnv(env=gym.make("CartPole-v0"))

### Algorithm

For this initial example we will apply [PPO](https://maze-rl.readthedocs.io/en/latest/trainers/maze_trainers.html#proximal-policy-optimization-ppo). We won't customize it in any way, i.e. we will have our agent learn using the default policy.
Maze provides sane defaults for all supported algorithms. The easiest way to specify which algorithm to use is to pass its name as string (see [the documentation](https://maze-rl.readthedocs.io/en/latest/trainers/maze_trainers.html) for an exhaustive list of supported algorithms) when initializing a `RunContext`.

### Initialization

One of the tasks of `RunContext` is to orchestrate the training and rollout process. In the case of a distrubuted training or rollout process this necessitates instantiating the environment multiple times. That's why we require to pass an environment as a callable environment factory function<sup>*</sup> instead of an environment instance.

By default, Maze logs useful data to `stdout`. We'll suppress this for now with `silent=True`.

This is already all we need to know in order to start training. With this information we can initialize our `RunContext` instance.

<sup>*</sup> Maze also offers other ways to initialize components like environments. These will be covered in subsequent tutorials.

In [4]:
rc = RunContext(silent=True, algorithm="ppo", env=lambda: GymMazeEnv(env=gym.make("CartPole-v0")))



## Training and Rollout

Having instantiated our `RunContext`, we are ready to train:

In [5]:
rc.train(n_epochs=1)

100%|██████████| 25/25 [00:06<00:00,  4.08it/s]


The trained agent can now be rolled out. As of now, `RunContext` lacks full support for rollouts. This will be added shortly, at which point this tutorial will be updated. In the interim we'll use `maze.utils.notebooks.rollout`, which wraps a manual rollout loop.

First we'll establish a baseline by running and evaluating some episodes randomly.

In [6]:
import maze.utils.notebooks
from maze.core.agent.random_policy import RandomPolicy

steps = 200
n_episodes = 15
random_policy = RandomPolicy(env.action_spaces_dict)
rewards = [
    maze.utils.notebooks.rollout(rc.env_factory(), random_policy, steps)
    for _ in range(n_episodes)
]
print("Mean return with #{ne} episodes: {rew}".format(ne=n_episodes, rew=sum(rewards) / len(rewards)))

Mean return with #15 episodes: 17.933333333333334


This means that a random selection of actions can keep the pole upright for around 20 steps. How does our trained agent perform?

In [7]:
rewards = [
    maze.utils.notebooks.rollout(rc.env_factory(), rc, steps)
    for _ in range(n_episodes)
]

print("Mean return with #{ne} episodes: {rew}".format(ne=n_episodes, rew=sum(rewards) / len(rewards)))

Mean return with #15 episodes: 137.6


That's pretty close to the maximum of 200 - not bad for a single episode of training! How does this look like in action?

In [8]:
maze.utils.notebooks.rollout(rc.env_factory(), rc, steps, True)

200.0

Our agent seems to have gotten the hang of it.

## Saving and Loading

Maze automatically stores checkpoints and all other data generated during training. It doesn't expose saving or loading functionality explicitly - whenever you train or roll out your agent, the resulting artifacts are stored. By default, the destination directory is the current working directory. If you have run the previous snippets, your current directory should contain the following structure:

```
outputs
|-- gym_env-flatten_concat-ppo-local
|   |-- [YYYY-MM-DD]_HH-mm-SS
|   |    |-- ...
```

`gym_env-flatten_concat-ppo-local` is an identifier that's generated automatically based on your agent's configuration (more on that in [the next installment of the getting started series](www.github.com/enlite-ai/maze/blob/main/tutorials/notebooks/getting_started_2.ipynb)).

Note that the directory holding all the generated artifacts will be named after the time at which you start your training or rollout.  `RunContext` ensures that all training and rollout data are stored in the same directory as long as they are started from the same instance. If you however want to continue training with another `RunContext` instance - or you are using Maze' low-level API instead of `RunContext`- you can fix the storage directory with `run_dir`:

In [9]:
rc = RunContext(
    silent=True,
    algorithm="ppo",
    env=lambda: GymMazeEnv(env=gym.make("CartPole-v0")),
    run_dir="."
)



## Summary

This notebook shows how to...
* ...train a policy on existing (Gym) environment in two lines of code.
* ...roll out a trained policy.
* ...visualize state and actions in an environment.
* ...evaluate a trained policy.

### What's next?

* We recommend continuing with the [second part of the getting started series](www.github.com/enlite-ai/maze/blob/main/tutorials/notebooks/getting_started_2.ipynb), which covers configurability in Maze, e.g. how to implement your own components (like policies) and how to configure and use wrappers.
* If you would like to see more notebooks covering other areas of Maze, feel free to [kick of a discussion on Github](https://github.com/enlite-ai/maze/discussions).