# Tutorial 1: Environment Creation and Training

In this tutorial, we will show a simple example of how to create a gym environment using [EAGERx](https://eagerx.readthedocs.io/en/master/).
Also, we will use this environment to train a policy using [Stable Baselines 3](https://stable-baselines3.readthedocs.io/en/master/).

The following will be covered:
- Creating a [Graph](https://eagerx.readthedocs.io/en/master/guide/api_reference/graph/graph.html) with an [Object](https://eagerx.readthedocs.io/en/master/guide/api_reference/object/index.html)
- How to use this [Graph](https://eagerx.readthedocs.io/en/master/guide/api_reference/graph/graph.html) and a [Bridge](https://eagerx.readthedocs.io/en/master/guide/api_reference/bridge/index.html) to create an [Eagerx Environment](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html)
- How to train a policy with the [EAGERx Environment](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html)

In the remainder of this tutorial we will go more into detail on these concepts.


## Pendulum Swing-up

We will create an environment for solving the classic control problem of swinging up an underactuated pendulum, very similar to the [Pendulum-v1 environment](https://www.gymlibrary.ml/environments/classic_control/pendulum/).
Our goal is to swing up this pendulum to the upright position and keep it there, while minimizing the velocity of the pendulum and the input voltage.

Since the dynamics of a pendulum actuated by a DC motor are well known, we can simulate the pendulum by integrating the corresponding ordinary differential equations (ODEs):


$\mathbf{x} = \begin{bmatrix} \theta \\ \dot{\theta} \end{bmatrix} \\ \dot{\mathbf{x}} = \begin{bmatrix} \dot{\theta} \\ \frac{1}{J}(\frac{K}{R}u - mgl \sin{\theta} - b \dot{\theta} - \frac{K^2}{R}\dot{\theta})\end{bmatrix}$

with $\theta$ the angle w.r.t. upright position, $\dot{\theta}$ the angular velocity, $u$ the input voltage, $J$ the inertia, $m$ the mass, $g$ the gravitational constant, $l$ the length of the pendulum, $b$ the motor viscous friction constant, $K$ the motor constant and $R$ the electric resistance.

<img src="./figures/pendulum.GIF" width="480" />

## Notebook Setup

In order to be able to run the code, we need to install the *eagerx_tutorials* package and ROS.

In [None]:
try:
    import eagerx_tutorials
except ImportError:
    !{"echo 'Installing eagerx-tutorials with pip.' && pip install eagerx-tutorials >> /tmp/eagerx_install.txt 2>&1"}
if 'google.colab' in str(get_ipython()):
    !{"curl 'https://raw.githubusercontent.com/eager-dev/eagerx_tutorials/master/scripts/setup_colab.sh' > ~/setup_colab.sh"}
    !{"bash ~/setup_colab.sh"}

# Setup interactive notebook
# Required in interactive notebooks only.
from eagerx_tutorials import helper
helper.setup_notebook()
env = None

# Allows reloading of registered entites from changed files
# Required in interactive notebooks only.
%reload_ext autoreload
%autoreload 1

## Let's get started

First we will import EAGERx and initialize it.
As mentioned before, EAGERx makes use of ROS functionality for communication and during initialization a ROS master is started if there isn't one running already.

In [None]:
import eagerx
# Initialize eagerx (starts roscore if not already started.)
eagerx.initialize("eagerx_core")

An `Object` is an entity that has inputs (sensors), outputs (actuators) and states (that can be reset at the beginning of an episode).

We are going to create one object (the pendulum). For this first tutorial, we don't want to go into details too much and start with an existing object.
If you are interested, you can find its definition [here](https://github.com/eager-dev/eagerx_tutorials/blob/master/eagerx_tutorials/pendulum/objects.py).
Note that we import the pendulum.
While this might look like an unused import, it is not.
During the import, the pendulum object is registered and we can therefore make it based on its ID, i.e. *Pendulum*.

Before making the object, we will first obtain some info on the *Pendulum*, such that we know with what arguments we should make it.

In [None]:
import eagerx_tutorials.pendulum  # Registers Pendulum

eagerx.Object.info("Pendulum")

We see that the `eagerx.Object.info("Pendulum")` provides us information on the *Pendulum* object.
It has four sensors (*theta*, *dtheta*, *image*, *u*), one actuator (*u*) and two states (*model_state*, *model_parameters*).
Here *theta*, *dtheta* and *u* correspond to $\theta$, $\dot{\theta}$ and $u$, respectively.
For now, we are only interested in how to make this object, other information will be covered in later tutorials.
We can make the *Pendulum* object with the `eagerx.Object.make` method with the required arguments *entity_id* and (a unique) *name*.
Furthermore, we will specify which actuators, sensors and states we will use:

In [None]:
# Make pendulum
pendulum = eagerx.Object.make("Pendulum", "pendulum", actuators=["u"], sensors=["theta", "dtheta", "image"], states=["model_state"])

Next, we create a [Graph](https://eagerx.readthedocs.io/en/master/guide/api_reference/graph/graph.html) and add the pendulum to it.

The graph describes the interconnect of nodes and objects.
In this way, the creation of an environment becomes modular.
This allows users to create an implementation for nodes and objects once, and easily create new environments by reusing these implementations.
Also, this allows to construct complex environments using nodes and objects as basic building blocks.

After adding the pendulum to the graph, we will connect the actuator *u* to a new action called *voltage*.
We will connect the sensors *theta* and *dtheta* to the observations *angle* and *angular_velocity*, respectively.
In this way, the agent will be able to send actions to control $u$ of the pendulum and observe $\theta$ and $\dot{\theta}$.

Finally, we will also render the *image* sensor in order to visualize the pendulum.
More detailed information on rendering is covered in another tutorial.

In [None]:
# Define rate (depends on rate of ode)
rate = 30.0

# Initialize empty graph
graph = eagerx.Graph.create()

# Add pendulum to the graph
graph.add(pendulum)

# Connect the pendulum to an action and observation
graph.connect(action="voltage", target=pendulum.actuators.u)
graph.connect(source=pendulum.sensors.theta, observation="angle")
graph.connect(source=pendulum.sensors.dtheta, observation="angular_velocity")

# Render image
graph.render(source=pendulum.sensors.image, rate=rate)

It is also possible to inspect the graph using the eagerx-gui package.
It can be installed as follows:
```bash
pip3 install eagerx-gui
```
Jupyter notebooks have limited support for interactive applications, so we cannot open the GUI here.
But if we were to run
```python
graph.gui()
```
The ouput would be as follows:

<img src="./figures/tutorial_1_gui.svg" width=720>

Here we see that the actions of the agent are outputs of *env/actions* and that the observations of the agent are inputs of *env/observations*.
Also, we could render output by connecting to *env/render*, which will be covered in another tutorial.
Note that *env/actions*, *env/observations* and *env/render* represent connections of the `Graph` to the environment.
They are split up in the GUI as nodes for visualization purposes.

Next, we will create the [Bridge](https://eagerx.readthedocs.io/en/master/guide/api_reference/bridge/index.html).
Since objects can have implementions for multiple physics engines and real systems, we need to initialize the appropriate bridge.
In our case, we will use the [OdeBridge](https://github.com/eager-dev/eagerx_ode), which allows to simulate systems based on ordinary differential equations (ODEs).
In other tutorials we will go more into detail on the bridge and how you can create your own bridge.

In [None]:
import eagerx_ode  # Registers OdeBridge

# Define bridges
bridge = eagerx.Bridge.make("OdeBridge", rate=rate)

Just like in normal Gym environments, we will create a step function in which we will calculate the reward at each time step and check for termination conditions.
Our goal is to stabilize the pendulum in upright position, while minimizing the input voltage that is applied.
Therefore we choose a reward function that is a weighted sum of $\theta^2$, $\dot{\theta^2}$ and $u^2$. 

Note that we can obtain the values of the actions and observations using the keys *voltage*, *angle* and *angular_velocity*, which correspond to the names of the actions and observations above in the screenshot of the GUI.

In [None]:
import numpy as np
from typing import Dict

# Define step function
def step_fn(prev_obs: Dict[str, np.ndarray], obs: Dict[str, np.ndarray], action: Dict[str, np.ndarray], steps: int):
    
    # Get angle and angular velocity
    # Take first element because of window size (covered in other tutorial)
    th = obs["angle"][0] 
    thdot = obs["angular_velocity"][0]
    
    # Convert from numpy array to float
    u = float(action["voltage"])
    
    # Normalize angle so it lies in [-pi, pi]
    th -= 2 * np.pi * np.floor((th + np.pi) / (2 * np.pi))
    
    # Calculate cost
    # Penalize angle error, angular velocity and input voltage
    cost = th**2 + 0.1 * thdot**2 + 0.001 * u**2  
    
    # Determine when is the episode over
    # currently just a timeout after 100 steps
    done = steps > 100
    
    # Set info, tell the algorithm the termination was due to a timeout
    # (the episode was truncated)
    info = {"TimeLimit.truncated": steps > 100}
    
    return obs, -cost, done, info

Having created a graph, a bridge and a step function, we can now construct the EAGERx environment.
We can use it like any other Gym environment.
Here we will now train a policy to swing up the pendulum using the Soft Actor Critic (SAC) reinforcement learning algorithm implementation from [Stable Baselines 3](https://stable-baselines3.readthedocs.io/en/master/).

In [None]:
import stable_baselines3 as sb
from eagerx.wrappers import Flatten

# Initialize Environment
env = eagerx.EagerxEnv(name="PendulumEnv", rate=rate, graph=graph, bridge=bridge, step_fn=step_fn)

# Toggle render
env.render("human")

# Stable Baselines3 expects flattened actions & observations
# Convert observation and action space from Dict() to Box()
env = Flatten(env)

# Initialize learner
model = sb.SAC("MlpPolicy", env, verbose=1)

# Train for 1 minute (sim time)
model.learn(total_timesteps=int(60 * rate))

env.shutdown()