# Tutorial 2: Reset and Step Function

In this tutorial, we will show how to create a gym environment using [EAGERx](https://eagerx.readthedocs.io/en/master/) while specifying the [step function](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html#eagerx.core.env.EagerxEnv.step_fn) and [reset function](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html#eagerx.core.env.EagerxEnv.reset_fn).

The following will be covered:
- Extracting observations in the [step_fn](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html#eagerx.core.env.EagerxEnv.step_fn)
- Resetting states using the [reset_fn](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html#eagerx.core.env.EagerxEnv.reset_fn)
- The `window` argument of the [connect method](https://eagerx.readthedocs.io/en/master/guide/api_reference/graph/graph.html?highlight=connect#eagerx.core.graph.Graph.connect)
- Simulating delays using the `delay` argument of the [connect method](https://eagerx.readthedocs.io/en/master/guide/api_reference/graph/graph.html?highlight=connect#eagerx.core.graph.Graph.connect)

In the remainder of this tutorial we will go more into detail on these concepts.

Furthermore, at the end of this notebook you will find exercises.
For the exercises you will have to add/modify a couple of lines of code, which are marked by

```python

# START EXERCISE [BLOCK_NUMBER]

# END EXERCISE [BLOCK_NUMBER]
```

## Pendulum Swing-up

We will create an environment for solving the classic control problem of swinging up an underactuated pendulum, very similar to the [Pendulum-v0 environment](https://gym.openai.com/envs/Pendulum-v0/).
Our goal is to swing up this pendulum to the upright position and keep it there, while minimizing the velocity of the pendulum and the input voltage.

Since the dynamics of a pendulum actuated by a DC motor are well known, we can simulate the pendulum by integrating the corresponding ordinary differential equations (ODEs):


$\mathbf{x} = \begin{bmatrix} \theta \\ \dot{\theta} \end{bmatrix} \\ \dot{\mathbf{x}} = \begin{bmatrix} \dot{\theta} \\ \frac{1}{J}(\frac{K}{R}u - mgl \sin{\theta} - b \dot{\theta} - \frac{K^2}{R}\dot{\theta})\end{bmatrix}$

with $\theta$ the angle w.r.t. upright position, $\dot{\theta}$ the angular velocity, $u$ the input voltage, $J$ the inertia, $m$ the mass, $g$ the gravitational constant, $l$ the length of the pendulum, $b$ the motor viscous friction constant, $K$ the motor constant and $R$ the electric resistance.

## Notebook Setup

In order to be able to run the code, we need to install the *eagerx_tutorials* package and ROS.

In [1]:
try:
    import eagerx_tutorials
except ImportError:
    !{"echo 'Installing eagerx-tutorials with pip.' && pip install eagerx-tutorials >> /tmp/eagerx_install.txt 2>&1"}
if 'google.colab' in str(get_ipython()):
    !{"curl 'https://raw.githubusercontent.com/eager-dev/eagerx_tutorials/master/scripts/setup_colab.sh' > ~/setup_colab.sh"}
    !{"bash ~/setup_colab.sh"}

# Setup interactive notebook
# Required in interactive notebooks only.
from eagerx_tutorials import helper
helper.setup_notebook()
env = None

# Allows reloading of registered entites from changed files
# Required in interactive notebooks only.
%reload_ext autoreload
%autoreload 1

Not running on CoLab.
Execute ROS commands as "!...".
ROS noetic available.


## Let's get started

We start by importing the required packages and initializing EAGERx.

In [2]:
import eagerx
import eagerx_tutorials.pendulum  # Registers Pendulum
import eagerx_ode  # Registers OdeBridge

# Initialize eagerx (starts roscore if not already started.)
eagerx.initialize("eagerx_core")

... logging to /home/jelle/.ros/log/d734afb2-c7c2-11ec-ab25-bdefe663dbb0/roslaunch-jelle-Alienware-m15-R4-63389.log
[1mstarted roslaunch server http://145.94.60.89:33347/[0m
ros_comm version 1.15.14


SUMMARY

PARAMETERS
 * /rosdistro: noetic
 * /rosversion: 1.15.14

NODES

[INFO] [1651241088.184788]: Roscore cannot run as another roscore/master is already running. Continuing without re-initializing the roscore.


Next, we make the *Pendulum* object and add it to an empty graph, just like we did in the [first tutorial](https://colab.research.google.com/github/eager-dev/eagerx_tutorials/blob/master/tutorials/pendulum/pendulum_1.ipynb).

We will again connect the *u* actuator of the *Pendulum* to an action that we will call *voltage* and connect the sensors *theta* and *dtheta* to observations, which we will call *angle* and *angular_velocity*.
However, we will now go a bit more into detail on the [connect method](https://eagerx.readthedocs.io/en/master/guide/api_reference/graph/graph.html?highlight=connect#eagerx.core.graph.Graph.connect).
When connecting outputs, sensors or actions, we can specify among other things the `window` of the connection.
It specifies how to deal with messages that are sent between nodes in between calls to their callback.
In some cases it makes sense to use the last one only; in others you would like to receive all messages between calls.
This can be achieved by setting the `window` size:

- `window` $= 1$: Only the last received input message are available to the receiver.
- `window` $= x \ge 1$: The trailing last $x$ received input messages are available to the receiver ($1 \le$ received number of messages $\le$ `window` ).
- `window` $= 0$: All input messages received since the last call to the node's callback are available.

This is in particular relevant when connecting to observations, since it has consequences for the size of the observation space.
When connecting to an observation with `window` $= 0$, this observation will **not** be included in the observation space of the agent, because its dimensions might change every time step and are therefore unknown on beforehand.
Also worth noting, is that for observations if `window` $= x > 1$, at time step $t < x$, the first message is repeated $x - t$ times to ensure that the dimensions of the observation space are consistent.

Next to the `window` size, we can also specify the `delay` of each connection.
In this way, we can easily simulate delays for inputs and sensors.


In [3]:
# Define rate (Hz)
rate = 30.0

# Initialize empty graph
graph = eagerx.Graph.create()

# Make pendulum

# START EXERCISE 1.1
sensors = ["theta", "dtheta", "image"]
# END EXERCISE 1.1

# START EXERCISE 2.1
states = ["model_state"]
# END EXERCISE 2.1

pendulum = eagerx.Object.make("Pendulum", "pendulum", actuators=["u"], sensors=sensors, states=states)

# Add pendulum to the graph
graph.add(pendulum)

# Connect the pendulum to an action and observation
# We will now explicitly set the window size
graph.connect(action="voltage", target=pendulum.actuators.u, window=1)
graph.connect(source=pendulum.sensors.theta, observation="angle", window=1)

# START EXERCISE 1.2
graph.connect(source=pendulum.sensors.dtheta, observation="angular_velocity", window=1)
# END EXERCISE 1.2

# START EXERCISE 1.3

# END EXERCISE 1.3

# Render image
graph.render(source=pendulum.sensors.image, rate=rate)

# Make OdeBridge
bridge = eagerx.Bridge.make("OdeBridge", rate=rate)

Using the [*eagerx_gui* package](https://github.com/eager-dev/eagerx_gui), we see that the graph looks as follows:


```python
graph.gui()
```

<img src="./figures/tutorial_1_gui.svg" width=720>

We will now define the [step function](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html#eagerx.core.env.EagerxEnv.step_fn).
Here we define the `reward` and fill the `info` dictionary at each time step.
Since we want to stabilize the pendulum in upright position — while minimising the input voltage — we define the reward to be a weighted sum of $\theta^2$, $\dot{\theta^2}$ and $u^2$.

We will elaborate a bit more on this step function.
The step function is an argument to the [EagerxEnv](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html#eagerx.core.env.EagerxEnv).
This function is called by the EAGERx environment every time step and it returns the same things as the `step()` method of OpenAI Gym environments, i.e. `observation` (**dict**), `reward` (**float**), `done` (**boolean**) and `info` (**dict**).
More information on this can be found [here](https://gym.openai.com/docs/#observations).
The input to the step function in EAGERx are:

- `previous_observation` (**dict**): The `observation` at the previous timestep.
- `observation` (**dict**): The `observation` at the current timestep.
- `action` (**dict**): The agent's action at the current timestep. 
- `steps` (**int**): The number of timesteps since the start of the episode (since the last reset).

Note that the `observation` is both an input and output of this function and should only be used for extracting information and should not be manipulated.

The keys of observations and dictionaries correspond to respectively the value of the `observation` and `action` argument provided in the [connect method](https://eagerx.readthedocs.io/en/master/guide/api_reference/graph/graph.html?highlight=connect#eagerx.core.graph.Graph.connect).


In [4]:
import numpy as np
from typing import Dict

# Define step function
def step_fn(previous_observation: Dict[str, np.ndarray], observation: Dict[str, np.ndarray], action: Dict[str, np.ndarray], steps: int):
    
    # Get angle 
    th = observation["angle"][-1]
    
    # START EXERCISE 1.4
    thdot = observation["angular_velocity"][-1]
    # END EXERCISE 1.4
    
    # Convert from numpy array to float
    u = float(action["voltage"])
    
    # Normalize angle so it lies in [-pi, pi]
    th -= 2 * np.pi * np.floor((th + np.pi) / (2 * np.pi))
    
    # Calculate cost
    # Penalize angle error, angular velocity and input voltage
    cost = th**2 + 0.1 * thdot**2 + 0.001 * u**2
    
    # Determine when is the episode over
    # currently just a timeout after 100 steps
    done = steps > 100
    
    # Set info, tell the algorithm the termination was due to a timeout
    # (the episode was truncated)
    info = {"TimeLimit.truncated": steps > 100}
    
    return observation, -cost, done, info

Next we will also define a [reset function](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html#eagerx.core.env.EagerxEnv.reset_fn).
The reset function allows to specify how states are reset at the beginning of an episode.
Remember that we have one object (*Pendulum*) with one state (*model_state*).
This *model_state* corresponds to $x = \begin{bmatrix} \theta \\ \dot{\theta} \end{bmatrix}$.
The default reset function as defined in [EagerxEnv](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html?highlight=eagerxenv#eagerx.core.env.EagerxEnv) is:
```python
reset_fn = lambda env: env.state_space.sample()
```
which results in resetting each state to a randomly sampled value from the corresponding state distribution.

Here we will create a custom reset function that will reset the pendulum at the beginning of an episode to $\mathbf{x} = \begin{bmatrix} \pi \\ 0 \end{bmatrix}$.
This corresponds to the downward position with zero velocity.

In [5]:
# Define reset function

def reset_fn(env: eagerx.EagerxEnv):
    state_dict = dict()
    
    # START EXERCISE 2.2
    
    # key = "[object_name]/[state_name]"
    state_dict["pendulum/model_state"] = np.array([np.pi, 0], dtype="float32")
    
    # END EXERCISE 2.2
    
    return state_dict

Finally, we will initialize the environment and train the agent using [Stable Baselines3](https://stable-baselines3.readthedocs.io/en/master/), similar to the first tutorial.

In [6]:
import stable_baselines3 as sb
from eagerx.wrappers import Flatten

# Initialize Environment
env = eagerx.EagerxEnv(name="PendulumEnv", rate=rate, graph=graph, bridge=bridge, step_fn=step_fn, reset_fn=reset_fn)

# Toggle render
env.render("human")

# Stable Baselines3 expects flattened actions & observations
# Convert observation and action space from Dict() to Box()
env = Flatten(env)

# Initialize learner
model = sb.SAC("MlpPolicy", env, verbose=1)

# Train for 1 minute (sim time)
model.learn(total_timesteps=int(60 * rate))

env.shutdown()

[INFO] [1651241088.944307]: Node "/PendulumEnv/env/supervisor" initialized.
[INFO] [1651241089.087094]: Node "/PendulumEnv/bridge" initialized.
[INFO] [1651241089.209922]: Node "/PendulumEnv/environment" initialized.
[INFO] [1651241089.236116]: Node "/PendulumEnv/env/render" initialized.
[INFO] [1651241089.311365]: Node "/PendulumEnv/pendulum/theta" initialized.
[INFO] [1651241089.383828]: Node "/PendulumEnv/pendulum/dtheta" initialized.
[INFO] [1651241089.454643]: START RENDERING!
Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
[INFO] [1651241089.483672]: Adding object "pendulum" of type "Pendulum" to the simulator.
[INFO] [1651241089.501378]: Node "/PendulumEnv/pendulum/x" initialized.
[INFO] [1651241089.512113]: [pendulum/image] START RENDERING!
[INFO] [1651241089.518910]: Node "/PendulumEnv/pendulum/image" initialized.
[INFO] [1651241089.533136]: Node "/PendulumEnv/pendulum/pendulum_actuator" initialized.
[INFO] [1651241089.547454]: No

# Exercises

In these exercises we will address the scenario in which the angular velocity $\dot{\theta}$ is not available.
So now we want to learn to swing up the pendulum using the angle $\theta$ only.
Step-by-step we will guide you through the process of creating an environment for this scenario in the following exercises.

For these exercises, you will need to modify or add some lines of code in the cells above.
These lines are indicated by the following comments:

```python
# START EXERCISE [BLOCK_NUMBER]

# END EXERCISE [BLOCK_NUMBER]
```

However, feel free to play with the other code as well if you are interested.
We recommend you to restart and run all code after each section (in Colab there is the option *Restart and run all* under *Runtime*).


## 1. Violation of the Markov property

We could naively remove the sensor *dtheta* in the environment above and train the agent.
However, this will most likely not result in a successful policy because the [Markov property](http://www.incompleteideas.net/book/3/node6.html) is violated.
It is not possible to fully restore the Markov property without observing $\dot{\theta}$, but we can create a representation that is sufficient for solving the task.
If we stack the last three measurements of $\theta$ and provide this information as an observation, the agent will be able to approximate $\dot{\theta}$ (e.g. using a finite difference method).
With this information, the agent can estimate the angular velocity at the previous time step.
If we also provide the last applied action as an observation, the agent will be able to estimate $\dot{\theta}$ at the current time step.

After this the graph should look as follows:

<img src="./figures/tutorial_21_gui.svg" width=720>

Furthermore, the Markov property can also be violated due to delays.
If we want our policy to transfer from simulation to a real system, we also need to account for delays that are present in the real world.
Therefore we will simulate that the sensor *theta* has a delay of 0.01 seconds.

We also have to update `step_fn`, since we no longer have the *angular_velocity* observation.
In the reward function, we still want to penalize the angular velocity.
Therefore we will have to approximate $\dot{\theta}$ in `step_fn`, which could for example be done as follows: $\hat{\dot{\theta}} = \text{rate} \times (\theta_k  - \theta_{k - 1})$ where $k$ is the time step and rate is the `rate` of the [environment](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html).

### Add your code to the following blocks: 

1.1 Remove sensor *dtheta* from and add sensor *u* to the list of sensors.  
1.2 Connect sensor *theta* with `window` = 3 to stack the last three observations of $\theta$ and set `delay` to 0.01.  
1.3 Connect *u* to an observation called *action_applied* with `window` = 1.  
1.4 Update the`step_fn` such that we use an estimate of $\dot{\theta}$ to calculate the reward.
Hint: you could use the `previous_observation` for this.

## 2. Initial state sampling and domain randomization

Next, we will add domain randomization [domain randomization](https://sites.google.com/view/domainrandomization/), in order to improve the robustness against model inacurracies.
If we want to transfer a policy from simulation to a real system, we need to be aware that the model used for simulation is inaccurate and that the agent could possibly exploit these inaccuracies.
One of the techniques for addressing this problem is domain randomization, i.e. varying over simulator parameters in order to improve the robustness of the resulting policy.
More specifically, we will do this by varying over the ODE parameters ($m, l, J, b, K$ and $R$).
We can do this by adding the *model_parameters* state.

After this the graph should look as follows:

<img src="./figures/tutorial_22_gui.svg" width=720>

We will also improve the reset procedure of the environment.
At the beginning of each episode, the environment is reset.
In the code as provided above, the pendulum is reset to the downward position with zero velocity each episode.
However, the initial state distribution can have a significant influence on the learning speed.
If we sample the $x_0 = \begin{bmatrix} \pi \\ 0 \end{bmatrix}$ initial state every time, it will take many timesteps for the agent to obtain experience for $-\frac{\pi}{2} < \theta < \frac{\pi}{2}$.
Namely, in the beginning the policy will be random and it is unlikely that acting randomly will result in the pendulum gaining enough momentum to move upwards.
This is problematic, since the agent will obtain the highest rewards when the pendulum is pointed upwards.
If the agent does not explore enough (see [the exploration-exploitation trade-off](http://www.incompleteideas.net/book/2/node2.html)), the agent will not know that it can obtain the highest rewards by swinging the pendulum upward.
Therefore, we will update the `reset_fn`, such that we sample the initial state randomly, rather than sampling $x_0 = \begin{bmatrix} \pi \\ 0 \end{bmatrix}$ everytime.
We also need to make sure that the aforementioned *model_parameters* state that is reset to perform domain randomization.

### Add your code to the following blocks: 

2.1 Add the state *model_parameters* to the list of states of the pendulum  
2.2 Update the reset function, such that the *model_state* and *model_parameters* states are reset to random values at the beginning of each episode.