# Exercises Tutorial 2

In these exercises we will address the scenario in which the angular velocity $\dot{\theta}$ is not available.
So now we want to learn to swing up the pendulum using the angle $\theta$ only.
Step-by-step we will guide you through the process of creating an environment for this scenario in the following exercises.

For these exercises, you will need to modify or add some lines of code in the cells above.
These lines are indicated by the following comments:

```python
# START EXERCISE [BLOCK_NUMBER]

# END EXERCISE [BLOCK_NUMBER]
```

However, feel free to play with the other code as well if you are interested.
We recommend you to restart and run all code after each section (in Colab there is the option *Restart and run all* under *Runtime*).


## 1. Violation of the Markov property

We could naively remove the sensor *dtheta* in the environment above and train the agent.
However, this will most likely not result in a successful policy because the [Markov property](http://www.incompleteideas.net/book/3/node6.html) is violated.
It is not possible to fully restore the Markov property without observing $\dot{\theta}$, but we can create a representation that is sufficient for solving the task.
If we stack the last three measurements of $\theta$ and provide this information as an observation, the agent will be able to approximate $\dot{\theta}$ (e.g. using a finite difference method).
With this information, the agent can estimate the angular velocity at the previous time step.
If we also provide the last applied action as an observation, the agent will be able to estimate $\dot{\theta}$ at the current time step.

After this the graph should look as follows:

<img src="./img/tutorial_21_gui.svg" width=720>

Furthermore, the Markov property can also be violated due to delays.
If we want our policy to transfer from simulation to a real system, we also need to account for delays that are present in the real world.
Therefore we will simulate that the sensor *theta* has a delay of 0.01 seconds.

We also have to update `step_fn`, since we no longer have the *angular_velocity* observation.
In the reward function, we still want to penalize the angular velocity.
Therefore we will have to approximate $\dot{\theta}$ in `step_fn`, which could for example be done as follows: $\hat{\dot{\theta}} = \text{rate} \times (\theta_k  - \theta_{k - 1})$ where $k$ is the time step and rate is the `rate` of the [environment](https://eagerx.readthedocs.io/en/master/guide/api_reference/env/index.html) in Hz.

### Add your code to the following blocks: 

1.1 Remove sensor *dtheta* from and add sensor *u* to the list of sensors.  
1.2 Remove the connection from sensor *dtheta*. Also, connect sensor *theta* with `window` = 3 to stack the last three observations of $\theta$ and set `delay` to 0.01.  
1.3 Connect *u* to an observation called *action_applied* with `window` = 1. Do you know why *u* should be an observation to the agent in order to restore the Markov property?  
1.4 Update the`step_fn` such that we use an estimate of $\dot{\theta}$ to calculate the reward.
Hint: you could use the `previous_observation` for this.

## 2. Initial state sampling and domain randomization

Next, we will add domain randomization [domain randomization](https://sites.google.com/view/domainrandomization/), in order to improve the robustness against model inacurracies.
If we want to transfer a policy from simulation to a real system, we need to be aware that the model used for simulation is inaccurate and that the agent could possibly exploit these inaccuracies.
One of the techniques for addressing this problem is domain randomization, i.e. varying over simulator parameters in order to improve the robustness of the resulting policy.
More specifically, we will do this by varying over the ODE parameters ($m, l, J, b, K$ and $R$).
We can do this by adding the *model_parameters* state.

After this the graph should look as follows:

<img src="./img/tutorial_22_gui.svg" width=720>

We will also improve the reset procedure of the environment.
At the beginning of each episode, the environment is reset.
In the code as provided above, the pendulum is reset to the downward position with zero velocity each episode.
However, the initial state distribution can have a significant influence on the learning speed.
If we sample the $\mathbf{x}_0 = \begin{bmatrix} \pi \\ 0 \end{bmatrix}$ initial state every time, it will take many timesteps for the agent to obtain experience for $-\frac{\pi}{2} < \theta < \frac{\pi}{2}$.
Namely, in the beginning the policy will be random and it is unlikely that acting randomly will result in the pendulum gaining enough momentum to move upwards.
This is problematic, since the agent will obtain the highest rewards when the pendulum is pointed upwards.
If the agent does not explore enough (see [the exploration-exploitation trade-off](http://www.incompleteideas.net/book/2/node2.html)), the agent will not know that it can obtain the highest rewards by swinging the pendulum upward.
Therefore, we will update the `reset_fn`, such that we sample the initial state randomly, rather than sampling $\mathbf{x}_0 = \begin{bmatrix} \pi \\ 0 \end{bmatrix}$ everytime.
We also need to make sure that the aforementioned *model_parameters* state that is reset to perform domain randomization.

### Add your code to the following blocks: 

2.1 Add the state *model_parameters* to the list of states of the pendulum  
2.2 Update the reset function, such that the *model_state* and *model_parameters* states are reset to random values at the beginning of each episode.
Hint: you can sample states from the environment's state space as follows:
```python
env.state_space["[object_name]/[state_name]"].sample()
```
where *object_name* should be replaced with the name of the object and *state_name* with the name of the state.

# Exercises Tutorial 3

In these exercises you will improve the sample efficiency of the learning problem by modifying the space converter.

For these exercises, you will need to modify or add some lines of code in the cells above.
These lines are indicated by the following comments:

```python
# START EXERCISE [BLOCK_NUMBER]

# END EXERCISE [BLOCK_NUMBER]
```

However, feel free to play with the other code as well if you are interested.
We recommend you to restart and run all code after each section (in Colab there is the option *Restart and run all* under *Runtime*).


## 1. Angle Decomposition

In the code as provided above, we reduced the observation space by normalizing $\theta$.
This will improve the sample efficiency, but we can do even better.
Normalizing $\theta$ results in discontinous observations of $\theta$, i.e. there is a sign switch increasing the angle over $\pi$ or decreasing the angle smaller than $-\pi$.
Many (reinforcement) learning algorithms have difficulties with such discontinuities.
Therefore it is better to choose a representation for $\theta$ without discontinuities, e.g. its sine and cosine component: $[\sin(\theta), \cos(\theta)]$.


### Add your code to the following blocks: 

1.1 Instead of the normalized angle, the `B_TO_A` method should return the decomposed angle: $[\sin(\theta), \cos(\theta)]$.  
1.2 The values of `low` and `high` of the Gym space should be updated accordingly.  
1.3 The function `step_fn` should be updated as well. Reconstruct $\theta$, since it is no longer observed directly by the agent.  

# Exercises Tutorial 4

In these exercises you finalize the implementation of the moving average filter node.
Furthermore, the Markov property will be violated after implementing the moving average filter.
You will restore the Markov property, which will also require you to consider the graph validity.

For these exercises, you will need to modify or add some lines of code in the cells above.
These lines are indicated by the following comments:

```python
# START EXERCISE [BLOCK_NUMBER]

# END EXERCISE [BLOCK_NUMBER]
```

However, feel free to play with the other code as well if you are interested.
We recommend you to restart and run all code after each section (in Colab there is the option *Restart and run all* under *Runtime*).


## 1. Finalize the moving average filter

In the code as provided above, the implementation of the moving average filter is not yet finalized.
Currently, it just outputs the input signal without applying any filtering.
Finalize the filter such that moving average filtering is applied to the acuator $u$.


### Add your code to the following blocks: 

1.1 Add the custom parameter *n* (the window size of the moving average filter) to the node specification.  
1.2 Having added *n* to the specification, will result in it becoming an argument to the [initialize()](https://eagerx.readthedocs.io/en/master/guide/api_reference/node/node.html#eagerx.core.entities.Node.initialize) method.
Also, we need *n* to be available in [reset()](https://eagerx.readthedocs.io/en/master/guide/api_reference/node/node.html#eagerx.core.entities.Node.reset) and [callback()](https://eagerx.readthedocs.io/en/master/guide/api_reference/node/node.html#eagerx.core.entities.Node.callback).
Therefore it should be added to `self`.
Furthermore, initialize the moving average with the value 0 by adding a new variable *moving_average* to `self`.  
1.3 During a reset at the beginning of the episode, we should make sure that the moving average is reset to 0.
So, make sure that the instance variable *moving_average* you have just created is reset to 0 in [reset()](https://eagerx.readthedocs.io/en/master/guide/api_reference/node/node.html#eagerx.core.entities.Node.reset).
1.4 In [callback()](https://eagerx.readthedocs.io/en/master/guide/api_reference/node/node.html#eagerx.core.entities.Node.callback), the actual moving average should be calculated.
Calculate the moving average recursively, i.e. $a_t = \frac{(n-1)a_{t-1} + x_t}{n}$, where $a_t$ is the moving average at time step $t$, $n$ is the moving average window size and $x_t$ the value of the input *signal* at time step $t$.
Make sure you store the resulting moving average in instance variable *moving_average*.


## 2. Restore the Markov property

After implementing the moving average filter, we have violated the [Markov property](http://www.incompleteideas.net/book/ebook/node32.html).
Namely, the state is no longer memoryless due to the filtering procedure.
We can restore the markov property by adding the moving average, i.e. output *filtered* of the *filter* node, as an observation to the agent.

After this the graph should look as follows:

<img src="./img/tutorial_42_gui.svg" width=720>

However, if we naively connect *filtered* to an observation and now run

```python
graph.is_valid()
```

we get the following output:

<img src="./img/tutorial_42_communication_graph.svg" width=420>

Here we see that this results in a causal loop between *actions/voltage*, *filter/filtered* and *observations*.
Therefore, you need to consider the graph validity when creating this connection.

### Add your code to the following blocks: 

2.1 Connect the output *filtered* of the node *filter* to an observation called *moving_average* in order to restore the Markov property.
Make sure that the graph is valid after this connection.

*Hint*: when connecting to an observation you need to specify an initial observation (`initial_obs`) in some cases (when `skip=True` to avoid cycle in the graph).

# Exercise Tutorial 5

In this exercise you will add a new engine-specific implementation to the object definition of the underactuated pendulum.

For this exercise, you will need to modify or add some lines of code in the cells above.
These lines are indicated by the following comments:

```python
# START EXERCISE [BLOCK_NUMBER]

# END EXERCISE [BLOCK_NUMBER]
```

However, feel free to play with the other code as well if you are interested.
We recommend you to restart and run all code after each section (in Colab there is the option *Restart and run all* under *Runtime*).

## 1. Add support for a new physics-engine
Up until now, we have simulated the pendulum dynamics with the *engine-specific* implementation [here](https://github.com/eager-dev/eagerx_tutorials/blob/3ddc2eb7558c7825095611fec3a01a47f5e7af79/eagerx_tutorials/pendulum/objects.py#L108-L168) that was registered with the [OdeBridge](https://github.com/eager-dev/eagerx_ode).

Most informative would be an exercise where we interface a real pendulum. Unfortunately, interactive notebooks do not allow us to easily demonstrate this without forcing users to have the exact same real pendulum we have in our lab. Therefore, we will instead add an implementation for the already defined OpenAI's [GymBridge](https://github.com/eager-dev/eagerx/blob/master/eagerx/bridges/openai_gym/bridge.py). We created [GymBridge](https://github.com/eager-dev/eagerx/blob/master/eagerx/bridges/openai_gym/bridge.py) so that any [OpenAI environment](https://gym.openai.com/envs/#classic_control) could be used as the physics-engine. In this exercise we will use the dynamics of the [Pendulum-v1](https://gym.openai.com/envs/Pendulum-v0/) environment to simulate our pendulum. For this, we will make use of the already defined engine nodes [here](https://github.com/eager-dev/eagerx/blob/master/eagerx/bridges/openai_gym/enginenodes.py).

Given that you've already created the engine nodes to interface the real pendulum, you can easily add an implementation for the [RealBridge](https://github.com/eager-dev/eagerx_reality/blob/m1aster/eagerx_reality/bridge.py) to train with a real pendulum following the same steps. Creating [engine nodes](https://eagerx.readthedocs.io/en/master/guide/api_reference/node/engine_node.html) is very similar to creating regular nodes which was already covered in [tutorial 4](https://colab.research.google.com/github/eager-dev/eagerx_tutorials/blob/master/tutorials/pendulum/4_nodes.ipynb). 

### Add your code to the following blocks: 

1.1.a Make an `EngineNode` that will be `dtheta`. Use `entity_id=FloatOutput` and set `idx=1` (the angular velocity is the second entry in the processed observation array, `angular_velocity = obs[1]`, hence `idx=1`).
*(hint: look at the code for `theta`).* 

1.1.b Add EngineNode `dtheta` to the engine graph. *(hint: look at the code for `theta`).*  

1.1.c Connect `dtheta` to the corresponding sensor with `sensor=dtheta`. *(hint: look at the code for `theta`).*  

1.2 Select the GymBridge by uncommenting the marked line. Run the code *(note: you may need to restart your kernel)*.  

1.3 Now, select sensor `u` (not to be mistaken with the actuator `u`!!) for the pendulum and connect it as an `observation`. Run the code and observe that it fails. As the error states, we did not provide an implementation for sensor `u`. This highlights that it is not compulsory to implement every actuator, sensor, or state that was defined by the object. You are free to only support a subset of them. However, you **will** get an error if you try to run with one that does not have an *engine-specific* implementation for the selected bridge.  
1.4 Switch back to using the OdeBridge (while still selecting sensor `u`). Run the code. It should again run without problems, as the OdeBridge **does** have an implementation for the sensor `u`.  

# Exercise Tutorial 6

In this exercise you will create a node that overlays the applied actions over raw images that are produced by the image sensor of the pendulum. As the overlay node is agnostic to the physics-engine, we have the same overlay in every physics-engine.

For this exercise, you will need to modify or add some lines of code in the cells above.
These lines are indicated by the following comments:

```python
# START EXERCISE [BLOCK_NUMBER]

# END EXERCISE [BLOCK_NUMBER]
```

However, feel free to play with the other code as well if you are interested.
We recommend you to restart and run all code after each section (in Colab there is the option *Restart and run all* under *Runtime*).

## 1. Render more informative images


### Add your code to the following blocks: 

1.1 Add the overlay node to the graph and connect the inputs `overlay.inputs.raw_image` and `overlay.inputs.u` to `pendulum.sensors.image` and action `voltage`, respectively.  
1.2 Change the render source to `overlay.outputs.image`. Using the [*eagerx_gui* package](https://github.com/eager-dev/eagerx_gui), you would see that the graph looks as below if `graph.gui()` would be called. Run the code, and you should now see the rendered overlay instead of the raw sensor images. 

<img src="./img/tutorial_6_gui.svg" width=720>

1.3 In the callback of the overlay node, add the current time (i.e. `t_n`) as text to the image. Run the code, and you should see a timestamp that increase while the episode progresses.  
1.4 Select the GymBridge by uncommenting the marked line. Also deselect the `model_state` by uncommenting the marked line of code.  Run the code, and you should see that the raw image has changed, but the overlay is still put on top. Hence, this demonstrates the agnostic behavior of the `graph`. 

# Exercise Tutorial 7

In this exercise you will modify the reset routine defined above. 

For this exercise, you will need to modify or add some lines of code in the cells above.
These lines are indicated by the following comments:

```python
# START EXERCISE [BLOCK_NUMBER]

# END EXERCISE [BLOCK_NUMBER]
```

However, feel free to play with the other code as well if you are interested.
We recommend you to restart and run all code after each section (in Colab there is the option *Restart and run all* under *Runtime*).

## 1. Render more informative images


### Add your code to the following blocks: 

1.1 Change the reset function, such that the desired angles are sampled randomly around the downward position of the pendulum. This will improve state-space coverage and improve the learning rate.  
1.2 Next, modify the callback of the reset node such that we do not use the PID controller, but perform random actions for 2 seconds before considering the reset finished. This will improve state-space coverage even more, because we now also allow for non-zero angular velocity resets. 