# Reinforcement Learning course notebook

## Environment simulation

We will start with a simple control environment that you will complete yourselves, then we will move to a more complex open-source aircraft control environment.

### Simple line control environment

We will begin with a very simple control problem consisting in maintaining a moving object along a straight line.
We can control its acceleration as shown in the following picture.
The objective is to keep minimum the distance between the object and the center line as long as possible.
To make the problem a bit interesting, we constrain the norm of the acceleration to be larger than a given value $a_{min}$: $\forall t > 0, \Vert a(t) \Vert_2 \geqslant a_{min}$.

![Line control environment](line_control_diagram.png)

#### Mathematical modeling of the Reinforcement Learning problem

We want to learn to control the object using Reinforcement Learning.
The state of the system must contain the minimal information required to update the physics of the system from the action, i.e. the acceleration of the object.
In this case, we need at least the position and the speed of the object, thus the state will be defined at any time $t$ by $s(t)=(x(t), y(t), v_x(t), v_y(t))$
Moreover, for classical RL algorithms to apply, we need to discretize the time every $\Delta t$ time units.
We can now approximate the physics of the object movement with the following equations, knowing the initial state $s(0)=(x(0), y(0), v_x(0), v_y(0))$:
- $\forall t > 0, v_x(t+\Delta t) = v_x(t) + a_x(t) \cdot \Delta t$ ;
- $\forall t > 0, x(t+\Delta t) = x(t) + v_x(t) \cdot \Delta t$ ;
- $\forall t > 0, v_y(t+\Delta t) = v_y(t) + a_y(t) \cdot \Delta t$ ;
- $\forall t > 0, y(t+\Delta t) = y(t) + v_y(t) \cdot \Delta t$ ;

Since we want to keep minimum the distance between the center line and the object, we will model the reward signal at any time $t$ by $r(t) = e^{-\vert y(t) \vert}$.
By doing so, an RL agent who will try to maximize the cumulated sum of (discounted) rewards will try to keep $y(t)$ as close as possible to $0$ at any time step.
There are two possible ways to enforce the constraint $\Vert a(t) \Vert_2 \geqslant a_{min}$ in RL:
- either by ensuring that the algorithm will only select such actions ;
- or by associating a very large penalty (i.e. negative reward) to transitions labelled with such actions.

#### Implementation in a Gym environment

[OpenAI Gym](https://gym.openai.com/) is a popular Python software library to model RL environments in a standard way which can be exploited RL algorithm libraries like [RLlib](https://www.ray.io/rllib) or [Stable Baselines](https://github.com/DLR-RM/stable-baselines3).
OpenAI Gym - or Gym in short - provides well-known environment implementations like CartPole, but we can also implement our own environment by following their standards, which will allow us to solve our environment using well-implemented and efficient RL algorithms from the aforementioned libraries.
All we have to do is to implement a domain class with the following methods:
```python
class MyEnvironement:
    def __init__(self):
        # Declare your variables here, including the environment's state.
        # Declare also the action and observation (i.e. state) spaces: the action space is used by the algorithm
        # to select actions while the observation space is used by Deep RL algorithms to properly initialize
        # the observation (i.e. state) layer of the tensors.
        pass

    def reset(self):
        # Initialize and return the initial state of the environment
        pass

    def step(self, action):
        # Perform one simulation step of the environment, i.e. compute the state resulting from applying the given action in the current state.
        # Don't forget to update the environment's state so that the next call to the step method will reason about the updated state.
        # Must return a tuple (state, reward, done, info) where done is true if the episode should stop now and info is a dictionary that can be left empty.
        pass

    def render(self, mode="human"):
        # If you want to render something at each simulation step (e.g. an image, some text, etc.)
        pass
```

#### It's your turn!
Please fill in the missing lines in the definition below of the Gym environment which implements "simple line control" problem.

In [None]:
import gym
import numpy as np
from gym.envs.classic_control import rendering
from math import sqrt, exp, fabs


HORIZON = 500
ACCELERATION_MIN = 0.5
PENALTY = -1000.


class SimpleLineControlGymEnv:
    """This class mimics an OpenAI Gym environment"""

    def __init__(self):
        """Initialize GymDomain.
        # Parameters
        gym_env: The Gym environment (gym.env) to wrap.
        """
        inf = np.finfo(np.float32).max
        self.action_space = gym.spaces.Box(
            np.array([-1.0, -1.0]), np.array([1.0, 1.0]), dtype=np.float32
        )
        self.observation_space = gym.spaces.Box(
            np.array([-inf, -inf, -inf, -inf]),
            np.array([inf, inf, inf, inf]),
            dtype=np.float32,
        )
        self._delta_t = 0.001
        self._init_pos_x = 0.0
        self._init_pos_y = 0.5
        self._init_speed_x = 10.0
        self._init_speed_y = 1.0
        self._pos_x = None
        self._pos_y = None
        self._speed_x = None
        self._speed_y = None
        self.viewer = None
        self._path = []

    def get_state(self):
        return np.array(
            [self._pos_x, self._pos_y, self._speed_x, self._speed_y], dtype=np.float32
        )

    def set_state(self, state):
        self._pos_x = state[0]
        self._pos_y = state[1]
        self._speed_x = state[2]
        self._speed_y = state[3]

    def reset(self):
        self._pos_x = self._init_pos_x
        self._pos_y = self._init_pos_y
        self._speed_x = self._init_speed_x
        self._speed_y = self._init_speed_y
        self._path = []
        return np.array(
            [self._pos_x, self._pos_y, self._speed_x, self._speed_y], dtype=np.float32
        )

    def step(self, action):
        if sqrt(action[0]*action[0] + action[1]*action[1]) < ACCELERATION_MIN:
            obs = np.array(
                [self._pos_x, self._pos_y, self._speed_x, self._speed_y], dtype=np.float32
            )
            return obs, PENALTY, True, {}
        self._speed_x = self._speed_x + action[0] * self._delta_t
        self._speed_y = self._speed_y + action[1] * self._delta_t
        self._pos_x = self._pos_x + self._delta_t * self._speed_x
        self._pos_y = self._pos_y + self._delta_t * self._speed_y
        obs = np.array(
            [self._pos_x, self._pos_y, self._speed_x, self._speed_y], dtype=np.float32
        )
        reward = exp(-fabs(self._pos_y))
        done = bool(fabs(self._pos_y) > 1.0)
        self._path.append((self._pos_x, self._pos_y))
        return obs, reward, done, {}

    def render(self, mode="human"):
        screen_width = 600
        screen_height = 400

        if self.viewer is None:
            self.viewer = rendering.Viewer(screen_width, screen_height)
            self.track = rendering.Line(
                (0, screen_height / 2), (screen_width, screen_height / 2)
            )
            self.track.set_color(0, 0, 1)
            self.viewer.add_geom(self.track)
            self.traj = rendering.PolyLine([], False)
            self.traj.set_color(1, 0, 0)
            self.traj.set_linewidth(3)
            self.viewer.add_geom(self.traj)

        if len(self.traj.v) != len(self._path):
            self.traj.v = []
            for p in self._path:
                self.traj.v.append((p[0] * 100, screen_height / 2 + p[1] * 100))

        return self.viewer.render(return_rgb_array=mode == "rgb_array")

    def close(self):
        if self.viewer:
            self.viewer.close()
            self.viewer = None


### Aircraft taxiing control environment

Now we investigate a more complex control problem consisting in controlling a flying aircraft.
Based on the Gym environments from the [gym-jsbsim](https://github.com/galleon/gym-jsbsim) library which simulate aircraft physics,
we will try to learn to follow a certain heading and altitude so that every 150 sec a new target heading and altitude are set.
The environment is explained [here](https://github.com/galleon/gym-jsbsim/blob/master/README.md#heading-and-altitude-task).

In [None]:
import gym
import gym_jsbsim

env = gym.make("GymJsbsim-HeadingAltitudeControlTask-v0")
env.reset()
done = False

while not done:
   action = env.action_space.sample()
   state, reward, done, _ = env.step(action)
   print('state: {}'.format(state))