# Reinforcement Learning graded exercice

## Environment simulation

We will start with a simple deterministic control environment that we will make stochastic in a second phase.

### Simple line control environment

We will begin with a very simple control problem consisting in maintaining a moving object along a straight line.
We can control its acceleration as shown in the following picture.
The objective is to keep minimum the distance between the object and the center line as long as possible.
To make the problem a bit interesting, we constrain the norm of the acceleration to be larger than a given value $a_{min}$: $\forall t > 0, \Vert a(t) \Vert_2 \geqslant a_{min}$.

![Line control environment](line_control_diagram.png)

#### Mathematical modeling of the Reinforcement Learning problem

We want to learn to control the object using Reinforcement Learning.
The state of the system must contain the minimal information required to update the physics of the system from the action, i.e. the acceleration of the object.
In this case, we need at least the position and the speed of the object, thus the state will be defined at any time $t$ by $s(t)=(x(t), y(t), v_x(t), v_y(t))$
Moreover, for classical RL algorithms to apply, we need to discretize the time every $\Delta t$ time units.
We can now approximate the physics of the object movement with the following equations, knowing the initial state $s(0)=(x(0), y(0), v_x(0), v_y(0))$:
- $\forall t > 0, v_x(t+\Delta t) = v_x(t) + a_x(t) \cdot \Delta t$ ;
- $\forall t > 0, x(t+\Delta t) = x(t) + v_x(t) \cdot \Delta t$ ;
- $\forall t > 0, v_y(t+\Delta t) = v_y(t) + a_y(t) \cdot \Delta t$ ;
- $\forall t > 0, y(t+\Delta t) = y(t) + v_y(t) \cdot \Delta t$ ;

Since we want to keep minimum the distance between the center line and the object, we will model the reward signal at any time $t$ by $r(t) = e^{-\vert y(t) \vert}$.
By doing so, an RL agent who will try to maximize the cumulated sum of (discounted) rewards will try to keep $y(t)$ as close as possible to $0$ at any time step.
There are two possible ways to enforce the constraint $\Vert a(t) \Vert_2 \geqslant a_{min}$ in RL:
- either by ensuring that the algorithm will only select such actions ;
- or by associating a very large penalty (i.e. negative reward) to transitions labelled with such actions.

#### Implementation in a Gym environment

[OpenAI Gym](https://gym.openai.com/) is a popular Python software library to model RL environments in a standard way which can be exploited by RL algorithm libraries like [RLlib](https://www.ray.io/rllib) or [Stable Baselines](https://github.com/DLR-RM/stable-baselines3).

OpenAI Gym - or Gym in short - provides well-known environment implementations like CartPole, but we can also implement our own environment by following their standards, which will allow us to solve our environment using well-implemented and efficient RL algorithms from the aforementioned libraries. We will use the open-source Airbus library [scikit-decide](https://github.com/airbus/scikit-decide) which will allow us to use either RLLib or Stable Baselines as underlying solvers.

All we have to do is to implement a domain class with the following methods:
```python
class State:
    # Define your state class here
    pass

class Action:
    # Define your action class here
    pass


class D(RLDomain, UnrestrictedActions, FullyObservable, Renderable):
    T_state = State  # Type of states
    T_observation = T_state  # Type of observations
    T_event = Action  # Type of events
    T_value = float  # Type of transition values (rewards or costs)
    T_info = None  # Type of additional information in environment outcome

class MyDomain(D):
    def __init__(self):
        # Declare your variables here, including the environment's state.
        # Declare also the action and observation (i.e. state) spaces: the action space is used by the algorithm
        # to select actions while the observation space is used by Deep RL algorithms to properly initialize
        # the observation (i.e. state) layer of the tensors.
        pass

    def _state_reset(self) -> D.T_state:
        # Initialize and return the initial state of the environment
        pass

    def _state_step(self, action: D.T_event) -> TransitionOutcome[D.T_state, Value[D.T_value], D.T_predicate, D.T_info]:
        # Perform one simulation step of the environment, i.e. compute the state resulting from applying the given action in the current state.
        # Don't forget to update the environment's state so that the next call to the step method will reason about the updated state.
        # Must return a tuple (state, reward, done, info) where done is true if the episode should stop now and info is a dictionary that can be left empty.
        pass

    def  _get_observation_space_(self) -> Space[D.T_observation]:
        pass

    def _get_action_space_(self) -> Space[D.T_event]:
        pass

    def _render_from(self, memory: D.T_state, **kwargs: Any) -> Any:
        # If you want to render something at each simulation step (e.g. an image, some text, etc.)
        pass
```

#### It's your turn!
Please fill in the missing lines in the definition below of the Gym environment which implements "simple line control" problem.

In [None]:
from typing import *
from numpy.typing import ArrayLike

from skdecide import *
from skdecide.builders.domain import *
from skdecide.hub.space.gym import *
import numpy as np
import pygame
from pygame import gfxdraw
from math import sqrt, exp, fabs


HORIZON = 500
ACCELERATION_MIN = 0.5
PENALTY = -1000.


class D(RLDomain, UnrestrictedActions, FullyObservable, Renderable):
    T_state = ArrayLike  # Type of states
    T_observation = T_state  # Type of observations
    T_event = ArrayLike  # Type of events
    T_action = T_event
    T_value = float  # Type of transition values (rewards or costs)
    T_info = None  # Type of additional information in environment outcome


class SimpleLineControlDomain(D):
    """This class mimics an OpenAI Gym environment"""

    metadata = {"render_modes": ["human", "rgb_array"], "render_fps": 50}

    def __init__(self, env_config=None):
        """Initialize GymDomain.
        # Parameters
        gym_env: The Gym environment (gym.env) to wrap.
        """
        inf = np.finfo(np.float32).max
        self.action_space = BoxSpace(
            np.array([-1.0, -1.0]), np.array([1.0, 1.0]), dtype=np.float32
        )
        self.observation_space = BoxSpace(
            np.array([-inf, -inf, -inf, -inf]),
            np.array([inf, inf, inf, inf]),
            dtype=np.float32,
        )
        self._delta_t = 0.001
        self._init_pos_x = 0.0
        self._init_pos_y = 0.5
        self._init_speed_x = 10.0
        self._init_speed_y = 1.0
        self._pos_x = None
        self._pos_y = None
        self._speed_x = None
        self._speed_y = None
        self._path = []

        self.screen = None
        self.clock = None
        self.isopen = True

    def get_state(self):
        return np.array(
            [self._pos_x, self._pos_y, self._speed_x, self._speed_y], dtype=np.float32
        )

    def set_state(self, state):
        self._pos_x = state[0]
        self._pos_y = state[1]
        self._speed_x = state[2]
        self._speed_y = state[3]

    def _state_reset(self) -> D.T_state:
        self._pos_x = self._init_pos_x
        self._pos_y = self._init_pos_y
        self._speed_x = self._init_speed_x
        self._speed_y = self._init_speed_y
        self._path = []
        return np.array(
            [self._pos_x, self._pos_y, self._speed_x, self._speed_y], dtype=np.float32
        )

    def _state_step(
        self, action: D.T_event
    ) -> TransitionOutcome[D.T_state, Value[D.T_value], D.T_predicate, D.T_info]:
        ### WRITE YOUR CODE HERE ###

        ############################
        self._path.append((self._pos_x, self._pos_y))
        return TransitionOutcome(obs, Value(reward=reward if not done else PENALTY), done, {})

    def _get_observation_space_(self) -> Space[D.T_observation]:
        return self.observation_space

    def _get_action_space_(self) -> Space[D.T_event]:
        return self.action_space

    def close(self):
        if self.screen is not None:
            pygame.display.quit()
            pygame.quit()
            self.isopen = False

    def _render_from(self, memory: D.T_state, **kwargs: Any) -> Any:
        screen_width = 600
        screen_height = 400

        if self.screen is None:
            pygame.init()
            pygame.display.init()
            self.screen = pygame.display.set_mode((screen_width, screen_height))
        if self.clock is None:
            self.clock = pygame.time.Clock()

        self.surf = pygame.Surface((screen_width, screen_height))
        self.surf.fill((255, 255, 255))
        self.track = gfxdraw.hline(
            self.surf, 0, screen_width, int(screen_height / 2), (0, 0, 255)
        )

        if len(self._path) > 1:
            for p in range(len(self._path) - 1):
                gfxdraw.line(
                    self.surf,
                    int(self._path[p][0] * 100),
                    int(screen_height / 2 + self._path[p][1] * 100),
                    int(self._path[p + 1][0] * 100),
                    int(screen_height / 2 + self._path[p + 1][1] * 100),
                    (255, 0, 0),
                )

        self.surf = pygame.transform.flip(self.surf, False, True)
        self.screen.blit(self.surf, (0, 0))
        return np.transpose(
            np.array(pygame.surfarray.pixels3d(self.screen)), axes=(1, 0, 2)
        )

Now we test a trajectory of a random RL agent.

In [None]:
from typing import *

from skdecide import *
from skdecide.builders.domain import *
from skdecide.builders.solver import *


class D(RLDomain, UnrestrictedActions, FullyObservable, Renderable):
    pass


class RandomSolver(DeterministicPolicySolver):
    T_domain = D
    
    def __init__(self, action_space) -> None:
        super().__init__()
        self.action_space = action_space

    def _get_next_action(self, observation: D.T_observation) -> D.T_event:
        return np.random.Generator.choice(self.action_space.sample())

    def _is_policy_defined_for(self, observation: D.T_observation) -> bool:
        return True

In [None]:
import matplotlib.pyplot as plt
import random
from IPython import display
%matplotlib inline

domain_factory = lambda: SimpleLineControlDomain()
domain = domain_factory()

def rollout(domain, solver, max_steps):
    obs = domain.reset()

    for i in range(max_steps):
        plt.imshow(domain.render())
        display.display(plt.gcf())
        display.clear_output(wait=True)
        obs, reward, done, info = domain.step(
            solver.sample_action()
        )

rollout(domain=domain, solver=RandomSolver(domain.get_action_space()), max_steps=50)
domain.close()

We optimize the RL agent using RLLib.

In [None]:
# Import the RL algorithm (Trainer) we would like to use.
from ray.rllib.algorithms.ppo import PPO
from skdecide.hub.solver.ray_rllib import RayRLlib

assert RayRLlib.check_domain(domain)
solver_factory = lambda: RayRLlib(
    PPO, train_iterations=100
)

with solver_factory() as solver:
    # Solve domain
    SimpleLineControlDomain.solve_with(solver, domain_factory)

    # Test solution
    rollout(
        domain=domain,
        solver=solver,
        max_steps=200,
    )

### Stochastic line control

Now we make the environment stochastic by assuming that the object actuators are noised in such a way that the lateral acceleration (along the y axis) follows a Gaussian distribution centered around the lateral acceleration command with a standard deviation depending on the magnitude of the longitudinal acceleration command.
If we note $a^c_x(t)$ and $a^c_y(t)$ the acceleration command (RL actions), the actual accelerations that will act on the object are $a_x(t) = a^c_x(t)$ and $a_y(t) \sim \mathcal{N} \left( a^c_y(t), \sqrt{\vert a^c_x(t) \vert} \right)$.
Please implement this stochastic domain in a Gym environment and implement an RLLib agent that learns to solve it.

In [None]:
### YOUR TURN ###

Please experiment with another noised acceleration model of your choice.

In [None]:
### YOUR TURN ###