In [None]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_ml_control
%set_random_seed 12

In [None]:
%presentation_style

In [None]:
import warnings

warnings.simplefilter("ignore", UserWarning)

In [None]:
%autoreload
import numpy as np
from training_ml_control.shortest_path_problem import (
    create_shortest_path_graph,
    plot_shortest_path_graph,
    plot_all_paths_graph,
)
from training_ml_control.environments import (
    create_grid_world_environment,
    plot_grid_graph,
    plot_grid_all_paths_graph,
    simulate_environment,
)
from training_ml_control.nb_utils import (
    show_video,
)

:::{figure} ./_static/images/aai-institute-cover.png
:width: 90%
:align: center
---
name: aai-institute
---
:::

## Dynamic Programming

Dynamic programming (DP) is a method that in general solves optimization problems that involve making a sequence of decisions by determining, for each decision, subproblems that can be solved similarily, such that an optimal solution of the original problem can be found from optimal solutions of subproblems. This method is based on Bellman’s Principle of Optimality:

> An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

:::{figure} _static/images/20_optimality_principle.png
:width: 50%
:align: center
Schematic illustration of the principle of optimality. The tail $\{u_k^∗, \dots, u_{N-1}^*\}$ of an optimal sequence $\{u_0^∗, \dots, u_{N-1}^*\}$ is optimal for the tail subproblem that starts at the state $x_k^*$ of the optimal state trajectory.
:::

Dynamic Programming is a very general solution method for problems which have these properties:

- Optimal substructure (Principle of optimality applies)
  - The optimal solution can be decomposed into subproblems, e.g., shortest path.
- Overlapping subproblems
  - The subproblems recur many times.
  - The solutions can be cached and reused.
- Additive cost function
  - The cost function along a given path can be decomposed as the sum of cost functions for each step.

Dynamic programming is used across a wide variety of domains, e.g.

- Scheduling algorithms
- Graph algorithms (e.g., shortest path algorithms)
- Graphical models in ML (e.g., Viterbi algorithm)
- Bioinformatics (e.g., Sequence alignment, Protein folding) 

## Bellman Equation

We define the **cost-to-go function** (also known as **optimal value function**) for any feasible $x_0 \in \mathbf{X}$ as[^*]:

$$
V(x_0) := \min_{u \in \mathbf{U}} J(x_0, u)
$$

[^*]: for the sake of simplicity we will focus on the discrete-time case

An admissible control sequence $u^*$ is called optimal, if

$$
V(x_0) = J(x_0, u^*)
$$

Given an infinite-horizon decision problem:

$$
V(x_0) = \min_{u \in \mathbf{U}} J(x_0, u) = \min_{u \in \mathbf{U}} \left[ \sum \limits_{k = 0}^{\infty} \gamma^k c_k(x_k, u_k) \right]
$$

We can unroll the expression by one-step to obtain:

$$
V(x_0) = \min_{u \in \mathbf{U}} \left[ c_0(x_0, u_0) + \min_{u \in \mathbf{U}} \sum \limits_{k = 1}^{\infty} \gamma^k c_k(x_k, u_k) \right]
$$

Which is equivalent to:

$$
V(x_0) = \min_{u \in \mathbf{U}} \left[ c_0(x_0, u_0) + V(x_1) \right]
$$

We obtain what we call the Bellman equation.

We define the **cost-to-go function** (also known as **optimal value function**) as[^*]:

$$
V_0(x_0) := \min_{u \in \mathbf{U}} J_0(x_0, u)
$$

[^*]: for the sake of simplicity we will focus on the discrete-time case

An admissible control sequence $u^*$ is called optimal, if

$$
V_0(x_0) = J_0(x_0, u^*)
$$

For any feasible $x_0 \in \mathbf{X}$ the optimal value function satisfies

$$
V_k(x_0) = \displaystyle \min_{u \in \mathbf{U}} \left[ c_{k}(x_0, u) + V_{k+1}(f(x_0, u)) \right]
$$

Moreover, if $u^*$ is an optimal control, then

$$
V_0(x_0) = c_{k}(x_0, u) + V_{k+1}(f(x_0, u))
$$

and

$$
V_0(x_0) = J_0(x_0, u^*)
$$

###  DP Algorithm

For every initial state $x_0$, the optimal cost is equal to $V_N(x_0)$, given by the last step of the following algorithm, which proceeds backward in time from stage $N-1$ to stage $0$:

- Start with $V_N(x_N) = g_N(x_N)$
- then for $k = \{0, \dots , N - 1\}$, let:
 
  $$
  V_{k}(x_{k}) = \displaystyle \min_{u \in \mathbf{U}} \left\{ c_{k}(x_{k}, u_{k}) + V_{k+1}(f(x_{k}, u_{k})) \right\}
  $$
  
Once the functions $V_0, \dots , V_N$ have been obtained, we can use a forward algorithm to construct an optimal control sequence $\{u_0^*, \dots, u_{N-1}^*\}$ and corresponding state trajectory $\{x_1^∗, \dots, x_{N}^*\}$ for the given initial state $x_0$.

$$
\begin{equation}
u^*(s_i) = \displaystyle  \argmin_{u \in \mathbf{U}}
\left[ c_{k}(x_{k}, u_{k}) + V\left( f(s_i,a) \right) \right].
\end{equation}
$$

:::{figure} _static/images/20_dynamic_programming.png
:width: 60%
Illustration of the DP algorithm. The tail subproblem that starts at $x_k$ at time $k$ minimizes over
$\{u_k , \dots , u_{N-1}\}$ the "cost-to-go" from $k$ to $N$.
:::

## Graph Search

In [None]:
G = create_shortest_path_graph()
plot_shortest_path_graph(G)

We wish to travel from node A to node G at minimum cost. If the cost represents time then we want to find the shortest path from A to G.

- Arrows (edges) indicate the possible movements.
- Numbers on edges indicate the cost of moving along an edge.

We can use Dynamic Programming to solve this problem.

We start by determining all possible paths first .

In [None]:
plot_all_paths_graph(G)

We then compute the cost-to-go at each node to determine the shortest path.

Each node in this new graph represents a state. We will start from the tail (the last states) and compute recursively the cost for each state transition.

Let $l(n_1, n_2)$ the cost of going from node $n_1$ to $n_2$ and $V(n)$ be the cost-to-go from node $n$.

$$
\begin{array}{lll}
V(\text{ABDF}) &= g(\text{ABDF}, \text{ABDFG}) &= 1\\
V(\text{ABE}) &= g(\text{ABE}, \text{ABEG}) &= 4\\
V(\text{ACF}) &= g(\text{ACF}, \text{ACFG}) &= 1\\
V(\text{ADF}) &= g(\text{ADF}, \text{ADFG}) &= 1\\
\end{array}
$$

$$
\begin{array}{lll}
V(\text{ABD}) &= \min \left[ g(\text{ABD}, \text{ABDG}), g(\text{ABD}, \text{ABDF}) + V(\text{ABDF}) \right]
&= \min \left[ 8, 5 + 1 \right] &= 6
\\
V(\text{AB}) &= \min \left[ g(\text{AB}, \text{ABD}) + V(\text{ABD}), g(\text{AB}, \text{ABE}) + V(\text{ABE}) \right]
&= \min \left[ 9 + 6, 1 + 4 \right] &= 5
\\
V(\text{AC}) &= g(\text{AC}, \text{ACF}) + V(\text{ACF}) &= 2 + 1 &= 3
\\
V(\text{AD}) &= \min \left[ g(\text{AD}, \text{ADF}) + V(\text{ADF}), g(\text{AD}, \text{ADG})) \right]
&= \min \left[ 5 + 1, 8 \right] &= 6
\\
\end{array}
$$

$$
\begin{array}{lll}
V(\text{A}) &= \min \left[
g(\text{A}, \text{AB}) + V(\text{AB}), g(\text{A}, \text{AC}) + V(\text{AC}), g(\text{A}, \text{AD}) + V(\text{AD})
\right]
&= \min \left[ 1 + 5, 5 + 3, 3 + 6 \right] &= 6
\\
\end{array}
$$

The shortest-path is ABEG.

In [None]:
plot_all_paths_graph(G, show_solution=True)

### Value Iteration

Another way to compute the optimal cost-to-go for all states that is also applicable in stochastic problems
is the **Value Iteration** algorithm:

$$
\begin{array}{l}
  \textbf{Input}:\ \text{MDP}\ M = \langle S, s_0, A, P_a(s' \mid s), r(s,a,s')\rangle\\
  \textbf{Output}:\ \text{Value function}\ V\\[2mm]
  \text{Set}\ V\ \text{to arbitrary value function; e.g., }\ V(s) = 0\ \text{for all}\ s\\[2mm]
  \text{repeat}\ \\
  \quad\quad \Delta \leftarrow 0 \\
  \quad\quad \text{foreach}\ s \in S \\
  \quad\quad\quad\quad \underbrace{V'(s) \leftarrow \max_{a \in A(s)} \sum_{s' \in S}  P_a(s' \mid s)\ [r(s,a,s') + 
 \gamma\ V(s') ]}_{\text{Bellman equation}} \\
  \quad\quad\quad\quad \Delta \leftarrow \max(\Delta, |V'(s) - V(s)|) \\
  \quad\quad V \leftarrow V' \\
  \text{until}\ \Delta \leq \theta 
\end{array}
$$

## Optimal Control as Graph Search

We can formulate optimal control as a graph search by either considering a system with discrete states and actions or by discretizing a system with continuous states and actions.

````{exercise-start} Grid World
:label: grid-world
````

:::{figure} _static/images/20_constrained_motion.png
:width: 60%
:::

In [None]:
%%html
<iframe width="800" height="600" src="https://www.youtube-nocookie.com/embed/p178eQpDI_E?si=7wzD4d1TIVj29WG0&amp;start=4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

In [None]:
env = create_grid_world_environment(render_mode="rgb_array", max_steps=50)
result = simulate_environment(env)
show_video(result.frames, fps=3)

The task can be represented as the following undirected graph:

In [None]:
env.reset()
G = env.unwrapped.get_graph()
plot_grid_graph(G, show_start_to_target_paths=False)

:::{exercise}
How many possible paths from start position to target position are there?
:::

In [None]:
plot_grid_graph(G, show_start_to_target_paths=True)

We wish to travel to the goal cell in green. If the cost represents time then we want to find the shortest path to the goal.

- Arrows (edges) indicate the possible movements.
- Numbers on edges indicate the cost of moving along an edge.

Use Dynamic Programming to solve this problem:

- Compute the optimal cost-to-go for each state.
- Determine the optimal plan using the computed optimal cost-to-go.
- Implement the plan in the environment.

:::{tip} Hint 1
:class: dropdown
Determine all possible paths first.

You can use `plot_grid_all_paths_graph(G)`.
:::

:::{tip} Hint 2
:class: dropdown
Compute the optimal cost-to-go at each node.

You can use `dict(G.nodes(data=True))` to get a dictionary that maps the nodes to their attributes
and you can use `G.start_node` and `G.end_node` to access the start and end (i.e. goal) nodes, respectively.
:::

````{exercise-end}
````

````{solution-start} grid-world
````

In [None]:
# Your Solution Here

:::{solution} grid-world
:class: dropdown

For this solution we first need to import some functions:

```{code-cell}
from training_ml_control.environments import (
    value_iteration,
    compute_best_path_and_actions_from_values,
)
```

After that, to plot all paths from start to end we use:

```{code-cell} python3
plot_grid_all_paths_graph(G)
```

To compute the optimal cost-to-go we use:

```{code-cell}
values = value_iteration(G)
```

Once that's computed, we can determine the best path and correponding actions:

```{code-cell}
best_path, actions = compute_best_path_and_actions_from_values(G, start_node=G.start_node, target_node=G.target_node, values=values)
print(best_path)
```

To plot the shortest path on the graph of all paths we use:

```{code-cell} python3
plot_grid_all_paths_graph(G, show_solution=True)
```

:::

## Continous Optimal Control

### Hamilton-Jacobi-Bellman Equation

Let's consider the continous optimal control problem with finite horizon over the time period $[t_0 ,t_f]$.

The system's dynamics is given by:

$$
\dot{\mathbf{x}}(t) = f(\mathbf{x}(t), \mathbf{u}(t))
$$

The cost function is given by:

$$
J(\mathbf{x}(t), \mathbf{u}(t), t_0, t_f) = c(\mathbf{x}(t_f), t_f) + \int\limits_{t_0}^{t_f} c(\mathbf{x}(t), \mathbf{u}(t)) d\tau
$$

The optimal cost-to-go function is given by:

$$
\displaystyle V(x(t), t_0, t_f) = \underset{\mathbf{u(t)}}{min} \left[ J(\mathbf{x}(t), \mathbf{u}(t), t_0, t_f) \right]
$$

It can be shown that the optimal condition is also satisfied in this case:

$$
V(x(t_0), t_0, t_f) = \underset{\mathbf{u(t)}}{min} \left[ c(t + \right]
$$

$$
V(x(t_0), t_0, t_f) = V(x(t_0), t_0, t) + V(x(t), t, t_f)
$$

$$
\frac{d}{dt}(V(\mathbf{x}(t), t, t_f) = \frac{\partial V}{\partial t} + \left( \frac{\partial V}{\partial \mathbf{x}} \right)^T \underbrace{\frac{d\mathbf{x}}{dt}}_{= f(\mathbf{x}(t), \mathbf{u}(t))}
$$

The term on the left can be simplified to:

$$
\frac{d}{dt}(V(\mathbf{x}(t), t, t_f) &= \underset{\mathbf{u(t)}}{min} \frac{d}{dt} \left[ c(\mathbf{x}(t_f), t_f) + \int\limits_{t_0}^{t_f} c(\mathbf{x}(t), \mathbf{u}(t)) d\tau  \right] \\
&= \underset{\mathbf{u(t)}}{min} \left[ \frac{d}{dt} \int\limits_{t_0}^{t_f} c(\mathbf{x}(t), \mathbf{u}(t)) d\tau \right] \\
&= \underset{\mathbf{u(t)}}{min} \left[ -c(\mathbf{x}(t), \mathbf{u}(t)) \right]
$$

Replacing this new expression into the original one and moving some terms aronud we get:

$$
- \frac{\partial V}{\partial t} = \underset{\mathbf{u(t)}}{min} \left[ \left( \frac{\partial V}{\partial \mathbf{x}} \right)^T f(\mathbf{x}(t), \mathbf{u}(t)) + c(\mathbf{x}(t), \mathbf{u}(t)) \right]
$$

This is called the Hamilton-Jacobi-Bellman (HJB) equation.

### Direct Single Shooting

### Direct Multiple Shooting

### Collocation