# Value Iteration for Minimum Time Control

In [3]:
import numpy as np
from IPython.display import HTML
from pydrake.all import (
    DiagramBuilder,
    DynamicProgrammingOptions,
    FittedValueIteration,
    LinearSystem,
    LogVectorOutput,
    Simulator,
    VectorSystem,
)

from underactuated.exercises.dp.minimum_time_utils import (
    create_animation,
    simulate_and_plot,
)

## Problem Description
In this problem you will analyze the performance of the value-iteration algorithm on the minimum-time problem for the double integrator.
Don't worry, the value iteration algorithm is provided by Drake, and you won't have to code it!
You will be asked to analyze the policy it produces and understand the algorithmic reasons behind the poor performance of the closed loop system.
Then you will have to implement on your own the closed-form controller we have studied in class, and compare it with the one obtained numerically.

**These are the main steps of the notebook (Items needed to be completed by you are marked as "TODO"):**
1. Construct the double integrator system.
2. Define the objective function for the minimum time problem (TODO).
3. Run the value-iteration algorithm.
4. Animate the intermediate steps of the algorithm.
5. Simulate the double integrator in closed loop with the controller from the value iteration.
6. Write down a controller that implements the closed form solution, and test it (TODO).

## Dynamics of the Double Integrator
We start by writing a function that returns the double-integrator system.
We write the dynamics is state-space linear form
$$\dot{\mathbf{x}} = A \mathbf{x} + B u,$$
where $\mathbf{x} = [q, \dot{q}]^T$.

In [None]:
# we write a function since we will need to call
# this a handful of times


def get_double_integrator():
    A = np.array([[0, 1], [0, 0]])
    B = np.array([[0], [1]])
    C = np.eye(2)
    D = np.zeros((2, 1))
    return LinearSystem(A, B, C, D)

## Implementation of integrand of the Cost Function
Remember that the minimum-time objective can be written in integral form
$$\int_{0}^{\infty} \ell(\mathbf{x}) dt,$$
by defining
$$\ell(\mathbf{x}) = \begin{cases} 0 & \text{if} \quad \mathbf{x} =0,\\ 1 & \text{otherwise}. \end{cases}$$
(See also [the example from the textbook](https://underactuated.csail.mit.edu/dp.html#minimum_time_double_integrator).)
Implement the integrand of cost function $$l(x)$$ using context as an argument.

**Note**: To handle small numerical errors, the implementation of checking whether $$x=0$$ should be approximated using ```numpy``` function ```isclose``` instead of ```if x == 0```.

In [None]:
def cost_function(context):
    # Modify here to get the correct state vector value from context.
    # Hint: Once you get a BasicVector in Drake, then call CopyToVector() to get a
    # numpy array.
    x = np.array([0.0, 0.0])
    return 0  # Modify here to compute the cost function

## Value Iteration Algorithm
The value iteration is implemented in the Drake function
`FittedValueIteration`. Take some time to have a look at [its
documentation](https://drake.mit.edu/doxygen_cxx/group__control.html#ga32d5768cb664f6d07fc58b4af536c45a),
and to go through the description of this algorithm in [the
textbook](https://underactuated.csail.mit.edu/dp.html#barycentric). Before
using it, we need to construct an appropriate discretization of the state and
input space.

**Important:** This code will work if you change the limits of the input to be
different from $u_{\text{min}} = -1$ and $u_{\text{max}} = 1$. However, be
aware that the closed-form solution we derived in class (and that you'll have
to implement at the end of this notebook) is assuming that! It's not hard to
generalize the closed-form solution to the case with generic bounds
$u_{\text{min}}$ and $u_{\text{max}}$. But if you don't want to do that, do not
change `mesh['u_lim']` below!

In [None]:
# discretization mesh of state space, input space,
# and time for the value-iteration algorithm
mesh = {}

# number of knot points in the grids
# odd to have a point in the origin
mesh["n_q"] = 31  # do not exceed ~51/101
mesh["n_qdot"] = 31  # do not exceed ~51/101
mesh["n_u"] = 11  # don't exceed ~11/21

# grid limits
mesh["q_lim"] = [-2.0, 2.0]
mesh["qdot_lim"] = [-2.0, 2.0]
mesh["u_lim"] = [-1.0, 1.0]  # do not change

# axis discretization
for s in ["q", "qdot", "u"]:
    mesh[f"{s}_grid"] = np.linspace(*mesh[f"{s}_lim"], mesh[f"n_{s}"])

    # important: ensure that a knot point is in the origin
    # otherwise there is no way the value iteration can converge
    assert 0.0 in mesh[f"{s}_grid"]

# time discretization in the value-iteration algorithm
mesh["timestep"] = 0.005

In the following cell we wrap Drake's `FittedValueIteration` function with a function we call `run_value_iteration`.
This returns the optimal value function, the optimal controller, and all the data we need for the upcoming animation.

In [None]:
def run_value_iteration(cost_function, mesh, max_iter=10000):
    # to create an animation, we store the values of
    # the cost to go and the optimal policy for each
    # iteration of the value-iteration algorithm
    J_grid = []
    pi_grid = []

    # callback from the value-iteration algorithm
    # that saves the intermediate values of J and pi
    # and that ensures we do not exceed max_iter
    # (iteration number i starts from 1)
    def callback(i, unused, J, pi):
        # check max iter is not exceeded
        if i > max_iter:
            raise RuntimeError(
                f"Value-iteration algorithm did not converge within {max_iter} iterations."
            )

        # store cost to go for iteration i
        # the 'F' order facilitates the plot phase
        J_grid.append(np.reshape(J, (mesh["n_q"], mesh["n_qdot"]), order="F"))
        pi_grid.append(np.reshape(pi, (mesh["n_q"], mesh["n_qdot"]), order="F"))

    # set up a simulation
    simulator = Simulator(get_double_integrator())

    # grids for the value-iteration algorithm
    state_grid = [set(mesh["q_grid"]), set(mesh["qdot_grid"])]
    input_grid = [set(mesh["u_grid"])]

    # add custom callback function as a visualization_callback
    options = DynamicProgrammingOptions()
    options.visualization_callback = callback

    # run value-iteration algorithm
    policy, cost_to_go = FittedValueIteration(
        simulator,
        cost_function,
        state_grid,
        input_grid,
        mesh["timestep"],
        options,
    )

    # recast J and pi from lists to 3d arrays
    J_grid = np.dstack(J_grid)
    pi_grid = np.dstack(pi_grid)

    return policy, cost_to_go, J_grid, pi_grid

## Animation of the Value-Iteration Algorithm
The animation of the value-iteration is coded mainly using matplotlib. If you are interested, feel free to check support function `create_animation` provided in [`minimum_time_utils.py`](https://github.com/RussTedrake/underactuated/blob/master/underactuated/exercises/dp/minimum_time_utils.py).
What it does can be summarized as follows:
- runs value iteration,
- initializes an empty 3D surface plot for the value function and the policy,
- creates the function `update_surf` that when called updates the surface plots from the previous point,
- creates a fancy animation by calling `update_surf` many times.

This animation is built for the purpose of visualizing value-iteration, therefore, we include supporting functions in a separate file and hope you can appreciate the relevant final results!

In [None]:
policy, cost_to_go, J_grid, pi_grid = run_value_iteration(cost_function, mesh)
animation = create_animation(J_grid, pi_grid, mesh)
HTML(animation.to_jshtml())

## Performance of the Value-Iteration Policy
Value iteration is an extremely powerful and very general algorithm.
However, its performances in solving "bang-bang" problems (i.e. problems where the control is always at the bounds) can be very poor.
In this section we simulate the double integrator in closed-loop with the approximated optimal policy.
We'll see that things do not go exactly how we expect...

In [None]:
# function that simulates the double integrator
# starting from the state (q0, qdot0) for sim_time
# seconds in closed loop with the passed controller


def simulate(q0, qdot0, sim_time, controller):
    # initialize block diagram
    builder = DiagramBuilder()

    # add system and controller
    double_integrator = builder.AddSystem(get_double_integrator())
    controller = builder.AddSystem(controller)

    # wirw system and controller
    builder.Connect(double_integrator.get_output_port(0), controller.get_input_port(0))
    builder.Connect(controller.get_output_port(0), double_integrator.get_input_port(0))

    # measure double-integrator state and input
    state_logger = LogVectorOutput(double_integrator.get_output_port(0), builder)
    input_logger = LogVectorOutput(controller.get_output_port(0), builder)

    # finalize block diagram
    diagram = builder.Build()

    # instantiate simulator
    simulator = Simulator(diagram)
    simulator.set_publish_every_time_step(False)  # makes sim faster

    # set initial conditions
    context = simulator.get_mutable_context()
    context.SetContinuousState([q0, qdot0])

    # run simulation
    simulator.AdvanceTo(sim_time)

    # unpack sim results
    q_sim, qdot_sim = state_logger.FindLog(context).data()
    u_sim = input_logger.FindLog(context).data().flatten()
    t_sim = state_logger.FindLog(context).sample_times()

    return q_sim, qdot_sim, u_sim, t_sim

In order to properly visualize the results of the simulator above we need a bunch of helper functions. Since they are not directly relevant to drake simulation or value iteration algorithm, we included them in [`minimum_time_utils.py`](underactuated/exercises/dp/minimum_time_utils.py). Feel free to check the detailed implementation if you are interested.

We are finally ready to simulate and plot the trajectories of the double integrator controlled by the value-iteration policy.
Running the following cell you'll see two plots:
- The plot of the state-space trajectory of the double integrator superimposed to the level plot of the policy.
In the red regions the controller selects the input $u=1$ (full gas), in the blue regions it selects $u=-1$ (full brake). The are in between approximates the quadratic boundaries we have seen in class, and are due to the discretization of the state space.
- The plot of the control force as a function of time.

Is this the optimal policy we expected to see?
Take your time to understand why these plots look so strange!
Does this get any better if you increase the number of knot points (finer discretization of $q$ and $\dot{q}$)?
If no, why?
(Questions not graded, do not submit.)

In [None]:
# initial state
q0 = -1.0
qdot0 = 0.0

# verify that the given initial state is inside the value-iteration grid
assert mesh["q_lim"][0] <= q0 <= mesh["q_lim"][1]
assert mesh["qdot_lim"][0] <= qdot0 <= mesh["qdot_lim"][1]

# duration of the simulation in seconds
sim_time = 5.0

# sim and plot
policy = run_value_iteration(cost_function, mesh)[0]
simulate_and_plot(q0, qdot0, sim_time, policy, mesh["u_lim"], simulate=simulate)

## Implementation of the Closed-Form Solution
Since value iteration didn't give us the results we wanted, in the next cell we ask you to implement [the closed-form solution we've derived in class](https://underactuated.csail.mit.edu/dp.html#minimum_time_double_integrator).
Note that in class we assumed the input to be bounded between $-1$ and $1$, so you can either do the math and generalize that result to generic bounds $u_{\text{min}} < 0$ and $u_{\text{max}} > 0$ (not hard), or double check that `mesh['u_lim']` is still set to `[-1., 1.]`.

**Note 1:**
To help you, we already partially filled the function.
In a small neighborhood of the origin we return $u = - \dot{q} - q$, even if the theoretical solution would say $u = 0$.
This gives the closed-loop dynamics $m \ddot{q} = - q - \dot{q}$ which makes the origin a stable equilibrium.
This trick prevents the controller from chattering wildly between $u_{\text{max}}$ and $u_{\text{min}}$ because of small numerical errors.
Do not cancel it.

**Note 2:**
To complete this function with [the control law from the textbook](https://underactuated.csail.mit.edu/dp.html#minimum_time_double_integrator)
you need to write two conditions on the state $[q, \dot{q}]^T$: one for the full-gas region and one for the full-brake region.
Notice that, momentarily, the function always returns $u = u_{\text{max}}$ if the state is not close to the origin.

In [None]:
def policy_closed_form(q, qdot, atol=1.0e-2):
    # system in a neighborhood of the origin
    # up to the absolute tolerance atol
    x_norm = np.linalg.norm([q, qdot])
    if np.isclose(x_norm, 0.0, atol=atol):
        # little trick, do not modify: use a stabilizing controller in the
        # neighborhood of the origin to prevent wild chattering
        return -q - qdot

    # full-brake region
    # check if the state of the system is
    # such that u must be set to -1
    elif False:  # modify here
        return mesh["u_lim"][0]

    # full-gas region
    # if all the others do not apply,
    # u must be set to 1
    else:  # modify here
        return mesh["u_lim"][1]

Now we just encapsulate the function you wrote in a Drake `VectorSystem` that can be sent to the simulator.
Does this state trajectory and this control signal look more reasonable than the ones from the value-iteration algorithm? (Question not graded, do not submit.)

In [None]:
# controller which implements the closed-form solution


class ClosedFormController(VectorSystem):
    # two inputs (system state)
    # one output (system input)
    def __init__(self):
        VectorSystem.__init__(self, 2, 1)

    # just evaluate the function above
    def DoCalcVectorOutput(self, context, x, controller_state, u):
        u[:] = policy_closed_form(*x)


# sim and plot
simulate_and_plot(
    q0,
    qdot0,
    sim_time,
    ClosedFormController(),
    mesh["u_lim"],
    simulate=simulate,
)

## Autograding
You can check your work by running the following cell:

In [None]:
from underactuated.exercises.dp.test_minimum_time import TestMinimumTime
from underactuated.exercises.grader import Grader

Grader.grade_output([TestMinimumTime], [locals()], "results.json")
Grader.print_test_results("results.json")

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=5d164217-c09d-4ecb-b19c-a1b65e9cf513' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>