# Solving Inverted Pendulum with SVP

This notebook shows how to use the proposed structured value-based planning (SVP) approach to generate the state-action $Q$-value function for the classic inverted pendulum problem. The correctness of the solution is verified by trajectory simulations.

## Problem definition

The goal is to balance the inverted pendulum on the upright equilibrium position.
The physical dynamics of the system is described by the angle and the angular speed, i.e., $(\theta, \dot{\theta})$. Denote $\tau$ as the time interval between decisions, $u$ as the torque input on the pendulum, the dynamics can be written as

\begin{align}
    & \theta := \theta + \dot{\theta}~\tau,\\
    & \dot{\theta} := \dot{\theta} + \left( \sin{\theta} - \dot{\theta} + u \right)\tau.
\end{align}

A reward function that penalizes control effort while favoring an upright pendulum is used as

\begin{equation}
    r(\theta,u) = - 0.1u^2 + \exp{\left(\cos{\theta}-1 \right)}.
\end{equation}

In the simulation, the state space is $(-\pi, \pi]$ for $\theta$ and $[-10,10]$ for $\dot{\theta}$. We limit the input torque in $[-1,1]$ and set $\tau=0.3$.
We discretize each dimension of the state space into 50 values, and action space into 1000 values, which forms an $Q$-value function  a matrix of dimension $2500\times 1000$.

## Structured Value-based Planning (SVP)

The proposed structured value-based planning (SVP) approach is based on the $Q$-value iteration. At the $t$-th iteration, instead of a full pass over all state-action pairs:
- SVP first randomly selects a subset $\Omega$ of the state-action pairs. In particular, each state-action pair in $\mathcal{S}\times\mathcal{A}$ is observed (i.e., included in $\Omega$) independently with probability $p$. 
- For each selected $(s,a)$, the intermediate $\hat{Q}(s,a)$ is computed based on the $Q$-value iteration: 
    \begin{equation*}\hat{Q}(s,a) \leftarrow \sum_{s'} P(s'|s,a) \left( r(s,a) + \gamma \max_{a'} Q^{(t)}(s',a') \right),\quad\forall\:(s,a)\in\Omega.
    				\end{equation*}
- The current iteration then ends by reconstructing the full $Q$ matrix with matrix estimation, from the set of observations in $\Omega$. That is, $Q^{(t+1)}=\textrm{ME}\big(\{\hat{Q}(s,a)\}_{(s,a)\in\Omega}\big).$

Overall, each iteration reduces the computation cost by roughly $1-p$.

Through SVP, we can compute the final state-action $Q$-value function.
To obtain the optimal policy for state $s$, we compute

\begin{align*}
    \pi^{\star} \left(s\right) = \mbox{argmax}_{a \in \mathcal{A}} Q^{\star}\left(s, a\right).
\end{align*}

### Generate state-action value function with SVP

In [None]:
push!(LOAD_PATH, ".")
using MDPs, InvertedPendulum, Printf, LinearAlgebra
mdp = MDP(state_space(), action_space(), transition, reward)
__init__()
policy = value_iteration(mdp, true, "../data/qip_otf_0.4.csv", true)
print("")  # suppress output

### Visualize policy as a heat map

In [None]:
viz_policy(mdp, policy, "SVP policy (40% observed)", true, "ip/policy_ip_0.4.tex")

### Verify correctness
Simulate and visualize trajectory from initial state `[angle, speed]`.

In [None]:
ss, as = simulate(mdp, policy, [-0.5, 0.0])
viz_trajectory(ss, as, "SVP policy trajectory (40% observed)", "SVP policy input (40% observed)", true, "ip/traj_ip_0.4.tex")


In [None]:
nsim = 2000
deviation = 0
for sim = 1:nsim
    state = [rand(PMIN:0.001 * (PMAX - PMIN):PMAX), 
             rand(VMIN:0.001 * (VMAX - VMIN):VMAX)]
    traj, _ = simulate(mdp, policy, copy(state))
    deviation += vecnorm(traj[51:end, 1])
end # for sim
deviation /= nsim
@printf("average deviation: %.3f\n", deviation)