# Solving Mountain Car with SVP

This notebook shows how to use the proposed structured value-based planning (SVP) approach to generate the state-action $Q$-value function for the classic mountain car problem. The correctness of the solution is verified by trajectory simulations.

## Problem Definition

In this problem, an under-powered car aims to drive up a steep hill. The physical dynamics of the system is described by the  position and the velocity, i.e., $\left(x, \dot{x}\right)$. Denote $u$ as the acceleration input on the car, the dynamics can be written as

\begin{align}
    & x := x + \dot{x},\\
    & \dot{x} := \dot{x} - 0.0025\cos{(3x)} + 0.001u.
\end{align}

The reward function is defined to encourage the car to get onto the top of the mountain at $x_0=0.5$:

\begin{equation}
r(x) = \left\{
\begin{aligned}
10, & \qquad x\ge x_0,\\
-1, & \qquad else.
\end{aligned}
\right.
\end{equation}

We follow standard settings to restrict the state space as $\left[-0.07, 0.07 \right]$ for $x$ and $\left[-1.2, 0.6 \right]$ for $\dot{x}$, and limit the input $u\in[-1,1]$. Similarly, the whole state space is discretized into 2500 values, and the action space is discretized into 1000 values. The evaluation metric we are concerned about is the total time it takes to reach the top of the mountain, given a randomly and uniformly generated initial state.

## Structured Value-based Planning (SVP)

The proposed structured value-based planning (SVP) approach is based on the $Q$-value iteration. At the $t$-th iteration, instead of a full pass over all state-action pairs:
- SVP first randomly selects a subset $\Omega$ of the state-action pairs. In particular, each state-action pair in $\mathcal{S}\times\mathcal{A}$ is observed (i.e., included in $\Omega$) independently with probability $p$. 
- For each selected $(s,a)$, the intermediate $\hat{Q}(s,a)$ is computed based on the $Q$-value iteration: 
    \begin{equation*}\hat{Q}(s,a) \leftarrow \sum_{s'} P(s'|s,a) \left( r(s,a) + \gamma \max_{a'} Q^{(t)}(s',a') \right),\quad\forall\:(s,a)\in\Omega.
    				\end{equation*}
- The current iteration then ends by reconstructing the full $Q$ matrix with matrix estimation, from the set of observations in $\Omega$. That is, $Q^{(t+1)}=\textrm{ME}\big(\{\hat{Q}(s,a)\}_{(s,a)\in\Omega}\big).$

Overall, each iteration reduces the computation cost by roughly $1-p$.

Through SVP, we can compute the final state-action $Q$-value function.
To obtain the optimal policy for state $s$, we compute

\begin{align*}
    \pi^{\star} \left(s\right) = \mbox{argmax}_{a \in \mathcal{A}} Q^{\star}\left(s, a\right).
\end{align*}

### Generate state-action value function with SVP

In [None]:
push!(LOAD_PATH, ".")
using MDPs, MountainCar
mdp = MDP(state_space(), action_space(), transition, reward)
__init__()
policy = value_iteration(mdp, true, "../data/qmc_otf_0.4.csv", true)
print("")  # suppress output

### Visualize policy as a heat map

In [5]:
viz_policy(mdp, policy, "SVP policy (40% observed)", true, "mc/policy_mc_0.4.tex")

┌ Info: Precompiling ImageMagick [6218d12a-5da1-5696-b52f-db25d2ecc6d1]
└ @ Base loading.jl:1187


### Verify correctness
Simulate and visualize trajectory from initial state `[position, speed]`.

In [6]:
ss, as = simulate(mdp, policy, [-0.5, 0.0])
viz_trajectory(ss, as, "SVP policy trajectory (40% observed)", "SVP policy input (40% observed)", true, "mc/traj_mc_0.4.tex")

In [7]:
nsim = 1000
times = 0
for sim = 1:nsim
    state = [rand(XMIN:0.001 * (XMAX - XMIN):XMAX), 
             rand(VMIN:0.001 * (VMAX - VMIN):VMAX)]
    _, as = simulate(mdp, policy, copy(state))
    times += length(as)
end # for sim
times /= nsim
@printf("average times: %.3f", times)

average times: 58.424

Add `using Printf` to your imports.
  likely near /home/yuzhe/.julia/packages/IJulia/gI2uA/src/kernel.jl:52
