In [2]:
import gym
import numpy as np
from pprint import pprint

import seaborn as sns
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm

sns.set('notebook', font_scale=1.1, rc={ 'figure.figsize': (6, 3) })
sns.set_style('ticks', rc={ 'figure.facecolor': 'none', 'axes.facecolor': 'none'})
%config InlineBackend.figure_format = 'svg'
matplotlib.rcParams['figure.facecolor'] = 'white'

### 5.7 Off-policy Monte Carlo Control

> We implement exercise 5.12: **Racetrack** (page 111 in Sutton & Barto)

---

Consider driving a race car around a turn like those shown in Figure 5.5. You want to go as fast as possible, but not so fast as to run off the track.
   - In our simplified racetrack, the car is at one of a discrete set of grid positions, the cells in the diagram. 
   - The velocity is also discrete, a number of grid cells moved horizontally and vertically per time step. 
   - The actions are increments to the velocity components. Each may be changed by +1, -1, or 0 in each step, for a total of nine (3 x 3) actions. 
   - Both velocity components are restricted to be nonnegative and less than 5, and they cannot both be zero except at the starting line. Each episode begins in one of the randomly selected start states with both velocity components zero and ends when the car crosses the finish line. 
   - The rewards are -1 for each step until the car crosses the finish line. If the car hits the track boundary, it is moved back to a random position on the starting line, both velocity components are reduced to zero, and the episode continues. 
   - Before updating the car’s location at each time step, check to see if the projected path of the car intersects the track boundary. If it intersects the finish line, the episode ends; if it intersects anywhere else, the car is considered to have hit the track boundary and is sent back to the starting line. 
   - To make the task more challenging, with probability 0.1 at each time step the velocity increments are both zero, independently of the intended increments. Apply a Monte Carlo control method to this task to compute the optimal policy from each starting state. Exhibit several trajectories following the optimal policy (but turn the noise off for these trajectories).

**Environment**
- An irregularly shaped grid (how to create this?, what should be the size?)

**State**
- A tuple, $\mathbf{x} = (x, y)$ , indicating the 2D position of the car
  
**Actions**
-  Increments to the velocity of the car, which is the number of changed horizontal and vertical cells  $\mathbf{a} = (a_x, a_y)$
   -  possible actions: $\mathcal{A} = \{ (0, 0), (0, 1), (0, -1), (1, 0), (-1, 0), (1, 1), (-1, -1), (-1, 1), (1, -1) \} $
- Each step, the velocity $\mathbf{v}$ is updated with these actions
  - _note_: $\mathbf{v}$ is restricted to be nonnegative and < 5. They cannot be zero except at the starting line. 

**Reward**
- $\mathcal{R} = \{ -1 \}$
- Agent receives $r = -1$ for each step

In [32]:
# Create environment
rows = 10
cols = 20

race_track = np.zeros((rows, cols)).T

race_track[5:, rows-5:] = 1
race_track[:10, 0] = 1
race_track[15:, 0] = 1
race_track[15:, 1] = 1
race_track

#sns.heatmap(race_track)

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [0., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 0., 0., 0., 1., 1., 1., 1., 1.]])