# S&B RL Examples 9.1, 10.1, & 13.1

Chapters 9, 10, and 13 do not have any programming exercises. However, some of the examples in these chapters are interesting and we would like to implement them for the sake of getting more practice. We are interested in particular in:

+ **Example 9.1**: This example describes a 1000-state random walk in which function approximation is used for estimating the value function. The approximation used is state aggregation on every 100 states. The solution implements gradient MC.

+ **Example 10.1**: This example describes a control problem in which a car must learn how to fight gravity to get itself out of a valley. We want to implemtn n-step semi-gradient Sarsa to solve this problem using a linear model as is done in the example.

+ **Example 12.2**: Same as Example 10.1 but implementing Sarsa($\lambda$) and True Online Sarsa($\lambda$).

+ **Example 13.1**: This example optimizes a very simple random walk where the results of the actions "left" and "right" get flipped in one of the states. Given the limited representation of the actions the optimal policy is stochastic. We want to learn this stochastic policy using policy gradients.

In [None]:
%matplotlib inline
import numpy as np
import copy
import time
import matplotlib.pyplot as plt
import scipy as sp
import math
import seaborn as sns

# Example 9.1

In [None]:
def generate_episode(n, width):
    """
    Generate a full episode of our MRP. Function 
    returns tuple containing sequence of states in
    episode and the final (total) reward.
    """
    
    episode = []
    s = int(n/2) # start state
    episode_done = False
    reward = None
    
    while not episode_done:
        # Record current state
        episode.append(s)
        
        # Determine the next state
        left_edge = s - width
        right_edge = s + width
        move_right = np.random.binomial(1, 0.5)
        
        if move_right:
            s_prime = np.random.randint(s + 1, right_edge + 1)
        else:
            s_prime = np.random.randint(left_edge, s)
            
        # Figure out if the transition was terminal and handle
        if s_prime >= n:
            episode_done = True
            s = n
            reward = 1
        elif s_prime <= 1:
            episode_done = True
            s = 1
            reward = -1
        else:
            s = s_prime
            
    return episode, reward

def zero_vector(group_size, n):

    assert n % group_size == 0, "Group size must divide total number of states"
    
    num_groups = int(n/group_size)
    return np.zeros(num_groups)

def feature_vector(s, group_size, n):
    """
    Given a state s this returns the feature 
    vector X(s). Note that in this problem we
    use state aggregation so each component of 
    the returned vector corresponds to a contiguous 
    group of states.
    """
    X = zero_vector(group_size, n)
    
    group_num = int((s - 1)/group_size)
    X[group_num] = 1
    
    return X
    
def value_function(X, w, group_size, n):
    """
    Calculates the value function linearly
    parametrized by w with feature vector 
    X(s).
    """

    return np.dot(X, w)

def gradient_monte_carlo(episodes, alpha, n, width, group_size):
    """
    Takes in the problem parameters and the number
    of episodes to be simulated and returns the vector
    w describing our estimate of the value function for
    the target policy (or process in our MRP case).
    """

    w = zero_vector(group_size, n)
    
    t_i = time.time()
    base_t_i = t_i

    print("\nBEGIN GRADIENT MC WITH PARAMS:\n" +
          "episodes: {}\n".format(episodes) +
          "alpha: {}\n".format(alpha) +
          "n: {}\n".format(n) +
          "width: {}\n".format(width) +
          "group_size: {}\n".format(group_size))
    
    for episode_num in range(episodes):
        
        episode, G = generate_episode(n, width)
        
        for s in episode:
            # Note derivative of value function is just X(s) for linear model
            # Also note the total reward (G_t) is the same at all t
            X = feature_vector(s, group_size, n)
            w += alpha * (G - value_function(X, w, group_size, n)) * X
            
        if (episode_num + 1) % int(episodes/10) == 0:
            delta_t = round(time.time() - t_i, 2)
            print("Episodes {}-{} complete: {} s".format(episode_num + 1 - int(episodes/10),
                                                         episode_num + 1,
                                                         delta_t))
            t_i = time.time() 
            
    base_delta_t = round(time.time() - base_t_i, 2)
    print("\nSIMULATIONS COMPLETED, TOTAL TIME: {} s\n".format(base_delta_t))
            
    return w

def value_function_arrays(w, n, group_size):
    """
    Takes the vector w parametrizing our value 
    function and returns the arrays need to plot 
    the value as a function of the states.
    """
    
    states = np.arange(1, n + 1)
    values = np.array([w[int((s - 1)/group_size)] for s in states])
    
    return states, values

In [None]:
agg_params = {"episodes": 10**5, "alpha": 2 * 10**(-5), "n": 10**3 , "width": 100, "group_size": 100}
non_agg_params = {"episodes": 10**5, "alpha": 3 * 10**(-3), "n": 10**3, "width": 100, "group_size": 1}
agg_w = gradient_monte_carlo(**agg_params)
non_agg_w = gradient_monte_carlo(**non_agg_params)

In [None]:
agg_states, agg_values = value_function_arrays(agg_w, n, agg_params["group_size"])
nonagg_states, nonagg_values = value_function_arrays(non_agg_w, n, non_agg_params["group_size"])

Note that in the plot below comparing the estimates for group size of 1 vs. 100 the estimate of the value function for the smaller group size is much noiser. This is because, for the same number of episodes, we have many less samples for each group or bin. Thus aggregating states increases bias (since it forces contiguous states to share a value) but reduces variance (less noise for estimated state values)!

In [None]:
figure = plt.figure(figsize=(20, 10))

plt.plot(nonagg_states[1:-1], nonagg_values[1:-1], label="Group Size = 1")
plt.plot(agg_states, agg_values, linewidth=5, label="Group Size = 100")
plt.xlabel("States $[s]$", size=30)
plt.xticks(size=20)
plt.ylabel("State Value $[v_{\pi}(s)]$", size=30)
plt.yticks(size=20)
plt.legend(prop={"size": 20})
plt.title("Gradient Monte Carlo Approximations To Value Function", size=30)


plt.tight_layout()

# Example 10.1

In [None]:
def check_bounds_valid(bounds):
    assert len(bounds) == 2 \
           and bounds[0] < bounds[1], "Passed bounds are invalid"


class MountainCarSimulator():
    
    def __init__(self, initial_position_bounds=[-0.6, -0.4], initial_velocity=0,
                 x_bounds=[-1.2, 0.5], x_dot_bounds=[-0.07, 0.07], acceleration=0.001, 
                 gravity=0.0025, stepness_factor=3):
    
        check_bounds_valid(x_bounds)
        check_bounds_valid(x_dot_bounds)
    
        self.x_bounds = x_bounds
        self.x_dot_bounds = x_dot_bounds
        self.alpha = alpha
        self.acceleration = acceleration
        self.gravity = gravity
        self.stepness_factor = stepness_factor
        self.initial_position_bounds = initial_position_bounds
        self.initial_position = np.random.uniform(initial_position_bounds[0], initial_position_bounds[1])
        self.initial_velocity = initial_velocity
        
        MountainCarSimulator.check_initial_state([self.initial_position, self.initial_velocity], 
                                                 x_bounds, x_dot_bounds)
        
    @staticmethod
    def check_initial_state(initial_state, x_bounds, x_dot_bounds):
        assert x_bounds[0] <= initial_state[0] <= x_bounds[1], "Initial position is not valid"
        assert x_dot_bounds[0] <= initial_state[1] <= x_dot_bounds[1], "Initial velocity is not valid"
        
    @staticmethod
    def bound_clip(x, bounds):
        return max(min(x, bounds[1]), bounds[0])
    
    def get_initial_state(self):
        return [self.initial_position, self.initial_velocity]
        
    def next_x(self, x, next_x_dot):
        """
        Note that in this problem x, x' are NOT feature
        vectors, together they specify the true state s.
        """
        next_x = x + next_x_dot
        next_x_bounded = MountainCarSimulator.bound_clip(next_x, self.x_bounds)
        
        return next_x_bounded

    def next_x_dot(self, a, x, x_dot):
        """
        Note that in this problem x, x' are NOT feature
        vectors, together they specify the true state s.
        """
        next_x_dot = x_dot + self.acceleration * a \
                     - self.gravity * math.cos(self.stepness_factor * x)
        
        next_x_dot_bounded = MountainCarSimulator.bound_clip(next_x_dot, self.x_dot_bounds)
        
        return next_x_dot
    
    def next_state(self, a, x, x_dot):
        """
        Convenience method for getting the 
        next state, but handles the crucial 
        case of the car hitting the left boundary.
        """
        
        next_x_dot = self.next_x_dot(a, x, x_dot)
        next_x = self.next_x(x, next_x_dot)
        
        if next_x == self.x_bounds[0]:
            next_x_dot = 0
            
        return [next_x, next_x_dot]
    
    def check_terminal(self, x):
        return x >= self.x_bounds[1]


class TileOffsetGenerator2D():
    
    def __init__(self, dim, width, num_tilings):
        """
        Based on recommendations of Miller and
        Glanz explained on page 220 of S&B.
        """
        assert 2**num_tilings >= 4*dim, "Miller and Glanz recommend that 2**num_tilings >= 4*dim"
        
        self.width = width
        self.num_tilings = num_tilings
        self.base_unit = self.width/num_tilings
        
        self.fundamental_vector = np.arange(1, 2*dim, 2)
        
        self.offsets = [step * self.base_unit * self.fundamental_vector for step in range(0, self.num_tilings)]
        
    def get_offsets(self):
        return self.offsets

    
class RectangularTiling2D():
    
    def __init__(self, x_num, y_num, offset=[0, 0]):
        assert len(offset) == 2, "offset_vec must have length two"
        assert type(x_num) == int and type(y_num) == int, "x_num and y_num must be ints"
        
        # Since the tiles are rectangular
        self.x_lines = np.append(np.sort((np.linspace(0, 1, x_num, endpoint=False) + offset[0]) % 1), [1.0])
        self.y_lines = np.append(np.sort((np.linspace(0, 1, y_num, endpoint=False) + offset[1]) % 1), [1.0])
        
        # When there is no offset we have exactly x_num tiles along x and y_num along y
        self.extra_x_tile = self.x_lines[0] != 0
        self.extra_y_tile = self.y_lines[0] != 0
        
        self.x_tiles = x_num + int(self.extra_x_tile)
        self.y_tiles = y_num + int(self.extra_y_tile)
        
        self.num_tiles = self.x_tiles * self.y_tiles
        
    @staticmethod
    def check_normalized(coord):
        assert 0 <= coord <= 1, "Coordinates passed must be normalized."
        
    def find_tile_index(self, x_norm, y_norm):
        """
        Passed coordinates are assumed to be normalized.
        Returns 2D index of the tile within which coordinates
        fall. 0-indexing is used.
        """
        RectangularTiling2D.check_normalized(x_norm)
        RectangularTiling2D.check_normalized(y_norm)
        
        x_index = np.argmin(x_norm >= self.x_lines) - int(not self.extra_x_tile)
        y_index = np.argmin(y_norm >= self.y_lines) - int(not self.extra_y_tile)
        
        assert 0 <= x_index < self.x_tiles, "x tile index {} out of bounds [0, {}).".format(x_index, self.x_tiles)
        assert 0 <= y_index < self.y_tiles, "y tile index {} out of bounds [0, {}).".format(y_index, self.y_tiles)
        
        return [x_index, y_index]
    
    def find_tile_scalar_index(self, x_norm, y_norm):
        """
        Returns scalar index for tile. This will be the
        index for this tile in the one dimensional vector
        representing the encoding for this tile.
        """
        
        x_index, y_index = self.find_tile_index(x_norm, y_norm)
        
        return x_index + self.y_tiles * y_index

    
class TileEncoder2D():
    
    def __init__(self, offsets, x_bounds, x_dot_bounds, x_num=8, x_dot_num=8,
                 tiler=RectangularTiling2D):
        
        check_bounds_valid(x_bounds)
        check_bounds_valid(x_dot_bounds)
        
        self.num_tilings = len(offsets)
        self.offsets = offsets
        self.x_bounds = x_bounds
        self.x_dot_bounds = x_dot_bounds
        self.x_num = x_num
        self.x_dot_num = x_dot_num
        
        # A list containing all of our tile objects
        self.tilings = []
        
        for offset in offsets:
            self.tilings.append(tiler(x_num, x_dot_num, offset))
            
        # What dimension vector X(s) is required to represent our encoding?
        self.encoder_dim = 0 
        
        for tiling in self.tilings:
            self.encoder_dim += tiling.num_tiles
        
    @staticmethod
    def normalize(var, var_range):
        return (var - var_range[0])/(var_range[1] - var_range[0])
    
    @staticmethod
    def denormalize(norm_var, var_range):
        return (var_range[1] - var_range[0]) * norm_var + var_range[0]
    
    def featurize(self, x, x_dot):
        """
        Returns feature vector X(s) correspoding 
        to the state coordinate s=(x, x').
        """
        x_norm, x_dot_norm = TileEncoder2D.normalize(x, self.x_bounds), \
                             TileEncoder2D.normalize(x_dot, self.x_dot_bounds)
        tiling_vectors = []
        
        for tiling in self.tilings:
            dim = tiling.num_tiles
            index = tiling.find_tile_scalar_index(x_norm, x_dot_norm)
            tiling_vector = np.zeros(dim, dtype=int)
            tiling_vector[index] = 1
            
            tiling_vectors.append(tiling_vector)
            
        feature_vector = np.concatenate(tiling_vectors)
        
        assert len(feature_vector) == self.encoder_dim, "Actual encoder dimension not what expected"
        assert len(feature_vector.shape) == 1, "Feature vector is not one dimensional"
            
        return feature_vector
    
class SemiGradientLinearNStepSarsa():
    
    """
    This class maintains a queue which contains the
    last seen N + 1 states, actions, and rewards. When a
    state, action, and reward are outside of the n-step 
    window they are removed from the queue. The queue will
    be shorter than length n if:
    
    1) The episode has less than n-steps
    2) The episode has just begun
    3) The episode is ending
    """
    
    def __init__(self, n, alpha, gamma, epsilon):
        self.n = n
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.features = [] # X(s) -- storing features rather than states to save compute
        self.rewards = []
        self.actions = []
        
        self.t = 0
        self.tau = self.t - self.n + 1
        self.T = np.inf
        
    @staticmethod
    def Q(X, w_a):
        """
        We assume each of the three actions -1, 0, 1
        have their own weight vector of dimension 
        len(X(s)).
        
        This assumption is equivalent to assuming that
        X(s, a) \dot w = X(s, a) \dot [w_{-1}, w_0, w_1]^T
        = X(s, a) \dot w_{a*=a} = X(s) \cdot w_a. This follows
        from the fact that X(s, a) \dot  w_{a*!=a} = 0 because
        the components of the vector X(s, a_0) corresponding 
        to a != a_0 are equal to zero. 
        """
        
        assert len(X) == len(w_a), "X and w_a must have the same length"
        return np.dot(X, w_a)
    
    @staticmethod
    def Del_Q(X, w_a):
        """
        In linear case the derivative is just equal
        to the feature vector X(s). This follows from 
        the definition of our representation X(s) and
        of X(s, a) as explained in the Q method.
        """
        return X
    
    def pi(self, X, w):
        Q_s = np.array([SemiGradientLinearNStepSarsa.Q(X, w[a + 1]) for a in [-1, 0, 1]])
        optimal_actions = np.argwhere(Q_s == np.amax(Q_s)) - 1
        not_greedy = np.random.binomial(1, self.epsilon)
        
        if not_greedy:
            return np.random.choice([-1, 0, 1])
        else:
            return np.random.choice(optimal_actions.flatten())
        
    def add_observation(self, observation_list, observation):
        # We pop elements if the list is sufficiently long or if the episode has terminated
        if len(observation_list) > self.n or self.T != np.inf:
            observation_list.pop(0)
            
        # When None is passed as an observation it is not appended (handles post episode termination case)
        if observation is not None:
            observation_list.append(observation)
    
    def increment_time(self):
        self.t +=1
        self.tau += 1
    
    def add_timestep(self, feature_t_1, action_t_1, reward_t_2):
        feature_none = feature_t_1 is None
        action_none = action_t_1 is None
        reward_none = reward_t_2 is None
        
        if feature_none or action_none or reward_none:
            assert  feature_none and action_none and reward_none, \
                "If one state variable is `None` all should be."
        
        self.add_observation(self.features, feature_t_1)
        self.add_observation(self.actions, action_t_1)
        self.add_observation(self.rewards, reward_t_2)
        
    def G(self, w):
        """
        We represent w as a 3 x num_features matrix where
        each of the three rows corresponds to w_{-1}, w_{0},
        and w_{1} respectively.
        """
        
        n = min(self.n, len(self.rewards)) # If less than n rewards left we truncate sum
        gamma_factors = np.logspace(0, n - 1, n, base=self.gamma)
        
        G = np.dot(gamma_factors, self.rewards[:self.n])
        
        if self.tau + self.n < self.T:
            X_tau_n = self.features[self.n]
            w_a_tau_n = w[self.actions[self.n] + 1] # Actions are -1, 0, 1
            G += self.gamma**n + SemiGradientLinearNStepSarsa.Q(X_tau_n, w_a_tau_n)
            
        return G
    
    def w_update(self, w):
        G = self.G(w)
        X_tau = self.features[0]
        w_a_tau = w[self.actions[0] + 1] # Actions are -1, 0, 1
        
        # Note that we only get a gradient for the portion of w 
        # corresponding to the action value for which we update Q(S,A)
        delta_w = np.zeros((3, len(w_a_tau)))
        delta_w[self.actions[0] + 1] = alpha * (G - self.Q(X_tau, w_a_tau)) * self.Del_Q(X_tau, w_a_tau)
        
        return delta_w
    
    def reset(self):
        """
        Resets the state of the Sarsa optimizer
        """
        
        self.features = []
        self.rewards = []
        self.actions = []
        
        self.t = 0
        self.tau = self.t - self.n + 1
        self.T = np.inf
    
    def check_done(self):
        # Note that on the last iteration we pop first, then calculate the last G
        # THEN measure the length. This means we need to check the length equals 1
        # at the final timestep
        features_empty = len(self.features) == 1
        actions_empty = len(self.actions) == 1
        rewards_empty = len(self.rewards) == 1
        
        lists_empty = features_empty and actions_empty and rewards_empty
        time_limit = (self.tau == self.T - 1)
        
        # Just a consistency check that our implementation agrees with the pseudo-code
        # Note we expect to be one step ahead of pseudo-code when lists are empty
        assert (lists_empty == time_limit), "Time condition and queue being empty should match."
        
        if lists_empty:
            self.reset()
        
        return lists_empty
    
    def set_T_terminal(self):
        self.T = self.t + 1

In [None]:
## Main execution of algorithm here -- may want to wrap in method or class 
## if we want to be able to easily run multiple runs

class SolveMountainCar():
    
    def __init__(self, episodes, simulator, sarsa, tiler):
        self.episodes = episodes
        self.simulator = simulator
        self.sarsa = sarsa
        self.tiler = tiler
        self.X_dim = self.tiler.encoder_dim
        self._w_array = np.empty((self.episodes, 3, self.X_dim))
        self._steps_array = np.empty(self.episodes)
        
    def get_w_array(self):
        return self._w_array
    
    def get_steps_array(self):
        return self._steps_array
    
    def set_w_array(self, episode, w):
        self._w_array[episode] = w
        
    def set_steps_array(self, episode, steps):
        self._steps_array[episode] = steps
        
    def run(self):
        w = np.zeros((3, self.X_dim))
        
        print("BEGIN MOUNTAIN CAR SARSA\n")
        t_i = time.time()
        t_i_abs = t_i
        
        for episode in range(self.episodes):
            steps = 0
            
            x, v = self.simulator.get_initial_state()
            X = self.tiler.featurize(x, v)
            a = self.sarsa.pi(X, w)
            r_prime = -1
            
            self.sarsa.add_timestep(X, a, r_prime)
            sarsa_terminated = False
            episode_terminated = False
            
            while not sarsa_terminated:
                # Check at the beginning if each step if we've already terminated. If not, 
                # get the next state and check if that state if terminal
                if not episode_terminated:
                    x_prime, v_prime = self.simulator.next_state(a, x, v)
                    episode_terminated = self.simulator.check_terminal(x_prime)
                    
                    if episode_terminated:
                        self.sarsa.set_T_terminal()
                
                # If next state is terminal pass next features as None
                if episode_terminated:
                    X_prime = None
                    a_prime = None
                    r_dprime = None
                    
                # Else featurize next state, get next action and reward
                else:
                    X_prime = self.tiler.featurize(x_prime, v_prime)
                    a_prime = self.sarsa.pi(X_prime, w)
                    r_dprime = -1
                    steps += 1
                    
                # Store next feature, action, and reward
                self.sarsa.add_timestep(X_prime, a_prime, r_dprime)
                
                # If tau >= 0 apply the required update to w
                if self.sarsa.tau >= 0:
                    w += self.sarsa.w_update(w)
                
                # Finally, increment t and tau, and check whehter we have applied 
                # sarsa to all data in the episode before moving to next episode
                x = x_prime
                v = v_prime
                a = a_prime
                
                sarsa_terminated = self.sarsa.check_done()
                if not sarsa_terminated:
                    self.sarsa.increment_time()
                    
            if self.episodes >= 100 and (episode + 1) % int(self.episodes/100) == 0:
                t_f = time.time()
                print("Iterations {}-{} Completed - Time: {}s".format(episode + 1 - int(self.episodes/100),
                                                                      episode + 1,
                                                                      round(t_f-t_i, 2)))
                t_i = t_f
                
            self.set_w_array(episode, w)
            self.set_steps_array(episode, steps)

        delta_t_abs = round((time.time() - t_i_abs)/(self.episodes), 2)
        print("\nEND MOUNTAIN CAR SARSA - Avg. Episode Time: {}s".format(delta_t_abs))
            
        return w

In [None]:
episodes = 100000 # start with 1, work up to 500
simulator_params = {"x_bounds": [-1.2, 0.5], "x_dot_bounds": [-0.07, 0.07]}
learning_params = {"n": 4, "alpha": 0.1, "gamma": 1, "epsilon": 0.1}
offset_params = {"dim": 2, "width": 1/8, "num_tilings": 8}

offsets = TileOffsetGenerator2D(**offset_params)
tiling_params = {"offsets": offsets.get_offsets(), 
                 "x_bounds": simulator_params["x_bounds"], 
                 "x_dot_bounds": simulator_params["x_dot_bounds"]}

simulator = MountainCarSimulator(**simulator_params)
sarsa = SemiGradientLinearNStepSarsa(**learning_params)
tiler = TileEncoder2D(**tiling_params)

solver = SolveMountainCar(episodes, simulator, sarsa, tiler)
w_final = solver.run()
w_array = solver.get_w_array()
steps_array = solver.get_steps_array()

np.save("example_10_1_outputs/w_array_100000.npy", w_array)
np.save("example_10_1_outputs/steps_array_100000.npy", steps_array)

In [None]:
w_array = np.load("example_10_1_outputs/w_array_100000.npy")
w_final = w_array[-1]
steps_array = np.load("example_10_1_outputs/steps_array_100000.npy")

In [None]:
nx, ny = 100, 100

x = np.linspace(0, 1, nx, False)
y = np.linspace(0, 1, ny, False)

xv, yv = np.meshgrid(x, y)

vdenorm_x = np.vectorize(lambda x: tiler.denormalize(x, tiler.x_bounds))
vdenorm_x_dot = np.vectorize(lambda x_dot: tiler.denormalize(x_dot, tiler.x_dot_bounds))

def vQ(x_v, x_dot_v, w, tiler, sarsa):
    assert x_v.shape == x_dot_v.shape, "Shapes must be equal"
    input_shape = x_v.shape
    Q = np.empty((input_shape[0] * input_shape[1], 3))

    x_flat = x_v.flatten()
    x_dot_flat = x_dot_v.flatten()
    
    index = 0
    for x, x_dot in zip(x_flat, x_dot_flat):
        for a in np.arange(-1,2,1, dtype=int): 
            X = tiler.featurize(x, x_dot)
            Q[index, a + 1] = sarsa.Q(X, w[a + 1])
            
        index += 1
        
    return Q.reshape(input_shape[0], input_shape[1], 3)

x_v = vdenorm_x(xv)
x_dot_v = vdenorm_x_dot(yv)
Q_a = vQ(x_v, x_dot_v, w_final, tiler, sarsa)
Q = np.max(Q_a, axis=2)
pi = np.argmax(Q_a, axis=2) - 1

In [None]:
%matplotlib inline
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(121)
ax.contourf(x_v, x_dot_v, pi)
ax = fig.add_subplot(122, projection='3d')
ax.plot_surface(x_v, x_dot_v, -Q, cmap="coolwarm")
plt.tight_layout()

In [None]:
%matplotlib inline
fig = plt.figure(figsize=(20,10))
ax = fig.add_subplot(121)
ax.contourf(x_v, x_dot_v, -Q)
ax = fig.add_subplot(122, projection='3d')
ax.plot_surface(x_v, x_dot_v, -Q, cmap="coolwarm")
plt.tight_layout()

In [None]:
%matplotlib inline 
pi0 = np.argmax(vQ(x_v, x_dot_v, w_array[0], tiler, sarsa), axis=2) - 1
pi1 = np.argmax(vQ(x_v, x_dot_v, w_array[10], tiler, sarsa), axis=2) - 1
pi2 = np.argmax(vQ(x_v, x_dot_v, w_array[100], tiler, sarsa), axis=2) - 1
pi3 = np.argmax(vQ(x_v, x_dot_v, w_array[1000], tiler, sarsa), axis=2) - 1
pi4 = np.argmax(vQ(x_v, x_dot_v, w_array[10000], tiler, sarsa), axis=2) - 1
pi5 = np.argmax(vQ(x_v, x_dot_v, w_array[99999], tiler, sarsa), axis=2) - 1

fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(321)
plt1 = ax.contourf(x_v, x_dot_v, pi0)
fig.colorbar(plt1)
ax = fig.add_subplot(322)
plt2 = ax.contourf(x_v, x_dot_v, pi1)
fig.colorbar(plt2)
ax = fig.add_subplot(323)
plt3 = ax.contourf(x_v, x_dot_v, pi2)
fig.colorbar(plt3)
ax = fig.add_subplot(324)
plt4 = ax.contourf(x_v, x_dot_v, pi3)
fig.colorbar(plt4)
ax = fig.add_subplot(325)
plt5 = ax.contourf(x_v, x_dot_v, pi4)
fig.colorbar(plt5)
ax = fig.add_subplot(326)
plt6 = ax.contourf(x_v, x_dot_v, pi5)
fig.colorbar(plt6)

plt.tight_layout()
plt.show()

In [None]:
Q0 = np.max(vQ(x_v, x_dot_v, w_array[0], tiler, sarsa), axis=2)
Q1 = np.max(vQ(x_v, x_dot_v, w_array[10], tiler, sarsa), axis=2)
Q2 = np.max(vQ(x_v, x_dot_v, w_array[100], tiler, sarsa), axis=2)
Q3 = np.max(vQ(x_v, x_dot_v, w_array[1000], tiler, sarsa), axis=2)
Q4 = np.max(vQ(x_v, x_dot_v, w_array[10000], tiler, sarsa), axis=2)
Q5 = np.max(vQ(x_v, x_dot_v, w_array[99999], tiler, sarsa), axis=2)

%matplotlib inline
fig = plt.figure(figsize=(20,20))
ax = fig.add_subplot(321, projection='3d')
ax.plot_surface(x_v, x_dot_v, -Q0, cmap="coolwarm")
ax = fig.add_subplot(322, projection='3d')
ax.plot_surface(x_v, x_dot_v, -Q1, cmap="coolwarm")
ax = fig.add_subplot(323, projection='3d')
ax.plot_surface(x_v, x_dot_v, -Q2, cmap="coolwarm")
ax = fig.add_subplot(324, projection='3d')
ax.plot_surface(x_v, x_dot_v, -Q3, cmap="coolwarm")
ax = fig.add_subplot(325, projection='3d')
ax.plot_surface(x_v, x_dot_v, -Q4, cmap="coolwarm")
ax = fig.add_subplot(326, projection='3d')
ax.plot_surface(x_v, x_dot_v, -Q5, cmap="coolwarm")

plt.tight_layout()

In [None]:
fig = plt.figure(figsize=(20,10))
plt.plot(np.log10(steps_array), zorder=0)
plt.hlines(2.18, 0, 100000, zorder=1, linewidth=3, linestyle="dashed")
plt.tight_layout()

# Example 12.2 [FIXME- Complete]

Use Sarsa($\lambda$) and True Online Sarsa($\lambda$) to solve the mountain car problem. Can reuse most of the clases from the solution to Example 10.1. Just need to chance the Sarsa and main classes. Want to compare the performance of these algorithms (with near optimal parameters) to $n$-step Sarsa with $n=4$ which is optimal according to S&B.

# Example 13.1

We parametrize the policy by equations 13.2 and 13.3. In this case exercise 13.3 (which we solved in our notebook) implies that:

$$ \nabla \ln{ \pi(a \ | \ s, \mathbf{\theta} ) } = \mathbf{x}(s,a) - \sum_{b} \pi(b  \ | \ s, \mathbf{\theta}) \ \mathbf{x}(s,b)$$ 

We use this formula to implement the REINFORCE Monte Carlo Policy-Gradient Control algorithm.

In [None]:
def calc_h(a, theta):
    """
    Note that the feature representation is
    independent of the state so we can omit s
    from the arguments of h. 
    
    We represent right with 0 and left with 1
    to align with the representation in the example
    in the book.
    """
    
    assert a in [0, 1], "a must be one of `[0, 1]`"
    
    # Dot product is same as indexing since x(a) is basis vector
    return theta[a]

def calc_pi(a, theta):
    """
    See documentation for `calc_h`
    """
    
    h_0 = calc_h(0, theta) 
    h_1 = calc_h(1, theta) 
    h_a = h_0 if a == 0 else h_1
    
    return np.exp(h_a)/(np.exp(h_0) + np.exp(h_1))

def calc_del_log_pi(a, theta):
    """
    Calculates the derivative of the natural
    logarithm of the policy based on the formula
    above.
    """
    
    x_0 = np.array([1, 0])
    x_1 = np.array([0, 1])
    x_a = np.array([1, 0]) if a == 0 else np.array([0, 1])
    
    return x_a - calc_pi(0, theta) * x_0 - calc_pi(1, theta) * x_1 

def polgrad_episode(theta):
    
    s = 0 # Initial state representation
    states = []
    actions = []
    prob_right = calc_pi(0, theta)
    episode_over = False
    
    while not episode_over:
        states.append(s)
        try:
            take_right = np.random.binomial(1, prob_right)
        except ValueError as e:
            h_0 = calc_h(0, theta) 
            h_1 = calc_h(1, theta) 
            print(h_0)
            print(h_1)
            print(theta)
            print(prob_right)
            raise e
        actions.append(1 - take_right)    
        
        # This line of code captures the fact that at s=1 left and right are reversed
        actually_right = take_right if s != 1 else 1 - take_right
        
        # Move the agent to the state that the action actually transitions to
        if actually_right:
            s = s + 1
            episode_over = True if s >= 3 else False
        else:
            s = s - 1 if s > 0 else 0
    
    assert len(states) == len(actions), "Length of actions and state list should be same"
    
    return states, actions

def reinforce(simulations, episodes, alpha, theta_i=np.zeros(2)):
    rewards = np.empty((simulations, episodes))
    pi = np.empty((simulations, episodes, 2))
    
    print("\nBEGIN REINFORCE POLICY GRADIENT WITH PARAMS:\n" +
          "simulations: {}\n".format(simulations) +
          "episodes: {}\n".format(episodes) +
          "alpha: {}\n".format(alpha) +
          "theta_i: {}\n".format(theta_i) + 
          "theta_i: {}\n".format(np.array([calc_pi(0, theta_i), calc_pi(1, theta_i)])))
    
    base_t_i = time.time()
    t_i = time.time()
    
    for simulation in range(simulations): 
        theta = copy.deepcopy(theta_i) # Note setting theta = theta_i just points theta to theta_i
                                       # which biases simulations beyond the first one
        
        for episode in range(episodes):
            states, actions = polgrad_episode(theta)
            theta_history = np.full((len(states), 2), np.nan) # For debugging
            G = - len(states) # G_0 is total steps until episode end
            rewards[simulation, episode] = G
            pi_0 = calc_pi(0, theta)
            pi[simulation, episode] = np.array([pi_0, 1 - pi_0]) 

            # Don't really need to loop through actions
            i = 0
            for s, a in zip(states, actions):
                prev_theta = copy.deepcopy(theta)
                theta_history[i] = prev_theta
                theta += alpha * G * calc_del_log_pi(a, theta)
                
                if np.log10(calc_pi(0, theta)) < -4 or np.log10(calc_pi(1, theta)) < -4:
                    # Checks for divergence scenario explained above
                    # Logs useful details about it
                    print("Previous theta")
                    print(prev_theta)
                    print("Current theta")
                    print(theta)
                    print("G")
                    print(G)
                    print("Current pi")
                    print([calc_pi(0, theta), 1-calc_pi(0, theta)])
                    print("Current policy gradient")
                    print(calc_del_log_pi(a, theta))
                    print("Current policy update")
                    print(alpha * G * calc_del_log_pi(a, theta))
                    print("Current action")
                    print(a)
                    print("Current state")
                    print(s)
                    print("Theta history")
                    print(np.concatenate((np.array(theta_history), 
                                          np.array(states).reshape(-1,1), 
                                          np.array(actions).reshape(-1,1)), axis=1))
                    raise ValueError("Probabilities are becoming too low")
                
                G += 1 # Remove single -1 from G since we move from G_t -> G_{t+1}
                
        if (simulation + 1) % 10 == 0:
            t_f = time.time()
            delta_t = round(t_f - t_i, 2)
            print("10 SIMULATIONS COMPLETED, TOTAL TIME: {} s".format(delta_t))
            t_i = t_f
            
    base_delta_t = round(time.time() - base_t_i, 2)
    print("\nALL SIMULATIONS COMPLETED, TOTAL TIME: {} s\n".format(base_delta_t))
            
    return rewards, pi

We note a couple of observations about the simulations in S&B:

+ The expected value of $G_0$ for the first episode seems to be around $-90$ which would imply a $\mathbf{\theta}$ such that the probability of taking a right is either $\sim 4\%$ or $\sim 97\%$. This follows from our solution to exercise 13.1 regarding the optimal probability. We achieve this approximately by setting $\mathbf{\theta} = [0, 4]^T, [4, 0]^T$.

    + Interestingly, it seems that increasing the difference between the components of $\mathbf{\theta}$ can increase the likelihood of $\pi$ diverging. A feedback cycle can start which drives the probability of taking a right to 1, which then makes episodes longer and longer, etc. 
    
    + We think this happens because we get a lot of feedback that "right is the wrong action" since right has a high probability initially but tends to send the agent back to state 0 for a while until we happen to pick left by chance in state 1. This feedback makes the policy shift far towards taking left with high probability, which leads to the same problem, swinging the probability back towards right (likely with a higher probability than before). Eventually either we get a NaN error because of exponential overflow, or the episodes get so long the episodes get too slow to end in a reasonable time.
    
    + There are three possible solutions. We attempted both (2) and (3) with successful results below: 
    
        1) Avoid overflow errors by improving the numerical simulation framework. In theory the learning should correct itself if we can avoid numerical errors. However a sufficiently smaller learning rate **may also** be needed.
        
        2) Reduce the learning rate so that we do not overshoot the minimum within an episode.
        
        3) Make the initial theta less extreme so this overly-strong feedback does not lead to the aforementioned instability.
        
    + We're not sure how the authors used the given learning rates and got $-90$ for the initial expected reward on episode one -- it could have been a perfectly balanced choice of $\alpha$ and $\mathbf{\theta}_i$. We're even less sure about how their $\alpha=2^{-12}$ performed the worst given their expected initial reward. Our results yield the opposite result.


+ There is a lot of variance in the solutions we have obtained by setting $\alpha = 2^{-12}, 2^{-13}, 2^{-14}$. We believe the learning rate in the book may be overstated so we're trying different learning rates with larger spacings between them.
    + It turns out that going further down slows learning so much that it basically doesn't happen in the first 1000 episodes.
    + We brought up the variance because we thought it meant the learning rate was too high, and that this was why our average rewards seeem to converge so quickly to the optimum. However, the actual reason it was learning so fast is because we were not reseting $\mathbf{\theta}_i$ correctly. $\mathbf{\theta}$ was being initialized to the previously learned value at each simulation start! This was biasing the results for simulations beyond the first one! Hence why our average reward curve was flat, but a single reward curve had the right (noisy) shape.

There is also a possibility that there is an error with our code and that is causing issues with our reward predictions. However, when we eliminate the flip of the actions on $s=1$ the curves we get make sense. Thus it seems that most of the code here should be working correctly...

In [None]:
# Why are we getting overflow errors, binomial parameter errors, or the code freezing altogether???
import sys
np.set_printoptions(threshold=sys.maxsize)

simulations = 100
episodes = 1000
alphas = [2**(-13), 2**(-14), 2**(-15)]

# We set theta this way to attain an expected reward of -90 for the first step
# To avoid divergence this required starting alpha at 2**(-13) or smaller
rewards_12_full, pi_12_full = reinforce(simulations, episodes, alphas[0], np.array([3.737, 0], dtype=float))
rewards_13_full, pi_13_full = reinforce(simulations, episodes, alphas[1], np.array([3.737, 0], dtype=float))
rewards_14_full, pi_14_full = reinforce(simulations, episodes, alphas[2], np.array([3.737, 0], dtype=float))

rewards_12, pi_12 = np.mean(rewards_12_full, axis=0), np.mean(pi_12_full[:,-1,:], axis=0)
rewards_13, pi_13 = np.mean(rewards_13_full, axis=0), np.mean(pi_13_full[:,-1,:], axis=0)
rewards_14, pi_14 = np.mean(rewards_14_full, axis=0), np.mean(pi_14_full[:,-1,:], axis=0)

In [None]:
figure = plt.figure(figsize=(20, 10))

plt.plot(rewards_12, label=r"$ \alpha=2^{-13}$", zorder=0)
plt.plot(rewards_13, label=r"$ \alpha=2^{-14}$", zorder=1)
plt.plot(rewards_14, label=r"$ \alpha=2^{-15}$", zorder=2)
plt.hlines(-11.6, 0, 1000, color="black", linestyle="dashed", linewidth=5, label="$v_{*}(S_0)$", zorder=3)

plt.xlabel("Episode", size=30)
plt.xticks(size=20)
plt.ylabel("$G_0$", size=30)
plt.yticks(size=20)
plt.legend(prop={"size": 20})
plt.title("Total Reward On Episodes - Average Over 100 Runs", size=30)


plt.tight_layout()

Note that the probability attained doesn't quite reach the optimal value! More learning time would be required under the chosen parameters. However, it looks like to match the graph in the book (i.e. the expected value in the first episode) we could have reduced $\mathbf{\theta}_i$ a little bit.

It's worth noting in other runs we got quite close to the optimal probability value under these parameters. This means that the average of 100 runs still has significant variance. More runs would be needed to get a better estimate of the performance of these parameters.

In [None]:
figure = plt.figure(figsize=(20, 10))

plt.plot(np.mean(pi_12_full[:,:,0], axis=0), label=r"$ \alpha=2^{-13}$", linewidth=4)
plt.plot(np.mean(pi_13_full[:,:,0], axis=0), label=r"$ \alpha=2^{-14}$", linewidth=4)
plt.plot(np.mean(pi_14_full[:,:,0], axis=0), label=r"$ \alpha=2^{-15}$", linewidth=4)
plt.hlines(0.58579, 0, 1000, linestyle="dashed", label="$P_{*}(Right)$", linewidth=4)
plt.ylim(0, 1)

plt.xlabel("Episode", size=30)
plt.xticks(size=20)
plt.ylabel("$P(Right)$", size=30)
plt.yticks(size=20)
plt.legend(prop={"size": 20})
plt.title("Total Reward On Episodes - Average Over 100 Runs", size=30)

plt.tight_layout()

In [None]:
print("Probability of taking right for -13: {}".format(np.mean(pi_12_full[:,-1,0], axis=0)))
print("Probability of taking right for -14: {}".format(np.mean(pi_13_full[:,-1,0], axis=0)))
print("Probability of taking right for -15: {}".format(np.mean(pi_14_full[:,-1,0], axis=0)))

We re-run the experiment but with a slightly smaller $\mathbf{\theta}$ differential and with the learning rates specified in the book. Note the expected initial reward is higher but still pretty close to that pictured in the book! 

Although it's much less likely that divergence happens under these parameters, there is still some probability of it happening. We've had it occur in some instances with $\alpha=2^{-12}$ in the runs below... The more correct approach would be to decrease $\alpha$ and to have the learning occur over more episodes. But we want a direct comparison to the approach taken in the book.

The fact that divergence is probabilistic makes me think it is due to $\alpha$ not being small enough to guarantee the theoretical conditions necessary for convergence. The choice for $\alpha$ is not as clear in policy gradients as in other approaches because in practice we only calculate a quantity proportional to the true gradient, not the exact gradient.

In [None]:
# Why are we getting overflow errors, binomial parameter errors, or the code freezing altogether???
import sys
np.set_printoptions(threshold=sys.maxsize)

simulations = 100
episodes = 1000
alphas = [2**(-12), 2**(-13), 2**(-14)]

# With this choice for theta we can use the learning rates specified in the book without diverging
rewards_12_full, pi_12_full = reinforce(simulations, episodes, alphas[0], np.array([3.5, 0], dtype=float))
rewards_13_full, pi_13_full = reinforce(simulations, episodes, alphas[1], np.array([3.5, 0], dtype=float))
rewards_14_full, pi_14_full = reinforce(simulations, episodes, alphas[2], np.array([3.5, 0], dtype=float))

rewards_12, pi_12 = np.mean(rewards_12_full, axis=0), np.mean(pi_12_full[:,-1,:], axis=0)
rewards_13, pi_13 = np.mean(rewards_13_full, axis=0), np.mean(pi_13_full[:,-1,:], axis=0)
rewards_14, pi_14 = np.mean(rewards_14_full, axis=0), np.mean(pi_14_full[:,-1,:], axis=0)

In [None]:
figure = plt.figure(figsize=(20, 10))

plt.plot(rewards_12, label=r"$ \alpha=2^{-12}$", zorder=0)
plt.plot(rewards_13, label=r"$ \alpha=2^{-13}$", zorder=1)
plt.plot(rewards_14, label=r"$ \alpha=2^{-14}$", zorder=2)
plt.hlines(-11.6, 0, 1000, color="black", linestyle="dashed", linewidth=5, label="$v_{*}(S_0)$", zorder=3)

plt.xlabel("Episode", size=30)
plt.xticks(size=20)
plt.ylabel("$G_0$", size=30)
plt.yticks(size=20)
plt.legend(prop={"size": 20})
plt.title("Total Reward On Episodes - Average Over 100 Runs", size=30)


plt.tight_layout()

We nearly attain the optimal probability! We've seen this choice of parameters be more consistent (over admittedly only a couple of trials). This is a somewhat biased result since we've seen $2^{-12}$ diverge and only kept runs that succeeded, but we don't know what fraction of runs would be divergent and hence can't quantify the bias. 

In [None]:
figure = plt.figure(figsize=(20, 10))

plt.plot(np.mean(pi_12_full[:,:,0], axis=0), label=r"$ \alpha=2^{-12}$", linewidth=4)
plt.plot(np.mean(pi_13_full[:,:,0], axis=0), label=r"$ \alpha=2^{-13}$", linewidth=4)
plt.plot(np.mean(pi_14_full[:,:,0], axis=0), label=r"$ \alpha=2^{-14}$", linewidth=4)
plt.hlines(0.58579, 0, 1000, linestyle="dashed", label=r"$P_{*}(Right)", linewidth=4)
plt.ylim(0, 1)

plt.xlabel("Episode", size=30)
plt.xticks(size=20)
plt.ylabel("$P(Right)$", size=30)
plt.yticks(size=20)
plt.legend(prop={"size": 20})
plt.title("Total Reward On Episodes - Average Over 100 Runs", size=30)

plt.tight_layout()

In [None]:
print("Probability of taking right for -12: {}".format(np.mean(pi_12_full[:,-1,0], axis=0)))
print("Probability of taking right for -13: {}".format(np.mean(pi_13_full[:,-1,0], axis=0)))
print("Probability of taking right for -14: {}".format(np.mean(pi_14_full[:,-1,0], axis=0)))

We believe the case $2^{-12}$ converges to a suboptimal solution for the authors because each episode we learn we overshoot the optimum. How they achieved this and simultaneously attained an expected reward on the first episode of about $-90$ is unclear to me. It seems to me you would need a $\mathbf{\theta}_i$ close to the optimum to begin with along with a high $\alpha$ to fall into that scenario. At that point however the intial episode should have a better expectation than what is shown in their graph. 

However, for our purposes the solution above is close enough and satisfactory!