# Policy Based Methods


- Read the most famous [blog post](http://karpathy.github.io/2016/05/31/rl/) on policy gradient methods.
- Implement a policy gradient method to win at Pong in this [Medium post](https://medium.com/@dhruvp/how-to-write-a-neural-network-to-play-pong-from-scratch-956b57d4f6e0).
- Many challanging problems can be found in [OpenAI's Request for Research](https://openai.com/blog/requests-for-research-2/).

In [None]:
from IPython.lib.display import YouTubeVideo

### Why Policy-Based Methods

In [None]:
YouTubeVideo('ToS8vXGdODE')

### Introduction

In [None]:
YouTubeVideo('mMnhi8yzwKk')

### Policy Function Approximation

In [None]:
YouTubeVideo('v8tGjlc2aG4')

# Black-Box Optimization

## Hill Climbing

In [None]:
YouTubeVideo('5E86a0OyVyI')

In [None]:
YouTubeVideo('0XzzqIXyax0')

The goal of the agent is to find the value of the policy network weights $\theta$ that maximizes expected return, which has been denoted by $J$.

In the hill climbing algorithm, the values of $\theta$ are evaluated according to how much return $G$ is collected in a single episode. Note that due to randomness in the environment (and the policy, if it is stochastic), it is highly likely that if we collect a second episode with the same values for $\theta$, we'll likely get a different value for the return $G$. Because of this, the (sampled) return $G$ is not a perfect estimate for the expected return $J$, but it often turns out to be good enough in practice.

## stochastic policy search

Despite the fact that we have no idea what that function $J(\theta)$ looks like, the _hill climbing_ algorithm helps us determine the value of $\theta$ that maximizes it. There are some improvements we can make to the hill climbing algorithm!

We refer to the general class of approaches that find $\displaystyle\arg\max_\theta J(\theta)$ through randomly perturbing the most recent best estimate as **stochastic policy search**.

In [None]:
YouTubeVideo('QicxmyE5vTo')

## Cross-Entropy Method (CEM)

In [None]:
YouTubeVideo('2poDljPvY58')

Other very popular black-box optimization techniques are [evolution strategies](https://blog.openai.com/evolution-strategies). A well-written implementation is [here](https://github.com/alirezamika/evostra). To see how to apply it to an OpenAI Gym (BipedalWalker) environment, check out [this repository](https://github.com/alirezamika/bipedal-es).

To see one way to structure the analysis, check out [this blog post](http://kvfrans.com/simple-algoritms-for-solving-cartpole/), along with the [accompanying code](https://github.com/kvfrans/openai-cartpole). For instance, you will likely find that _hill climbing_ is very unstable, where the number of episodes that it takes to solve `CartPole-v0` varies greatly with the random seed.

---

## Hill Climbing with Adaptive Noise Scaling

We will train hill climbing with adaptive noise scaling with OpenAI Gym's Cartpole environment.

### Import the Necessary Packages

In [None]:
import gym
import math
import numpy as np
from collections import deque
import matplotlib.pyplot as plt
%matplotlib inline

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()

is_ipython = 'inline' in plt.get_backend()
if is_ipython:
    from IPython import display

plt.ion()

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

### Define the Policy

In [None]:
env = gym.make('CartPole-v0')
print('observation space:', env.observation_space)
print('action space:', env.action_space)

class Policy():
    def __init__(self, s_size=4, a_size=2):
        self.w = 1e-4*np.random.rand(s_size, a_size) # weights for simple linear policy: state_space x action_space
        
    def forward(self, state):
        x = np.dot(state, self.w)
        return np.exp(x)/sum(np.exp(x))
    
    def act(self, state):
        probs = self.forward(state)
        # action = np.random.choice(2, p=probs) # option 1: stochastic policy
        action = np.argmax(probs)               # option 2: deterministic policy
        return action

### Train the Agent with Stochastic Policy Search

In [None]:
env.seed(0)
np.random.seed(0)

policy = Policy()

def hill_climbing(n_episodes=1000, max_t=1000, gamma=1.0, print_every=10, noise_scale=1e-2):
    '''
    Implementation of hill climbing with adaptive noise scaling.
        
    Params
    ======
        n_episodes (int): maximum number of training episodes
        max_t (int): maximum number of timesteps per episode
        gamma (float): discount rate
        print_every (int): how often to print average score (over last 100 episodes)
        noise_scale (float): standard deviation of additive noise
    '''
    scores_deque = deque(maxlen=100)
    scores = []
    best_R = -np.Inf
    best_w = policy.w
    for i_episode in range(1, n_episodes+1):
        rewards = []
        state = env.reset()
        for t in range(max_t):
            action = policy.act(state)
            state, reward, done, _ = env.step(action)
            rewards.append(reward)
            if done:
                break 
        scores_deque.append(sum(rewards))
        scores.append(sum(rewards))

        discounts = [gamma**i for i in range(len(rewards)+1)]
        R = sum([a*b for a,b in zip(discounts, rewards)])

        if R >= best_R: # found better weights
            best_R = R
            best_w = policy.w
            noise_scale = max(1e-3, noise_scale / 2)
            policy.w += noise_scale * np.random.rand(*policy.w.shape) 
        else:          # did not find better weights
            noise_scale = min(2, noise_scale * 2)
            policy.w = best_w + noise_scale * np.random.rand(*policy.w.shape)

        if i_episode % print_every == 0:
            print('Episode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
        if np.mean(scores_deque)>=195.0:
            print('Environment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
            policy.w = best_w
            break
    return scores

scores = hill_climbing()

### Plot the Scores

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

### Watch a Smart Agent!

In [None]:
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
for t in range(4000):
    action = policy.act(state)
    img.set_data(env.render(mode='rgb_array')) 
    plt.axis('off')
    display.display(plt.gcf())
    display.clear_output(wait=True)
    state, reward, done, _ = env.step(action)
    if done:
        break 

In [None]:
env.close()

---

## Cross-Entropy Method

We will train a Cross-Entropy Method with OpenAI Gym's MountainCarContinuous environment.

### Instantiate the Environment and Agent

In [None]:
ENVIRONMENT = 'MountainCarContinuous-v0'
MODEL_FILE = './models/cem-mountain-car-continuous.pt'
RANDOM_SEED = 101

In [None]:
env = gym.make(ENVIRONMENT)

env.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

print('observation space:', env.observation_space)
print('action space:', env.action_space)
print('\t- low:', env.action_space.low)
print('\t- high:', env.action_space.high)

class Agent(nn.Module):
    def __init__(self, env, h_size=16):
        super(Agent, self).__init__()
        self.env = env
        # state, hidden layer, action sizes
        self.s_size = env.observation_space.shape[0]
        self.h_size = h_size
        self.a_size = env.action_space.shape[0]
        # define layers
        self.fc1 = nn.Linear(self.s_size, self.h_size)
        self.fc2 = nn.Linear(self.h_size, self.a_size)
        
    def set_weights(self, weights):
        s_size = self.s_size
        h_size = self.h_size
        a_size = self.a_size
        # separate the weights for each layer
        fc1_end = (s_size*h_size)+h_size
        fc1_W = torch.from_numpy(weights[:s_size*h_size].reshape(s_size, h_size))
        fc1_b = torch.from_numpy(weights[s_size*h_size:fc1_end])
        fc2_W = torch.from_numpy(weights[fc1_end:fc1_end+(h_size*a_size)].reshape(h_size, a_size))
        fc2_b = torch.from_numpy(weights[fc1_end+(h_size*a_size):])
        # set the weights for each layer
        self.fc1.weight.data.copy_(fc1_W.view_as(self.fc1.weight.data))
        self.fc1.bias.data.copy_(fc1_b.view_as(self.fc1.bias.data))
        self.fc2.weight.data.copy_(fc2_W.view_as(self.fc2.weight.data))
        self.fc2.bias.data.copy_(fc2_b.view_as(self.fc2.bias.data))
    
    def get_weights_dim(self):
        return (self.s_size+1)*self.h_size + (self.h_size+1)*self.a_size
        
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = torch.tanh(self.fc2(x))
        return x.cpu().data
        
    def evaluate(self, weights, gamma=1.0, max_t=5000):
        self.set_weights(weights)
        episode_return = 0.0
        state = self.env.reset()
        for t in range(max_t):
            state = torch.from_numpy(state).float().to(device)
            action = self.forward(state)
            state, reward, done, _ = self.env.step(action)
            episode_return += reward * math.pow(gamma, t)
            if done:
                break
        return episode_return
    
agent = Agent(env).to(device)

### Train the Agent with a Cross-Entropy Method

Run the code cell below to train the agent from scratch.  Alternatively, you can skip to the next code cell to load the pre-trained weights from file.

In [None]:
def cem(n_iterations=500, max_t=1000, gamma=1.0, print_every=10, pop_size=50, elite_frac=0.2, sigma=0.5):
    '''
    PyTorch implementation of a cross-entropy method.
        
    Params
    ======
        n_iterations (int): maximum number of training iterations
        max_t (int): maximum number of timesteps per episode
        gamma (float): discount rate
        print_every (int): how often to print average score (over last 100 episodes)
        pop_size (int): size of population at each iteration
        elite_frac (float): percentage of top performers to use in update
        sigma (float): standard deviation of additive noise
    '''
    n_elite=int(pop_size*elite_frac)

    scores_deque = deque(maxlen=100)
    scores = []
    best_weight = sigma*np.random.randn(agent.get_weights_dim())

    for i_iteration in range(1, n_iterations+1):
        weights_pop = [best_weight + (sigma*np.random.randn(agent.get_weights_dim())) for i in range(pop_size)]
        rewards = np.array([agent.evaluate(weights, gamma, max_t) for weights in weights_pop])

        elite_idxs = rewards.argsort()[-n_elite:]
        elite_weights = [weights_pop[i] for i in elite_idxs]
        best_weight = np.array(elite_weights).mean(axis=0)

        reward = agent.evaluate(best_weight, gamma=1.0)
        scores_deque.append(reward)
        scores.append(reward)
        
        if i_iteration % print_every == 0:
            print('Episode {}\tAverage Score: {:.2f}'.format(i_iteration, np.mean(scores_deque)))

        if np.mean(scores_deque)>=90.0:
            print('\nEnvironment solved in {:d} iterations!\tAverage Score: {:.2f}'.format(i_iteration, np.mean(scores_deque)))
            torch.save(agent.state_dict(), MODEL_FILE)
            break
    return scores

scores = cem()

### Plot the Scores

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

### Watch a Smart Agent!

Load the trained weights from file to watch a smart agent!

In [None]:
map_location = (lambda storage, loc: storage.cuda()) if torch.cuda.is_available() else 'cpu'
agent.load_state_dict(torch.load(MODEL_FILE, map_location=map_location))

state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
while True:
    state = torch.from_numpy(state).float().to(device)
    with torch.no_grad():
        action = agent(state)
    img.set_data(env.render(mode='rgb_array')) 
    plt.axis('off')
    display.display(plt.gcf())
    display.clear_output(wait=True)
    next_state, reward, done, _ = env.step(action)
    state = next_state
    if done:
        break

In [None]:
env.close()

---

Next: [Policy Gradient Methods](./Policy%20Gradient%20Methods.ipynb)