# Policy Gradients with Baseline: DOOM

In this notebook, we will create a Policy Gradient based agent that tries to survive in an hostile environement by collecting health.
<br>
We will use Policy Gradient with Baseline method <b>which is an approach to design the policy gradient algorithm</b>

TODO : Add doom gif

## Aknowledgements
Our implementation will be inspired by this tutorial:
https://www.oreilly.com/ideas/reinforcement-learning-with-tensorflow

## A recap: Reinforcement Learning Process 🎮

<img src="assets/rl.png" alt="Reinforcement Learning process"/>

Reinforcement Learning: is when an agent learns by interacting with the environement itself (through trial and error) it receives reward when performing correct actions. It's a decision making problem.

<br>
The Reinforcement Learning loop:

- Agent receive state S0 from the environment
- Based on that state S0 agent take an action A0
- Environnement transitions to a new state S1
- Give some reward R1 to the agent
<br>
<br>
→ This output a sequence of state, reward and action.<br>
→ The goal of the agent <b>is maximize expected cumulative reward in order to reach an optimal policy (way to behave).</b>

### Recap: What is policy gradient? 🤖

- Policy Gradient is a policy based reinforcement learning method: in this method, we want <b>to directly to learn an π* by optimize it without worrying about a value function, we’ll directly parameterize the π and do gradient descent into a direction that improves it.</b>
<br><br>
- Why? Because it's sometimes easier to approximate than the value function. Also, we need a parameterized policy to deal with continuous action spaces and environments where we need to act stochastically.
<br><br>
- Common choices for the policy function: Softmax for discrete actions, Gaussian parameters for continuous actions.
<br><br>
- To measure the quality of π, we calculate the objective/score function `J(theta)` : we can use 3 differents methods: start value, average value, or average reward per time step. In our case by calculating the maximum expected cumulative reward (we can use the start value : use the mean of the return from the first time step (G1) == cumulative discounted reward for the entire episode).
<br>
<img style="width: 400px;" src="assets/objective.png" alt="Objective function"/>
<br>
<br>
- Then, we have an objective function. Now, we want to find a π that max it. We Using gradient ascent by computing the gradients analytically: Policy Gradient Theorem: `grad(J(theta)) = Ex[grad(log(pi(s, a))) * Q(s, a)]`. Basically, we move our policy into a direction of more reward.
<br>
<img style="width: 600px;" src="assets/policygrad.png" alt="Policy gradients"/>
<br>
<br>
- REINFORCE (Monte Carlo Policy Gradient): We substitute a samples return `g_t` form an episode for Q(s, a) to make an update. Unbiased but high variance.
<img src="assets/montecarlo.png" alt="Monte Carlo"/>
<br><br>
    For each episode:
        At each time step within that episode:
            Compute the log probabilities produced by our policy function.
            Multiply it by the score function.
            Update the weights with some small learning rate alpha
<br>    <br>     


## Policy Gradients with Baseline method 👾
- Baseline: Instead of measuring the absolute goodness of an action we want to know how much better than "average" it is to take an action given a state. E.g. some states are naturally bad and always give negative reward. This is called the advantage and is defined as `Q(s, a) - V(s)`. We use that for our policy update, e.g. `g_t - V(s)` for REINFORCE.

# Let's implement our Policy Based Agent 🕹️

<img src="assets/doom.gif" alt="Doom game"/>
Environement:
- The player is standing on top of acid water.
- He needs to learn how to navigate and <b>collect health packs to stay alive</b>.

### Libraries 📚

In [1]:
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
from vizdoom import *
import skimage
from skimage import transform

### Preprocessing functions ⚙️
We have 2 preprocessing functions:
- discount: discount rewards and normalize them. This function evaluates a set of rewards from an episode + add normalization to our discounted rewards to make sure our reward range stays small.
- preprocess_frame: gray and resize our frame to 84x84x1

In [2]:
def discount(r, gamma, normal):
    discount = np.zeros_like(r)
    G = 0.0
    for i in reversed(range(0, len(r))):
        G = G * gamma + r[i]
        discount[i] = G
    # Normalize 
    if normal:
        mean = np.mean(discount)
        std = np.std(discount)
        discount = (discount - mean) / (std)
    return discount

"""# Preprocess the frame
## Reshape function:
# Processes Doom screen image to produce cropped and resized image. 
def preprocess_frame(frame):
    s = frame[10:-10,30:-30]
    s = scipy.misc.imresize(s,[84,84])
    s = np.reshape(s,[np.prod(s.shape)]) / 255.0
    return s
"""
def preprocess_frame(frame):
    # Greyscale Frame
    x = np.mean(frame,-1)
    # Normalize Pixel Values
    x = x/255
    x = transform.resize(x, [84,84])
    return(x)

### Environement 🎮
We use vizdoom that simulate a doom-like game

In [3]:
game = DoomGame()
game.load_config("health_gathering.cfg")
game.set_doom_scenario_path("health_gathering.wad")
game.init()
actions_list  = np.identity(3,dtype=bool).tolist()
render = False

### Hyperparameters 📊

In [4]:
## Model hyperparameters
action_size = 3
kernel = 8
strides = 2
padding = "VALID"

## Tune hyperparameters
alpha = 1e-4
gamma = 0.99 # Discount rate
normalize_reward = True

value_scale = 0.5
entropy_scale = 0.00
gradient_clip = 40

## Folders
save_path="models/healthGather.ckpt" # Place to save our model
episode_path = "episodes/"

## Training
num_epochs = 500
batch_size = 5000

### Neural Network 🧠
- Takes as input a state frame.
    - Implies preprocess the frame
- Output a probability distribution

We need as hyperparameters:
- action size = nb of probabilities
- kernel
- strides
- padding

In [5]:
# Our placeholders
# Our frame is 84x84x1
x = tf.placeholder(tf.float32, (None,84,84,1), name="inputs")
y = tf.placeholder(tf.int32, (None), name="actions")
# Compute the one hot vectors for each action given
actions_onehot = tf.one_hot(y ,action_size,dtype=tf.int32)
r = tf.placeholder(tf.float32, (None,), name='reward')
n = tf.placeholder(tf.float32, (None), name='episodes')
d_r = tf.placeholder(tf.float32, (None,), name='discounted_reward')

In [6]:
# POLICY NETWORK
# Input is 128*128*1
conv1 = tf.layers.conv2d(inputs = x,
                                 filters = 32,
                                 kernel_size = kernel,
                                 strides = strides,
                                 padding = padding,
                                kernel_initializer=tf.truncated_normal_initializer(stddev=0.02),
                                 name = "conv1")
        
conv1_batchnorm = tf.layers.batch_normalization(conv1,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm1')

conv1_out = tf.nn.elu(conv1_batchnorm, name="conv1_out")
# 64x64x32
        
conv2 = tf.layers.conv2d(inputs = conv1_out,
                                 filters = 64,
                                 kernel_size = kernel,
                                 strides = strides,
                                 padding = padding,
                                kernel_initializer=tf.truncated_normal_initializer(stddev=0.02),
                                 name = "conv2")
        
conv2_batchnorm = tf.layers.batch_normalization(conv2,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm2')

conv2_out = tf.nn.elu(conv2_batchnorm, name="conv2_out")
# 32x32x64
        
        
conv3 = tf.layers.conv2d(inputs = conv2_out,
                                 filters = 128,
                                 kernel_size = kernel,
                                 strides = strides,
                                 padding = padding,
                                kernel_initializer=tf.truncated_normal_initializer(stddev=0.02),
                                 name = "conv3")
        
conv3_batchnorm = tf.layers.batch_normalization(conv3,
                                                   training = True,
                                                   epsilon = 1e-5,
                                                     name = 'batch_norm3')

conv3_out = tf.nn.elu(conv3_batchnorm, name="conv3_out")
# 16x16x128
        
flatten = tf.layers.flatten(conv3_out)

fc1 = tf.layers.dense(inputs = flatten,
                                  units = 512,
                                  activation = tf.nn.elu,
                                name="fc1")


logits = tf.layers.dense(inputs = fc1,
                                  units = action_size,
                                  activation = None,
                        name = "logits")

value = tf.layers.dense(
        inputs=fc1, 
        units = 1, 
        name='value')
        
calc_action = tf.multinomial(logits, 1)
aprob = tf.nn.softmax(logits)
action_logprob = tf.nn.log_softmax(logits)

In [7]:
tf.trainable_variables()

[<tf.Variable 'conv1/kernel:0' shape=(8, 8, 1, 32) dtype=float32_ref>,
 <tf.Variable 'conv1/bias:0' shape=(32,) dtype=float32_ref>,
 <tf.Variable 'batch_norm1/gamma:0' shape=(32,) dtype=float32_ref>,
 <tf.Variable 'batch_norm1/beta:0' shape=(32,) dtype=float32_ref>,
 <tf.Variable 'conv2/kernel:0' shape=(8, 8, 32, 64) dtype=float32_ref>,
 <tf.Variable 'conv2/bias:0' shape=(64,) dtype=float32_ref>,
 <tf.Variable 'batch_norm2/gamma:0' shape=(64,) dtype=float32_ref>,
 <tf.Variable 'batch_norm2/beta:0' shape=(64,) dtype=float32_ref>,
 <tf.Variable 'conv3/kernel:0' shape=(8, 8, 64, 128) dtype=float32_ref>,
 <tf.Variable 'conv3/bias:0' shape=(128,) dtype=float32_ref>,
 <tf.Variable 'batch_norm3/gamma:0' shape=(128,) dtype=float32_ref>,
 <tf.Variable 'batch_norm3/beta:0' shape=(128,) dtype=float32_ref>,
 <tf.Variable 'fc1/kernel:0' shape=(3200, 512) dtype=float32_ref>,
 <tf.Variable 'fc1/bias:0' shape=(512,) dtype=float32_ref>,
 <tf.Variable 'logits/kernel:0' shape=(512, 3) dtype=float32_ref>,

### Losses and gradients ⛰️

#### Policy Gradient Loss
- We defined an objective function  as <b> total reward an agent can achieve under policy π averaged over all starting states </b>:

In [8]:
%%latex
\begin{align}
J(\theta) = E_{\pi} [G_{1}]
\end{align}

<IPython.core.display.Latex object>

We also know that a gradient of this function is determined by policy gradient theorem as:
<img src="assets/pg.png" alt="pg"/>

We’re trying to maximize the J function so we can just say that:

In [9]:
%%latex
\begin{align}
L_{\pi} = - J(\theta)
\end{align}

<IPython.core.display.Latex object>

The final loss function is then:

In [10]:
%%latex
\begin{align}
L_{\pi} = -1/n \sum_{i=1}^{n} A(s_{i}, a_{i}) * log\pi(a_{i} | s_{i}) 
\end{align}

<IPython.core.display.Latex object>

Where: 
- -1/n sum : tf.reduce_mean
- A(si,ai): (d_r - value)
- logpi(ai|si): tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=y) calculates the softmax and log loss


tf.reduce_mean([[3.,4], [5.,6], [6.,7]]) <br>
--> [ 3.5  5.5  6.5] <br>
--> Mean between element <br>

In [11]:
policyGradient_loss = tf.reduce_mean((d_r - value) * tf.nn.sparse_softmax_cross_entropy_with_logits(logits = logits, labels=actions_onehot))

#### Value loss
Calculate our value loss by using common squared mean error loss:

In [12]:
value_loss = value_scale * tf.reduce_mean(tf.square(d_r - value))

#### Total loss

In [13]:
entropy_loss = -entropy_scale * tf.reduce_sum(aprob * tf.exp(aprob))
loss = policyGradient_loss + value_loss - entropy_loss

#### Optimizer and gradients

In [14]:
optimizer = tf.train.AdamOptimizer(alpha)
gradients = tf.gradients(loss, tf.trainable_variables())
gradients, _ = tf.clip_by_global_norm(gradients, gradient_clip) # gradient clipping
gradients_and_variables = list(zip(gradients, tf.trainable_variables()))
train_op = optimizer.apply_gradients(gradients_and_variables)

# Initialize Session
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

### Training ⏲️
#### Rollout function
This function will gather a batch of training data from multiple episodes.

In [15]:
def rollout(batch_size, render):
    states, actions, rewards, rewardsFeed, discountedRewards = [], [], [], [], []
    
    
    
    # Launch a new episode
    game.new_episode()
    state = game.get_state().screen_buffer
    state = preprocess_frame(state)
    episode_num = 0
    
    ### ???
    action_repeat = 3
    reward = 0
    
    while True:
        # Run State Through Policy & Calculate Action
        state = state.reshape(1, 84, 84, 1)
        feed = {x: state}
        action = sess.run(calc_action, feed_dict=feed)
        action = action[0][0]
        
        # Perform action
        for i in range(action_repeat):
            new_reward = game.make_action(actions_list[action])
            done = game.is_episode_finished()
            if done:
                break
                
            reward += new_reward
            new_state = game.get_state().screen_buffer
            
            
        
        # Store results
        states.append(state)
        rewards.append(reward)
        actions.append(action) 
        
        # Update Current State
        reward = 0
        state = preprocess_frame(new_state)
        if done:
            
            # Track Discounted Rewards
            rewardsFeed.append(rewards)
            discountedRewards.append(discount(rewards, gamma, normalize_reward))
            
            if len(np.concatenate(rewardsFeed)) > batch_size:
                break
                
            # Reset Environment
            rewards = []
            game.new_episode("episodes/episode" + str(episode_num) + "_rec.lmp")
            state = game.get_state().screen_buffer
            state = preprocess_frame(state)
            episode_num += 1
                         
    return np.stack(states), np.stack(actions), np.concatenate(rewardsFeed), np.concatenate(discountedRewards), episode_num
        

### Summaries 📋

In [None]:
avg_reward = tf.divide(tf.reduce_sum(r), n)
writer = tf.summary.FileWriter("/tensorboard/pg1")

## Losses
tf.summary.scalar("Total_Loss", loss)
tf.summary.scalar("PolicyGradient_Loss", policyGradient_loss)
tf.summary.scalar("Entropy_Loss", entropy_loss)
tf.summary.scalar("Value_Loss", value_loss)

## Rewards
tf.summary.scalar("Mean_Reward", avg_reward)

## Model
#tf.summary.histogram('Conv1', tf.trainable_variables()[0])
#tf.summary.histogram('Conv2', tf.trainable_variables()[2])
#tf.summary.histogram('FC', tf.trainable_variables()[4])
#tf.summary.histogram('Logits', tf.trainable_variables()[6])
#tf.summary.histogram('Value', tf.trainable_variables()[8])

write_op = tf.summary.merge_all()

In [None]:
average = []
step = 0
epoch = 0
while epoch < num_epochs + 1:
    # Gather training data
    print("Epoch ", epoch, "/", num_epochs)
    s_, a_, r_, d_r_, n_ = rollout(batch_size, render)
    
    # Calculate average reward
    avg_reward = np.sum(r_)/n_
    average.append(avg_reward)
    
    print('Training Episodes: {}  Average Reward: {:4.2f}  Total Average: {:4.2f}'.format(n, avg_reward, np.mean(average)))
    
    
    sess.run(train_op, feed_dict={x: s_.reshape(len(s_),84,84,1), y:a_, d_r: d_r_, r: r_, n: n_})
    
    # Write TF summaries
    summary = sess.run(write_op, feed_dict={x: s_.reshape(len(s_),84,84,1), y:a_, d_r: d_r_, r: r_, n: n_})
    writer.add_summary(summary, step)
    writer.flush()
    
    # Save Model
    if step % 10 == 0:
          print("SAVED MODEL")
          saver.save(sess, save_path, global_step=step)
          
    step += 1

Epoch  0 / 500


  warn("The default mode, 'constant', will be changed to 'reflect' in "


Training Episodes: Tensor("episodes:0", dtype=float32)  Average Reward: 555.56  Total Average: 555.56


### Display results 📈

In [None]:
game.init()
state = game.get_state().screen_buffer
state = preprocess_frame(state)
prob, val = sess.run([aprob, value], feed_dict={X: state.reshape(1, 84, 84, 1)})

print('Turn Right: {:4.2f}  Turn Left: {:4.2f}  Move Forward {:4.2f}'.format(prob[0][0],prob[0][2], prob[0][1]))
print('Approximated State Value: {:4.4f}'.format(val[0][0]))

### See our Agent play 🕹️

In [None]:
# New render settings for replay
game = DoomGame()
game.load_config("health_gathering.cfg")

# New render settings for replay
game.set_screen_resolution(ScreenResolution.RES_800X600)
game.set_render_hud(True)

# Replay can be played in any mode.
game.set_mode(Mode.SPECTATOR)


game.init()
i = 500


#sleep(2)

for i in range(1000):
    
    game.replay_episode("episode" + str(i) + "_rec.lmp")

    while not game.is_episode_finished():
        s = game.get_state()

        # Use advance_action instead of make_action.
        game.advance_action()

        r = game.get_last_reward()

    print("Episode finished.")
    print("total reward:", game.get_total_reward())
    print("************************")

game.close()