- [Blog](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724)
  - **key idea**: Collect rewards and use these experiences to update the NN all at once.
  - It will force the NN to take account for the entire trace of actions (rollouts, experience traces). Note that these rewards for all the actions need to be discounted before applying it to updating NN.

In [9]:
import numpy as np
import tensorflow as tf
import tensorflow.contrib.slim as slim
from tensorflow.contrib.layers import fully_connected
import gym

In [10]:
print(fully_connected.__doc__)

Adds a fully connected layer.

  `fully_connected` creates a variable called `weights`, representing a fully
  connected weight matrix, which is multiplied by the `inputs` to produce a
  `Tensor` of hidden units. If a `normalizer_fn` is provided (such as
  `batch_norm`), it is then applied. Otherwise, if `normalizer_fn` is
  None and a `biases_initializer` is provided then a `biases` variable would be
  created and added the hidden units. Finally, if `activation_fn` is not `None`,
  it is applied to the hidden units as well.

  Note: that if `inputs` have a rank greater than 2, then `inputs` is flattened
  prior to the initial matrix multiply by `weights`.

  Args:
    inputs: A tensor of at least rank 2 and static value for the last dimension;
      i.e. `[batch_size, depth]`, `[None, None, None, channels]`.
    num_outputs: Integer or long, the number of output units in the layer.
    activation_fn: Activation function. The default value is a ReLU function.
      Explicitly set it to

In [11]:
from IPython.display import display_markdown
display_markdown(tf.gather.__doc__, raw=True)

Gather slices from `params` according to `indices`.

  `indices` must be an integer tensor of any dimension (usually 0-D or 1-D).
  Produces an output tensor with shape `indices.shape + params.shape[1:]` where:

  ```python
      # Scalar indices
      output[:, ..., :] = params[indices, :, ... :]

      # Vector indices
      output[i, :, ..., :] = params[indices[i], :, ... :]

      # Higher rank indices
      output[i, ..., j, :, ... :] = params[indices[i, ..., j], :, ..., :]
  ```

  If `indices` is a permutation and `len(indices) == params.shape[0]` then
  this operation will permute `params` accordingly.

  <div style="width:70%; margin:auto; margin-bottom:10px; margin-top:20px;">
  <img style="width:100%" src="../../images/Gather.png" alt>
  </div>

  Args:
    params: A `Tensor`.
    indices: A `Tensor`. Must be one of the following types: `int32`, `int64`.
    validate_indices: An optional `bool`. Defaults to `True`.
    name: A name for the operation (optional).

  Returns:
    A `Tensor`. Has the same type as `params`.
  

In [12]:
display_markdown(tf.gradients.__doc__, raw=True)

Constructs symbolic partial derivatives of sum of `ys` w.r.t. x in `xs`.

  `ys` and `xs` are each a `Tensor` or a list of tensors.  `grad_ys`
  is a list of `Tensor`, holding the gradients received by the
  `ys`. The list must be the same length as `ys`.

  `gradients()` adds ops to the graph to output the partial
  derivatives of `ys` with respect to `xs`.  It returns a list of
  `Tensor` of length `len(xs)` where each tensor is the `sum(dy/dx)`
  for y in `ys`.

  `grad_ys` is a list of tensors of the same length as `ys` that holds
  the initial gradients for each y in `ys`.  When `grad_ys` is None,
  we fill in a tensor of '1's of the shape of y for each y in `ys`.  A
  user can provide their own initial `grad_ys` to compute the
  derivatives using a different initial gradient for each y (e.g., if
  one wanted to weight the gradient differently for each value in
  each y).

  Args:
    ys: A `Tensor` or list of tensors to be differentiated.
    xs: A `Tensor` or list of tensors to be used for differentiation.
    grad_ys: Optional. A `Tensor` or list of tensors the same size as
      `ys` and holding the gradients computed for each y in `ys`.
    name: Optional name to use for grouping all the gradient ops together.
      defaults to 'gradients'.
    colocate_gradients_with_ops: If True, try colocating gradients with
      the corresponding op.
    gate_gradients: If True, add a tuple around the gradients returned
      for an operations.  This avoids some race conditions.
    aggregation_method: Specifies the method used to combine gradient terms.
      Accepted values are constants defined in the class `AggregationMethod`.

  Returns:
    A list of `sum(dy/dx)` for each x in `xs`.

  Raises:
    LookupError: if one of the operations between `x` and `y` does not
      have a registered gradient function.
    ValueError: if the arguments are invalid.

  

In [13]:
def discount_rewards(rewards, discount_rate=0.99):
    acc_rewards_discounted = np.zeros_like(rewards)
    acc_rewards = 0
    for t in reversed(range(0, len(rewards))):
        acc_rewards = acc_rewards*discount_rate + rewards[t]
        acc_rewards_discounted[t] = acc_rewards
    return acc_rewards_discounted

In [14]:
class Agent(object):
    
    def __init__(self, learn_rate, num_states, num_actions, num_hidden_neurons):
        self.state_in = tf.placeholder(shape=[None, num_states], dtype=tf.float32)
        hidden_layer = fully_connected(self.state_in,
                                       num_hidden_neurons,
                                       biases_initializer=None,
                                       activation_fn=tf.nn.relu)
        self.output = fully_connected(hidden_layer, 
                                      num_actions,
                                      biases_initializer=None,
                                      activation_fn=tf.nn.softmax)
        self.chosen_action = tf.argmax(self.output, 1)
        
        self.reward_holder = tf.placeholder(shape=[None], dtype=tf.float32)
        self.action_holder = tf.placeholder(shape=[None], dtype=tf.int32)
        
        indexes = tf.range(0, tf.shape(self.output)[0])*tf.shape(self.output)[1]+self.action_holder
        self.responsible_outputs = tf.gather(tf.reshape(self.output, [-1]), indexes)
        self.loss = -tf.reduce_mean(tf.log(self.responsible_outputs)*self.reward_holder)
        
        weights = tf.trainable_variables()
        self.gradient_holders = []
        for i, _ in enumerate(weights):
            placeholder = tf.placeholder(tf.float32, name="{}_holder".format(i))
            self.gradient_holders.append(placeholder)
        self.gradients = tf.gradients(self.loss, weights)
        optimizer = tf.train.AdamOptimizer(learning_rate=learn_rate)
        self.train_op = optimizer.apply_gradients(zip(self.gradient_holders, weights))

----

In [None]:
env = gym.make("CartPole-v0")
env.reset() # position, cart-velocity, pole-angle, pole-vel-at-tip

[2017-06-13 16:12:27,725] Making new env: CartPole-v0


array([-0.02738802, -0.0328304 ,  0.00793568,  0.01976243])

In [None]:
# Trainging agent
tf.reset_default_graph()

agent = Agent(learn_rate=1e-2, num_states=4, num_actions=2, num_hidden_neurons=8)

num_episodes = 5000
max_iters_per_episodes = 999
update_frequency = 5 # how many episodes for an update

sess = tf.InteractiveSession()

tf.global_variables_initializer().run()
total_reward = []
total_length = []
    
gradBuffer = sess.run(tf.trainable_variables())
for idx, grad in enumerate(gradBuffer):
    gradBuffer[idx] = grad * 0
    
for episode in range(num_episodes):
    current_state = env.reset()
    acc_reward = 0
    episode_history = []
    for i in range(max_iters_per_episodes):
        action_dist = sess.run(agent.output, feed_dict={agent.state_in:[current_state]})[0]
        action = np.random.choice(np.arange(action_dist.shape[0]), p=action_dist)
        new_state, reward, is_doomed, _ = env.step(action)
        episode_history.append((current_state, action, reward, new_state))
        current_state = new_state
        acc_reward += reward
            
        if is_doomed:
            episode_history = np.array(episode_history)
            episode_history[:,2] = discount_rewards(episode_history[:, 2])
            feed_dict = {agent.reward_holder:episode_history[:,2],
                         agent.action_holder:episode_history[:,1],
                         agent.state_in:np.vstack(episode_history[:,0])}
            grads = sess.run(agent.gradients, feed_dict=feed_dict)
            for idx, grad in enumerate(grads):
                gradBuffer[idx] += grad
                
            if episode % update_frequency == 0 and episode != 0:
                feed_dict = dict(zip(agent.gradient_holders, gradBuffer))
                _ = sess.run(agent.train_op, feed_dict=feed_dict)
                for idx, grad in enumerate(gradBuffer):
                    gradBuffer[idx] = grad*0
            total_reward.append(acc_reward)
            total_length.append(i)
            break
    if episode % 100 == 0:
        print(np.mean(total_reward[-100:]))
print("done")

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


21.0
26.17
34.41
47.28
57.55
84.1
108.24
146.65
176.95
180.59
192.8
188.18
175.14
160.03
111.52
118.12
156.36
183.76
186.92
168.74
189.97
183.44
195.09
181.98
168.06
175.55
176.43
146.55
160.27
185.0
187.76
196.57
200.0
200.0
200.0
200.0
200.0
196.74
193.14
198.49
195.3
190.96
179.18
186.45
184.56
189.01
196.47
198.24
198.75


In [None]:
from gym.wrappers import Monitor
env = Monitor(gym.make("CartPole-v0"), "./exp_cartpole", force=True)

In [None]:
current_state = env.reset()
for _ in range(10000):
    action = sess.run(agent.chosen_action, feed_dict={agent.state_in:[current_state]})[0]
    new_state, _, is_doomed, _ = env.step(action)
    current_state = new_state
    if is_doomed:
        break

In [None]:
env.close()

- [tf.gradients](https://www.tensorflow.org/api_docs/python/tf/gradients)
  - [StackOverflow](https://stackoverflow.com/questions/41822308/how-tf-gradients-work-in-tensorflow)
- [CartPole-v0](https://github.com/openai/gym/wiki/CartPole-v0)