# Simple Reinforcement Learning in Tensorflow Part 2: Policy Gradient Method
This tutorial contains a simple example of how to build a policy-gradient based agent that can solve the CartPole problem. For more information, see this [Medium post](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-2-ded33892c724#.mtwpvfi8b).

For more Reinforcement Learning algorithms, including DQN and Model-based learning in Tensorflow, see my Github repo, [DeepRL-Agents](https://github.com/awjuliani/DeepRL-Agents). 

Parts of this tutorial are based on code by [Andrej Karpathy](https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5) and [korymath](https://gym.openai.com/evaluations/eval_a0aVJrGSyW892vBM04HQA).

In [1]:
import numpy as np
import _pickle as pickle
import tensorflow as tf
%matplotlib inline
import matplotlib.pyplot as plt
import math

### Loading the CartPole Environment
If you don't already have the OpenAI gym installed, use  `pip install gym` to grab it.

In [2]:
import gym
from gym import wrappers
env = gym.make('CartPole-v0')
env = wrappers.Monitor(env, '/tmp/cartpole-experiment-3')


[2017-05-25 15:51:06,900] Making new env: CartPole-v0
[2017-05-25 15:51:06,906] Creating monitor directory /tmp/cartpole-experiment-3


What happens if we try running the environment with random actions? How well do we do? (Hint: not so well.)

In [3]:
env.reset()
# random_episodes = 0
# reward_sum = 0
# while random_episodes < 10:
#     env.render()
#     observation, reward, done, _ = env.step(np.random.randint(0,2))
#     reward_sum += reward
#     if done:
#         random_episodes += 1
#         print("Reward for this episode was:",reward_sum)
#         reward_sum = 0
#         env.reset()

[2017-05-25 15:51:06,912] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video000000.mp4


array([-0.04197807,  0.01352341, -0.04684693, -0.01576206])

The goal of the task is to achieve a reward of 200 per episode. For every step the agent keeps the pole in the air, the agent recieves a +1 reward. By randomly choosing actions, our reward for each episode is only a couple dozen. Let's make that better with RL!

### Setting up our Neural Network agent
This time we will be using a Policy neural network that takes observations, passes them through a single hidden layer, and then produces a probability of choosing a left/right movement. To learn more about this network, see [Andrej Karpathy's blog on Policy Gradient networks](http://karpathy.github.io/2016/05/31/rl/).

In [4]:
# hyperparameters
H = 10 # number of hidden layer neurons
batch_size = 5 # every how many episodes to do a param update?
learning_rate = 1e-2 # feel free to play with this to train faster or more stably.
gamma = 0.99 # discount factor for reward

D = 4 # input dimensionality

In [5]:
tf.reset_default_graph()

#This defines the network as it goes from taking an observation of the environment to 
#giving a probability of chosing to the action of moving left or right.
observations = tf.placeholder(tf.float32, [None,D] , name="input_x")
W1 = tf.get_variable("W1", shape=[D, H],
           initializer=tf.contrib.layers.xavier_initializer())
layer1 = tf.nn.relu(tf.matmul(observations,W1))
W2 = tf.get_variable("W2", shape=[H, 1],
           initializer=tf.contrib.layers.xavier_initializer())
score = tf.matmul(layer1,W2)
probability = tf.nn.sigmoid(score)

#From here we define the parts of the network needed for learning a good policy.
tvars = tf.trainable_variables()
input_y = tf.placeholder(tf.float32,[None,1], name="input_y")
advantages = tf.placeholder(tf.float32,name="reward_signal")

# The loss function. This sends the weights in the direction of making actions 
# that gave good advantage (reward over time) more likely, and actions that didn't less likely.
loglik = tf.log(input_y*(input_y - probability) + (1 - input_y)*(input_y + probability))
loss = -tf.reduce_mean(loglik * advantages) 
newGrads = tf.gradients(loss,tvars)

# Once we have collected a series of gradients from multiple episodes, we apply them.
# We don't just apply gradeients after every episode in order to account for noise in the reward signal.
adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
W1Grad = tf.placeholder(tf.float32,name="batch_grad1") # Placeholders to send the final gradients through when we update.
W2Grad = tf.placeholder(tf.float32,name="batch_grad2")
batchGrad = [W1Grad,W2Grad]
updateGrads = adam.apply_gradients(zip(batchGrad,tvars))

### Advantage function
This function allows us to weigh the rewards our agent recieves. In the context of the Cart-Pole task, we want actions that kept the pole in the air a long time to have a large reward, and actions that contributed to the pole falling to have a decreased or negative reward. We do this by weighing the rewards from the end of the episode, with actions at the end being seen as negative, since they likely contributed to the pole falling, and the episode ending. Likewise, early actions are seen as more positive, since they weren't responsible for the pole falling.

In [6]:
def discount_rewards(r):
    """ take 1D float array of rewards and compute discounted reward """
    discounted_r = np.zeros_like(r)
    running_add = 0
    for t in reversed(range(0, r.size)):
        running_add = running_add * gamma + r[t]
        discounted_r[t] = running_add
    return discounted_r

### Running the Agent and Environment

Here we run the neural network agent, and have it act in the CartPole environment.

As you can see, the network not only does much better than random actions, but achieves the goal of 200 points per episode, thus solving the task!

In [7]:
xs,hs,dlogps,drs,ys,tfps = [],[],[],[],[],[]
running_reward = None
reward_sum = 0
episode_number = 1
total_episodes = 10000
init = tf.initialize_all_variables()

# Launch the graph
with tf.Session() as sess:
    rendering = False
    sess.run(init)
    observation = env.reset() # Obtain an initial observation of the environment

    # Reset the gradient placeholder. We will collect gradients in 
    # gradBuffer until we are ready to update our policy network. 
    gradBuffer = sess.run(tvars)
    for ix,grad in enumerate(gradBuffer):
        gradBuffer[ix] = grad * 0
    
    while episode_number <= total_episodes:
        
        # Rendering the environment slows things down, 
        # so let's only look at it once our agent is doing a good job.
#         if reward_sum/batch_size > 100 or rendering == True :
        if True:
            env.render("human")
            rendering = True
            
        # Make sure the observation is in a shape the network can handle.
        x = np.reshape(observation,[1,D])
        
        # Run the policy network and get an action to take. 
        tfprob = sess.run(probability,feed_dict={observations: x})
        action = 1 if np.random.uniform() < tfprob else 0
        
        xs.append(x) # observation
        y = 1 if action == 0 else 0 # a "fake label"
        ys.append(y)

        # step the environment and get new measurements
        observation, reward, done, info = env.step(action)
        reward_sum += reward

        drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)

        if done: 
            episode_number += 1
            # stack together all inputs, hidden states, action gradients, and rewards for this episode
            epx = np.vstack(xs)
            epy = np.vstack(ys)
            epr = np.vstack(drs)
            tfp = tfps
            xs,hs,dlogps,drs,ys,tfps = [],[],[],[],[],[] # reset array memory

            # compute the discounted reward backwards through time
            discounted_epr = discount_rewards(epr)
            # size the rewards to be unit normal (helps control the gradient estimator variance)
            discounted_epr -= np.mean(discounted_epr)
            discounted_epr /= np.std(discounted_epr)
            
            # Get the gradient for this episode, and save it in the gradBuffer
            tGrad = sess.run(newGrads,feed_dict={observations: epx, input_y: epy, advantages: discounted_epr})
            for ix,grad in enumerate(tGrad):
                gradBuffer[ix] += grad
                
            # If we have completed enough episodes, then update the policy network with our gradients.
            if episode_number % batch_size == 0: 
                sess.run(updateGrads,feed_dict={W1Grad: gradBuffer[0],W2Grad:gradBuffer[1]})
                for ix,grad in enumerate(gradBuffer):
                    gradBuffer[ix] = grad * 0
                
                # Give a summary of how well our network is doing for each batch of episodes.
                running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01
                print('Average reward for episode %f.  Total average reward %f.' % (reward_sum/batch_size, running_reward/batch_size))
                
                if reward_sum/batch_size > 200: 
                    print("Task solved in",episode_number,'episodes!')
                    break
                    
                reward_sum = 0
            
            observation = env.reset()

print(episode_number,'Episodes completed.')


Instructions for updating:
Use `tf.global_variables_initializer` instead.


[2017-05-25 15:51:08,266] From <ipython-input-7-3c59ebf77625>:6: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
[2017-05-25 15:51:08,725] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video000001.mp4
[2017-05-25 15:51:09,395] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video000008.mp4


Average reward for episode 15.400000.  Total average reward 15.400000.
Average reward for episode 18.200000.  Total average reward 15.428000.
Average reward for episode 19.800000.  Total average reward 15.471720.
Average reward for episode 17.400000.  Total average reward 15.491003.
Average reward for episode 32.200000.  Total average reward 15.658093.


[2017-05-25 15:51:10,425] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video000027.mp4


Average reward for episode 22.800000.  Total average reward 15.729512.
Average reward for episode 23.800000.  Total average reward 15.810217.
Average reward for episode 20.800000.  Total average reward 15.860115.
Average reward for episode 26.400000.  Total average reward 15.965513.
Average reward for episode 19.200000.  Total average reward 15.997858.
Average reward for episode 17.400000.  Total average reward 16.011880.
Average reward for episode 31.200000.  Total average reward 16.163761.


[2017-05-25 15:51:11,867] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video000064.mp4


Average reward for episode 17.800000.  Total average reward 16.180123.
Average reward for episode 24.600000.  Total average reward 16.264322.
Average reward for episode 22.800000.  Total average reward 16.329679.
Average reward for episode 22.600000.  Total average reward 16.392382.
Average reward for episode 40.600000.  Total average reward 16.634458.
Average reward for episode 22.200000.  Total average reward 16.690114.
Average reward for episode 34.000000.  Total average reward 16.863213.
Average reward for episode 38.400000.  Total average reward 17.078580.
Average reward for episode 22.800000.  Total average reward 17.135795.
Average reward for episode 28.800000.  Total average reward 17.252437.
Average reward for episode 24.200000.  Total average reward 17.321912.


[2017-05-25 15:51:14,201] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video000125.mp4


Average reward for episode 18.200000.  Total average reward 17.330693.
Average reward for episode 23.200000.  Total average reward 17.389386.
Average reward for episode 22.800000.  Total average reward 17.443492.
Average reward for episode 21.600000.  Total average reward 17.485057.
Average reward for episode 19.000000.  Total average reward 17.500207.
Average reward for episode 34.200000.  Total average reward 17.667205.
Average reward for episode 40.200000.  Total average reward 17.892533.
Average reward for episode 26.200000.  Total average reward 17.975607.
Average reward for episode 23.800000.  Total average reward 18.033851.
Average reward for episode 48.400000.  Total average reward 18.337513.
Average reward for episode 24.200000.  Total average reward 18.396138.
Average reward for episode 28.600000.  Total average reward 18.498176.
Average reward for episode 39.200000.  Total average reward 18.705195.
Average reward for episode 23.000000.  Total average reward 18.748143.
Averag

[2017-05-25 15:51:17,965] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video000216.mp4


Average reward for episode 41.000000.  Total average reward 19.712429.
Average reward for episode 30.800000.  Total average reward 19.823304.
Average reward for episode 34.200000.  Total average reward 19.967071.
Average reward for episode 35.000000.  Total average reward 20.117401.
Average reward for episode 29.400000.  Total average reward 20.210227.
Average reward for episode 29.600000.  Total average reward 20.304124.
Average reward for episode 32.600000.  Total average reward 20.427083.
Average reward for episode 49.800000.  Total average reward 20.720812.
Average reward for episode 23.800000.  Total average reward 20.751604.
Average reward for episode 50.200000.  Total average reward 21.046088.
Average reward for episode 29.000000.  Total average reward 21.125627.
Average reward for episode 28.600000.  Total average reward 21.200371.
Average reward for episode 32.800000.  Total average reward 21.316367.
Average reward for episode 43.400000.  Total average reward 21.537204.
Averag

[2017-05-25 15:51:24,810] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video000343.mp4


Average reward for episode 52.800000.  Total average reward 24.927735.
Average reward for episode 31.200000.  Total average reward 24.990458.
Average reward for episode 59.000000.  Total average reward 25.330553.
Average reward for episode 40.800000.  Total average reward 25.485247.
Average reward for episode 59.400000.  Total average reward 25.824395.
Average reward for episode 37.800000.  Total average reward 25.944151.
Average reward for episode 52.600000.  Total average reward 26.210710.
Average reward for episode 55.600000.  Total average reward 26.504602.
Average reward for episode 61.800000.  Total average reward 26.857556.
Average reward for episode 57.200000.  Total average reward 27.160981.
Average reward for episode 48.800000.  Total average reward 27.377371.
Average reward for episode 30.200000.  Total average reward 27.405597.
Average reward for episode 58.400000.  Total average reward 27.715541.
Average reward for episode 57.000000.  Total average reward 28.008386.
Averag

[2017-05-25 15:51:37,023] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video000512.mp4


Average reward for episode 57.400000.  Total average reward 34.135395.
Average reward for episode 53.400000.  Total average reward 34.328041.
Average reward for episode 88.200000.  Total average reward 34.866761.
Average reward for episode 46.600000.  Total average reward 34.984093.
Average reward for episode 105.000000.  Total average reward 35.684252.
Average reward for episode 51.000000.  Total average reward 35.837410.
Average reward for episode 61.200000.  Total average reward 36.091036.
Average reward for episode 81.000000.  Total average reward 36.540125.
Average reward for episode 104.000000.  Total average reward 37.214724.
Average reward for episode 74.000000.  Total average reward 37.582577.
Average reward for episode 64.400000.  Total average reward 37.850751.
Average reward for episode 128.800000.  Total average reward 38.760244.
Average reward for episode 89.000000.  Total average reward 39.262641.
Average reward for episode 66.400000.  Total average reward 39.534015.
Ave

[2017-05-25 15:52:03,681] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video000729.mp4


Average reward for episode 198.000000.  Total average reward 59.466912.
Average reward for episode 170.000000.  Total average reward 60.572243.
Average reward for episode 161.400000.  Total average reward 61.580521.
Average reward for episode 132.000000.  Total average reward 62.284715.
Average reward for episode 165.600000.  Total average reward 63.317868.
Average reward for episode 68.000000.  Total average reward 63.364690.
Average reward for episode 144.800000.  Total average reward 64.179043.
Average reward for episode 148.400000.  Total average reward 65.021252.
Average reward for episode 154.000000.  Total average reward 65.911040.
Average reward for episode 185.400000.  Total average reward 67.105929.
Average reward for episode 158.200000.  Total average reward 68.016870.
Average reward for episode 192.600000.  Total average reward 69.262701.
Average reward for episode 200.000000.  Total average reward 70.570074.
Average reward for episode 191.400000.  Total average reward 71.7

[2017-05-25 15:53:03,474] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video001000.mp4


Average reward for episode 178.600000.  Total average reward 108.421716.
Average reward for episode 190.400000.  Total average reward 109.241499.
Average reward for episode 166.000000.  Total average reward 109.809084.
Average reward for episode 178.200000.  Total average reward 110.492993.
Average reward for episode 185.400000.  Total average reward 111.242063.
Average reward for episode 185.200000.  Total average reward 111.981643.
Average reward for episode 177.400000.  Total average reward 112.635826.
Average reward for episode 190.600000.  Total average reward 113.415468.
Average reward for episode 170.000000.  Total average reward 113.981313.
Average reward for episode 200.000000.  Total average reward 114.841500.
Average reward for episode 163.200000.  Total average reward 115.325085.
Average reward for episode 178.800000.  Total average reward 115.959834.
Average reward for episode 162.000000.  Total average reward 116.420236.
Average reward for episode 174.400000.  Total avera

[2017-05-25 15:57:01,721] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video002000.mp4


Average reward for episode 200.000000.  Total average reward 183.567276.
Average reward for episode 200.000000.  Total average reward 183.731604.
Average reward for episode 200.000000.  Total average reward 183.894288.
Average reward for episode 195.800000.  Total average reward 184.013345.
Average reward for episode 200.000000.  Total average reward 184.173211.
Average reward for episode 200.000000.  Total average reward 184.331479.
Average reward for episode 200.000000.  Total average reward 184.488164.
Average reward for episode 188.000000.  Total average reward 184.523283.
Average reward for episode 200.000000.  Total average reward 184.678050.
Average reward for episode 200.000000.  Total average reward 184.831269.
Average reward for episode 200.000000.  Total average reward 184.982957.
Average reward for episode 200.000000.  Total average reward 185.133127.
Average reward for episode 200.000000.  Total average reward 185.281796.
Average reward for episode 200.000000.  Total avera

[2017-05-25 16:00:59,994] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video003000.mp4


Average reward for episode 200.000000.  Total average reward 193.805086.
Average reward for episode 189.800000.  Total average reward 193.765035.
Average reward for episode 200.000000.  Total average reward 193.827385.
Average reward for episode 200.000000.  Total average reward 193.889111.
Average reward for episode 196.800000.  Total average reward 193.918220.
Average reward for episode 193.200000.  Total average reward 193.911037.
Average reward for episode 200.000000.  Total average reward 193.971927.
Average reward for episode 200.000000.  Total average reward 194.032208.
Average reward for episode 200.000000.  Total average reward 194.091886.
Average reward for episode 200.000000.  Total average reward 194.150967.
Average reward for episode 200.000000.  Total average reward 194.209457.
Average reward for episode 185.600000.  Total average reward 194.123363.
Average reward for episode 184.000000.  Total average reward 194.022129.
Average reward for episode 200.000000.  Total avera

[2017-05-25 16:04:52,392] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video004000.mp4


Average reward for episode 187.600000.  Total average reward 189.103584.
Average reward for episode 188.600000.  Total average reward 189.098548.
Average reward for episode 179.400000.  Total average reward 189.001563.
Average reward for episode 196.000000.  Total average reward 189.071547.
Average reward for episode 191.600000.  Total average reward 189.096832.
Average reward for episode 171.400000.  Total average reward 188.919863.
Average reward for episode 195.800000.  Total average reward 188.988665.
Average reward for episode 184.000000.  Total average reward 188.938778.
Average reward for episode 200.000000.  Total average reward 189.049390.
Average reward for episode 170.000000.  Total average reward 188.858896.
Average reward for episode 186.000000.  Total average reward 188.830307.
Average reward for episode 197.200000.  Total average reward 188.914004.
Average reward for episode 187.600000.  Total average reward 188.900864.
Average reward for episode 190.000000.  Total avera

[2017-05-25 16:08:52,739] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video005000.mp4


Average reward for episode 175.000000.  Total average reward 196.310149.
Average reward for episode 200.000000.  Total average reward 196.347047.
Average reward for episode 200.000000.  Total average reward 196.383577.
Average reward for episode 200.000000.  Total average reward 196.419741.
Average reward for episode 196.000000.  Total average reward 196.415544.
Average reward for episode 200.000000.  Total average reward 196.451388.
Average reward for episode 200.000000.  Total average reward 196.486874.
Average reward for episode 200.000000.  Total average reward 196.522006.
Average reward for episode 200.000000.  Total average reward 196.556786.
Average reward for episode 200.000000.  Total average reward 196.591218.
Average reward for episode 200.000000.  Total average reward 196.625306.
Average reward for episode 200.000000.  Total average reward 196.659053.
Average reward for episode 200.000000.  Total average reward 196.692462.
Average reward for episode 200.000000.  Total avera

[2017-05-25 16:13:06,445] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video006000.mp4


Average reward for episode 200.000000.  Total average reward 199.191850.
Average reward for episode 200.000000.  Total average reward 199.199932.
Average reward for episode 200.000000.  Total average reward 199.207933.
Average reward for episode 200.000000.  Total average reward 199.215853.
Average reward for episode 200.000000.  Total average reward 199.223695.
Average reward for episode 200.000000.  Total average reward 199.231458.
Average reward for episode 200.000000.  Total average reward 199.239143.
Average reward for episode 200.000000.  Total average reward 199.246752.
Average reward for episode 200.000000.  Total average reward 199.254284.
Average reward for episode 200.000000.  Total average reward 199.261741.
Average reward for episode 200.000000.  Total average reward 199.269124.
Average reward for episode 200.000000.  Total average reward 199.276433.
Average reward for episode 200.000000.  Total average reward 199.283668.
Average reward for episode 200.000000.  Total avera

[2017-05-25 16:17:08,418] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video007000.mp4


Average reward for episode 200.000000.  Total average reward 198.839633.
Average reward for episode 200.000000.  Total average reward 198.851237.
Average reward for episode 200.000000.  Total average reward 198.862725.
Average reward for episode 200.000000.  Total average reward 198.874097.
Average reward for episode 200.000000.  Total average reward 198.885356.
Average reward for episode 200.000000.  Total average reward 198.896503.
Average reward for episode 200.000000.  Total average reward 198.907538.
Average reward for episode 200.000000.  Total average reward 198.918462.
Average reward for episode 200.000000.  Total average reward 198.929278.
Average reward for episode 200.000000.  Total average reward 198.939985.
Average reward for episode 200.000000.  Total average reward 198.950585.
Average reward for episode 200.000000.  Total average reward 198.961079.
Average reward for episode 200.000000.  Total average reward 198.971469.
Average reward for episode 200.000000.  Total avera

[2017-05-25 16:21:08,937] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video008000.mp4


Average reward for episode 200.000000.  Total average reward 199.489042.
Average reward for episode 200.000000.  Total average reward 199.494152.
Average reward for episode 200.000000.  Total average reward 199.499210.
Average reward for episode 200.000000.  Total average reward 199.504218.
Average reward for episode 200.000000.  Total average reward 199.509176.
Average reward for episode 200.000000.  Total average reward 199.514084.
Average reward for episode 200.000000.  Total average reward 199.518943.
Average reward for episode 200.000000.  Total average reward 199.523754.
Average reward for episode 200.000000.  Total average reward 199.528516.
Average reward for episode 200.000000.  Total average reward 199.533231.
Average reward for episode 200.000000.  Total average reward 199.537899.
Average reward for episode 200.000000.  Total average reward 199.542520.
Average reward for episode 200.000000.  Total average reward 199.547095.
Average reward for episode 186.400000.  Total avera

[2017-05-25 16:25:08,074] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video009000.mp4


Average reward for episode 200.000000.  Total average reward 199.382854.
Average reward for episode 200.000000.  Total average reward 199.389026.
Average reward for episode 194.200000.  Total average reward 199.337135.
Average reward for episode 200.000000.  Total average reward 199.343764.
Average reward for episode 200.000000.  Total average reward 199.350326.
Average reward for episode 200.000000.  Total average reward 199.356823.
Average reward for episode 197.600000.  Total average reward 199.339255.
Average reward for episode 200.000000.  Total average reward 199.345862.
Average reward for episode 198.800000.  Total average reward 199.340404.
Average reward for episode 199.000000.  Total average reward 199.337000.
Average reward for episode 189.600000.  Total average reward 199.239630.
Average reward for episode 195.400000.  Total average reward 199.201233.
Average reward for episode 200.000000.  Total average reward 199.209221.
Average reward for episode 200.000000.  Total avera

[2017-05-25 16:28:59,360] Starting new video recorder writing to /tmp/cartpole-experiment-3/openaigym.video.0.6558.video010000.mp4


Average reward for episode 188.200000.  Total average reward 192.189493.
10001 Episodes completed.


In [9]:
# env.close()
gym.upload('/tmp/cartpole-experiment-3', api_key='sk_WZ0h9zAS6e9BWgfvUeCsQ')


[2017-05-25 16:31:21,314] [CartPole-v0] Uploading 10000 episodes of training data
[2017-05-25 16:31:23,289] [CartPole-v0] Uploading videos of 20 training episodes (181901 bytes)
[2017-05-25 16:31:23,910] [CartPole-v0] Creating evaluation object from /tmp/cartpole-experiment-3 with learning curve and training video
[2017-05-25 16:31:24,220] 
****************************************************
You successfully uploaded your evaluation on CartPole-v0 to
OpenAI Gym! You can find it at:

    https://gym.openai.com/evaluations/eval_xyQALBq8SMyXO6yOMRcwOg

****************************************************
