Reinforcement Learning with OpenAI Gym
---
This notebook will create and test different reinforcement learning agents and environments.

In [1]:
import tensorflow as tf
import gym
import os

import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

  from ._conv import register_converters as _register_converters


Load the Environment
---
Call `gym.make("environment name")` to load a new environment.

Check out the list of available environments at <https://gym.openai.com/envs/>

Edit this cell to load different environments!

In [2]:
# TODO: Load an environment
env = gym.make("CartPole-v1")

[2018-08-03 10:33:18,575] Making new env: CartPole-v1


In [3]:
# TODO: Print observation and action spaces
print(env.observation_space)
print(env.action_space)

Box(4,)
Discrete(2)


Run an Agent
---

Reset the environment before each run with `env.reset`

Step forward through the environment to get new observations and rewards over time with `env.step`

`env.step` takes a parameter for the action to take on this step and returns the following:
- Observations for this step
- Rewards earned this step
- "Done", a boolean value indicating if the game is finished
- Info - some debug information that some environments provide. 

In [4]:
# TODO Make a random agent
games_to_play = 10
 
for i in range(games_to_play):
    # Reset the environment
    obs = env.reset()
    episode_rewards = 0
    done = False
     
    while not done:
        # Render the environment so we can watch
        env.render()
         
        # Choose a random action
        action = env.action_space.sample()
         
        # Take a step in the environment with the chosen action
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
 
    # Print episode total rewards when done
    print(episode_rewards)
     
# Close the environment
env.close()

13.0
15.0
17.0
12.0
17.0
21.0
15.0
13.0
13.0
11.0


Policy Gradients
---
The policy gradients algorithm records gameplay over a training period, then runs the results of the actions chosen through a neural network, making successful actions that resulted in a reward more likely, and unsuccessful actions less likely.

In [5]:
# TODO Build the policy gradient neural network
class Agent:
    def __init__(self, num_actions, state_size):
         
        initializer = tf.contrib.layers.xavier_initializer()
         
        self.input_layer = tf.placeholder(dtype=tf.float32, shape=[None, state_size])
         
        # Neural net starts here
         
        hidden_layer = tf.layers.dense(self.input_layer, 8, activation=tf.nn.relu, kernel_initializer=initializer)
        hidden_layer_2 = tf.layers.dense(hidden_layer, 8, activation=tf.nn.relu, kernel_initializer=initializer)
         
        # Output of neural net
        out = tf.layers.dense(hidden_layer_2, num_actions, activation=None)
         
        self.outputs = tf.nn.softmax(out)
        self.choice = tf.argmax(self.outputs, axis=1)
         
        # Training Procedure
        self.rewards = tf.placeholder(shape=[None, ], dtype=tf.float32)
        self.actions = tf.placeholder(shape=[None, ], dtype=tf.int32)
         
        one_hot_actions = tf.one_hot(self.actions, num_actions)
         
        cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits=out, labels=one_hot_actions)
         
        self.loss = tf.reduce_mean(cross_entropy * self.rewards)
         
        self.gradients = tf.gradients(self.loss, tf.trainable_variables())
         
        # Create a placeholder list for gradients
        self.gradients_to_apply = []
        for index, variable in enumerate(tf.trainable_variables()):
            gradient_placeholder = tf.placeholder(tf.float32)
            self.gradients_to_apply.append(gradient_placeholder)
        optimizer = tf.train.AdamOptimizer(learning_rate=1e-2)
        self.update_gradients = optimizer.apply_gradients(zip(self.gradients_to_apply, tf.trainable_variables()))

Discounting and Normalizing Rewards
---
In order to determine how "successful" a given action is, the policy gradient algorithm evaluates each action based on how many rewards were earned after it was performed in an episode.

The discount rewards function goes through each time step of an episode and tracks the total rewards earned from each step to the end of the episode.

For example, if an episode took 10 steps to finish, and the agent earns 1 point of reward every step, the rewards for each frame would be stored as 
`[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]`

This allows the agent to credit early actions that didn't lose the game with future success, and later actions (that likely resulted in the end of the game) to get less credit.

One disadvantage of arranging rewards like this is that early actions didn't necessarily directly contribute to later rewards, so a **discount factor** is applied that scales rewards down over time. A discount factor < 1 means that rewards earned closer to the current time step will be worth more than rewards earned later.

With our reward example above, if we applied a discount factor of .90, the rewards would be stored as
`[ 6.5132156   6.12579511  5.6953279   5.217031    4.68559     4.0951      3.439
  2.71        1.9         1. ]`

This means that the early actions still get more credit than later actions, but not the full value of the rewards for the entire episode.

Finally, the rewards are normalized to lower the variance between reward values in longer or shorter episodes.

You can tweak the discount factor as one of the hyperparameters of your model to find one that fits your task the best!

In [6]:
# TODO Create the discounted and normalized rewards function
discount_rate = 0.95
 
def discount_normalize_rewards(rewards):
    discounted_rewards = np.zeros_like(rewards)
    total_rewards = 0
     
    for i in reversed(range(len(rewards))):
        total_rewards = total_rewards * discount_rate + rewards[i]
        discounted_rewards[i] = total_rewards
     
    discounted_rewards -= np.mean(discounted_rewards)
    discounted_rewards /= np.std(discounted_rewards)
     
    return discounted_rewards

Training Procedure
---
The agent will play games and record the history of the episode. At the end of every game, the episode's history will be processed to calculate the **gradients** that the model learned from that episode.

Every few games the calculated gradients will be applied, updating the model's parameters with the lessons from the games so far.

While training, you'll keep track of average scores and render the environment occasionally to see your model's progress.

In [7]:
# TODO Create the training loop
tf.reset_default_graph()
 
# Modify these to match shape of actions and states in your environment
num_actions = 2
state_size = 4
 
path = "./cartpole-pg/"
 
training_episodes = 1000
max_steps_per_episode = 10000
episode_batch_size = 5
 
agent = Agent(num_actions, state_size)
 
init = tf.global_variables_initializer()
 
saver = tf.train.Saver(max_to_keep=2)
 
if not os.path.exists(path):
    os.makedirs(path)
 
with tf.Session() as sess:
    sess.run(init)
     
    total_episode_rewards = []
     
    # Create a buffer of 0'd gradients
    gradient_buffer = sess.run(tf.trainable_variables())
    for index, gradient in enumerate(gradient_buffer):
        gradient_buffer[index] = gradient * 0
 
    for episode in range(training_episodes):
 
        state = env.reset()
         
        episode_history = []
        episode_rewards = 0
         
        for step in range(max_steps_per_episode):
             
            if episode % 100 == 0:
                env.render()
             
            # Get weights for each action
            action_probabilities = sess.run(agent.outputs, feed_dict={agent.input_layer: [state]})
            print(action_probabilities)
            action_choice = np.random.choice(range(num_actions), p=action_probabilities[0])
            print(action_choice)
             
            state_next, reward, done, _ = env.step(action_choice)
            episode_history.append([state, action_choice, reward, state_next])
            state = state_next
             
            episode_rewards += reward
             
            if done or step + 1 == max_steps_per_episode:
                total_episode_rewards.append(episode_rewards)
                episode_history = np.array(episode_history)
                episode_history[:,2] = discount_normalize_rewards(episode_history[:,2])
                 
                ep_gradients = sess.run(agent.gradients, feed_dict={agent.input_layer: np.vstack(episode_history[:, 0]),
                                                                    agent.actions: episode_history[:, 1],
                                                                    agent.rewards: episode_history[:, 2]})
                # add the gradients to the grad buffer:
                for index, gradient in enumerate(ep_gradients):
                    gradient_buffer[index] += gradient
                 
                break
             
        if episode % episode_batch_size == 0:
         
            feed_dict_gradients = dict(zip(agent.gradients_to_apply, gradient_buffer))
             
            sess.run(agent.update_gradients, feed_dict=feed_dict_gradients)
             
            for index, gradient in enumerate(gradient_buffer):
                gradient_buffer[index] = gradient * 0
                 
        if episode % 100 == 0:
            saver.save(sess, path + "pg-checkpoint", episode)
            print("Average reward / 100 eps: " + str(np.mean(total_episode_rewards[-100:])))

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.



[2018-08-03 10:34:05,425] From <ipython-input-5-09f7aad063ea>:26: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.



[[0.49723607 0.50276387]]
1
[[0.43594557 0.5640545 ]]
1
[[0.36091578 0.6390842 ]]
1
[[0.2903715 0.7096285]]
1
[[0.22695674 0.7730433 ]]
0
[[0.27868316 0.7213168 ]]
1
[[0.21550758 0.78449243]]
0
[[0.26359284 0.7364071 ]]
1
[[0.20140174 0.7985983 ]]
1
[[0.14944082 0.85055923]]
1
[[0.10785232 0.8921477 ]]
1
[[0.07584309 0.9241569 ]]
0
Average reward / 100 eps: 12.0
[[0.47091788 0.52908206]]
0
[[0.48536298 0.514637  ]]
1
[[0.46441352 0.5355865 ]]
1
[[0.3702201  0.62977993]]
0
[[0.4583744  0.54162556]]
0
[[0.48674914 0.5132509 ]]
1
[[0.45092848 0.5490716 ]]
0
[[0.48756766 0.51243234]]
0
[[0.4812864 0.5187136]]
0
[[0.47483507 0.5251649 ]]
0
[[0.4679743  0.53202564]]
1
[[0.47280738 0.52719265]]
1
[[0.47820342 0.52179664]]
1
[[0.48468453 0.51531553]]
0
[[0.47673556 0.5232644 ]]
1
[[0.4830347 0.5169653]]
0
[[0.47515333 0.5248466 ]]
0
[[0.4670535 0.5329465]]
1
[[0.47223702 0.527763  ]]
1
[[0.4783484  0.52165157]]
0
[[0.47028974 0.5297103 ]]
1
[[0.47730494 0.52269506]]
1
[[0.49580044 0.5041996 ]]

[[0.03284866 0.9671513 ]]
1
[[0.01943025 0.9805697 ]]
1
[[0.01120893 0.9887911 ]]
1
[[0.00631024 0.9936898 ]]
1
[[0.00346793 0.9965321 ]]
1
[[0.41940376 0.58059615]]
0
[[0.4419806  0.55801946]]
0
[[0.43112513 0.56887484]]
0
[[0.42018726 0.57981277]]
0
[[0.4090789  0.59092104]]
1
[[0.41938072 0.5806192 ]]
1
[[0.4293563 0.5706437]]
1
[[0.43908727 0.5609127 ]]
0
[[0.42670035 0.5732996 ]]
0
[[0.41424945 0.5857506 ]]
1
[[0.4232729  0.57672703]]
1
[[0.43024236 0.56975764]]
0
[[0.41764775 0.5823523 ]]
1
[[0.42375553 0.5762444 ]]
1
[[0.42965585 0.57034415]]
1
[[0.41005656 0.58994347]]
0
[[0.42234442 0.57765555]]
1
[[0.41858238 0.5814176 ]]
0
[[0.4146204 0.5853796]]
1
[[0.4174714 0.5825286]]
0
[[0.40631336 0.59368664]]
0
[[0.4333829  0.56661713]]
1
[[0.32096565 0.6790343 ]]
0
[[0.43347886 0.5665211 ]]
0
[[0.4396081 0.5603919]]
0
[[0.42853418 0.5714658 ]]
0
[[0.4173704  0.58262956]]
0
[[0.40603063 0.59396935]]
1
[[0.41602597 0.5839741 ]]
1
[[0.42566767 0.57433236]]
0
[[0.41325468 0.58674526]]
0


1
[[0.06604125 0.9339587 ]]
1
[[0.03797175 0.9620282 ]]
1
[[0.02124174 0.9787582 ]]
1
[[0.01159745 0.98840255]]
1
[[0.0061896 0.9938105]]
1
[[0.0032316 0.9967685]]
1
[[0.36012694 0.6398731 ]]
1
[[0.2649748  0.73502517]]
1
[[0.1775122 0.8224879]]
1
[[0.11145765 0.8885424 ]]
1
[[0.06725242 0.9327476 ]]
1
[[0.03928963 0.9607104 ]]
1
[[0.02233319 0.9776668 ]]
1
[[0.01238613 0.9876138 ]]
1
[[0.00671153 0.99328846]]
0
[[0.01044038 0.98955965]]
1
[[0.00549161 0.99450845]]
1
[[0.36025107 0.63974893]]
1
[[0.25538662 0.74461347]]
0
[[0.35849592 0.641504  ]]
1
[[0.25308716 0.74691284]]
1
[[0.16289182 0.83710814]]
1
[[0.09957064 0.90042937]]
1
[[0.0585165  0.94148356]]
1
[[0.03332868 0.9666713 ]]
1
[[0.0184899  0.98151016]]
1
[[0.01001911 0.9899809 ]]
1
[[0.00530993 0.99469006]]
1
[[0.3430465 0.6569535]]
0
[[0.37775406 0.6222459 ]]
1
[[0.34630477 0.6536952 ]]
1
[[0.24785104 0.752149  ]]
1
[[0.16180986 0.83819014]]
1
[[0.09866218 0.90133786]]
1
[[0.05779878 0.94220126]]
1
[[0.03279128 0.96720874]]


[[0.29269046 0.7073095 ]]
1
[[0.2976901 0.7023099]]
1
[[0.31016782 0.6898322 ]]
1
[[0.21484067 0.78515935]]
1
[[0.13842504 0.861575  ]]
0
[[0.21181479 0.78818524]]
0
[[0.3060785 0.6939215]]
1
[[0.20858023 0.7914198 ]]
1
[[0.13320892 0.86679107]]
1
[[0.07789887 0.9221011 ]]
0
[[0.12739046 0.8726095 ]]
1
[[0.07267339 0.9273267 ]]
1
[[0.03983849 0.96016157]]
1
[[0.0212008 0.9787992]]
1
[[0.01100436 0.9889957 ]]
1
[[0.00558365 0.99441636]]
1
[[0.29310584 0.7068941 ]]
1
[[0.19899774 0.8010022 ]]
1
[[0.12705736 0.8729427 ]]
1
[[0.07460636 0.92539364]]
1
[[0.04169752 0.9583025 ]]
1
[[0.02262117 0.97737885]]
1
[[0.0119631 0.9880369]]
1
[[0.00617976 0.99382025]]
1
[[0.00312052 0.99687946]]
1
[[0.30829552 0.6917044 ]]
1
[[0.2104642 0.7895358]]
1
[[0.13395225 0.8660478 ]]
1
[[0.0790709  0.92092913]]
1
[[0.0438614  0.95613855]]
1
[[0.02359699 0.976403  ]]
1
[[0.01237298 0.9876271 ]]
1
[[0.00633829 0.99366164]]
1
[[0.00317521 0.99682486]]
1
[[0.28308144 0.7169185 ]]
1
[[0.19245851 0.80754143]]
0
[[

1
[[0.01154503 0.98845494]]
1
[[0.00597467 0.99402535]]
1
[[0.0030244  0.99697566]]
1
[[0.279951   0.72004896]]
0
[[0.30810297 0.69189703]]
1
[[0.28547212 0.7145278 ]]
1
[[0.20190316 0.7980969 ]]
0
[[0.2907328 0.7092672]]
0
[[0.30630085 0.6936991 ]]
1
[[0.29613838 0.7038616 ]]
0
[[0.30458117 0.69541883]]
0
[[0.29064158 0.7093584 ]]
1
[[0.3006087  0.69939125]]
0
[[0.28586772 0.71413225]]
1
[[0.2950626 0.7049374]]
1
[[0.29923767 0.70076233]]
1
[[0.23764464 0.7623554 ]]
0
[[0.30053017 0.69946986]]
1
[[0.24733701 0.75266296]]
0
[[0.29511264 0.7048874 ]]
0
[[0.27921465 0.7207853 ]]
1
[[0.2870496 0.7129504]]
1
[[0.26396278 0.7360372 ]]
1
[[0.19433607 0.8056639 ]]
1
[[0.13294373 0.86705625]]
1
[[0.08447172 0.91552836]]
0
[[0.14116459 0.8588354 ]]
1
[[0.08996685 0.9100331 ]]
1
[[0.05533064 0.94466937]]
1
[[0.03348505 0.9665149 ]]
1
[[0.01991745 0.9800826 ]]
1
[[0.0116635 0.9883365]]
1
[[0.00672778 0.99327224]]
1
[[0.00376213 0.9962379 ]]
1
[[0.00196348 0.99803656]]
1
[[0.00099958 0.9990005 ]]


1
[[0.11906166 0.8809383 ]]
1
[[0.07477697 0.92522305]]
1
[[0.04533002 0.95467   ]]
1
[[0.02562682 0.9743732 ]]
1
[[0.01411119 0.9858888 ]]
1
[[0.00759979 0.99240017]]
1
[[0.00400652 0.9959935 ]]
1
[[0.26072678 0.73927325]]
0
[[0.2885239 0.7114761]]
0
[[0.27161753 0.7283825 ]]
1
[[0.28623018 0.7137698 ]]
1
[[0.26573685 0.7342632 ]]
0
[[0.2839538 0.7160463]]
0
[[0.2660808  0.73391914]]
1
[[0.27912202 0.720878  ]]
0
[[0.26094672 0.7390533 ]]
1
[[0.27324063 0.7267594 ]]
1
[[0.27099538 0.7290046 ]]
1
[[0.2086234 0.7913766]]
1
[[0.14848582 0.8515142 ]]
1
[[0.09664335 0.9033567 ]]
1
[[0.06135241 0.93864757]]
1
[[0.03818921 0.9618108 ]]
1
[[0.02338531 0.97661465]]
1
[[0.01358801 0.9864119 ]]
1
[[0.00757206 0.9924279 ]]
1
[[0.00412868 0.99587137]]
1
[[0.00220237 0.9977976 ]]
1
[[0.00114873 0.9988513 ]]
1
[[0.27027795 0.729722  ]]
1
[[0.19852217 0.80147785]]
1
[[0.13280135 0.8671986 ]]
1
[[0.08517651 0.91482353]]
1
[[0.05274249 0.94725746]]
1
[[0.03162989 0.96837014]]
1
[[0.01789574 0.98210424]

0
[[0.2662038  0.73379624]]
0
[[0.24600281 0.7539972 ]]
1
[[0.26233009 0.7376699 ]]
1
[[0.26406926 0.73593074]]
1
[[0.20628911 0.7937109 ]]
1
[[0.15408438 0.8459157 ]]
1
[[0.10624433 0.89375573]]
1
[[0.07141041 0.9285896 ]]
1
[[0.04630958 0.95369047]]
1
[[0.02823574 0.9717642 ]]
1
[[0.01685024 0.9831497 ]]
1
[[0.00985138 0.99014854]]
1
[[0.00564424 0.9943558 ]]
1
[[0.00316918 0.99683076]]
1
[[0.23826416 0.7617358 ]]
1
[[0.17747675 0.8225233 ]]
1
[[0.12294138 0.8770586 ]]
1
[[0.08265937 0.9173407 ]]
1
[[0.05262703 0.947373  ]]
1
[[0.03169703 0.96830297]]
1
[[0.0186735 0.9813265]]
1
[[0.01077718 0.9892229 ]]
1
[[0.00609791 0.9939021 ]]
1
[[0.23965979 0.7603402 ]]
1
[[0.18300427 0.8169957 ]]
1
[[0.12820207 0.8717979 ]]
1
[[0.0873246  0.91267544]]
1
[[0.05712985 0.94287014]]
1
[[0.03483557 0.9651644 ]]
1
[[0.02077198 0.979228  ]]
1
[[0.01213176 0.9878683 ]]
1
[[0.00694542 0.99305457]]
1
[[0.0038991 0.9961009]]
1
[[0.25041318 0.7495868 ]]
1
[[0.18976538 0.81023467]]
0
[[0.24705689 0.7529431

Average reward / 100 eps: 12.13
[[0.22757582 0.7724242 ]]
0
[[0.2426284  0.75737154]]
1
[[0.22871299 0.77128696]]
1
[[0.18206465 0.8179353 ]]
1
[[0.13205495 0.867945  ]]
1
[[0.09239241 0.90760756]]
1
[[0.06316903 0.93683094]]
1
[[0.04029697 0.9597031 ]]
1
[[0.02466109 0.9753389 ]]
1
[[0.01479193 0.98520803]]
1
[[0.00870283 0.9912971 ]]
1
[[0.00502441 0.99497557]]
1
[[0.21635605 0.7836439 ]]
1
[[0.17213659 0.82786345]]
1
[[0.1235812  0.87641877]]
1
[[0.08635733 0.91364264]]
1
[[0.05868481 0.9413151 ]]
1
[[0.03731036 0.96268964]]
1
[[0.02284366 0.9771564 ]]
1
[[0.01371259 0.9862874 ]]
1
[[0.00807535 0.9919247 ]]
1
[[0.00466659 0.9953334 ]]
1
[[0.22638574 0.7736142 ]]
1
[[0.1755177 0.8244823]]
1
[[0.12346959 0.8765304 ]]
0
[[0.17117353 0.8288264 ]]
1
[[0.11964196 0.88035804]]
1
[[0.08166063 0.91833943]]
1
[[0.05286917 0.94713086]]
1
[[0.03206959 0.96793044]]
1
[[0.01904934 0.9809506 ]]
1
[[0.01109946 0.98890054]]
1
[[0.21448489 0.7855151 ]]
1
[[0.1672433 0.8327567]]
1
[[0.11840224 0.88159

1
[[0.11899011 0.8810099 ]]
1
[[0.08363159 0.9163684 ]]
1
[[0.05772666 0.9422733 ]]
1
[[0.0391893 0.9608107]]
0
[[0.0562745 0.9437255]]
1
[[0.03766091 0.9623391 ]]
1
[[0.02466092 0.9753391 ]]
1
[[0.01508331 0.9849166 ]]
1
[[0.00905138 0.9909486 ]]
1
[[0.00530812 0.9946919 ]]
1
[[0.19563931 0.80436075]]
1
[[0.1564626 0.8435374]]
1
[[0.11492337 0.88507664]]
1
[[0.0795681  0.92043185]]
1
[[0.05407893 0.9459211 ]]
1
[[0.03612177 0.9638782 ]]
1
[[0.02332953 0.97667044]]
1
[[0.01422452 0.9857755 ]]
1
[[0.00850837 0.9914916 ]]
1
[[0.17846026 0.8215397 ]]
1
[[0.14149429 0.8585057 ]]
1
[[0.10255112 0.89744884]]
1
[[0.07147393 0.92852604]]
1
[[0.048967   0.95103294]]
0
[[0.07037875 0.9296212 ]]
0
[[0.09989373 0.9001063 ]]
1
[[0.06826424 0.93173575]]
1
[[0.04582677 0.9541732 ]]
1
[[0.03027618 0.9697239 ]]
1
[[0.0189852 0.9810147]]
1
[[0.01144749 0.98855245]]
1
[[0.0067762 0.9932238]]
0
[[0.18326095 0.8167391 ]]
0
[[0.20413099 0.795869  ]]
0
[[0.18104489 0.8189551 ]]
1
[[0.20191066 0.79808927]]
1


0
[[0.16576372 0.8342363 ]]
1
[[0.14805874 0.8519412 ]]
1
[[0.11480542 0.88519454]]
0
[[0.14928496 0.850715  ]]
1
[[0.11655626 0.8834438 ]]
1
[[0.08774612 0.9122539 ]]
1
[[0.05896445 0.9410355 ]]
1
[[0.03894993 0.96105003]]
1
[[0.02532412 0.97467583]]
1
[[0.01621425 0.9837857 ]]
1
[[0.01021736 0.9897827 ]]
1
[[0.00633815 0.9936619 ]]
1
[[0.00376542 0.9962346 ]]
1
[[0.00213028 0.99786973]]
1
[[0.13905515 0.8609449 ]]
1
[[0.10836735 0.8916326 ]]
1
[[0.07651223 0.9234877 ]]
1
[[0.05051839 0.94948155]]
0
[[0.07525276 0.9247472 ]]
1
[[0.04923135 0.95076865]]
1
[[0.03166188 0.9683382 ]]
1
[[0.02003609 0.97996396]]
1
[[0.01248625 0.9875137 ]]
1
[[0.007664 0.992336]]
1
[[0.00458807 0.99541193]]
1
[[0.1377346  0.86226547]]
1
[[0.10702828 0.89297175]]
1
[[0.07166784 0.92833215]]
1
[[0.04640068 0.9535993 ]]
1
[[0.02953569 0.9704643 ]]
1
[[0.018501 0.981499]]
1
[[0.01141583 0.98858416]]
1
[[0.00693985 0.99306023]]
1
[[0.14230932 0.8576907 ]]
1
[[0.11030992 0.88969004]]
1
[[0.07491806 0.92508197]]


1
[[0.01305949 0.98694044]]
1
[[0.00826238 0.99173766]]
1
[[0.00514841 0.9948515 ]]
1
[[0.00296304 0.99703693]]
1
[[0.10736365 0.89263636]]
1
[[0.0862304 0.9137696]]
1
[[0.06423113 0.93576884]]
1
[[0.04305674 0.9569433 ]]
1
[[0.02844158 0.9715584 ]]
1
[[0.01852369 0.98147625]]
1
[[0.01189548 0.9881045 ]]
1
[[0.00752891 0.9924711 ]]
1
[[0.00469422 0.9953057 ]]
1
[[0.00273675 0.9972632 ]]
1
[[0.10781609 0.89218396]]
1
[[0.0862207  0.91377926]]
1
[[0.06167283 0.93832713]]
1
[[0.04089461 0.9591054 ]]
1
[[0.02672123 0.97327876]]
1
[[0.01721678 0.9827832 ]]
1
[[0.01093853 0.9890615 ]]
1
[[0.00685024 0.9931498 ]]
1
[[0.00417152 0.99582845]]
1
[[0.11274432 0.8872557 ]]
1
[[0.09036361 0.9096363 ]]
1
[[0.06553303 0.93446696]]
1
[[0.04359919 0.9564008 ]]
1
[[0.02857144 0.9714286 ]]
1
[[0.01845655 0.9815435 ]]
1
[[0.01175273 0.9882472 ]]
1
[[0.00737603 0.99262404]]
1
[[0.00451456 0.99548537]]
1
[[0.1124585 0.8875415]]
1
[[0.09026259 0.90973747]]
1
[[0.06585025 0.9341498 ]]
0
[[0.0896268  0.9103731

[[0.01108422 0.9889158 ]]
1
[[0.00677394 0.9932261 ]]
1
[[0.00407593 0.9959241 ]]
1
[[0.00241281 0.9975872 ]]
1
[[0.00140418 0.9985958 ]]
1
[[0.07408081 0.92591923]]
1
[[0.05765284 0.9423472 ]]
1
[[0.04433187 0.9556681 ]]
1
[[0.02810439 0.97189564]]
1
[[0.01755581 0.9824442 ]]
1
[[0.01080813 0.98919183]]
1
[[0.00655561 0.99344444]]
1
[[0.00391504 0.9960849 ]]
1
[[0.00230049 0.99769956]]
1
[[0.00132885 0.9986712 ]]
1
[[0.06752999 0.93246996]]
1
[[0.05147479 0.9485252 ]]
0
[[0.06724422 0.9327557 ]]
1
[[0.05120845 0.94879156]]
1
[[0.03611149 0.9638885 ]]
1
[[0.02198485 0.9780151 ]]
1
[[0.01318902 0.986811  ]]
1
[[0.0077984  0.99220157]]
1
[[0.00454302 0.99545693]]
1
[[0.00260588 0.9973941 ]]
1
[[0.00147048 0.99852955]]
1
[[0.07240261 0.92759734]]
1
[[0.05520957 0.9447904 ]]
1
[[0.04124128 0.9587588 ]]
1
[[0.02520557 0.97479445]]
0
[[0.03931528 0.9606848 ]]
1
[[0.02377313 0.97622687]]
1
[[0.01416001 0.98583996]]
1
[[0.00831224 0.9916877 ]]
1
[[0.00480818 0.9951918 ]]
1
[[0.0027391 0.997260

1
[[0.01125542 0.9887446 ]]
1
[[0.00600099 0.993999  ]]
1
[[0.00314938 0.99685067]]
1
[[0.00162527 0.9983747 ]]
1
[[8.237624e-04 9.991762e-01]]
1
[[4.0961202e-04 9.9959046e-01]]
1
[[0.04080161 0.95919836]]
1
[[0.02881868 0.97118133]]
1
[[0.02025876 0.9797412 ]]
1
[[0.01231223 0.98768777]]
1
[[0.00668513 0.9933149 ]]
1
[[0.00357348 0.9964265 ]]
1
[[0.00187845 0.99812156]]
1
[[9.6970948e-04 9.9903035e-01]]
1
[[4.909561e-04 9.995091e-01]]
1
[[2.4352547e-04 9.9975652e-01]]
1
[[0.04145752 0.9585425 ]]
1
[[0.02917651 0.97082347]]
1
[[0.02047863 0.97952133]]
1
[[0.01158733 0.9884127 ]]
1
[[0.00623688 0.99376315]]
1
[[0.00330302 0.99669695]]
1
[[0.00171934 0.99828064]]
1
[[8.785676e-04 9.991215e-01]]
1
[[4.402047e-04 9.995598e-01]]
1
[[0.03968009 0.96031994]]
1
[[0.02799553 0.97200453]]
1
[[0.01970634 0.98029363]]
1
[[0.01173749 0.9882625 ]]
1
[[0.0063839 0.9936161]]
1
[[0.00341799 0.9965821 ]]
1
[[0.00179938 0.99820054]]
1
[[9.3012728e-04 9.9906987e-01]]
1
[[4.7146506e-04 9.9952853e-01]]
1
[[

[[4.3591842e-04 9.9956411e-01]]
1
[[2.0056500e-04 9.9979943e-01]]
1
[[9.005467e-05 9.999099e-01]]
1
[[3.9423630e-05 9.9996054e-01]]
1
[[0.03089211 0.9691079 ]]
1
[[0.01996633 0.9800337 ]]
1
[[0.01286495 0.9871351 ]]
1
[[0.0063375 0.9936625]]
1
[[0.00302919 0.9969708 ]]
1
[[0.00142295 0.998577  ]]
1
[[6.5591448e-04 9.9934405e-01]]
1
[[2.962069e-04 9.997037e-01]]
1
[[1.3087096e-04 9.9986911e-01]]
1
[[0.03328166 0.96671826]]
1
[[0.02149605 0.978504  ]]
1
[[0.01383804 0.98616195]]
1
[[0.00695411 0.99304587]]
1
[[0.00331724 0.9966827 ]]
1
[[0.0015551 0.998445 ]]
1
[[7.1543077e-04 9.9928457e-01]]
1
[[3.2249955e-04 9.9967754e-01]]
1
[[0.03204997 0.96795   ]]
1
[[0.02090016 0.9790998 ]]
1
[[0.01358486 0.9864152 ]]
1
[[0.00725005 0.9927499 ]]
1
[[0.0035244  0.99647564]]
1
[[0.00168328 0.99831676]]
1
[[7.886570e-04 9.992113e-01]]
1
[[3.6186105e-04 9.9963808e-01]]
1
[[1.6235281e-04 9.9983764e-01]]
1
[[0.02949262 0.97050744]]
1
[[0.01886624 0.9811337 ]]
1
[[0.01196033 0.9880397 ]]
1
[[0.00591192 0

1
[[1.1079865e-04 9.9988914e-01]]
1
[[4.3295993e-05 9.9995673e-01]]
1
[[0.01808693 0.9819131 ]]
1
[[0.01076138 0.9892386 ]]
1
[[0.00635922 0.99364084]]
1
[[0.0034853  0.99651474]]
1
[[0.00150124 0.9984988 ]]
1
[[6.3448603e-04 9.9936551e-01]]
1
[[2.6250797e-04 9.9973744e-01]]
1
[[1.060701e-04 9.998939e-01]]
1
[[4.1771847e-05 9.9995828e-01]]
1
[[1.6008791e-05 9.9998403e-01]]
1
[[0.01557169 0.9844283 ]]
1
[[0.00920231 0.99079764]]
1
[[0.00542581 0.99457425]]
1
[[0.00300246 0.9969976 ]]
1
[[0.00129428 0.99870574]]
1
[[5.471267e-04 9.994529e-01]]
1
[[2.2623778e-04 9.9977380e-01]]
1
[[9.128092e-05 9.999087e-01]]
1
[[3.5857636e-05 9.9996412e-01]]
1
[[1.3691236e-05 9.9998629e-01]]
1
[[0.0174185 0.9825815]]
1
[[0.0103532  0.98964685]]
1
[[0.00613448 0.99386555]]
1
[[0.00362289 0.99637705]]
1
[[0.00159712 0.9984029 ]]
1
[[6.8106729e-04 9.9931896e-01]]
1
[[2.8407585e-04 9.9971586e-01]]
1
[[1.1560848e-04 9.9988437e-01]]
1
[[4.5801935e-05 9.9995422e-01]]
1
[[1.7634027e-05 9.9998236e-01]]
1
[[0.0175

1
[[0.01104437 0.98895556]]
1
[[0.00597937 0.99402064]]
1
[[0.00323155 0.9967685 ]]
1
[[0.00156497 0.9984351 ]]
1
[[5.9441995e-04 9.9940562e-01]]
1
[[2.210875e-04 9.997789e-01]]
1
[[8.0309634e-05 9.9991965e-01]]
1
[[2.8422699e-05 9.9997163e-01]]
1
[[0.01177894 0.988221  ]]
1
[[0.0066419 0.9933581]]
1
[[0.00371868 0.9962813 ]]
1
[[0.00207597 0.997924  ]]
1
[[0.00100529 0.99899477]]
1
[[4.005611e-04 9.995995e-01]]
1
[[1.558704e-04 9.998441e-01]]
1
[[5.906356e-05 9.999410e-01]]
1
[[2.1737309e-05 9.9997830e-01]]
1
[[7.7537343e-06 9.9999225e-01]]
1
[[2.6769098e-06 9.9999738e-01]]
1
[[0.00940349 0.99059653]]
1
[[0.00523376 0.99476624]]
1
[[0.0029129  0.99708706]]
1
[[0.00161654 0.9983835 ]]
1
[[6.8242330e-04 9.9931765e-01]]
1
[[2.6848205e-04 9.9973148e-01]]
1
[[1.031210e-04 9.998969e-01]]
1
[[3.855724e-05 9.999615e-01]]
1
[[1.3999279e-05 9.9998605e-01]]
1
[[4.9261907e-06 9.9999511e-01]]
1
[[0.00957407 0.990426  ]]
1
[[0.00534786 0.99465215]]
1
[[0.0029881 0.9970119]]
1
[[0.00166456 0.9983354

[[5.5239572e-05 9.9994481e-01]]
1
[[1.8917171e-05 9.9998105e-01]]
1
[[6.286952e-06 9.999937e-01]]
1
Average reward / 100 eps: 9.47
[[0.00888043 0.99111956]]
1
[[0.00451492 0.99548507]]
1
[[0.0022919 0.9977081]]
1
[[0.00116023 0.9988398 ]]
1
[[4.2256448e-04 9.9957746e-01]]
1
[[1.4956061e-04 9.9985039e-01]]
1
[[5.1641673e-05 9.9994838e-01]]
1
[[1.7349636e-05 9.9998260e-01]]
1
[[0.00941008 0.99059   ]]
1
[[0.00479462 0.9952054 ]]
1
[[0.00243852 0.99756145]]
1
[[0.00119704 0.998803  ]]
1
[[4.3361454e-04 9.9956638e-01]]
1
[[1.5365086e-04 9.9984634e-01]]
1
[[5.3105352e-05 9.9994683e-01]]
1
[[1.7855964e-05 9.9998212e-01]]
1
[[0.00869999 0.99130005]]
1
[[0.00450391 0.99549615]]
1
[[0.00229986 0.99770015]]
1
[[0.0011714 0.9988286]]
1
[[4.4006939e-04 9.9955994e-01]]
1
[[1.5779996e-04 9.9984217e-01]]
1
[[5.5216395e-05 9.9994481e-01]]
1
[[1.8801131e-05 9.9998116e-01]]
1
[[6.2153194e-06 9.9999380e-01]]
1
[[0.00923955 0.99076045]]
1
[[0.00477109 0.9952289 ]]
1
[[0.00245922 0.99754083]]
1
[[0.0012636

[[0.00797087 0.99202913]]
1
[[0.00404877 0.9959512 ]]
1
[[0.00205242 0.99794763]]
1
[[0.00103679 0.99896324]]
1
[[4.4410635e-04 9.9955589e-01]]
1
[[1.6181682e-04 9.9983811e-01]]
1
[[5.7464866e-05 9.9994254e-01]]
1
[[1.9829491e-05 9.9998021e-01]]
1
[[6.6318935e-06 9.9999332e-01]]
1
[[2.1456358e-06 9.9999785e-01]]
1
[[0.00932114 0.99067885]]
1
[[0.00468951 0.9953105 ]]
1
[[0.00231311 0.99768686]]
1
[[0.00113792 0.99886215]]
1
[[4.3542963e-04 9.9956459e-01]]
1
[[1.5236872e-04 9.9984765e-01]]
1
[[5.2031202e-05 9.9994802e-01]]
1
[[1.729052e-05 9.999827e-01]]
1
[[5.578972e-06 9.999944e-01]]
1
[[0.00912316 0.99087685]]
1
[[0.00453614 0.9954639 ]]
1
[[0.00225063 0.9977494 ]]
1
[[0.00111287 0.9988871 ]]
1
[[4.1587895e-04 9.9958414e-01]]
1
[[1.4684111e-04 9.9985313e-01]]
1
[[5.0542094e-05 9.9994946e-01]]
1
[[1.6910994e-05 9.9998307e-01]]
1
[[5.488112e-06 9.999945e-01]]
1
[[0.00759745 0.99240255]]
1
[[0.00382442 0.9961755 ]]
1
[[0.00192129 0.9980787 ]]
1
[[9.618111e-04 9.990382e-01]]
1
[[3.843798

[[6.207390e-04 9.993793e-01]]
1
[[2.2565226e-04 9.9977440e-01]]
1
[[8.005456e-05 9.999199e-01]]
1
[[2.7638609e-05 9.9997234e-01]]
1
[[9.264056e-06 9.999907e-01]]
1
[[0.01106744 0.9889326 ]]
1
[[0.00565271 0.9943473 ]]
1
[[0.00287851 0.9971215 ]]
1
[[0.00145973 0.9985403 ]]
1
[[6.218899e-04 9.993781e-01]]
1
[[2.323348e-04 9.997677e-01]]
1
[[8.4663174e-05 9.9991536e-01]]
1
[[3.0002670e-05 9.9996996e-01]]
1
[[1.0313071e-05 9.9998963e-01]]
1
[[3.4318587e-06 9.9999654e-01]]
1
[[0.0100419 0.9899581]]
1
[[0.00505087 0.9949491 ]]
1
[[0.00251197 0.997488  ]]
1
[[0.00120875 0.9987913 ]]
1
[[4.4375684e-04 9.9955624e-01]]
1
[[1.5939996e-04 9.9984062e-01]]
1
[[5.585647e-05 9.999441e-01]]
1
[[1.9042389e-05 9.9998093e-01]]
1
[[6.302157e-06 9.999937e-01]]
1
[[0.01138357 0.9886164 ]]
1
[[0.00587924 0.9941208 ]]
1
[[0.00301822 0.9969818 ]]
1
[[0.00154313 0.99845684]]
1
[[7.127244e-04 9.992873e-01]]
1
[[2.7027034e-04 9.9972969e-01]]
1
[[9.998865e-05 9.999000e-01]]
1
[[3.597852e-05 9.999640e-01]]
1
[[1.25

1
[[0.00337819 0.99662185]]
1
[[0.00173677 0.9982633 ]]
1
[[6.7562069e-04 9.9932444e-01]]
1
[[2.5452886e-04 9.9974543e-01]]
1
[[9.355292e-05 9.999064e-01]]
1
[[3.345554e-05 9.999665e-01]]
1
[[1.1613588e-05 9.9998844e-01]]
1
[[0.01327409 0.9867259 ]]
1
[[0.00685853 0.9931415 ]]
1
[[0.00348294 0.99651706]]
1
[[0.00174459 0.99825543]]
1
[[6.5654772e-04 9.9934345e-01]]
1
[[2.4200152e-04 9.9975795e-01]]
1
[[8.7118846e-05 9.9991286e-01]]
1
[[3.0548108e-05 9.9996948e-01]]
1
[[1.0410683e-05 9.9998963e-01]]
1
[[0.01436558 0.9856344 ]]
1
[[0.00747055 0.99252945]]
1
[[0.00386668 0.9961333 ]]
1
[[0.00199309 0.9980069 ]]
1
[[8.4809249e-04 9.9915195e-01]]
1
[[3.2171246e-04 9.9967825e-01]]
1
[[1.1911669e-04 9.9988091e-01]]
1
[[4.292730e-05 9.999571e-01]]
1
[[1.50208225e-05 9.99984980e-01]]
1
[[5.094084e-06 9.999949e-01]]
1
[[0.01480107 0.9851989 ]]
1
[[0.00751207 0.9924879 ]]
1
[[0.00380006 0.9961999 ]]
1
[[0.00186619 0.99813384]]
1
[[6.964078e-04 9.993036e-01]]
1
[[2.5446556e-04 9.9974555e-01]]
1
[[

[[9.818175e-05 9.999018e-01]]
1
[[3.593453e-05 9.999641e-01]]
1
[[1.2761051e-05 9.9998724e-01]]
1
[[4.3884806e-06 9.9999559e-01]]
1
[[0.01189776 0.9881023 ]]
1
[[0.00625812 0.9937418 ]]
1
[[0.00328222 0.99671775]]
1
[[0.00171458 0.99828535]]
1
[[7.2850462e-04 9.9927145e-01]]
1
[[2.8033284e-04 9.9971968e-01]]
1
[[1.05245046e-04 9.99894738e-01]]
1
[[3.8437924e-05 9.9996161e-01]]
1
[[1.362256e-05 9.999864e-01]]
1
[[4.6758983e-06 9.9999535e-01]]
1
[[0.01291449 0.9870855 ]]
1
[[0.00667866 0.99332136]]
1
[[0.00344375 0.9965563 ]]
1
[[0.00176895 0.9982311 ]]
1
[[6.7863718e-04 9.9932134e-01]]
1
[[2.5392364e-04 9.9974602e-01]]
1
[[9.2719849e-05 9.9990726e-01]]
1
[[3.295133e-05 9.999671e-01]]
1
[[1.137188e-05 9.999887e-01]]
1
[[0.01045981 0.98954016]]
1
[[0.00547405 0.99452597]]
1
[[0.0028574 0.9971426]]
1
[[0.00148584 0.9985141 ]]
1
[[5.8972853e-04 9.9941027e-01]]
1
[[2.2520257e-04 9.9977475e-01]]
1
[[8.390437e-05 9.999161e-01]]
1
[[3.0411378e-05 9.9996960e-01]]
1
[[1.0696907e-05 9.9998927e-01]

1
[[0.01051495 0.9894851 ]]
1
[[0.00536554 0.9946345 ]]
1
[[0.00272922 0.99727076]]
1
[[0.00136206 0.998638  ]]
1
[[5.0838105e-04 9.9949157e-01]]
1
[[1.8574503e-04 9.9981433e-01]]
1
[[6.623589e-05 9.999337e-01]]
1
[[2.2989554e-05 9.9997699e-01]]
1
[[7.7493132e-06 9.9999225e-01]]
1
[[0.01205492 0.9879451 ]]
1
[[0.00624861 0.99375135]]
1
[[0.00323182 0.99676824]]
1
[[0.00166618 0.9983339 ]]
1
[[6.6591165e-04 9.9933404e-01]]
1
[[2.478163e-04 9.997522e-01]]
1
[[9.00024e-05 9.99910e-01]]
1
[[3.1811167e-05 9.9996817e-01]]
1
[[1.0916655e-05 9.9998903e-01]]
1
[[0.01016506 0.98983496]]
1
[[0.00543645 0.99456364]]
1
[[0.00284696 0.99715304]]
1
[[0.00148686 0.99851316]]
1
[[6.6156092e-04 9.9933845e-01]]
1
[[2.528081e-04 9.997472e-01]]
1
[[9.4335875e-05 9.9990571e-01]]
1
[[3.4269113e-05 9.9996579e-01]]
1
[[1.2086285e-05 9.9998796e-01]]
1
[[4.129672e-06 9.999958e-01]]
1
[[1.3651137e-06 9.9999869e-01]]
1
[[0.01157808 0.98842186]]
1
[[0.00588459 0.99411535]]
1
[[0.00298521 0.99701476]]
1
[[0.00138893

1
[[0.00134296 0.998657  ]]
1
[[4.9881684e-04 9.9950123e-01]]
1
[[1.814022e-04 9.998186e-01]]
1
[[6.440646e-05 9.999356e-01]]
1
[[2.2267895e-05 9.9997771e-01]]
1
[[0.00950643 0.9904936 ]]
1
[[0.00511457 0.9948854 ]]
1
[[0.00274467 0.9972554 ]]
1
[[0.00146929 0.99853075]]
1
[[6.6215073e-04 9.9933785e-01]]
1
[[2.5491745e-04 9.9974507e-01]]
1
[[9.5807416e-05 9.9990416e-01]]
1
[[3.5046138e-05 9.9996495e-01]]
1
[[1.244397e-05 9.999876e-01]]
1
[[4.279878e-06 9.999957e-01]]
1
[[1.4238443e-06 9.9999857e-01]]
1
[[0.01058299 0.989417  ]]
1
[[0.00562282 0.9943772 ]]
1
[[0.00292903 0.9970709 ]]
1
[[0.00152071 0.9984793 ]]
1
[[5.6729052e-04 9.9943274e-01]]
1
[[2.0730453e-04 9.9979275e-01]]
1
[[7.399547e-05 9.999260e-01]]
1
[[2.5729803e-05 9.9997425e-01]]
1
[[8.696890e-06 9.999913e-01]]
1
[[0.00922975 0.9907703 ]]
1
[[0.0048675 0.9951325]]
1
[[0.0025637  0.99743634]]
1
[[0.00125857 0.99874145]]
1
[[4.7660334e-04 9.9952340e-01]]
1
[[1.7665484e-04 9.9982339e-01]]
1
[[6.389905e-05 9.999361e-01]]
1
[[2.

0
[[0.02181845 0.97818154]]
1
[[0.01342655 0.98657346]]
1
[[0.00768826 0.99231166]]
1
[[0.00429683 0.99570316]]
1
[[0.00238513 0.99761486]]
1
[[9.7708136e-04 9.9902296e-01]]
1
[[3.9241833e-04 9.9960762e-01]]
1
[[1.5408338e-04 9.9984586e-01]]
1
[[5.8987116e-05 9.9994099e-01]]
1
[[2.1963458e-05 9.9997807e-01]]
1
[[7.939022e-06 9.999920e-01]]
1
[[0.01517489 0.9848251 ]]
1
[[0.00838897 0.99161106]]
1
[[0.00462322 0.9953768 ]]
1
[[0.00210465 0.99789536]]
1
[[8.345624e-04 9.991654e-01]]
1
[[3.2433349e-04 9.9967563e-01]]
1
[[1.2322690e-04 9.9987674e-01]]
1
[[4.5667253e-05 9.9995434e-01]]
1
[[0.01501731 0.98498267]]
1
[[0.00842938 0.9915706 ]]
1
[[0.00468032 0.99531966]]
1
[[0.00228899 0.99771094]]
1
[[9.214977e-04 9.990785e-01]]
1
[[3.6363048e-04 9.9963641e-01]]
1
[[1.4029002e-04 9.9985969e-01]]
1
[[5.278716e-05 9.999472e-01]]
1
[[1.933230e-05 9.999807e-01]]
1
[[0.01404186 0.98595816]]
1
[[0.00804481 0.9919552 ]]
1
[[0.00452322 0.9954768 ]]
1
[[0.00236591 0.99763405]]
1
[[9.7546703e-04 9.9902

[[0.00128169 0.9987184 ]]
1
[[5.2803929e-04 9.9947196e-01]]
1
[[2.1276114e-04 9.9978727e-01]]
1
[[8.364562e-05 9.999163e-01]]
1
[[3.2022337e-05 9.9996793e-01]]
1
[[0.01953229 0.9804677 ]]
0
[[0.03133207 0.9686679 ]]
1
[[0.01972844 0.9802715 ]]
1
[[0.01157452 0.9884255 ]]
1
[[0.00664838 0.99335164]]
1
[[0.00377589 0.99622416]]
1
[[0.00162695 0.99837303]]
1
[[6.8778230e-04 9.9931216e-01]]
1
[[2.8456215e-04 9.9971539e-01]]
1
[[1.1493119e-04 9.9988508e-01]]
1
[[4.5208042e-05 9.9995482e-01]]
1
[[1.7285822e-05 9.9998271e-01]]
1
[[0.01916001 0.98084   ]]
1
[[0.01111105 0.9888889 ]]
1
[[0.00630709 0.993693  ]]
1
[[0.00318211 0.9968178 ]]
1
[[0.00134032 0.9986596 ]]
1
[[5.537819e-04 9.994462e-01]]
1
[[2.2389446e-04 9.9977607e-01]]
1
[[8.8362642e-05 9.9991167e-01]]
1
[[3.3970093e-05 9.9996603e-01]]
1
[[1.2701412e-05 9.9998724e-01]]
1
[[0.02053789 0.9794621 ]]
1
[[0.01185297 0.98814696]]
1
[[0.00675128 0.9932487 ]]
1
[[0.0035446  0.99645543]]
1
[[0.00150239 0.9984976 ]]
1
[[6.2465365e-04 9.993754

[[5.943794e-04 9.994056e-01]]
1
[[2.4085496e-04 9.9975914e-01]]
1
[[9.5261916e-05 9.9990475e-01]]
1
[[3.6707006e-05 9.9996328e-01]]
1
Average reward / 100 eps: 9.48
[[0.02083916 0.97916085]]
1
[[0.01180315 0.9881968 ]]
1
[[0.00665559 0.9933444 ]]
1
[[0.0032001 0.9967998]]
1
[[0.00135147 0.9986486 ]]
1
[[5.595045e-04 9.994405e-01]]
1
[[2.2655487e-04 9.9977344e-01]]
1
[[8.9526475e-05 9.9991047e-01]]
1
[[3.4461878e-05 9.9996555e-01]]
1
[[0.0206869 0.9793131]]
1
[[0.01189234 0.9881076 ]]
1
[[0.0068079  0.99319214]]
1
[[0.00356526 0.9964347 ]]
1
[[0.00154784 0.9984522 ]]
1
[[6.588892e-04 9.993411e-01]]
1
[[2.7435081e-04 9.9972564e-01]]
1
[[1.11468835e-04 9.99888539e-01]]
1
[[4.409780e-05 9.999559e-01]]
1
[[1.6958167e-05 9.9998307e-01]]
1
[[0.02271262 0.9772874 ]]
1
[[0.01277698 0.987223  ]]
1
[[0.00715627 0.9928437 ]]
1
[[0.00335177 0.99664825]]
1
[[0.00140256 0.99859744]]
1
[[5.7570101e-04 9.9942434e-01]]
1
[[2.3128053e-04 9.9976867e-01]]
1
[[9.074069e-05 9.999093e-01]]
1
[[3.470597e-05 9.

[[0.00152768 0.9984723 ]]
1
[[6.3299906e-04 9.9936706e-01]]
1
[[2.5653272e-04 9.9974340e-01]]
1
[[1.0142885e-04 9.9989855e-01]]
1
[[3.9038183e-05 9.9996102e-01]]
1
[[1.4600972e-05 9.9998546e-01]]
1
[[0.01678409 0.98321587]]
1
[[0.00928715 0.9907128 ]]
1
[[0.00511844 0.9948815 ]]
1
[[0.00253658 0.99746346]]
1
[[0.00105262 0.99894744]]
1
[[4.280513e-04 9.995720e-01]]
1
[[1.7015656e-04 9.9982977e-01]]
1
[[6.596214e-05 9.999341e-01]]
1
[[2.4886976e-05 9.9997509e-01]]
1
[[0.0166411 0.9833589]]
1
[[0.00918356 0.9908165 ]]
1
[[0.00502353 0.9949765 ]]
1
[[0.00242712 0.99757284]]
1
[[9.9656126e-04 9.9900347e-01]]
1
[[4.0116752e-04 9.9959880e-01]]
1
[[1.5793635e-04 9.9984205e-01]]
1
[[6.066708e-05 9.999393e-01]]
1
[[2.2693292e-05 9.9997735e-01]]
1
[[0.01733057 0.9826695 ]]
1
[[0.00960847 0.9903915 ]]
1
[[0.00530532 0.99469465]]
1
[[0.00271685 0.9972831 ]]
1
[[0.00113068 0.99886924]]
1
[[4.6108823e-04 9.9953890e-01]]
1
[[1.8379124e-04 9.9981624e-01]]
1
[[7.1437615e-05 9.9992859e-01]]
1
[[2.702210

1
[[0.00113787 0.99886215]]
1
[[4.5994602e-04 9.9954009e-01]]
1
[[1.8167014e-04 9.9981838e-01]]
1
[[6.991656e-05 9.999300e-01]]
1
[[2.6149963e-05 9.9997389e-01]]
1
[[9.4850384e-06 9.9999046e-01]]
1
[[3.3316817e-06 9.9999666e-01]]
1
[[0.01249924 0.9875008 ]]
1
[[0.00653488 0.9934651 ]]
1
[[0.00340498 0.99659497]]
1
[[0.00170186 0.9982981 ]]
1
[[6.624495e-04 9.993375e-01]]
1
[[2.5249925e-04 9.9974746e-01]]
1
[[9.3991424e-05 9.9990594e-01]]
1
[[3.4086104e-05 9.9996591e-01]]
1
[[1.2019525e-05 9.9998796e-01]]
1
[[0.01450716 0.9854929 ]]
1
[[0.00767765 0.9923224 ]]
1
[[0.00405032 0.9959496 ]]
1
[[0.00212826 0.99787176]]
1
[[9.3208614e-04 9.9906796e-01]]
1
[[3.644476e-04 9.996356e-01]]
1
[[1.3925631e-04 9.9986076e-01]]
1
[[5.1859210e-05 9.9994814e-01]]
1
[[1.8778446e-05 9.9998116e-01]]
1
[[6.6003904e-06 9.9999344e-01]]
1
[[0.01365967 0.9863403 ]]
1
[[0.00739851 0.99260145]]
1
[[0.00397363 0.9960264 ]]
1
[[0.00212561 0.9978744 ]]
1
[[0.00100139 0.9989986 ]]
1
[[4.0343386e-04 9.9959654e-01]]
1


[[5.9298898e-05 9.9994075e-01]]
1
[[2.0328549e-05 9.9997962e-01]]
1
[[6.7664173e-06 9.9999321e-01]]
1
[[0.00968691 0.99031305]]
1
[[0.0048707  0.99512935]]
1
[[0.002442 0.997558]]
1
[[0.00121924 0.9987808 ]]
1
[[5.1520375e-04 9.9948478e-01]]
1
[[1.9022732e-04 9.9980980e-01]]
1
[[6.8504087e-05 9.9993145e-01]]
1
[[2.3992876e-05 9.9997604e-01]]
1
[[8.153615e-06 9.999919e-01]]
1
[[0.00870868 0.9912913 ]]
1
[[0.00438819 0.9956118 ]]
1
[[0.00220498 0.997795  ]]
1
[[0.0011033 0.9988967]]
1
[[4.6546926e-04 9.9953461e-01]]
1
[[1.7234693e-04 9.9982762e-01]]
1
[[6.222768e-05 9.999378e-01]]
1
[[2.1846956e-05 9.9997818e-01]]
1
[[7.440358e-06 9.999926e-01]]
1
[[2.4539372e-06 9.9999750e-01]]
1
[[0.00953612 0.99046385]]
1
[[0.00475339 0.9952466 ]]
1
[[0.00236259 0.9976374 ]]
1
[[0.00116939 0.9988306 ]]
1
[[5.168180e-04 9.994831e-01]]
1
[[1.8908014e-04 9.9981099e-01]]
1
[[6.745419e-05 9.999325e-01]]
1
[[2.3398503e-05 9.9997663e-01]]
1
[[7.8732082e-06 9.9999213e-01]]
1
[[0.00940355 0.9905964 ]]
1
[[0.00

1
[[0.00321319 0.99678683]]
1
[[0.00154691 0.998453  ]]
1
[[7.4159278e-04 9.9925834e-01]]
1
[[3.418721e-04 9.996581e-01]]
1
[[1.21740355e-04 9.99878287e-01]]
1
[[4.2226187e-05 9.9995780e-01]]
1
[[1.4222173e-05 9.9998581e-01]]
1
[[4.6394307e-06 9.9999535e-01]]
1
[[1.4630622e-06 9.9999857e-01]]
1
[[0.00766535 0.99233466]]
1
[[0.00366235 0.99633765]]
1
[[0.00174531 0.9982547 ]]
1
[[8.2830724e-04 9.9917173e-01]]
1
[[3.678677e-04 9.996321e-01]]
1
[[1.2897601e-04 9.9987102e-01]]
1
[[4.4055512e-05 9.9995589e-01]]
1
[[1.4617785e-05 9.9998534e-01]]
1
[[4.7000672e-06 9.9999535e-01]]
1
[[0.00802276 0.9919773 ]]
1
[[0.00375405 0.9962459 ]]
1
[[0.00174713 0.9982528 ]]
1
[[8.1038463e-04 9.9918956e-01]]
1
[[3.4737328e-04 9.9965262e-01]]
1
[[1.1872682e-04 9.9988127e-01]]
1
[[3.957634e-05 9.999604e-01]]
1
[[1.282926e-05 9.999871e-01]]
1
[[4.0349669e-06 9.9999595e-01]]
1
[[0.006909 0.993091]]
1
[[0.00331071 0.99668926]]
1
[[0.00158237 0.9984176 ]]
1
[[7.5306540e-04 9.9924695e-01]]
1
[[3.5400494e-04 9.99

[[0.00313072 0.99686927]]
1
[[0.00143822 0.9985618 ]]
1
[[6.5826031e-04 9.9934167e-01]]
1
[[2.9949241e-04 9.9970055e-01]]
1
[[1.1078736e-04 9.9988925e-01]]
1
[[3.7282472e-05 9.9996269e-01]]
1
[[1.2186887e-05 9.9998784e-01]]
1
[[3.858784e-06 9.999962e-01]]
1
[[1.1810821e-06 9.9999881e-01]]
1
[[0.00708835 0.99291164]]
1
[[0.00318076 0.99681926]]
1
[[0.00142375 0.9985763 ]]
1
[[6.3471345e-04 9.9936527e-01]]
1
[[2.6515924e-04 9.9973482e-01]]
1
[[8.759483e-05 9.999124e-01]]
1
[[2.8189310e-05 9.9997187e-01]]
1
[[8.812607e-06 9.999912e-01]]
1
[[0.00750632 0.9924936 ]]
1
[[0.00340572 0.99659425]]
1
[[0.00154178 0.99845827]]
1
[[6.9534546e-04 9.9930465e-01]]
1
[[3.0897016e-04 9.9969101e-01]]
1
[[1.0381935e-04 9.9989617e-01]]
1
[[3.3998578e-05 9.9996603e-01]]
1
[[1.0818983e-05 9.9998915e-01]]
1
[[3.3375586e-06 9.9999666e-01]]
1
[[0.00670393 0.9932961 ]]
1
[[0.00309422 0.99690574]]
1
[[0.00142512 0.99857485]]
1
[[6.538352e-04 9.993462e-01]]
1
[[2.9814825e-04 9.9970180e-01]]
1
[[1.1000320e-04 9.99

[[5.424412e-06 9.999945e-01]]
1
[[1.6860740e-06 9.9999833e-01]]
1
[[0.00740811 0.99259186]]
1
[[0.00345106 0.99654895]]
1
[[0.00160368 0.99839634]]
1
[[7.4217503e-04 9.9925786e-01]]
1
[[3.2633622e-04 9.9967372e-01]]
1
[[1.1285601e-04 9.9988711e-01]]
1
[[3.8023652e-05 9.9996197e-01]]
1
[[1.2443876e-05 9.9998760e-01]]
1
[[3.9461652e-06 9.9999607e-01]]
1
[[0.00777931 0.9922207 ]]
1
[[0.00365056 0.99634945]]
1
[[0.0017089 0.9982911]]
1
[[7.9674594e-04 9.9920326e-01]]
1
[[3.6919152e-04 9.9963081e-01]]
1
[[1.3726769e-04 9.9986267e-01]]
1
[[4.6825484e-05 9.9995315e-01]]
1
[[1.5515285e-05 9.9998450e-01]]
1
[[4.980443e-06 9.999950e-01]]
1
[[1.5459308e-06 9.9999845e-01]]
1
[[0.00652597 0.99347407]]
1
[[0.00310923 0.99689084]]
1
[[0.00147758 0.99852246]]
1
[[6.9910591e-04 9.9930084e-01]]
1
[[3.2859994e-04 9.9967146e-01]]
1
[[1.2515335e-04 9.9987483e-01]]
1
[[4.356836e-05 9.999565e-01]]
1
[[1.4722644e-05 9.9998522e-01]]
1
[[4.8160250e-06 9.9999523e-01]]
1
[[1.5218312e-06 9.9999845e-01]]
1
[[0.0083

Testing the Model
---

This cell will run through games choosing actions without the learning process so you can see how your model has learned!

In [10]:
# TODO Create the testing loop
testing_episodes = 5
 
with tf.Session() as sess:
    checkpoint = tf.train.get_checkpoint_state(path)
    saver.restore(sess,checkpoint.model_checkpoint_path)
 
    for episode in range(testing_episodes):
     
            state = env.reset()
     
            episode_rewards = 0
             
            for step in range(max_steps_per_episode):
                 
                env.render()
                 
                # Get Action
                action_argmax = sess.run(agent.choice, feed_dict={agent.input_layer: [state]})
                action_choice = action_argmax[0]
                 
                state_next, reward, done, _ = env.step(action_choice)
                state = state_next
                 
                episode_rewards += reward
                 
                if done or step + 1 == max_steps_per_episode:
                   print("Rewards for episode " + str(episode) + ": " + str(episode_rewards))
                   break

INFO:tensorflow:Restoring parameters from ./cartpole-pg/pg-checkpoint-900


[2018-08-03 00:06:08,622] Restoring parameters from ./cartpole-pg/pg-checkpoint-900


Rewards for episode 0: 251.0
Rewards for episode 1: 291.0
Rewards for episode 2: 306.0
Rewards for episode 3: 264.0
Rewards for episode 4: 268.0


In [11]:
# Run to close the environment
env.close()